programming and timing analysis of parallel programs on multicores eugene yip, partha roop, morteza...
TRANSCRIPT
![Page 1: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/1.jpg)
1
Programming and Timing Analysis of Parallel Programs on Multicores
Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault
ACSD 2013
![Page 2: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/2.jpg)
2
Introduction
• Safety-critical systems:
– Perform specific real-time tasks.– Strict safety standards (IEC 61508, DO 178).– Time-predictability useful in real-time designs.– Shift towards multicore designs.
[Paolieri et al 2011] Towards Functional-Safe Timing-Dependable Real-Time Architectures.[Pellizzoni et al 2009] Handling Mixed-Criticality in SoC-Based Real-Time Embedded Systems.[Cullmann et al 2010] Predictability Considerations in the Design of Multi-Core Embedded Systems.
Embedded SystemsSafety-critical concerns
![Page 3: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/3.jpg)
3
Introduction
• Designing safety-critical systems:– Certified Real-Time Operating Systems (RTOS)• E.g., VxWorks, LynxOS, and SafeRTOS. • Programmer manages shared variables. • Hard to verify timing.
[VxWorks] http://www.windriver.com/products/vxworks/[LynxOS] http://www.lynuxworks.com/rtos/rtos-178.php[SafeRTOS] http://www.freertos.org/FreeRTOS-Plus/Safety_Critical_Certified/SafeRTOS.shtml[Sandell et al 2006] Static Timing Analysis of Real-Time Operating System Code
![Page 4: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/4.jpg)
4
Introduction
• Designing safety-critical systems:– Certified Real-Time Operating Systems (RTOS)• E.g., VxWorks, LynxOS, and SafeRTOS. • Programmer manages shared variables. • Hard to verify timing.
– Synchronous Languages• E.g., Esterel, Esterel C Language (ECL), and PRET-C.• Deterministic concurrency (Synchrony hypothesis). • Difficult to distribute: Instantaneous communication or
sequential semantics.[Benveniste et al 2003] The Synchronous Languages 12 Years Later.[Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.[Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.
[Girault 2005] A Survey of Automatic Distribution Method forSynchronous Programs
![Page 5: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/5.jpg)
5
Research Objective
• To design a C-based, parallel programming language that: – has deterministic execution behaviour, – can take advantage of multicore execution, and– is amenable to static timing analysis.
![Page 6: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/6.jpg)
6
Outline
• Introduction• ForeC Language• Timing Analysis• Results• Conclusions
![Page 7: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/7.jpg)
7
Outline
• Introduction• ForeC Language• Timing Analysis• Results• Conclusions
![Page 8: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/8.jpg)
8
ForeC (Foresee) Language
• C-based, multi-threaded, synchronous language. Inspired by Esterel and PRET-C.
• Minimal set of synchronous constructs.• Fork/join parallelism and shared memory
thread communication.• Structured preemption.
![Page 9: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/9.jpg)
9
Execution Exampleshared int sum = 1 combine with plus;
int plus(int copy1, int copy2) { return (copy1 + copy2);}
void main(void) { par(f(1), f(2));}
void f(int i) { sum = sum + i; pause; ...}
Global synchronisation barrier
Fork-join• Blocking statement.• Arbitrary thread execution order.
Shared variable and its combine function
![Page 10: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/10.jpg)
10
Execution Exampleshared int sum = 1 combine with plus;
int plus(int copy1, int copy2) { return (copy1 + copy2);}
void main(void) { par(f(1), f(2));}
void f(int i) { sum = sum + i; pause; ...}
Global
sum = 1Global tick start
![Page 11: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/11.jpg)
11
Execution Exampleshared int sum = 1 combine with plus;
int plus(int copy1, int copy2) { return (copy1 + copy2);}
void main(void) { par(f(1), f(2));}
void f(int i) { sum = sum + i; pause; ...}
Global Copies
f1 f2
sum = 1
sum1 = 1 sum2 = 1
Global tick start
Threads get a conceptual copy of the shared variables at the start of every global tick.
![Page 12: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/12.jpg)
12
Execution Exampleshared int sum = 1 combine with plus;
int plus(int copy1, int copy2) { return (copy1 + copy2);}
void main(void) { par(f(1), f(2));}
void f(int i) { sum = sum + i; pause; ...}
Global Copies
f1 f2
sum = 1
sum1 = 1sum1 = 2
sum2 = 1sum2 = 3
Global tick start
Threads modify their own copy during execution.
![Page 13: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/13.jpg)
13
Execution Exampleshared int sum = 1 combine with plus;
int plus(int copy1, int copy2) { return (copy1 + copy2);}
void main(void) { par(f(1), f(2));}
void f(int i) { sum = sum + i; pause; ...}
Global Copies
f1 f2
sum = 1
sum1 = 1sum1 = 2
sum2 = 1sum2 = 3
Global tick start
Global tick end
![Page 14: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/14.jpg)
14
Execution Exampleshared int sum = 1 combine with plus;
int plus(int copy1, int copy2) { return (copy1 + copy2);}
void main(void) { par(f(1), f(2));}
void f(int i) { sum = sum + i; pause; ...}
Global Copies
f1 f2
sum = 1
sum1 = 1sum1 = 2
sum2 = 1sum2 = 3
sum = 5
Global tick start
Global tick end
When a global tick ends, the modified copies are combined and assigned to the actual shared variables.
Combine function is defined by the programmer and must be commutative and associative.
![Page 15: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/15.jpg)
15
Execution Exampleshared int sum = 1 combine with plus;
int plus(int copy1, int copy2) { return (copy1 + copy2);}
void main(void) { par(f(1), f(2));}
void f(int i) { sum = sum + i; pause; ...}
Global Copies
f1 f2
sum = 1
sum1 = 1sum1 = 2
sum2 = 1sum2 = 3
sum = 5
Global tick start
Global tick end
• Modifications are isolated.• Interleaving does not matter.• Do not need locks or critical
sections.• But, the programmer has to
specify the combine function and placement of pauses.
![Page 16: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/16.jpg)
16
Execution Exampleshared int sum = 1 combine with plus;
int plus(int copy1, int copy2) { return (copy1 + copy2);}
void main(void) { par(f(1), f(2));}
void f(int i) { sum = sum + i; pause; ...}
Global Copies
f1 f2
sum = 1
sum1 = 1sum1 = 2
sum2 = 1sum2 = 3
sum = 5
sum1 = 5. . .
sum2 = 5. . .
Global tick start
Global tick end
Global tick start
![Page 17: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/17.jpg)
17
ForeC (Foresee) Language
int x = 1;abort {
x = 2;pause;x = 3;
} when (x > 0);...
Initialise variable x
Abort body starts executing.
Check the abort condition.
The abort body is preempted.
Execution continues.
Preemption construct:
![Page 18: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/18.jpg)
18
ForeC (Foresee) Language
Preemption construct:[weak] abort {
st } when [immediate] (cond)
• immediate: The abort condition is checked when execution first reaches the abort.• weak: Let the abort body to execute one last time before it is
preempted.
![Page 19: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/19.jpg)
19
ForeC (Foresee) Language
Variable type-qualifiers:input and output• Declares a variable whose value is updated or emitted to
the environment at each global tick.
E.g., input int x;
![Page 20: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/20.jpg)
20
Scheduling
Light-weight static scheduling:– Take advantage of multicore performance while
delivering time-predictability (ease static timing analysis).
– Thread allocation and scheduling order on each core decided at compile time by the programmer.
– Cooperative (non-preemptive) scheduling.– Fork/join semantics and notion of a global tick is
preserved via synchronisation.
![Page 21: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/21.jpg)
21
Scheduling
Light-weight static scheduling:– One core to perform housekeeping tasks at the
end of the global tick.• Combining of shared variables.• Emitting outputs and sampling inputs.• Starting the next global tick.
![Page 22: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/22.jpg)
22
Outline
• Introduction• ForeC Language• Timing Analysis• Results• Conclusions
![Page 23: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/23.jpg)
23
Timing Analysis
Compute the program’s Worst-Case Reaction Time (WCRT).
Physical time1s 2s 3s 4s
Maximumtime allowed
(design specification)
WCRT = max(Reaction times)
Must validate:WCRT ≤ Maximum time allowed
Reaction time
[Boldt et al 2008] Worst Case Reaction Time Analysis of Concurrent Reactive Programs.
![Page 24: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/24.jpg)
24
Timing Analysis
Construct a Concurrent CFG (CCFG) of the executable binary.
shared int sum = 1 combine with plus;
int plus(int copy1, int copy2) { return (copy1 + copy2);}
void main(void) { par(f(1), f(2));}
void f(int i) { sum = sum + i; pause; ...}
Fork
Join
Computation
Condition
Pause
Abort
Graph End
Graph Start
f1 f2
main
![Page 25: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/25.jpg)
25
Timing Analysis
One existing approach for multicores:• [Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose
Multiprocessors.• Uses ILP which is NP-Complete, no tightness result, analysis results are
only for a 4-core processor.
Existing approaches for single-core:– Integer Linear Programming (ILP)– Model Checking/Reachability– Max-Plus
[P. S. Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.[M. Boldt et al 2008] Worst Case Reaction Time Analysis of Concurrent Reactive Programs.
![Page 26: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/26.jpg)
26
Reachability
g1
g2
g3a g3b
g4a g4b g4c
RT1 = Reaction Time of g1
RT2
RT3b
RT4cRT4b
RT3a
RT4a
WCRT = MAX(RT1 … RT4c)
• Traverse the CCFG to find all possible global ticks.
• State-space explosion.• Precision vs. Analysis time.
![Page 27: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/27.jpg)
27
Reachability
g1
g2
g3a g3b
g4a g4b g4c
RT1
RT2
RT3b
RT4cWCRT = RT4b
RT3a
RT4a
Identify the path leading to the WCRT. Good for understanding the timing behaviour.
![Page 28: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/28.jpg)
28
Max-Plus
• Makes the safe assumption that the program’s WCRT occurs when all threads execute their longest reaction together.– Compute the WCRT of each thread separately.– Compute the program’s WCRT by using WCRT of
the threads.– Fast analysis time but over-estimation could be
large.
![Page 29: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/29.jpg)
29
Timing Analysis
Propose the use of Reachability for multicore analysis:– Trade off analysis time for higher precision.– Analyse inter-core synchronisations in detail.– Handle state-space explosion by reducing the
program’s CCFG before reachability analysis.
Program binary
(annotated)
Compute each global tick. WCRTProgram’s
reduced CCFG
![Page 30: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/30.jpg)
30
Timing Analysis
CCFG optimisations:– merge: Reduces the number of CFG nodes that
need to be traversed.– merge-b: Reduces the number of alternate paths
in the CFG. (Reduces the number of global ticks)
cost = 1
cost = 4
cost = 3
cost = 1
cost= 1 + 3= 4
cost= 1 + 4 + 1= 6
cost = 6
merge merge-b
![Page 31: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/31.jpg)
31
Timing Analysis
• Computing each global tick:1. Parallel thread execution and inter-core
synchronisations.2. Scheduling overheads.3. Variable delay in accessing the shared bus.
![Page 32: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/32.jpg)
32
Timing Analysis
1. Parallel thread execution and inter-core synchronisations.• An integer counter to track each core’s execution time.• Static scheduling allows us to determine the thread
execution order on each core.• Synchronisation at fork/join, and end of the global tick.
Core 1: Core 2:main f2
f1
Core 1 Core 2
mainf2f1
f1 f2
main
![Page 33: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/33.jpg)
33
Timing Analysis
2. Scheduling overheads.– Synchronisation: Fork/join and global tick.
• Via global memory.
– Thread context-switching.• Copying of shared variables at the start the thread’s
local tick via global memory.
Synchronisation
Thread context-switch
Core 1 Core 2
mainf2f1
Global tick
![Page 34: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/34.jpg)
34
Timing Analysis
2. Scheduling overheads.– Required scheduling routines statically known.– Analyse the control-flow of the routines.– Compute the execution time for each scheduling
overhead. Core 1 Core 2
main
f1
Core 1 Core 2
mainf2f1
f2
![Page 35: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/35.jpg)
35
Timing Analysis
3. Variable delay in accessing the shared bus.– Global memory accessed by scheduling routines.– TDMA bus delay has to be considered.
Core 1 Core 2
main
f1 f2
![Page 36: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/36.jpg)
36
Timing Analysis
3. Variable delay in accessing the shared bus.– Global memory accessed by scheduling routines.– TDMA bus delay has to be considered.
121212121212
Core 1 Core 2
slotsCore 1 Core 2
main
f1 f2
![Page 37: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/37.jpg)
37
Timing Analysis
3. Variable delay in accessing the shared bus.– Global memory accessed by scheduling routines.– TDMA bus delay has to be considered.
121212121212
Core 1 Core 2
main
f1f2
Core 1 Core 2
main
f1 f2
![Page 38: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/38.jpg)
38
Outline
• Introduction• ForeC Language• Timing Analysis• Results• Conclusions
![Page 39: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/39.jpg)
39
Results
• For the proposed reachability-based timing analysis, we demonstrate:– the precision of the computed WCRT.– the efficiency of the analysis, in terms of analysis
time.
![Page 40: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/40.jpg)
40
Results
• Timing analysis tool:
Program binary
(annotated)
ProposedReachability
Max-Plus
WCRTProgram CCFG (optimisations)
![Page 41: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/41.jpg)
41
Results
• Multicore simulator (Xilinx MicroBlaze):– Based on http://www.jwhitham.org/c/smmu.html
and extended to be cycle-accurate and support multiple cores and a TDMA bus.
Core0
TDMA Shared Bus
Data memory
Datamemory
Instruction memory Core
nDatamemory
Instruction memory16KB
16KB
32KB5 cycles
1 cycle
5 cycles/core(Bus schedule round = 5 * no. cores)
![Page 42: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/42.jpg)
42
Results
• Mix of control/data computations, thread structure and computation load.
* [Pop et al 2011] A Stream-Computing Extension to OpenMP.# [Nemer et al 2006] A Free Real-Time Benchmark.
*
*
#
Benchmark programs.
![Page 43: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/43.jpg)
43
Results
• Each benchmark program was distributed over 1 to n-number of cores.– n = maximum number of parallel threads.
• Observed the WCRT:– Input vectors to elicit the worst case execution
path identified by Reachability analysis.• Computed the WCRT:– Reachability– Max-Plus
![Page 44: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/44.jpg)
44
802.11a Results
Observed:• WCRT decreases
until 5 cores.• TDMA Bus is a
bottleneck: Global memory becomes more expensive.
• Synchronisation overheads.
1 2 3 4 5 6 7 8 9 100
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
Observed
Reachability
MaxPlus
Cores
WC
RT
(cl
ock
cyc
les)
![Page 45: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/45.jpg)
45
802.11a Results
1 2 3 4 5 6 7 8 9 100
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
Observed
Reachability
MaxPlus
Cores
WC
RT
(cl
ock
cyc
les)
Reachability:• ~2% over-
estimation.• Benefit of explicit
path exploration.
![Page 46: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/46.jpg)
46
802.11a Results
Max-Plus:• Assumes one global
tick where all threads execute their worst-case.
• Loss of thread execution context: Max execution time of the scheduling routines.
1 2 3 4 5 6 7 8 9 100
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
Observed
Reachability
MaxPlus
Cores
WC
RT
(cl
ock
cyc
les)
![Page 47: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/47.jpg)
47
802.11a Results
Both approaches:• Estimation of
synchronisation cost is conservative. Assumed that the receive only starts after the last sender.
1 2 3 4 5 6 7 8 9 100
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
Observed
Reachability
MaxPlus
Cores
WC
RT
(cl
ock
cyc
les)
![Page 48: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/48.jpg)
48
802.11a Results
1 2 3 4 5 6 7 8 9 100
500
1,000
1,500
2,000
2,500
Cores
An
alys
is T
ime
(sec
ond
s)
Max-Plus takes less than 2 seconds.Reachability
![Page 49: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/49.jpg)
49
802.11a Results
1 2 3 4 5 6 7 8 9 100
500
1,000
1,500
2,000
2,500
Cores
An
alys
is T
ime
(sec
ond
s)
Reachability (merge)
Reachabilitymerge:• Reduction of ~9.34x
![Page 50: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/50.jpg)
50
802.11a Results
1 2 3 4 5 6 7 8 9 100
500
1,000
1,500
2,000
2,500
Cores
An
alys
is T
ime
(sec
ond
s)
Reachability (merge)
Reachability (merge-b)
Reachabilitymerge:• Reduction of ~9.34x
![Page 51: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/51.jpg)
51
802.11a Results
1 2 3 4 5 6 7 8 9 100
500
1,000
1,500
2,000
2,500
Cores
An
alys
is T
ime
(sec
ond
s)
Reachability (merge)
Reachability (merge-b)
Reachabilitymerge:• Reduction of ~9.34xmerge-b:• Reduction of ~342x• Less than 7 sec.
![Page 52: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/52.jpg)
52
802.11a Results
Reduction in states reduction in analysis time
Number of global ticks explored.
![Page 53: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/53.jpg)
53
Results
Reachability:• ~1 to 8% over-estimation.• Loss in precision mainly from over-estimating the
synchronisation costs.
1 2 3 40
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
FmRadio
Cores
1 2 3 4 5 6 70
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Fly by Wire
Cores
1 2 3 4 5 6 7 80
20,000
40,000
60,000
80,000
100,000
120,000
140,000
Life
Cores1 2 3 4 5 6 7 8
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Matrix
Observed
Reachability
MaxPlus
Cores
![Page 54: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/54.jpg)
54
Results
Max-Plus:• Over-estimation very dependent on program structure.• FmRadio and Life very imprecise. Loops can “amplify” over-
estimations.• Matrix quite precise. Executes in one global tick. Thus, Max-
Plus assumption is valid.
1 2 3 40
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
FmRadio
Cores
1 2 3 4 5 6 70
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Fly by Wire
Cores
1 2 3 4 5 6 7 80
20,000
40,000
60,000
80,000
100,000
120,000
140,000
Life
Cores1 2 3 4 5 6 7 8
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Matrix
Observed
Reachability
MaxPlus
Cores
![Page 55: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/55.jpg)
55
Results
• Our tool generates a timing trace for the computed WCRT:– For each core: Thread start/end time, context-
switching, fork/join, ...– Can be used to tune the thread distribution.• Was used to find good thread distributions for each
benchmark program.
![Page 56: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/56.jpg)
56
Outline
• Introduction• ForeC Language• Timing Analysis• Results• Conclusions
![Page 57: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/57.jpg)
57
Conclusions
• ForeC language for deterministic parallel programming.
• Based on synchronous framework.• Able to achieve WCRT speedup while
providing time-predictability.• Precise, fast and scalable timing analysis for
multicore programs using reachability.
![Page 58: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/58.jpg)
58
Future work
Implementation:• WCRT-guided, automatic thread distribution.• Decrease global synchronisation overhead
without increasing analysis complexity.Analysis:• Prune additional infeasible paths using value
analysis.• Include the use of caches/scratchpads in the
multicore memory hierarchy.
![Page 59: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/59.jpg)
59
Questions?
![Page 60: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/60.jpg)
60
Introduction
• Existing parallel programming solutions.– Shared memory model.• OpenMP, Pthreads• Intel Cilk Plus, Thread Building Blocks• Unified Parallel C, ParC, X10
– Message passing model.• MPI, SHIM
– Provides ways to manage shared resources but not prevent concurrency errors.
[OpenMP] http://openmp.org [Pthreads] https://computing.llnl.gov/tutorials/pthreads/ [X10] http://x10-lang.org/[Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [Intel Thread Building Blocks] http://threadingbuildingblocks.org/[Unified Parallel C] http://upc.lbl.gov/ [Ben-Asher et al] ParC – An Extension of C for Shared Memory Parallel Processing.[MPI] http://www.mcs.anl.gov/research/projects/mpi/ [SHIM] SHIM: A Language for Hardware/Software Integration.
![Page 61: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/61.jpg)
61
Introduction
– Desktop variants optimised for average-case performance (FLOPS), not time-predictability.
– Threaded programming model.• Non-deterministic thread interleaving makes
understanding and debugging hard.
[Lee 2006] The Problem with Threads.
![Page 62: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/62.jpg)
62
Introduction
• Parallel programming:– Programmer manages the shared resources.– Concurrency errors:• Deadlock, Race condition, Atomic violation, Order
violation.
[McDowell et al 1989] Debugging Concurrent Programs.[Lu et al 2008] Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics.
![Page 63: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/63.jpg)
63
Introduction
• Deterministic runtime support.– Pthreads• dOS, Grace, Kendo, CoreDet, Dthreads.
– OpenMP• Deterministic OMP
– Concept of logical time.– Each logical time step broken into an execution
and communication phase.
[Bergan et al 2010] Deterministic Process Groups in dOS.[Olszewski et al 2009] Kendo: Efficient Deterministic Multithreading in Software. [Bergan et al 2010] CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution.[Liu et al 2011] Dthreads: Efficient Deterministic Multithreading.[Aviram 2012] Deterministic OpenMP.
![Page 64: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/64.jpg)
64
ForeC language
• Behaviour of shared variables is similar to:• Esterel (Valued signals)• Intel Cilk+ (Reducers)• Unified Parallel C (Collectives)• DOMP (Workspace consistency)• Grace (Copy-on-write)• Dthreads (Copy-on-write)
![Page 65: Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1](https://reader033.vdocuments.site/reader033/viewer/2022051620/56649ec45503460f94bceabf/html5/thumbnails/65.jpg)
65
ForeC language
• Parallel programming patterns:– Specifying an appropriate combine function.– Sacrifice for deterministic parallel programs.– Map-reduce– Scatter-gather– Software pipelining– Delayed broadcast or point-to-point
communication.