automatic parallelization of simulation code from equation based simulation languages
DESCRIPTION
Automatic Parallelization of Simulation Code from Equation Based Simulation Languages. Peter Aronsson, Industrial phd student, PELAB SaS IDA Linköping University, Sweden Based on Licentiate presentation & CPC’03 Presentation. Outline. Introduction Task Graphs - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/1.jpg)
Peter Aronsson
Automatic Parallelization of Simulation Code from Equation Based Simulation Languages
Peter Aronsson,
Industrial phd student, PELAB SaS IDA
Linköping University, Sweden
Based on Licentiate presentation & CPC’03 Presentation
![Page 2: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/2.jpg)
Peter Aronsson
Outline
• Introduction
• Task Graphs
• Related work on Scheduling & Clustering
• Parallelization Tool
• Contributions
• Results
• Conclusion & Future Work
![Page 3: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/3.jpg)
Peter Aronsson
Introduction
• Modelica– Object Oriented, Equation Based, Modeling Language
• Modelica enable modeling and simulation of large and complex multi-domain systems
• Large need for parallel computation– To decrease time of executing simulations
– To make large models possible to simulate at all.
– To meet hard real time demands in hardware-in-the-loop simulations
![Page 4: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/4.jpg)
Peter Aronsson
Examples of large complex systems in Modelica
![Page 5: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/5.jpg)
Peter Aronsson
Modelica Example - DCmotorR1 I1
emf
ground
stepload
![Page 6: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/6.jpg)
Peter Aronsson
Modelica examplemodel DCMotor import Modelica.Electrical.Analog.Basic.*; import Modelica.Electrical.Sources.StepVoltage; Resistor R1(R=10); Inductor I1(L=0.1); EMF emf(k=5.4); Ground ground; StepVoltage step(V=10); Modelica.Mechanics.Rotational.Inertia load(J=2.25); equation connect(R1.n, I1.p); connect(I1.n, emf.p); connect(emf.n, ground.p); connect(emf.flange_b, load.flange_a); connect(step.p, R1.p); connect(step.n, ground.p);end DCMotor;
![Page 7: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/7.jpg)
Peter Aronsson
Example – Flat set of EquationsR1.v = -R1.n.v+R1.p.v 0 = R1.n.i+R1.p.i R1.i = R1.p.i R1.i*R1.R = R1.v I1.v = -I1.n.v+I1.p.v 0 = I1.n.i+I1.p.i I1.i = I1.p.i I1.L*I1.der(i) = I1.v emf.v =-emf.n.v+emf.p.v 0 = emf.n.i+emf.p.i emf.i = emf.p.i emf.w = emf.flange_b.der(phi) emf.k*emf.w = emf.v emf.flange_b.tau = -emf.i*emf.k ground.p.v = 0 step.v = -step.n.v+step.p.v 0 = step.n.i+step.p.i step.i = step.p.i step.signalSource.outPort.signal[1] = (if time < step.signalSource.p_startTime[1] then 0 else step.signalSource.p_height[1])+step.signalSource.p_offset[1] step.v = step.signalSource.outPort.signal[1] load.flange_a.phi = load.phi load.flange_b.phi = load.phi load.w = load.der(phi) load.a = load.der(w) load.a*load.J = load.flange_a.tau+load.flange_b.tau R1.n.v = I1.p.v I1.p.i+R1.n.i = 0 I1.n.v = emf.p.v emf.p.i+I1.n.i = 0 emf.n.v = step.n.v step.n.v = ground.p.v emf.n.i+ground.p.i+step.n.i = 0 emf.flange_b.phi = load.flange_a.phiemf.flange_b.tau+load.flange_a.tau = 0 step.p.v = R1.p.v R1.p.i+step.p.i = 0 load.flange_b.tau = 0 step.signalSource.y = step.signalSource.outPort.signal
![Page 8: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/8.jpg)
Peter Aronsson
0.5 1 1.5 2
1
2
3
4
5
•load.flange_a.tau
•load.w
•load.flange_a.tau
•load.w
Plot of Simulation result
![Page 9: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/9.jpg)
Peter Aronsson
Task Graphs
• Directed Acyclic Graph (DAG)G = (V,E, ,c)V – Set of nodes, representing computational tasksE – Set of edges, representing communication of data
between tasks(v) – Execution cost for node v c(i,j) – Communication cost for edge (i,j)
• Referred to as the delay model (macro dataflow model)
![Page 10: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/10.jpg)
Peter Aronsson
Small Task Graph Example
12
32
21
41
52
62
71
81
5 10
5 5 5
10 1010
![Page 11: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/11.jpg)
Peter Aronsson
Task Scheduling Algorithms• Multiprocessor Scheduling Problem
– For each task, assign• Starting time• Processor assignment (P1,...PN)
– Goal: minimize execution time, given• Precedence constraints• Execution cost• Communication cost
• Algorithms in literature– List Scheduling approaches (ERT, FLB)– Critical Path scheduling approaches (TDS, MCP)
• Categories: Fixed No. of Proc, fixed c and/or , ...
![Page 12: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/12.jpg)
Peter Aronsson
Granularity
• Granularity g = min((v))/max(c(i,j)) • Affects scheduling result
– E.g. TDS works best for high values of g, i.e. low communication cost
• Solutions:– Clustering algorithms
• IDEA: build clusters of nodes where nodes in the same cluster are executed on the same processor
– Merging algorithms• Merge tasks to increase computational cost.
![Page 13: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/13.jpg)
Peter Aronsson
Task Clustering/Merging Algorithms
• Task Clustering Problem:– Build clusters of nodes such that parallel time decreases
– PT(n) = tlevel(n)+blevel(n)
– By zeroing edges, i.e. putting several nodes into the same cluster => zero communication cost.
• Literature:– Sarkars Internalization alg., Yangs DSC alg.
• Task Merging Problem– Transform the Task Graph by merging nodes
• Literature: E.g. Grain Packing alg.
![Page 14: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/14.jpg)
Peter Aronsson
Clustering v.s. Merging
12
32
21
41
52
62
71
81
5 0
0 0 0
0 010
Clustered Task Graph
12
32
21
41
52
62
71
81
5 10
5 55
10 1010
mer
ging
Merged Task Graph
12
3,66
2,5,64
71
81
5 10
1010
10
![Page 15: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/15.jpg)
Peter Aronsson
DSC algorithm
1. Initially, put each node a separate cluster.
2. Traverse Task Graph– Merge clusters as long as Parallel Time does
not increase.
• Low complexity O((n+e) log n)
• Previously used by Andersson in ObjectMath (PELAB)
![Page 16: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/16.jpg)
Peter Aronsson
Modelica Compilation
a...
a...
a...
a...
a...
a...
inertialxy
r...
r...
r...
r...
r...
r...
b0 c...
b...
b...
b...
b...
b...
b...
b...
c...
b...
b...
l...
r...
Modelica model (.mo)
Modelica semantics
Equation system(DAE)
Opt. Rhscalculations
Flat modelica (.mof)
Numericalsolver
C code
Structure of simulation code:for t=0;t<stopTime;t+=stepSize { x_dot[t+1] = f(x_dot[t],x[t],t); x[t+1] = ODESolver(x_dot[t+1]);}
![Page 17: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/17.jpg)
Peter Aronsson
Optimizations on equations• Simplification of equations
E.g. a=b, b=c eliminate => b
• BLT transformation, i.e. topological sorting into strongly connected components(BLT = Block Lower Triangular form)
• Index reduction, Index is how many times an equation needs to be differentiated in order to solve the equation system.
• Mixed Mode /Inline Integration, methods of optimizing equations by reducing size of equation systems
ab
cd
e
0
![Page 18: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/18.jpg)
Peter Aronsson
Generated C Code Content• Assignment statements• Arithmetic expressions (+,-,*,/), if-expressions• Function calls
– Standard Math functions• Sin, Cos, Log
– Modelica Functions• User defined, side effect free
– External Modelica Functions• In External lib, written in Fortran or C
– Call function for solving subsystems of equations• Linear or non-linear
• Example Application– Robot simulation has 27 000 lines of generated C code
![Page 19: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/19.jpg)
Peter Aronsson
Parallelization Tool Overview
ModelicaCompiler
C compiler
Model.mo
C code
C compiler
ParallelizerParallelizer
Parallel C codeSolver
libMPIlib
Seq exe
Parallel exe
![Page 20: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/20.jpg)
Peter Aronsson
Parallelization Tool Internal Structure
Parser
Task Graph Builder
Symbol Table
Scheduler
Code Generator Debug & Statistics
Sequential C code
Parallel C code
![Page 21: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/21.jpg)
Peter Aronsson
Task Graph building• First graph: corresponds to individual
arithmetic operations, assignments, function calls and variable definitions in the C code
• Second graph: Clusters of tasks from first task graph
Example:
+
- *
foo -/
+
*
a b cd
defs
+,-,* +,*
foo /,-
![Page 22: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/22.jpg)
Peter Aronsson
Investigated Scheduling Algorithms
• Parallelization Tool – TDS (Task Duplications Scheduling Algorithm)– Pre – Clustering Method– Full Task Duplication Method
• Experimental Framework (Mathematica)– ERT– DSC– TDS– Full Task Duplication Method– Task Merging approaches (Graph Rewrite Systems)
![Page 23: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/23.jpg)
Peter Aronsson
Method 1:Pre Clustering algorithm
– buildCluster(n:node, l:list of nodes, size:Integer)
– Adds n to a new cluster– Repeatedly adds nodes until the
size(cluster)=size – Children to n– One in-degree children to cluster– Siblings to n– Parents to n– Arbitrary nodes
![Page 24: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/24.jpg)
Peter Aronsson
Managing cycles• When adding a node to a
cluster the resulting graph might have cycles
• Resulting graph when clustering a and b is cyclic since you can reach {a,b} from c
• Resulting graph not a DAG– Can not use standard scheduling
algorithms
a
b
c
de
![Page 25: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/25.jpg)
Peter Aronsson
Pre Clustering Results
• Did not produce Speedup– Introduced far too many dependencies in
resulting task graph– Sequentialized schedule
• Conclusion:– For fine grained task graphs:
• Need task duplication in such algorithm to succeed
![Page 26: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/26.jpg)
Peter Aronsson
Method 2: Full Task Duplication• For each node:n with successor(n)={}
– Put all pred(n) in one cluster• Repeat for all nodes in cluster
– Rationale: If depth of graph limited, task duplication will be kept at reasonable level and cluster size reasonable small.
– Works well when communication cost >> execution cost
![Page 27: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/27.jpg)
Peter Aronsson
Full Task Duplication (2)
• Merging clusters1. Merge clusters with load balancing strategy,
without increasing maximum cluster size
2. Merge clusters with greatest number of common nodes
• Repeat (2) until number of processors requirement is met
![Page 28: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/28.jpg)
Peter Aronsson
Full Task Duplication Results
• Computed measurements– Execution cost of largest cluster +
communication cost
• Measured speedup– Executed on PC Linux
cluster SCI network interface,
using SCAMPI
![Page 29: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/29.jpg)
Peter Aronsson
Robot Example Computed Speedup
• Mixed Mode / Inline Integration
1 2 4 9# Proc
0.250.5
0.751
1.251.5
1.752
Speedup
c10
c100
c1000
1 2# Proc
0.250.5
0.751
1.251.5
1.752
Speedup
c10
c100
c1000
With MM/IIWithout MM/II
![Page 30: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/30.jpg)
Peter Aronsson
Thermofluid pipe executed on PC Cluster
• Pressurewavedemo in Thermofluid package 50 discretization points
1 2 4 8 16# Proc
0.25
0.5
0.75
1
1.25
1.5
1.75
2Speedup
![Page 31: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/31.jpg)
Peter Aronsson
Thermofluid pipe executed on PC Cluster
• Pressurewavedemo in Thermofluid package 100 discretization points
1 2 4 8 16# Proc
0.5
1
1.5
2
2.5
3Speedup
![Page 32: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/32.jpg)
Peter Aronsson
Task Merging using GRS
• Idea: A set of simple rules to transform a task graph to increase its granularity (and decrease Parallel Time)
• Use top level (and bottom level) as metric:
• Parallel Time = max tlevel + max blevel
tleveln 0 , predn maxkpredntlevelk k L ck,n
B, predn
![Page 33: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/33.jpg)
Peter Aronsson
Rule 1
• Merging a single child with only one parent.
• Motivation: The merge does not decrease amount of parallelism in the task graph. And granularity can possibly increase.
p
ctlevelj, j prednp p’
![Page 34: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/34.jpg)
Peter Aronsson
tlevelc maxpipredctlevelpipj
pjpredc
c
Rule 2• Merge all parents of a node together with the node
itself.
• Motivation: If the top level does not increase by the merge the resulting task will increase in size, potentially increasing granularity.
p1
c c’
p2 pn
tlevelj, j prednpi, pi
…
![Page 35: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/35.jpg)
Peter Aronsson
p L cp, ciB
, i 1..n
tlevelci tlevelpj L cpj, ciB
p i 1..n pj predcipj p
Rule 3• Duplicate parent and merge into each child node
• Motivation: As long as each child’s tlevel does not increase, duplicating p into the child will reduce the number of nodes and increase granularity.
c2
p
tlevelj, j prednpcnc1
c2’ cn’c1’…
…
![Page 36: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/36.jpg)
Peter Aronsson
p1 p2 ... pnmax k ,
i1
k
pi maxSize, 1 k n
Rule 4• Merge siblings into a single node as long as a parameterized
maximum execution cost is not exceeded.
• Motivation: This rule can be useful if several small predecessor nodes exist and a larger predecessor node which prevents a complete merge. Does not guarantee decrease of PT.
tlevelj, j prednpp1
c
p2 pnp´
c
Pk+1 pn… …
![Page 37: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/37.jpg)
Peter Aronsson
Results – Example
• Task graph from Modelica simulation code– Small example from the
mechanical domain.
– About 100 nodes built on expression level, originating from 84 equations & variables
![Page 38: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/38.jpg)
Peter Aronsson
Result Task Merging example
• B=1, L=1
![Page 39: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/39.jpg)
Peter Aronsson
Result Task Merging example
– B=1, L=10
– B=1, L=100
![Page 40: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/40.jpg)
Peter Aronsson
Conclusions
• Pre Clustering approach did not work well for the fine grained task graphs produced by our parallelization tool
• FTD Method– Works reasonable well for some examples
• However, in general: – Need for better scheduling/clustering
algorithms for fine grained task graphs
![Page 41: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/41.jpg)
Peter Aronsson
Conclusions (2)
• Simple delay model may not be enough– More advanced model require more complex
scheduling and clustering algorithms
• Simulation code from equation based models– Hard to extract parallelism from– Need new optimization methods on DAE:s or
ODE:s to increase parallelism
![Page 42: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/42.jpg)
Peter Aronsson
Conclusions Task Merging using GRS
• A task merging algorithm using GRS have been proposed– Four rules with simple patterns => fast pattern matching
• Can easily be integrated in existing scheduling tools.• Successfully merges tasks considering
– Bandwidth & Latency– Task duplication– Merging criterion: decrease Parallel Time, by decreasing
tlevel (PT)
• Tested on examples from simulation code
![Page 43: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/43.jpg)
Peter Aronsson
Future Work
• Designing and Implementing Better Scheduling and Clustering Algorithms– Support for more advanced task graph models– Work better for high granularity values
• Try larger examples• Test on different architectures
– Shared Memory machines– Dual processor machines
![Page 44: Automatic Parallelization of Simulation Code from Equation Based Simulation Languages](https://reader035.vdocuments.site/reader035/viewer/2022062309/5681595d550346895dc69b90/html5/thumbnails/44.jpg)
Peter Aronsson
Future Work (2)
• Heterogeneous multiprocessor systems– Mixed DSP processors, RISC,CISC, etc.
• Enhancing Modelica language with data parallelism– e.g. parallel loops, vector operations
• Parallelize e.g. combined PDE and ODE problems in Modelica.
• Using e.g. SCALAPACK for solving subsystems of linear equations. How to integrate into scheduling algorithms?