component frameworks:
DESCRIPTION
Component Frameworks:. Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu. Motivation. Parallel Computing in Science and Engineering Competitive advantage Pain in the neck - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/1.jpg)
PPL-Dept of Computer Science, UIUC
Component Frameworks:
Laxmikant (Sanjay) KaleParallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana-Champaign
http://charm.cs.uiuc.edu
![Page 2: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/2.jpg)
PPL-Dept of Computer Science, UIUC
Motivation
• Parallel Computing in Science and Engineering– Competitive advantage
– Pain in the neck
– Necessary evil
• It is not so difficult– But tedious, and error-prone
– New issues: race conditions, load imbalances, modularity in presence of concurrency,..
– Just have to bite the bullet, right?
![Page 3: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/3.jpg)
PPL-Dept of Computer Science, UIUC
But wait…
• Parallel computation structures– The set of the parallel applications is diverse and
complex
– Yet, the underlying parallel data structures and communication structures are small in number
• Structured and unstructured grids, trees (AMR,..), particles, interactions between these, space-time
• One should be able to reuse those– Avoid doing the same parallel programming again and
again
– Domain specific frameworks
![Page 4: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/4.jpg)
PPL-Dept of Computer Science, UIUC
A Unique Twist
• Many apps require dynamic load balancing– Reuse load re-balancing strategies
• It should be possible to separate load balancing code from application code
• This strategy is embodied in Charm++– Express the program as a collection of interacting
entities (objects).
– Let the system control mapping to processors
![Page 5: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/5.jpg)
PPL-Dept of Computer Science, UIUC
Multi-partition decomposition
• Idea: divide the computation into a large number of pieces– Independent of number of processors
– typically larger than number of processors
– Let the system map entities to processors
![Page 6: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/6.jpg)
PPL-Dept of Computer Science, UIUC
Object-based Parallelization
User View
System implementation
User is only concerned with interaction between objects
![Page 7: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/7.jpg)
PPL-Dept of Computer Science, UIUC
Charm Component Frameworks
Object based decomposition
Reuse of Specialized
Parallel Strucutres
Component Frameworks
Load balancing
Auto. Checkpointing
Flexible use of clusters
Out-of-core execn.
![Page 8: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/8.jpg)
PPL-Dept of Computer Science, UIUC
Goals for Our Frameworks
• Ease of use:– C++ and Fortran versions
– Retain “look-and-feel” of sequential programs
– Provide commonly needed features
– Application-driven development
– Portability
• Performance:– Low overhead
– Dynamic load balancing via Charm++
– Cache performance
![Page 9: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/9.jpg)
PPL-Dept of Computer Science, UIUC
Current Set of Component Frameworks
• FEM / unstructured meshes:– “Mature”, with several applications already
• Multiblock: multiple structured grids– New, but very promising
• AMR:– Oct and Quad-trees
![Page 10: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/10.jpg)
PPL-Dept of Computer Science, UIUC
Charm++
Converse
Load database + balancer
MPI-on-Charm Irecv+
AutomaticConversion from
MPI
FEM Structured
Cross module interpolation
Migration path
Frameworkpath
Using the Load Balancing Framework
![Page 11: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/11.jpg)
PPL-Dept of Computer Science, UIUC
begin time loop
compute forces
update node positions
end time loop
begin time loop
compute forces
communicate shared nodes
update node positions
end time loopSerial Code
for entire meshFramework Code for mesh partition
Finite Element Framework Goals
• Hide parallel implementation in the runtime system
• Allow adaptive parallel computation and dynamic automatic load balancing
• Leave physics and numerics to user
• Present clean, “almost serial” interface:
![Page 12: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/12.jpg)
PPL-Dept of Computer Science, UIUC
FEM Framework: Responsibilities
Charm++(Dynamic Load Balancing, Communication)
FEM Framework(Update of Nodal properties, Reductions over nodes or partitions)
FEM Application(Initialize, Registration of Nodal Attributes, Loops Over Elements, Finalize)
METIS I/O
Partitioner Combiner
![Page 13: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/13.jpg)
PPL-Dept of Computer Science, UIUC
Structure of an FEM Program
• Serial init() and finalize() subroutines– Do serial I/O, read serial mesh and call FEM_Set_Mesh
• Parallel driver() main routine:– One driver per partitioned mesh chunk
– Runs in a thread: time-loop looks like serial version
– Does computation and call FEM_Update_Field
• Framework handles partitioning, parallelization, and communication
![Page 14: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/14.jpg)
PPL-Dept of Computer Science, UIUC
Structure of an FEM Application
init()
Update Update Update
finalize()
driver driver driver
Shared Nodes Shared Nodes
![Page 15: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/15.jpg)
PPL-Dept of Computer Science, UIUC
Framework Calls
• FEM_Set_Mesh – Called from initialization to set the serial mesh
– Framework partitions mesh into chunks
• FEM_Create_Field– Registers a node data field with the framework, supports user data types
• FEM_Update_Field– Updates node data field across all processors
– Handles all parallel communication
• Other parallel calls (Reductions, etc.)
![Page 16: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/16.jpg)
PPL-Dept of Computer Science, UIUC
Dendritic Growth
• Studies evolution of solidification microstructures using a phase-field model computed on an adaptive finite element grid
• Adaptive refinement and coarsening of grid involves re-partitioning
![Page 17: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/17.jpg)
PPL-Dept of Computer Science, UIUC
Crack Propagation
• Explicit FEM code
• Zero-volume Cohesive Elements inserted near the crack
• As the crack propagates, more cohesive elements added near the crack, which leads to severe load imbalance
Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Pictures: S. Breitenfeld, and P. Geubelle
0
5
10
15
20
25
30
35
40
45
50
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91
Iteration Number
Num
ber
of It
erat
ions
Per
sec
ond
![Page 18: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/18.jpg)
PPL-Dept of Computer Science, UIUC
Crack Propagation
Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle
![Page 19: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/19.jpg)
PPL-Dept of Computer Science, UIUC
“Overhead” of Multipartitioning
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 4 8 16 32 64 128 256 512 1024 2048
Number of Chunks Per Processor
Tim
e (
Se
co
nd
s) p
er
Ite
rati
on
![Page 20: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/20.jpg)
PPL-Dept of Computer Science, UIUC
Load balancer in action
0
5
10
15
20
25
30
35
40
45
501 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91
Iteration Number
Nu
mb
er
of
Ite
rati
on
s P
er
se
con
dAutomatic Load Balancing in Crack Propagation
1. ElementsAdded 3. Chunks
Migrated
2. Load Balancer Invoked
![Page 21: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/21.jpg)
PPL-Dept of Computer Science, UIUC
Scalability of FEM Framework
Speedup of Crack Propagation
0
5
10
15
20
25
30
35
40
1 2 4 8 16 32
Number of Processors
Spee
dup
Actual Speedup
Ideal Speedup
![Page 22: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/22.jpg)
PPL-Dept of Computer Science, UIUC
Scalability of FEM Framework
1.E-3
1.E-2
1.E-1
1.E+0
1.E+1
1 10 100 1000
Processors
Tim
e/S
tep
(s)
3.1M elements
1.5M Nodes
ASCI Red
1 processor time : 8.24 secs
1024 processors time:7.13 msecs
![Page 23: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/23.jpg)
PPL-Dept of Computer Science, UIUC
Parallel Collision Detection
• Detect collisions (intersections) between objects scattered across processors
Approach, based on Charm++ ArraysOverlay regular, sparse 3D grid of voxels (boxes)
Send objects to all voxels they touch
Collide voxels independently and collect results
Leave collision response to user code
![Page 24: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/24.jpg)
PPL-Dept of Computer Science, UIUC
Collision Detection Speed• O(n) serial performance
Good speedups to 1000s of processors
ASCI Red, 65,000 polygons per processor scaling problem
(to 100 million polygons)
Single Linux PC
2us per polygon serial performance
![Page 25: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/25.jpg)
PPL-Dept of Computer Science, UIUC
FEM: Future Plans
• Better support for implicit computations– Interface to Solvers: e.g. ESI (PETSC), ScaLAPACK
or POOMA’s Linear Solvers
• Better discontinuous Galerkin method support• Fully distributed startup• Fully distributed insertion
– Eliminate serial bottleneck in insertion phase• Abstraction to allow multiple active meshes
– Needed for multigrid methods
![Page 26: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/26.jpg)
PPL-Dept of Computer Science, UIUC
Multiblock framework
• For collection of structured grids– Older versions:
• (Gengbin Zheng, 1999-2000)
– Recent completely new version:
• Motivated by ROCFLO
– Like FEM:
• User writes driver subroutines that deal with the life-cycle of a single chunk of the grid
• Ghost arrays managed by the framework– Based on registration of data by the user program
• Support for “connecting up” multiple blocks– makemblock processes geometry info
![Page 27: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/27.jpg)
PPL-Dept of Computer Science, UIUC
Multiblock Constituents
![Page 28: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/28.jpg)
PPL-Dept of Computer Science, UIUC
Terminology
![Page 29: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/29.jpg)
PPL-Dept of Computer Science, UIUC
Mutiblock structure
• Steps:– Feed geometry information to makemblock
• Input: top level blocks, number of partitions desired
• Output: block file containing list of partitions, and communication structure
– Run parallel application
• Reads the block file
• Initialization of data
• Manual and info:• http://charm.cs.uiuc.edu/ppl_research/mblock/
![Page 30: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/30.jpg)
PPL-Dept of Computer Science, UIUC
Multiblock code example: main loop
do tStep=1,nSteps
call MBLK_Apply_bc_All(grid, size, err) call MBLK_Update_field(fid,ghostWidth,grid,err)
do k=sk,ek do j=sj,ej do i=si,ei ! Only relax along I and J directions-- not K newGrid(i,j,k)=cenWeight*grid(i,j,k) & &+neighWeight*(grid(i+1,j,k)+grid(i,j+1,k) &
&+grid(i-1,j,k)+grid(i,j-1,k)) end do end do end do
![Page 31: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/31.jpg)
PPL-Dept of Computer Science, UIUC
Multiblock Driver
subroutine driver()implicit noneinclude 'mblockf.h’… call MBLK_Get_myblock(blockNo,err) call MBLK_Get_blocksize(size,err)...call MBLK_Create_field(& &size,1, MBLK_DOUBLE,1,& &offsetof(grid(1,1,1),grid(si,sj,sk)),& &offsetof(grid(1,1,1),grid(2,1,1)),fid,err)
! Register boundary condition functions call MBLK_Register_bc(0,ghostWidth, BC_imposed, err) … Time Loopend
![Page 32: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/32.jpg)
PPL-Dept of Computer Science, UIUC
Multiblock: Future work
• Support other stencils– Currently diagonal elements are not used
• Applications– We need volunteers!
– We will write demo apps ourselves
![Page 33: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/33.jpg)
PPL-Dept of Computer Science, UIUC
Adaptive Mesh Refinement
• Used in various engineering applications where there are regions of greater interest– e.g.http://www.damtp.cam.ac.uk/user/sdh20/amr/amr.html
– Global Atmospheric modeling
– Numerical Cosmology
– Hyperbolic partial differential equations (M.J. Berger and J. Oliger)
• Problems with uniformly refined meshes for above– Grid is too fine grained thus wasting resources
– Grid is too coarse thus the results are not accurate
![Page 34: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/34.jpg)
PPL-Dept of Computer Science, UIUC
AMR Library
• Implements the distributed grid which can be dynamically adapted at runtime
• Uses the arbitrary bit indexing of arrays• Requires synchronization only before refinement
or coarsening• Interoperability because of Charm++• Uses the dynamic load balancing capability of the
chare arrays
![Page 35: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/35.jpg)
PPL-Dept of Computer Science, UIUC
Indexing of array elements
• Case of 2D mesh (4x4)
0,0,0
0,0,2 0,1,2 1,0,2 1,1,2
0,0,4 0,1,4 1,0,4 1,1,4 0,2,4 0,3,4 1,2,4 1,3,4
Node or root
Leaf
Virtual Leaf
Question: Who are my neighbors
![Page 36: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/36.jpg)
PPL-Dept of Computer Science, UIUC
Indexing of array elements (contd.)
• Mathematicaly: (for 2D)
if parent is x,y using n bits then,
child1 – 2x , 2y using n+2 bits
child2 – 2x ,2y+1 using n+2 bits
child3 – 2x+1, 2y using n+2 bits
child4 – 2x+1,2y+1 using n+2 bits
![Page 37: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/37.jpg)
PPL-Dept of Computer Science, UIUC
Pictorially
0,0,4
![Page 38: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/38.jpg)
PPL-Dept of Computer Science, UIUC
Communication with Nbors
• In dimension x the two nbors can be obtained by
- nbor --- x-1 where x is not equal to 0
+ nbor --- x+1 where x is not equal to 2n
• In dimension y the two nbors can be obtained by
- nbor --- y-1 where y is not equal to 0
+ nbor--- y+1 where y is not equal to 2n
![Page 39: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/39.jpg)
PPL-Dept of Computer Science, UIUC
Case 1
Nbors of 1,1,2
Y dimension : -nbor 1,0,2
0,0,4
0,0,0
0,0,2 0,1,2 1,0,2 1,1,2
0,1,4 1,0,4 1,1,4 0,2,4 0,3,4 1,2,4 1,3,4
Case 2Nbors of 1,1,2X dimension : -nbor 0,1,2
Case 3
Nbors of 1,3,4
X dimension : +nbor 2,3,4
Nbor of 1,2,4
X Dimension : +nbor 2,2,4
![Page 40: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/40.jpg)
PPL-Dept of Computer Science, UIUC
Communication (contd.)
• Assumption : The level of refinement of adjacent cells differs at maximum by one (requirement of the indexing scheme used)
• Indexing scheme is similar for 1D and 3D cases
![Page 41: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/41.jpg)
PPL-Dept of Computer Science, UIUC
AMR Interface
• Library Tasks- Creation of Tree
- Creation of Data at cells- Communication between cells- Calling the appropriate user routines in each iteration- Refining – Refine on Criteria (specified by user)
• User Tasks- Writing the user data structure to be kept by each cell- Fragmenting + Combining of data for the Neighbors- Fragmenting of the data of the cell for refine- Writing the sequential computation code at each cell
![Page 42: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/42.jpg)
PPL-Dept of Computer Science, UIUC
Some Related Work
• PARAMESH Peter MacNeice et al. http://sdcd.gsfc.nasa.gov/RIB/repositories/inhouse_gsfc/Users_manual/amr.htm
- This library is implemented in Fortran 90
- Supported on CrayT3E and SGI’s
• Parallel Algorithms for Adaptive Mesh Refinement, Mark T. Jones and Paul E. Plassmann, SIAM J. on Scientific Computing, 18,(1997) pp. 686-708. (Also MSC Preprint p 421-0394. )
http://www-unix.mcs.anl.gov/sumaa3d/Papers/papers.html
• DAGH-Dynamic Adaptive Grid Hierarchies– By Manish Parashar & James C. Browne– In C++ using MPI
![Page 43: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/43.jpg)
PPL-Dept of Computer Science, UIUC
Future work
• Specialized version for structured grids– Integration with multiblock
• Fortran interface– Current version is C++ only
• unlike FEM and Multiblock frameworks, which support Fortran 90
– Relatively easy to do
![Page 44: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/44.jpg)
PPL-Dept of Computer Science, UIUC
Summary
• Frameworks are ripe for use– Well tested in some cases
• Questions and answers:– MPI libraries?
– Performance issues?
• Future plans:– Provide all features of Charm++
![Page 45: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/45.jpg)
PPL-Dept of Computer Science, UIUC
AMPI: Goals
• Runtime adaptivity for MPI programs– Based on multi-domain decomposition and dynamic
load balancing features of Charm++
– Minimal changes to the original MPI code
– Full MPI 1.1 standard compliance
– Additional support for coupled codes
– Automatic conversion of existing MPI programs
Original MPI Code AMPI Code
AMPI Runtime
AMPIzer
![Page 46: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/46.jpg)
PPL-Dept of Computer Science, UIUC
Charm++
• Parallel C++ with Data Driven Objects• Object Arrays/ Object Collections• Object Groups:
– Global object with a “representative” on each PE
• Asynchronous method invocation• Prioritized scheduling• Mature, robust, portable• http://charm.cs.uiuc.edu
![Page 47: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/47.jpg)
PPL-Dept of Computer Science, UIUC
Data driven execution
Scheduler Scheduler
Message Q Message Q
![Page 48: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/48.jpg)
PPL-Dept of Computer Science, UIUC
Load Balancing Framework
• Based on object migration and measurement of load information
• Partition problem more finely than the number of available processors
• Partitions implemented as objects (or threads) and mapped to available processors by LB framework
• Runtime system measures actual computation times of every partition, as well as communication patterns
• Variety of “plug-in” LB strategies available
![Page 49: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/49.jpg)
PPL-Dept of Computer Science, UIUC
Load Balancing Framework
![Page 50: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/50.jpg)
PPL-Dept of Computer Science, UIUC
Building on Object-based Parallelism
• Application induced load imbalances• Environment induced performance issues:
– Dealing with extraneous loads on shared m/cs
– Vacating workstations
– Automatic checkpointing
– Automatic prefetching for out-of-core execution
– Heterogeneous clusters
• Reuse: object based components• But: Must use Charm++!
![Page 51: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/51.jpg)
PPL-Dept of Computer Science, UIUC
AMPI: Goals• Runtime adaptivity for MPI programs
– Based on multi-domain decomposition and dynamic load balancing features of Charm++
– Minimal changes to the original MPI code
– Full MPI 1.1 standard compliance
– Additional support for coupled codes
– Automatic conversion of existing MPI programs
Original MPI Code AMPI Code
AMPI Runtime
AMPIzer
![Page 52: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/52.jpg)
PPL-Dept of Computer Science, UIUC
Adaptive MPI
• A bridge between legacy MPI codes and dynamic load balancing capabilities of Charm++
• AMPI = MPI + dynamic load balancing
• Based on Charm++ object arrays and Converse’s migratable threads
• Minimal modification needed to convert existing MPI programs (to be automated in future)
• Bindings for C, C++, and Fortran90
• Currently supports most of the MPI 1.1 standard
![Page 53: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/53.jpg)
PPL-Dept of Computer Science, UIUC
AMPI Features
• Over 70+ common MPI routines– C, C++, and Fortran 90 bindings
– Tested on IBM SP, SGI Origin 2000, Linux clusters
• Automatic conversion: AMPIzer– Based on Polaris front-end
– Source-to-source translator for converting MPI programs to AMPI
– Generates supporting code for migration
Very low “overhead” compared with native MPI
48
50
52
54
56
58
60
62
64
1 8 16 32 64 128
Number of Processors
Tim
e (s
eco
nd
s)
AMPI
MPI
-3
-2
-1
0
1
2
3
4
5
1 8 16 32 64 128
Number of Processors
Pe
rce
nt
Ov
erh
ea
dOverhead
![Page 54: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/54.jpg)
PPL-Dept of Computer Science, UIUC
AMPI Extensions
• Integration of multiple MPI-based modules– Example: integrated rocket simulation
• ROCFLO, ROCSOLID, ROCBURN, ROCFACE• Each module gets its own MPI_COMM_WORLD
– All COMM_WORLDs form MPI_COMM_UNIVERSE• Point-to-point communication among different
MPI_COMM_WORLDs using the same AMPI functions• Communication across modules also considered for balancing load
• Automatic checkpoint-and-restart– On different number of processors
– Number of virtual processors remain the same, but can be mapped to different number of physical processors
![Page 55: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/55.jpg)
PPL-Dept of Computer Science, UIUC
Charm++
Converse
![Page 56: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/56.jpg)
PPL-Dept of Computer Science, UIUC
Application Areas and Collaborations
• Molecular Dynamics: – Simulation of biomolecules
– Material properties and electronic structures
• CSE applications: – Rocket Simulation
– Industrial process simulation
– Cosmology visualizer
• Combinatorial Search:– State space search, game tree search, optimization
![Page 57: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/57.jpg)
PPL-Dept of Computer Science, UIUC
Molecular Dynamics
• Collection of [charged] atoms, with bonds• Newtonian mechanics• At each time-step
– Calculate forces on each atom
• Bonds:
• Non-bonded: electrostatic and van der Waal’s
– Calculate velocities and advance positions
• 1 femtosecond time-step, millions needed!• Thousands of atoms (1,000 - 100,000)
![Page 58: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/58.jpg)
PPL-Dept of Computer Science, UIUC
![Page 59: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/59.jpg)
PPL-Dept of Computer Science, UIUC
![Page 60: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/60.jpg)
PPL-Dept of Computer Science, UIUC
BC1 complex: 200k atoms
![Page 61: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/61.jpg)
PPL-Dept of Computer Science, UIUC
Performance Data: SC2000
Speedup on ASCI Red: BC1 (200k atoms)
0
200
400
600
800
1000
1200
1400
0 500 1000 1500 2000 2500
Processors
Spe
edup
![Page 62: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/62.jpg)
PPL-Dept of Computer Science, UIUC
Charm++
Converse
Load database + balancer
MPI-on-Charm Irecv+
AutomaticConversion from
MPI
FEM Structured
Cross module interpolation
Migration path
Frameworkpath
Component Frameworks:Using the Load Balancing Framework
![Page 63: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/63.jpg)
PPL-Dept of Computer Science, UIUC
Finite Element Framework Goals
• Hide parallel implementation in the runtime system• Allow adaptive parallel computation and dynamic
automatic load balancing• Leave physics and numerics to user• Present clean, “almost serial” interface:
begin time loop
compute forces
update node positions
end time loop
begin time loop
compute forces
communicate shared nodes
update node positions
end time loop
Serial Codefor entire mesh
Framework Code for mesh partition
![Page 64: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/64.jpg)
PPL-Dept of Computer Science, UIUC
FEM Framework: Responsibilities
Charm++(Dynamic Load Balancing, Communication)
FEM Framework(Update of Nodal properties, Reductions over nodes or partitions)
FEM Application(Initialize, Registration of Nodal Attributes, Loops Over Elements, Finalize)
METIS I/O
Partitioner Combiner
![Page 65: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/65.jpg)
PPL-Dept of Computer Science, UIUC
Structure of an FEM Application
init()
Update Update Update
finalize()
driver driver driver
Shared Nodes Shared Nodes
![Page 66: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/66.jpg)
PPL-Dept of Computer Science, UIUC
Dendritic Growth
• Studies evolution of solidification microstructures using a phase-field model computed on an adaptive finite element grid
• Adaptive refinement and coarsening of grid involves re-partitioning
![Page 67: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/67.jpg)
PPL-Dept of Computer Science, UIUC
Crack Propagation
Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle
![Page 68: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/68.jpg)
PPL-Dept of Computer Science, UIUC
“Overhead” of Multipartitioning
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 4 8 16 32 64 128 256 512 1024 2048
Number of Chunks Per Processor
Tim
e (
Se
co
nd
s) p
er
Ite
rati
on
![Page 69: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/69.jpg)
PPL-Dept of Computer Science, UIUC
Load balancer in action
0
5
10
15
20
25
30
35
40
45
501 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91
Iteration Number
Nu
mb
er
of
Ite
rati
on
s P
er
se
con
dAutomatic Load Balancing in Crack Propagation
1. ElementsAdded 3. Chunks
Migrated
2. Load Balancer Invoked
![Page 70: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/70.jpg)
PPL-Dept of Computer Science, UIUC
Parallel Collision Detection
• Detect collisions (intersections) between objects scattered across processors
Approach, based on Charm++ ArraysOverlay regular, sparse 3D grid of voxels (boxes)
Send objects to all voxels they touch
Collide voxels independently and collect results
Leave collision response to user code
![Page 71: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/71.jpg)
PPL-Dept of Computer Science, UIUC
Collision Detection Speed
• O(n) serial performance
Good speedups to 1000s of processors
ASCI Red, 65,000 polygons per processor scaling problem
(to 100 million polygons)
Single Linux PC
2us per polygon serial performance
![Page 72: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/72.jpg)
PPL-Dept of Computer Science, UIUC
Rocket Simulation
• Our Approach:– Multi-partition decomposition
– Data-driven objects (Charm++)
– Automatic load balancing framework
• AMPI: Migration path for existing MPI+Fortran90 codes– ROCFLO, ROCSOLID, and
ROCFACE
"Overhead" of multipartition decomposition
0
5
10
15
20
25
30
35
40
1 10 100 1000
Number of partitions
![Page 73: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/73.jpg)
PPL-Dept of Computer Science, UIUC
Timeshared parallel machines
• How to use parallel machines effectively?• Need resource management
– Shrink and expand individual jobs to available sets of processors
– Example: Machine with 100 processors
• Job1 arrives, can use 20-150 processors
• Assign 100 processors to it
• Job2 arrives, can use 30-70 processors, – and will pay more if we meet its deadline
• We can do this with migratable objects!
![Page 74: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/74.jpg)
PPL-Dept of Computer Science, UIUC
Faucets: Multiple Parallel Machines
• Faucet submits a request, with a QoS contract:– CPU seconds, min-max cpus, deadline, interacive?
• Parallel machines submit bids:– A job for 100 cpu hours may get a lower price bid if:
• It has less tight deadline,
• more flexible PE range
– A job that requires 15 cpu minutes and a deadline of 1 minute
• Will generate a variety of bids
• A machine with idle time on its hand: low bid
![Page 75: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/75.jpg)
PPL-Dept of Computer Science, UIUC
Faucets QoS and Architecture
•User specifies desired job parameters such as:
•min PE, max PE, estimated CPU-seconds, priority, etc.
•User does not specify machine. .
•Planned: Integration with Globus
Central Server
Faucet Client
Web Browser
Workstation Cluster
Workstation Cluster
Workstation Cluster
![Page 76: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/76.jpg)
PPL-Dept of Computer Science, UIUC
How to make all of this work?
• The key: fine-grained resource management model– Work units are objects and threads
• rather than processes
– Data units are object data, thread stacks, ..
• Rather than pages
– Work/Data units can be migrated automatically
• during a run
![Page 77: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/77.jpg)
PPL-Dept of Computer Science, UIUC
Time-Shared Parallel Machines
![Page 78: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/78.jpg)
PPL-Dept of Computer Science, UIUC
Appspector: Web-based Monitoring and Steering of Parallel Programs
• Parallel Jobs submitted via a server– Server maintains database of running programs
– Charm++ client-server interface
• Allows one to inject messages into a running application
• From any web browser:– You can attach to a job (if authenticated)
– Monitor performance
– Monitor behavior
– Interact and steer job (send commands)
![Page 79: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/79.jpg)
PPL-Dept of Computer Science, UIUC
BioCoRE
•Project Based•Workbench for Modeling•Conferences/Chat Rooms•Lab Notebook•Joint Document Preparation
Goal: Provide a web-based way to virtually bring scientists together.
http://www.ks.uiuc.edu/Research/biocore/
![Page 80: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/80.jpg)
PPL-Dept of Computer Science, UIUC
Some New Projects
• Load Balancing for really large machines:– 30k-128k processors
• Million-processor Petaflops class machines– Emulation for software development
– Simulation for Performance Prediction
• Operations Research– Combinatorial optiization
• Parallel Discrete Event Simulation
![Page 81: Component Frameworks:](https://reader035.vdocuments.site/reader035/viewer/2022062502/568145ea550346895db2ed17/html5/thumbnails/81.jpg)
PPL-Dept of Computer Science, UIUC
Summary
• Exciting times for parallel computing ahead• We are preparing an object based infrastructure
– To exploit future apps on future machines
• Charm++, AMPI, automatic load balancing• Application-oriented research that produces enabling
CS technology• Rich set of collaborations• More information: http://charm.cs.uiuc.edu