Download - Multiresolution Analysis, Computational Chemistry, and Implications for High Productivity Parallel Programming Aniruddha G. Shet, James Dinan, Robert J

Multiresolution Analysis, Computational Chemistry, and Implications for High Productivity Parallel Programming

Aniruddha G. Shet, James Dinan, Robert J. Harrison, and P. Sadayappan

Background• Multiresolution Analysis (MRA)

• Mathematical technique of function approximation

• Representation is a hierarchy of coefficients

• Dynamically adapts to guarantee the accuracy of the approximation

• Varying degree of information granularity in the hierarchy

• Trade numerical accuracy with computation timeMRA can represent different areas in an object at different levels of detail.

What is (the) MADNESS (about)?• Multiresolution Adaptive Numerical Environment for Scientific Simulation

• Programming environment for the solution of integral and differential equations

• Built on adaptive multiresolution analysis in multiwavelet basis and low separation rank methods for scaling to higher dimensions

• Fast algorithms with guaranteed precision

• Trade precision for speed

• High-level composition of numerical codes

• Work with functions and operators

• Target applications

• Quantum chemistry, atomic and molecular physics, material science, nuclear structure

A molecular orbital of the benzene molecule with the adaptive mesh also displayed.

Implementation Issues with MADNESS• Multi-dimensional tree distribution

• Multiresolution adaptive properties produce unbalanced coefficient trees (binary tree in 1-d, quad tree in 2-d, oct tree in 3-d etc.)

• Tree structure evolves in unscheduled ways due to very flexible adaptive refinement

• Need a scheme to partition the complete tree as the entire tree, and not just the leaf nodes, is utilized in some algorithms

• Two main types of applications

• Few large trees (billions of nodes) - time evolution of wavepackets in molecular physics

• Many (thousands) smaller trees (millions of nodes) - materials and electronic structure

• Nodes range from 1KB-1MB

• Algorithmic characteristics

• Some are recursive, tree-walking algorithms that move data up/down the tree structure

• Number of levels navigated varies dynamically, and may constitute a data dependence chain

• Some move data laterally within same level of tree i.e. between neighboring nodes

• Some algorithms involve applying mathematical functions to the collection of coefficient tensors, and possibly combining individual results

• Certain algorithms operate on multiple trees having different refinement characteristics and produce a new tree

Need to express and manage huge amounts of nested hierarchical concurrency on distributed many-core petascale machines

Binary tree numerical form of a 1-d analytical function. Note that some intervals are not sub-divided due to the adaptive nature of refinement.

Sub-trees from decomposition of function space

Function interval

Coefficient tensor

Tree node

Adaptively refined tree Compress tree algorithm Reconstruct tree algorithm

X

Multiplication of differently refined trees

Dashed arrows depict the flow of data between operations on tree nodes. The input data to a node operation is indicated by an arrow pointing into the node, and the output from the operation is shown by an outgoing arrow.

A coefficient tensor that is added to the tree during a tree algorithm.

A coefficient tensor that is removed from the tree after it has been operated upon.

Parallel Programming Challenges• Shared Memory Model

• Cilk-style fork-join task parallelism with a work-stealing runtime doing dynamic load balancing is a plausible solution, but…

• it is not targeted at distributed data

• the model mandates that a parent task await the completion of children tasks, which constrains the full expression of available parallelism

• Message Passing Model

• Hard to express dynamic, irregular computations

• Two-sided communication model introduces unnecessary overhead in reading and writing distributed tree data

• Does not address the need for dynamic load balancing

• Partitioned Global Address Space (PGAS) Models

• i.e. Co-Array Fortran, UPC, Titanium

• Static SPMD model of parallelism, lack flexible threading capabilities

• Maintenance of distributed data structures without remote operations requires complex low-level remote memory references

• Does not address the need for dynamic load balancing

Chapel Programming Model• Multithreaded parallel programming

• Global view of computation, data structures

• Abstractions for data and task parallelism

• data: domain, forall, iterators

• task: begin, cobegin, coforall, sync variables, atomic

• Composition of parallelism

• Virtualization of threads

• Locality-aware programming

• locale: machine unit of storage and processing

• domains may be distributed across locales

• on keyword binds computation to locale(s)

• Object-oriented programming

• OOP can help manage program complexity

• Classes and objects are provided in Chapel, but their use is typically not required

• Advanced language features (e.g. distributions) expressed using classes

• Generic programming and type inference

• Type parameters

• Latent types

• Variables are statically-typed

Solution Building Blocksclass FTree { const tree: [LocaleSpace] SubTree; def FTree(order: int) { coforall loc in Locales do on loc do tree[loc.id]=new SubTree(order); } def this(node: Node) { const t = tree[node2loc(node).id]; return t[node]; } /* Global tree access methods */ }

class SubTree { const coeffDom: domain(1); var nodes: domain(Node); var coeff: [nodes] [coeffDom] real; /* Local tree access methods */ }

• Global-view container

• Container to store the tree, presents a global-view and one-sided access to tree algorithms

• Internally, maintains a directory of the distributed collection of sub-trees and transparently maps an indexed node to the host locale

• Sub-trees are structured as associative arrays of node-coefficients key-value pairs

• The mapping scheme could be a simple hash function or driven by a specialized partitioning strategy having better locality properties

• Concurrency

• Operations on tree nodes are created as asynchronous tasks

• Tasks are chained together in a hierarchical nested manner to express recursive parallelism in tree algorithms

• Dependencies are handled by letting tasks synchronize on the completion of spawned tasks

• Locality control permits running tasks using owner-computes policy which will launch tasks where target data resides

• Plan to support work-stealing technique for dynamic load balancing of a distributed set of tasks in the language runtime

const myFTree = new FTree(order=5);

def walkDownOp(node: Node) { /* Perform the operation on node */

for child in node.getChild() do on myFTree.node2loc(child) do begin walkDownOp(child);}

sync on myFTree.node2loc(root) do begin walkDownOp(root);

def walkUpOp(node: Node) { coforall child in node.getChild() do on myFTree.node2loc(child) do walkUpOp(child);

/* Perform the operation on node */}

sync on myFTree.node2loc(root) do begin walkUpOp(root);

• Coefficient tensor arithmetic

• Chapel provides ZPL-style “array language” to simplify working with multi-dimensional arrays

• Domains, a first-class language concept denoting an index set, define the size and shape of arrays, and support data parallel iteration in creating and slicing arrays

• A range of array operators facilitate parallel operations on whole arrays or array slices eliminating need for tedious array indexing

• Mathematical functions involving coefficient tensors are easily expressed in the array language

/* Overloaded multiplication operator to perform vector-matrix multiplication */def *(V: [] real, M: [] real) where V.rank == 1 && M.rank == 2 { const R: [i in M.domain.dim(2)] real = + reduce (V * M(..,i)); return R;}

/* Overloaded multiplication operator to perform matrix-vector multiplication */def *(M: [] real, V: [] real) where M.rank == 2 && V.rank == 1 { const R: [i in M.domain.dim(1)] real = + reduce (M(i,..) * V); return R;}

/* Norm of a vector */def normf(V) where V.rank == 1 { return sqrt(+ reduce V**2);}

• Python-like high-level programming

• Better usability in writing end user-codes using functions and operators rather than their underlying representations

• Key language ideas are object-oriented and type inferencing features

• Future work will explore writing dimension-independent programs

/* Fn_Test1 class wraps an analytic function */var f = new Fn_Test1();

/* Function class holds the numerical tree form */var F = new Function(k=5,thresh=1e-5,f=f);

/* Overloaded arithmetic operators on Function class that invoke various tree algorithms */ var H = F + F; H = F * F;

/* Print the numerical value at a given point */ writeln("Numerical value: ", format(" %0.8f", F(0.5)));

Solution Building Blocks

Future Work• Recursive task parallelism

• Language constructs to distinguish between tasks that may vs. must run in parallel

• “May” construct permits runtime management of parallelism

• “Must” construct for patterns like producer-consumer

• Application vs. compiler control over the granularity and degree of parallelism

• DAG-based dynamic scheduling of tasks inside Chapel runtime

• Provide execution guarantees for a class of DAGs

• Compiler support for parallel distribution/iteration/algebraic operations on associative array structures

Putting the Pieces Together

def refine(node = root) { const child = node.getChild();

var sc: [0..2*k-1] real; sc[0..k-1] = project(child(1)); sc[k..2*k-1] = project(child(2));

const dc = sc*hgT; const nf = normf(dc[k..2*k-1]); if (nf < thresh) { sumC[child(1)] = sc[0..k-1]; sumC[child(2)] = sc[k..2*k-1]; } else { on sumC.node2loc(child(1)) do begin refine(child(1)); on sumC.node2loc(child(2)) do begin refine(child(2)); }}

sync on sumC.node2loc(node) do begin refine(node);

def compress(node = root) { const child = node.getChild(); cobegin { on sumC.node2loc(child(1)) do if !sumC.hasCoeffs(child(1)) then compress(child(1)); on sumC.node2loc(child(2)) do if !sumC.hasCoeffs(child(2)) then compress(child(2)); }

var sc: [0..2*k-1] real; sc[0..k-1] = sumC[child(1)]; sc[k..2*k-1] = sumC[child(2)];

const dc = sc*hgT; sumC[node] = dc[0..k-1]; diffC[node] = dc[k..2*k-1];

sumC.remove(child(1)); sumC.remove(child(2)); }

sync on sumC.node2loc(root) do begin compress(root);

• Parallel compress algorithm• Parallel refine algorithm

sumC and diffC are global-view tree containers

Research sponsored in part by the Laboratory Directed Research and Development Program and Post Masters Research Participation Program of Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC for the U. S. Department of Energy under Contract No. DE-AC05-00OR22725, and DOE grant #DE-FC02-06ER25755 and NSF grant #0403342.

Download - Multiresolution Analysis, Computational Chemistry, and Implications for High Productivity Parallel Programming Aniruddha G. Shet, James Dinan, Robert J

Top Related