use of a high level language in high performance biomechanics simulations katherine yelick, armando...

Use of a High Level Language in High Performance Biomechanics Simulations

Katherine Yelick, Armando Solar-Lezama, Jimmy Su, Dan Bonachea, Amir Kamil

U.C. Berkeley and LBNL

Collaborators: S. Graham, P. Hilfinger, P. Colella, K. Datta, E. Givelberg, N. Mai, T. Wen, C. Bell, P. Hargrove, J. Duell, C. Iancu, W. Chen, P. Husbands, M. Welcome, R. Nishtala

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Un

ipro

cess

or

Sp

ecI

nt P

erf

orm

an

ce (

vs. V

AX

-11

/78

0)

25%/year

52%/year

??%/year

A New World for Computing

• VAX : 25% per year 1978 to 1986• RISC + x86: 52% per year 1986 to 2002• RISC + x86: 18% per year 2002 to present

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Sea change in chip design: multiple “cores” or processors per chip from IBM, Sun,, AMD, Intel today

Slide Source: Dave Patterson

Why Is the Computer Industry Worried?

• For 20 years, hardware designers have taken care of performance

• Now they will produce only parallel processors– Double number of cores every 18-24 months – Uniprocessor performance relatively flat

Performance is a software problem

All software will be parallel

• Programming options: – Libraries: OpenMP (scalabililty?), MPI (usability?)– Languages: parallel C, Fortran, Java, Matlab

Titanium: High Level Language for Scientific Computing

• Titanium is an object-oriented language based on Java

• Additional languages support– Multidimensional arrays– Value classes (Complex type)– Fast memory management– Scalable parallelism model with locality

• Implementation strategy– Titanium compiler translates Titanium to C with

calls to communication library (GASNet), no JVM– Portable across machines with C compilers– Cross language calls to C/F/MPI possible

Joint work with Titanium group

Titanium Arrays Operations• Titanium arrays have a rich set of operations

• None of these modify the original array, they just create another view of the data in that array

• Iterate over an array without worrying about bounds– Bounds checking done to prevent errors (can be turned off)

translate restrict slice (n dim to n-1)

RectDomain<2> r = [0:0,11:11];

double [2d] a = new double [r];double [2d] b = new double [1:1,10:10];

foreach (p in b.domain()) { b[p] = 2.0*a[p]; }

Titanium Small Object Example

immutable class Complex {

private double real;

private double imag;

public Complex(double r, double i) {

real = r; imag = i; }

public Complex op+(Complex c) {

return new Complex(c.real + real, c.imag + imag);

}

• Support for small objects, like Complex– In Java these would be objects (not built-in)

• Extra indirection, poor memory locality– Titanium immutable classes are for small objects

• No indirection is used; like C structs

• Overloading is available for convenience

Titanium Templates• Many applications use containers:

– Parameterized by dimensions, element types,…– Java supports parameterization through inheritance

• Inefficient for small parameter types

• Titanium provides a template mechanism closer to C++– Instantiated with objects and non-object types (double, Complex)

• Example:template <class Element> class Stack { ...

public Element pop() {...} public void push( Element arrival ) {...}}_____________________________________________________

template Stack<int> list = new template Stack<int>();list.push( 1 );

Partitioned Global Address Space

• Global address space: any thread/process may directly read/write data allocated by another

• Partitioned: data is designated as local (near) or global (possibly far); programmer controls layout

Glo

bal

ad

dre

ss s

pac

e

x: 1y:

l: l: l:

g: g: g:

x: 5y:

x: 7y: 0

p0 p1 pn

By default: • Object heaps

are shared• Program

stacks are private

• Besides Titanium, Unified Parallel C and Co-Array Fortran use this parallelism model


Arrays in a Global Address Space• Key features of Titanium arrays

– Generality: indices may start/end and any point– Domain calculus allow for slicing, subarray, transpose and

other operations without data copies (F90 arrays and more)

• Domain calculus to identify boundaries and iterate: foreach (p in gridA.shrink(1).domain()) ...

• Array copies automatically work on intersection gridB.copy(gridA.shrink(1));

gridA gridB

“restricted” (non-ghost) cells

ghost cells

intersection (copied area)


Useful in grid-based computations

Immersed Boundaries in Biomechanics

• Fluid flow within the body is one of the major challenges, e.g., – Blood through the heart– Coagulation of platelets in clots– Effect of sounds waves on the inner ear– Movement of bacteria

• A key problem is modeling an elastic structure immersed in a fluid– Irregular moving boundaries– Wide range of scales– Vary by structure, connectivity, viscosity, external

forced, internally-generated forces, etc.

Software Architecture

Application Models

Generic Immersed Boundary Method (Titanium)

Heart(Titanium)

Cochlea(Titanium+C)

FlagellateSwimming

…

Spectral(Titanium + FFTW)

AMR

Extensible Simulation

SolversMultigrid(Titanium)

– Can add new models by extending material points– Can add new Navier-Stokes solvers

IB software and Cochlea by E. Givelberg; Heart by A. Solar based on Peskin/McQueen

existing components

Source: www.psc.org

Immersed Boundary Method

1. Compute the force f the immersed material applies to the fluid.

2. Compute the force applied to the fluid grid:

3. Solve the Navier-Stokes equations:

4. Move the material:

Immersed Boundary Method Structure

4. Interpolate &move material

3. Navier-Stokes Solver

2. SpreadForce

4 steps in each timestep

Material Points

Interaction

Fluid Lattice

2D Dirac Delta Function

1.Material activation &force calculation

Challenges to Parallelization

• Irregular material points need to interact with regular fluid lattice– Efficient “scatter-gather” across processors

• Placement of materials across processors– Locality: store material points with underlying fluid

and with nearby material points– Load balance: distribute points evenly

• Scalable fluid solver– Currently based on 3D FFT– Communication optimized using overlap (not yet in

full IB code)

P1P2

Improving Communication Performance 1: Material Interaction

• Communication within a material can be high– E.g., spring force law in heart fibers to contract– Instead, replicate point; uses linearity in spread

• Use graph partitioning (Metis) on materials– Improve locality in interaction– Nearby points on same proc

• Take advantage of hierarchical machines– Shared memory “nodes” within network

P1P2

communication redundant work

Joint work with A. Solar, J. Su

(up

is

go

od

)

GASNet excels at small to mid-range sizes

Improving Communication Performance 2: Use Lightweight Communication

Flood Bandwidth for 4KB messages

252

152

702

190

420

547

679

223763

231714

750

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Per

cent

HW

pea

k

MPI

GASNet

Joint work with UPC Group; GASNet design by Dan Bonachea

Improving Communication Performance 3: Fast FFTs with overlapped communication

.5 Tflop/s

0

200

400

600

800

1000

1200

C/64 D/256 D/256 D/512 D/256 D/512

Myrinet Infiniband Elan3 Elan3 Elan4 Elan4

MF

lop

s/P

roc

MPI/Fortran without overalpMPI with overlapGASNet with overlap

size/procs

network

• Better performance in GASNet version than MPI• This code is in UPC, not Titanium; not yet in full IB code

Immersed Boundary Parallel Scaling

Joint work with Ed Givelberg, Armando Solar-Lezama, Jimmy Su

Hand-Optimized (planes, 2004)

0

10

20

30

40

50

1 2 4 8 16 32 64 128procs

time

(sec

s)

256^3 on Pow er3/Colony

512^3 on Pow er3/Colony

512^2x256 on Pent4/Myrinet

• ½ the code size of the serial/vector Fortran version• 1 sec/timestep 1 day per simulation (heart/cochlea)

Runtime-Optimized (sphere, 2006)

0

1

2

3

4

16 32 64 128

procs

time

(sec

s)

128^3 on IBM Pow 4 256^3 on IBM Pow 4128^3 on Itan/Quadx 256^3 on Itan/quadx

256^3 on IBM Pow 3

2004 data on planes

Use of Adaptive Mesh Refinement• Adaptive Mesh

Refinement (AMR)– Improves scalability– Fine mesh only

where needed

• PhD thesis by Boyce Griffith at NYU for use in the heart– Uses of PETSc and

SAMRAI in parallel implementation

• AMR in Titanium– IB code is not yet

adaptive– Separate study on

AMR in Titanium

Image source: B. Griffith, http://www.math.nyu.edu/~griffith/

Adaptive Mesh Refinement in Titanium

C++/Fortran/MPI AMR• Chombo package from LBNL• Bulk-synchronous comm:

– Pack boundary data between procs

Titanium AMR• Entirely in Titanium• Finer-grained communication

– No explicit pack/unpack code– Automated in runtime system

Code Size in Lines

C++/F/MPI Titanium

AMR data Structures 35000 2000

AMR operations 6500 1200

Elliptic PDE solver 4200* 1500 10X reduction in lines of code!

* Somewhat more functionality in PDE part of Chombo code

AMR Work by Tong Wen and Philip Colella

Performance of Titanium AMR

Speedup

0

10

20

30

40

50

60

70

80

16 28 36 56 112

#procs

speedup

Ti Chombo

• Serial: Titanium is within a few % of C++/F; sometimes faster!• Parallel: Titanium scaling is comparable with generic

optimizations

- optimizations (SMP-aware) that are not in MPI code

- additional optimizations (namely overlap) not yet implemented

Comparable parallel performance

Joint work with Tong Wen, Jimmy Su, Phil Colella

Towards a Higher Level Language

• Domain-specific language for particle-mesh computations

• Basic language concepts and use– Particle, Mesh (1d, 2d, 3d), Particle group

– Optimizer re-uses communication information (schedule) and overlaps communcation

• Results on simple test case

UserInput

Program Synthesizer

ParallelTitanium

TitaniumCompiler

Machine

Base Re-use + Overlap

Time (ms) 8.7 6.7 6.5

Conclusions

• All software will soon be parallel– End of the single processor scaling era

• Titanium is a high level parallel language – Support for scientific computing– High performance and scalable– Highly portable across serial / parallel machines– Download: http://titanium.cs.berkeley.edu

• Immersed boundary method framework– Designed for extensibility– Demonstrations on heart and cochlea simulations– Some optimizations done in compiler / runtime– Contact: {jimmysu,yelick}@cs.berkeley.edu

use of a high level language in high performance biomechanics simulations katherine yelick, armando...

Documents

titanium group slide

new double r double

titanium templates

parallel c

convenience slide

matlab slide

nishtala slide

public complex op complex