massively parallel phase-field simulations for ternary eutectic … · 2015-06-05 · massively...

Massively Parallel Phase-Field Simulations for TernaryEutectic Directional Solidification

Martin Bauer1*, Johannes Hotzer2,3, Philipp Steinmetz2, Marcus Jainta2,3,Marco Berghoff2, Florian Schornbaum1, Christian Godenschwager1,

Harald Kostler1, Britta Nestler2,3, Ulrich Rude1

1 Chair for System Simulation, Friedrich-Alexander-Universitat Erlangen-Nurnberg (FAU), Cauerstraße 11, Erlangen, Germany2 Institute of Materials and Processes, Karlsruhe University of Applied Sciences (HSKA),

Haid-und-Neu-Str. 7, 76131 Karlsruhe, Germany3 Institute for Applied Materials - Computational Materials Science (IAM-CMS), Karlsruhe Institute of Technology (KIT),

Haid-und-Neu-Str. 7, 76131 Karlsruhe, Germany

*Corresponding Author. Email: [email protected]

ABSTRACTMicrostructures forming during ternary eutectic directionalsolidification processes have significant influence on the macro-scopic mechanical properties of metal alloys. For a realisticsimulation, we use the well established thermodynamicallyconsistent phase-field method and improve it with a newgrand potential formulation to couple the concentration evo-lution. This extension is very compute intensive due to atemperature dependent diffusive concentration. We signifi-cantly extend previous simulations that have used simplerphase-field models or were performed on smaller domain sizes.The new method has been implemented within the massivelyparallel HPC framework waLBerla that is designed to exploitcurrent supercomputers efficiently. We apply various opti-mization techniques, including buffering techniques, explicitSIMD kernel vectorization, and communication hiding. Sim-ulations utilizing up to 262,144 cores have been run on threedifferent supercomputing architectures and weak scalabilityresults are shown. Additionally, a hierarchical, mesh-baseddata reduction strategy is developed to keep the I/O problemmanageable at scale.

1. INTRODUCTIONThe microstructure in alloys significantly influences the

macroscopic properties. In order to predict and optimize theresulting macroscopic material parameters, it is important toget a deeper insight into this microscopic structure formation.Especially in ternary systems with three chemical species,a wide range of patterns can form in the microstructure,depending on the physical and process parameters [25, 19,

13]. In this work, we show simulation results for the patternformation in ternary eutectic directional solidification of Ag-Al-Cu alloys. The pattern formation of this ternary eutecticsystem, during directional solidification, was studied experi-mentally by Genau, Dennstedt, and Ratke in [13, 8, 9]. Thethermodynamic data are reported in the Calphad databaseand are derived from [35, 36]. This system is of special inter-est because many essential aspects of solidification processescan be studied due to the low temperature of the ternaryeutectic point and similar phase fractions in micrographs. Intechnical applications, this alloy is used for lead free soldersin micro-electronics [17]. However, an experimental studyof three dimensional structure requires significant technicaleffort, e.g. with synchrotron tomography. Simulations of thesolidification process exhibit another way of studying thesestructures, providing the whole microstructure evolution.

Phase-field models are well established to simulate solidifi-cation problems. In 2011, Shimokawabe et al. [29] presentedthe first peta-scale phase-field simulation on the TSUBAME2.0 GPU-Cluster. They studied binary dendritic growth,the evolution of multiple nuclei in three dimensions underconsideration of temperature, and concentration instabilities.Further investigations of this binary dendritic growth wereperformed by Yamanaka et al. [37] and Takaki et al. [31, 30]at the TSUBAME 2.0 and TSUBAME 2.5 GPU supercom-puter. In the Gordon Bell Prize 2011 awarded work, [29] useda phase-field model with two phases and two componentsbased on a simplified free-energy approach, neglecting thetemperature dependence in the concentration evolution equa-tion and hence the slope of the solidus- and liquidus lines.[30] continued the work of binary dendritic growth in 2D of[29] by using an improved model. There, an anti-trappingcurrent was added to obtain better quantitative results, butalso leading to a more complex model. This issue of solvingmultiple non-linear partial differential equations and hencethe increase of required memory and computational time wasalso mentioned by Yamanaka et al. [38].

In our work we focus on simulations of ternary eutecticdirectional solidfication with a phase-field model based onthe grand-potential approach [7]. Ternary eutectic growthdescribes the transformation of a melt in three solid phasesat a defined ternary eutectic point. To study this ternary so-

arX

iv:1

506.

0168

4v1

[cs

.DC

] 4

Jun

201

5

lidification with structures approximately two orders smallerthan dendritic growth, four phases and three componentshave to be considered in the model. The grand potentials,derived from thermodynamic databases, are coupled to thephase-field to ensure a physical driving force with the slopesof the solidus and liquidus planes [7, 6, 5].

In our simulations, we use an extension to the well estab-lished thermodynamic consistent phase-field method, includ-ing an anti-trapping current, based on a newly formulatedgrand potential approach [16]. The phase-field and chemicalpotential formulation consist of a relatively large number ofcomplicated terms and parameters, making the resulting algo-rithm significantly more compute intensive compared to otherstencil codes such as advection-diffusion calculations. An im-plementation of the described method was already availablein the general purpose phase-field framework PACE3D [33],which provides a wide range of phase-field models couplingstructural [27, 28, 26], fluid [10], and thermal effects [24]. Theavailable code was, due to its general approach, not optimizedto study the considered phenomena in a most efficient way.Simulations of large three dimensional domains are requiredto capture certain physically relevant effects, which can notbe seen in smaller domains due to strong boundary influ-ences [16]. Therefore, we re-implemented and optimized thespecific model in waLBerla, a massively parallel frameworkfor stencil-based algorithms on block-structured grids [11].The framework has been shown to efficiently scale up to 1.8million threads and run lattice Boltzmann simulations withup to one trillion cells [15].

Beginning at the intra-node level by making use of SIMDinstructions, to multithreading at intra-node level, up tointernode parallelism via MPI, several layers of parallelismhave to be exploited to efficiently use modern supercomputerarchitectures. We apply several optimization techniques at allthese levels, resulting in an overall relative speedup of factor80 compared to the original code and reach approximately25% of the peak FLOP rate on the SuperMUC petascalesystem located at LRZ Munich. We show scaling resultson three of the largest supercomputers in Germany: onSuperMUC, JUQUEEN, and HORNET.

The results obtained in our simulations show excellentagreement with experimental two dimensional micrographsand three dimensional tomography [16].

2. MODELTo simulate the solidification process of ternary eutectic

systems, a thermodynamically consistent phase-field modelis used. The presented phase-field model of Allen–Cahn typeconsists of two coupled evolution equations, for the vectorof order parameters φ and the vector of chemical potentialsµ. The equations are solved using finite difference methodsand an explicit Euler scheme for the time discretization.The domain Ω is discretized in equidistant regular cells withthe spatial position x. To indicate the spatial discretizationscheme of the evolution equations, we introduced the“DnCm”notation, as a stencil with n dimensions and total access onm cells. In the domain Ω, the order parameters φα(x, t),α ∈ [1,N], describe the fractions of the N thermodynamicphases in each cell and each time step t. We define φ(x, t)on the regular N − 1 simplex ∆N−1,∀x, t. The evolution

equations of the phase-fields are written as

∂φα∂t

=1

ταε(rhsα −

1

N

N

∑β=1

rhsβ) (1)

with

rhsα = εT (∂a(φ,∇φ)

∂φα−∇ ⋅

∂a(φ,∇φ)

∂∇φα)

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶D3C7

+1

εT∂ω(φ)

∂φα´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

D3C1

+∂ψ(φ,µ, T )

∂φα´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

D3C1

. (2)

The reciprocal relaxation parameter τα couples the phase-field evolution with the physical time scale and ε is related tothe interface width. The gradient energy density a(φ,∇φ)and the potential energy density ω(φ) describe the interfacialenergy part in (2). The driving force ψ(φ,µ, T ) describesthe thermodynamic process of the phase transition causedby the undercooling and connects the evolution of the orderparameter to the chemical potential. This force is derived byparabolically fitted Gibbs energies which are derived fromthe thermodynamic Calphad databases [5]. The evolutionequations of the K chemical potentials µ(x, t) are derivedfrom Fick’s law and the variational derivation of the con-centrations c(x, t). Due to the conservation of mass, oneconcentration can be derived from the others. The num-ber of evolution equations c(x, t) as well as the number ofcomponents of µ(x, t) can therefore be reduced to K − 1.

The µ evolution is written as

∂µ

∂t= [(

∂c

∂µ)T,φ

]

−1

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶D3C1

(−(∂c

∂φ)T,µ

∂φ

∂t− (

∂c

∂T)µ,φ

∂T

∂t´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

D3C1

+∇ ⋅ (M(φ, T )∇µ)´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

D3C7

−∇ ⋅ (Jat(φ,µ, T ))´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

D3C19

) (3)

The Jacobian matrix ∂c/∂µ describes the susceptibility andis derived from parabolic free energies. The two flux termsare a gradient flux of the chemical potential depending on themobility M(φ, T ) and a flux, called anti-trapping current[7]. It is derived as

Jat =πε

4

N

∑α=1(α≠`)

gα(φ)h`(φ)√φαφ`

∂φα∂t

(∇φα∣∇φα∣

⋅∇φ`∣∇φ`∣

)

((c`(µ) − cα(µ))⊗∇φα∣∇φα∣

) (4)

with the interpolation function h` [23].For the directional solidification, we use a frozen temper-

ature assumption by imprinting an analytical temperaturegradient with a defined velocity.

For the description of the numerical optimizations, wedistinguish different regions in the simulation domain. Theregion Bα ∶= x ∈ Ω ∣φα(x, t) = 1 ∧ ∣∇φα∣ = 0, where onlyone phase α exists, is called bulk region. In this region, thetime derivative of the order parameter and the anti-trappingcurrent (4) are zero.

The diffuse interface IΩ ∶= Ω∖⊍α∈N Bα is located betweenbulk regions. The interfacial energy part and the driving forcein (1) are exclusively calculated in IΩ. The solidification front

is defined as FΩ ∶= x ∈ IΩ ∣ φ`(x) > 0. The liquid region isdefined as LΩ ∶= B` and the solid one as SΩ ∶= Ω∖ (LΩ ∪FΩ).For a more detailed description of the model we refer to [16,7].

2.1 Phase-field AlgorithmThe algorithm to calculate the phase-field model is de-

composed into two kernels: one for updating values of thephase-field itself (1) and one to update the chemical potential(3). Two lattices are allocated for each variable: two desti-nation fields denoted by φdst and µdst and two source fields,denoted by φsrc and µsrc. In the destination fields values forthe next time step t +∆t are stored, while source fields holdvalues for the current time step. Each kernel performs a loopover the local simulation domain and updates all cell valuesof the φ or µ field. After each kernel run, the ghost layersare synchronized with neighboring blocks and boundaries areupdated using Dirichlet, Neumann and periodic boundaryconditions as shown in Figure 2. The details of this updatescheme are depicted in Algorithm 1.

Algorithm 1 Timestep

1: φdst ← φ-kernel(φsrc, µsrc) (see (1))

2: φdst ghostlayer communication3: φdst boundary handling

4: µdst ← µ-kernel(µsrc, φsrc, φdst) (see (3))

5: µdst ghostlayer communication6: µdst boundary handling7: Swap φsrc ↔ φdst and µsrc ↔ µdst

The φ-kernel needs the previous phase φsrc and chemicalpotential µsrc values as input. Direct neighborhood valuesare required to compute gradients of the phase-field, leadingto a D3C7 stencil for the φ-field while only local values arerequired for µ (D3C1). For updating the chemical poten-tial, previous chemical potential values µsrc as well as thephase-field values of two time steps (φsrc and φdst) are nec-essary. The two phase-field timesteps and the D3C19 stencil,including diagonal neighbors, are required to calculate (4).Figure 1 shows these data dependencies in detail for the twokernels.

φ(x, t)

µ(x, t)

φ(x, t+ ∆t)D3C7

D3C1

(a) φ-sweep

φ(x, t)

µ(x, t)

φ(x, t+ ∆t)

µ(x, t+ ∆t)D3C7

D3C19D3C19

(b) µ-sweep

Figure 1: Data dependencies of the two kernels.

As initial setup we use solid nuclei at the bottom of aliquid filled domain as shown in Figure 2. These solid nucleiare created by a Voronoi tesselation with respect to the givenvolume fractions of the phases. The simulation parameters of[16] are used for our studies in the presented model, describingthe system Ag-Al-Cu.

3. IMPLEMENTATIONIn the following section, we present our implementation of

the phase-field model and apply optimizations necessary toefficiently run on current supercomputing systems.

periodic BC

Neu

mannBC

DirichletBC

solid

liquid

TE

T

z

G

v

moving window

Figure 2: Simulation setting for the directional solid-ification of a ternary eutectic system including theboundary conditions. Below, the analytical temper-ature gradient G with the velocity v and its directionis shown.

3.1 The waLBerla FrameworkThe phase-field method described above was implemented

in the waLBerla framework (widely applicable lattice Boltz-mann from Erlangen). As the name suggests, the frameworkwas initially developed for simulations using the lattice Boltz-mann method. Over time, it evolved into a multi-physicsframework for efficient implementations of stencil-based algo-rithms on block structured grids in a massively parallel way.waLBerla partitions the simulation domain into equallysized chunks, called blocks. On each block, a regular grid isallocated, extended by one or more ghost layers for commu-nication in MPI parallel simulations. This concept providesthe flexibility for supporting complex geometries [15] andoptional mesh refinement. The regular structure inside theblocks allows for highly efficient compute kernels. The datastructure storing the blocks is fully distributed: Every pro-cess holds information only about local and adjacent blocks.Thus, the memory usage of a process does not depend on thetotal size of the simulation but only on the size of local andneighboring blocks. Only during the startup phase which isresponsible for block setup and load balancing, one processhas to store global domain information. This initializationcan be executed independently of the actual simulation. Theresulting block structure is then stored in a file to be loaded bythe simulation at runtime. The framework is entirely writtenin C++ with small portions of the code being automaticallygenerated. For performance critical code sections, we makeuse of template meta programming techniques to achievegood performance without losing flexibility and usability.

3.2 Input/OutputA big challenge in massively parallel programs is dealing

with the huge amount of output data. In many big simula-tions, I/O operations can become a significant bottleneck.Therefore, we try to minimize the amount of data that isread and written to the file system. Since the initial domainsetup is generated using a Voronoi approach, we do not haveto load big, voxel-based input files describing the domain.I/O is only necessary for checkpointing and for result output.For generating checkpoints, the complete simulation statehas to be stored on disk, containing four φ values and twoµ values per cell. While all computations are carried out in

double precision, checkpoints use only single precision to savedisk space and I/O bandwidth. Writing a checkpoint cantake a significant amount of time compared to a simulationtime step, therefore checkpoints are written infrequently.

Simulation results have to be written more often, thus afaster technique is required for this task. Instead of writingall values of a cell, we only store the position of the inter-faces using a triangle surface mesh. Therefore, a custommarching cubes algorithm based on [21] was developed thatgenerates meshes locally on each block, using the φ valuesas input. For each phase, a mesh is generated, describingthe interface between this phase and all other phases. Themarching cubes algorithm extends to the ghost regions suchthat the local meshes can be stitched together to a singlemesh describing the complete domain. These local meshescould be written to disk, leaving the stitching for postpro-cessing. Since the marching cubes algorithm creates triangleswith edge lengths in the order of dx, these meshes are stillunnecessarily fine and could be adaptively coarsened withoutlosing much accuracy. This coarsening step can optionallybe performed before writing the mesh to keep output filesizes small. For mesh coarsening, we use the quadric-erroredge-collapse-based simplification algorithm [12] from theVisualization and Computer Graphics Library (VCG) [4].The coarsening is done in a hierarchical way: In a first step,each process calls the edge-collapse algorithm on its localmesh. By assigning a high weight to all vertices that arelocated on block boundaries, the boundaries are preservedsuch that the later stitching step can work correctly. Then,two local meshes are gathered on a process, stitched together,and again coarsened in the stitched region. This step isrepeated log2(processes) times where in each step only halfof the processes take part in the reduction. This procedure isstopped if the mesh is either fully reduced or has reached asize that cannot be stored in the memory of a single node. Inthe latter case, postprocessing can be resumed on a machinewith more memory than a compute node.

3.3 OptimizationsStarting with a very general phase-field model, the goal

of this work was to develop a highly optimized code for thespecific model, presented in [7], which efficiently can simulatesufficiently large 3D domains. Optimizations are done ondifferent levels by exploiting physical, mathematical, andcomputational facts to decrease the computation time.

For directional solidification, we can reduce the effectivedomain size in the solidification direction (Nz) [16] by usinga moving window technique [33] as shown in Figure 2. Theevolution in the solid is multiple magnitudes lower than inthe liquid such that we neglect the evolution in the solid. Inorder to produce physically correct microstructures, the basesize of the domain Nx×Ny requires a minimal size to reducethe influence of the periodic boundaries.

The evolution of the temperature can be described bya frozen temperature ansatz in solidification direction, byexploiting the time scales of the different evolution equations[16, 7]. At a given time t, we assume the temperature beingconstant in slices orthogonal to the solidification direction.

The grand potentials for the driving force are only neededin the range of the ternary eutectic point, therefore we usefitted parabolic Gibbs energies to derive the potentials insteadof describing the total ternary system. This simplifies thecalculations involving the concentrations c and the chemical

potentials µ [5].The model equations can be simplified for certain parts of

the domain for the two evolution equations. The computationof the anti-trapping current in the µ-evolution is only requiredin the interface region FΩ of the solidification front. Sinceevaluating this current is computationally expensive, wedetect as early as possible if the complete term has to beevaluated by testing critical subexpressions for zeros. Whencomputing Jat, we first check if φ is zero since then h`(φ) andthus Jat are zero. By introducing one additional check, wecan omit the calculation of the expensive Jat in all cells whichdo not contain liquid. A similar check can be added for ∇φl:for cells with zero liquid phase gradient, the computationof the anti-trapping current can be skipped as well. Theinterface region IΩ is bounded due to a sinus-shaped interfaceprofile in which ∇φα ≠ 0. Following the calculation of theφ evolution is only required in this small interface region.Adding these kind of checks introduces possibly expensive if

conditions in the kernel, so there may be a trade-off betweenthe saved computations and the peak performance. As shownin section 5, the introduced checks reduce the total runtimeconsiderably.

We conducted further optimizations on the source codeand hardware level using processor extensions available onthe target machines. On these levels, we used a systematic,performance model driven approach to identify and eliminaterelevant bottlenecks. As we show in section 5, our code isbound by in-core execution, therefore the goal is to reducethe total number of floating point operations, potentially atthe cost of increased memory traffic.

The fact that in our specific scenario the temperature is afunction of the z coordinate led us to choose the z iterationas the outermost loop in computation kernels, followed byloops over y and x. Since many values, which are requiredto calculate the driving force ψ of (2) or the anti-trappingcurrent Jat of (4), depend on analytic temperatures only,these values can be pre-calculated once for each z-slice insteadof computing them again in every cell.

Following the strategy of saving floating point operationsat the cost of memory transfers, we identify expressions inthe model equations that are evaluated multiple times andcan therefore be buffered. In our discretization scheme wealso evaluate quantities at staggered grid positions. To avoidevaluation in each cell, they are written to a buffer aftercomputation and are reused. Since the outermost loop in thecomputation kernels iterates over the z coordinate, a bufferof the size Nx ×Ny is needed.

The computationally most intensive part of equation (3)is the calculation of the divergence of vbuf ∶= (M∇µ − Jat),even if the evaluation of the anti-trapping current can beskipped for many cells. In order to update one cell, the sixvalues of vbuf at neighboring staggered positions are required.However, three of them can be buffered and reused since theyhave already been calculated during the update of previouscells (Figure 3). Only in cells located at block boundaries,all six values have to be computed explicitly. Using thisbuffering approach, we effectively halve the required floatingpoint operations at the cost of additional memory accessesfor reading and writing the buffered values.

The same technique of buffering staggered values is alsoapplied for evaluating the divergence of ∂a(φ,∇φ)/∂∇φα inthe phase-field update step.

While adhering to the established data parallel approach

Figure 3: Buffering values at staggered positions(D3C7). Circles mark cell midpoints, lines indicatecell boundaries.

on inter-process level, the waLBerla framework offers taskparallel programming models on intra-process level to sup-port overlapping communication with computation. Thecomputation kernels as well as the ghost layer exchange rou-tines are implemented as C++ functors, which are registeredat a “Timeloop” class to manage the communication hiding.

The communication of the chemical potential field canbe hidden in a straightforward way since the following up-date of the phase-field only depends on local µ values (Fig-ure 1). This allows us to communicate the µ boundarieswhile updating the φ-field. In our implementation the orderof communication and boundary handling routines can alsobe interchanged without altering the results.

Since updating µ accesses neighborhood values of φ for thecurrent and new time step, the communication of the phase-field values can not simply be overlapped with µ computation.In order to hide the phase-field communication as well, theµ update step has to be split up into two kernels. The obvi-ous method would be to update inner values first and aftercommunication has finished, update the values at the border.Implementing this solution would require a staggered valuebuffer of the same size as the complete field or recomputationof staggered values. Thus, we split up the µ update into alocal and a non-local φ dependency as indicated in (3). Thissplitting simplifies the data dependencies (Figure 4) and doesnot affect the buffering of staggered values, but still has someoverhead, since now the temperature dependent values haveto be computed twice for each z-slice. Therefore, we calculatethe µ evolution without the anti-trapping current (4) first.After transfer of the ghost layers, the anti-trapping currentis calculated and added to the µ evolution. Algorithm 3.3shows the resulting time step of the phase-field method usingfully overlapping communication.

Algorithm 2 Timestep with communication overlap

1: communicate µsrc

2: φdst ← φ-sweep(φsrc, µsrc)

3: end communicate

4: φdst boundary handling5: communicate φdst

6: µdst ← µ-sweep-local(µsrc, φsrc, φdst)

7: end communicate

8: µdst ← µ-sweep-neighbor(µsrc, φsrc, φdst)

9: µdst boundary handling10: Swap φsrc ↔ φdst and µsrc ↔ µdst

φ(x, t)

µ(x, t) µJat(x, t+ ∆t) µ(x, t+ ∆t)

φ(x, t+ ∆t)

D3C7

D3C1D3C19

D3C1D3C19

D3C19

Figure 4: Data dependencies (with communicationoverlap)

In order to fully utilize modern hardware architectures,especially when running compute intensive codes, it is im-portant to make full use of their SIMD (single instruction,multiple data) capabilities. While most compilers can doautomatic loop vectorization, it is still necessary to provideadditional information to the compiler regarding aliasing,data alignment, and typical loop sizes using non portablepragma directives. An analysis with the performance mea-surement tool LIKWID [32], which can measure the amountof vectorized and non-vectorized floating point operations,showed that large portions of our code were not automati-cally vectorized. While the pragma approach proves usefulfor small concise loop kernels, we could not improve the morecomplex compute kernels for φ and µ of several hundred linesof code using this approach. Instead, we explicitly vectorizethese two main compute kernels with intrinsic compiler func-tions. Using intrinsics requires a manual reformulation of thealgorithm in terms of vector data types. Since these intrinsicfunctions directly map to assembler instructions, they are notportable across different architectures. In order to keep thecode portable, a lightweight abstraction layer was developedthat provides a common API supporting the Intel processorextensions SSE2, SSE4, AVX, AVX2 and the QPX extensionon Blue Gene/Q cores. Not all functions of this API directlymap to a single intrinsic function and therefore to a singleassembler instruction for each instruction set. For example,AVX2 provides instructions for permuting elements of vectordata types, which require two or more instructions in theSSE extension. Our API hides these differences, providingfunctions for all AVX2 and QPX intrinsics which are emu-lated on older extensions in the most efficient way. Manualinspection of the assembler code shows that calls to thisthin SIMD API are inlined by the compiler and therefore noadditional overhead is introduced.

4. HPC SYSTEMSIn this section we shortly describe the three Tier-0/1

cluster systems at the High Performance Computing Cen-ter Stuttgart (HLRS), the Leibniz Supercomputing Center(LRZ), and the Julich Supercomputing Centre (JSC). Allsystems support vectorization, either using AVX(2) on theIntel based chips or QPX on BlueGene/Q cores.

4.1 SuperMUCThe SuperMUC system located at Leibnitz Supercomput-

ing Center in Munich is built out of 18,432 Intel Xeon E5-2680processors running at 2.7 GHz resulting in a total number of147,456 cores [3]. One compute node of SuperMUC consistsof 2 sockets, each equipped with 8 cores. 512 nodes aredivided into one island. Within each island, SuperMUC usesa non-blocking tree network topology, whereas all 18 islandsare connected via a pruned tree (4:1) [3]. On SuperMUC

we used the Intel compiler in version 14.0.3 together withIBM MPI. We compiled our code on optimization level 3together with optimization flags enabling interproceduraloptimization at link time and fast floating point operations.

4.2 Cray XC40 (Hornet)The Cray XC40 system at the HLRS consists of 3944

nodes with two sockets, each containing a Intel Haswell E5-2680v3 with 12 cores. The systems contains 94,656 cores intotal which are interconnected by a Dragonfly network[18]called Cray Aries. The Haswell core clock rate is 2.5 GHzand it supports Hyper-Threading as well as the newer AVX2instruction set. On Hornet we have chosen the same compilerand compiler options as on SuperMUC with additional flagsto make use of the AVX2 extension.

4.3 JUQUEENJUQUEEN is, as of April 2015, the fastest supercomput-

ing system in Germany and is positioned on rank 8 on theTOP500 list[2]. It is a 28 rack Blue Gene/Q system pro-viding 458,752 PowerPC A2 processor cores, with each corecapable of 4-way multithreading [14]. JUQUEEN uses a5-dimensional torus network topology capable of achievingup to 40 GB/s, with latencies in the range of a few hundrednanoseconds. We compiled our code with the IBM XL com-piler in version 12.1 with the compiler optimization level setto 5.

5. RESULTS AND DISCUSSIONIn this section, we present single core performance results

as well as scaling experiments run on these three super-computing systems. The presented performance results aremeasured in MLUP/s, which stands for “million lattice cellupdates per second”. At the end of this section, we showsimulation results of the Ag-Al-Cu system and compare themto experimental data.

5.1 Performance ResultsSince the performance of the compute kernels depends

on the composition of the simulation domain as shown insection 3.3, we executed our benchmarks separately for threerepresentative parts of the domain. These different scenariosare labeled interface, solid, and liquid and correspond to theregions FΩ, LΩ, and SΩ defined in Sec. 2. The solid scenarioconsists purely of already solidified material, as is the case ina realistic simulation in the lower third of the domain. Theinterface scenario simulates only the solidification front, i.e.the middle third of the simulation domain. The upper partof the domain consists only of liquid phase and is representedby the liquid scenario.

5.1.1 Single Core PerformanceThe starting point of our HPC implementation was a

general phase-field code written in C. One main design goal ofthis application code is flexibility to allow rapid prototypingand testing of new models. A wide range of models is alreadyimplemented in this code: various phase-field models, astructural mechanics as well as a fluid mechanics solver, whichcan all be coupled with each other. As already mentionedabove, for the specific model described here, simulationsof large three dimensional domains are necessary to obtainphysically meaningful results. This motivates the design and

implementation of a new code that is highly optimized andparallelized to exploit the largest HPC systems available.

A straightforward re-implementation of the grand chemicalpotential based phase-field model in the waLBerla frame-work already yields significant performance improvements(Figure 6). Since this new implementation is targeted at onespecific model, certain basic optimizations can already beapplied easily to the compute kernels: While the originalimplementation makes heavy use of indirect function callsvia function pointers at cell level, we either remove theseindirections in the waLBerla version entirely due to thespecialization of the code to one specific model, or replacethem with static polymorphism using C++ templates. Ad-ditionally, we replaced divisions where the denominator isguaranteed to be in a small set of values by table lookupand multiplication with the inverse. The number of divisionoperations is further reduced by replacing inverse squareroot calculations, required for vector normalizations, withapproximated values provided by a fast inverse square rootalgorithm [20].

In a second step we implement more advanced, aggressiveoptimizations. These include explicit SIMD vectorization inboth computation kernels. Since the targeted architecturesall have a vector width of four double precision values, thestraightforward way for vectorizing the algorithm is to unrollthe innermost loop, updating four cells in one iteration.While this technique is the only possible one for the µ-kernel,a more natural approach exists for the φ-kernel of our specificmodel: Instead of handling four cells simultaneously, wecan use a SIMD vector to represent the four phases of acell. With this technique, the field is still updated cellwise,such that branching on a cell-by-cell basis becomes possible.This branching can significantly speed up the kernels, sincesome expensive terms have to computed only for certaincell configurations. Vectorized kernels that handle four cellssimultaneously can only take these shortcuts if the conditionis true for all four cells. Since the single cell kernel versionoperates on less data per iteration compared to the fourcell version, more intermediate values can be kept in vectorregisters, which would have to be spilled to memory otherwise.A drawback of the cellwise version is the need for variouspermute or rotate operations when computing terms thatcontain single components of the φ vector. This happensfor example when expressions like φα∑

4β=1 φβ have to be

evaluated.Figure 5 shows the performance of three vectorized φ-kernel

implementations: The first two implementations both use thecellwise vectorization approach. While the first version eval-uates all terms for each cell, the second version contains teststhat determine which terms have to be evaluated (cellwisewith shortcuts ). The third version uses four-cell-vectorizationand can only skip terms that are not required for all of thefour cells. In all three parts of the domain, the single cellkernel with shortcuts performes best. The additional costsof the vector permute operations are alleviated by the abilityto branch on a per cell basis and the reduced number ofrequired SIMD registers. For all further benchmarks, we usethis fastest cellwise kernel with shortcuts to update the phasefield. Since we apply two different vectorization strategiesto the two compute kernels, the optimal data layout of theφ-field for the two kernels is different: The µ-kernel processesthe data four cells at a time, so a structure-of-arrays (SoA)layout would be the best choice, while the fastest φ-kernel

0 2 4 6 8 10 12 14MLUP/s for φ-kernel only

interface

liquid

solid

cellwise

cellwise,with shortcuts

four cells

Figure 5: Comparison of different vectorizationstrategies on one SuperMUC core, block size cho-sen as 603.

requires an array-of-structures (AoS) layout to be able toload a SIMD vector directly from contiguous memory. Wechoose a SoA layout since the µ-kernel has to load phasefield values of 38 cells (19 cells from φ(x, t) and 19 fromφ(x, t +∆t)) whereas the φ-kernel only has to load 7 cells.Due to the high computational intensity of the φ-kernel, nonotable differences could be measured in the φ-kernel per-formance after a data layout change of the φ-field. Figure 6shows the increase in performance of the new SIMD kernelscompared to the basic implementation. By applying vector-ization only, a maximal speedup of factor 4 can be expected.The high speedup of factor 5 to 7 is due to the fact that inaddition to vectorization also additional optimizations likecommon subexpression precomputation were done at thisstage. This comes at the cost of decreased flexibility: Whilein the basic implementation the code was structured accord-ing to the mathematical formulation of the model, the singleterms of the model can hardly be recognized in the manuallyvectorized SIMD kernels. To decrease the maintenance effortfor the various kernels, a regularly running test suite checksall kernel versions for equivalence.

Figure 6 additionally shows the performance improvementdue to optimizations we can make when the temperature isprescribed as a function of time and one space component zonly. In this case, we precompute all temperature dependentterms once for each x-y-slice instead of computing them ineach cell. Using this optimization increases the performanceof the µ-kernel by approximately 20% and the performanceof the φ-kernel by 80%.

While this technique improved predominantly the runtimeof the φ-kernel, the staggered buffer optimization is targetedat the µ-kernel and the costly computation of the term∇⋅(M∇µ − Jat) which is evaluated at staggered positions inthe grid. Buffering and reusing half of these values increasesthe µ-kernel performance by almost a factor of two, whichshows that the runtime of this kernel is dominated by thecalculation of these staggered values. In the φ-kernel, thesame technique is applied for buffering gradient values atstaggered positions. In this case, the buffered values are notas expensive to compute as the buffered quantities in theµ-kernel, therefore this optimization leads only to slightlybetter performance for this kernel.

Up to now, all performance improvements affected all cellsof the domain. While the kernel runtime for updating µis, up to measurement error, equal in the complete domain,the φ-kernel runtimes vary slightly due to branches in aroutine that projects the φ values back into the allowedsimplex. The following optimization introduces additionalbranches by skipping the evaluation of terms depending onthe configuration of the current cell. This implementationis labeled as version “with shortcuts” in Figure 6 and wasalready included in the comparison of different vectorizationstrategies, shown above. These branches lead to differentruntimes of the kernels in different parts of the domain. Theperformance of the φ-kernel is increased predominantly inliquid parts of the domain since there the computation of thecoupling term ψ can be skipped entirely. The runtime of theµ-kernel is improved especially in solid cells due to a simplercalculation of the anti-trapping current in these cells.

All optimizations combined result in a total speedup ofup to 80 depending on the architecture. This huge speedupcompared to the original C implementation enables us tosimulate sufficiently large domains in reasonable time to getphysically meaningful results.

Since the relative speedup does not indicate to what extentthe target architecture is utilized, we additionally investi-gated the absolute performance of our code. In the following,we focus on the singlenode performance of our optimizedcode, without the “shortcut” optimizations, since in this casethe total number of executed floating point operations percell can be determined exactly.

We show the performance analysis for the µ-kernel on aSuperMUC node in detail, and shortly present the results forthe φ-kernel afterwards. First, we use a roofline performancemodel to determine if the code is memory or compute bound[34]. We measure the maximum attainable bandwidth usingSTREAM [22] on one node, resulting in a bandwidth ofapproximately 80 GiB/s. We assume that approximatelyhalf of the required data for one update can be held in cache,since in a symmetric D3C7 or D3C19 stencil half of the datacan be reused if one x-y-slice of both fields completely fitsinto cache. For a typical block size of 403, one x-y-slice ofall four fields fits into L2 cache. Under this assumption,half of the required values are held in L2 cache and at most680 Bytes have to be loaded from main memory to updateone cell. For one cell update, 1384 floating point operationsare required, resulting in a lower bound for the arithmeticintensity of approximately two floating point operations perbyte. Taking only the memory bottleneck into account ourcode could achieve up to 126.3 MLUP/s on one SuperMUCnode:

80GiB

s∶ 680

B

LUP= 126.3 MLUP/s

The benchmark results presented in Figure 7 show that ourcode does not hit the bandwidth limit of 126.3 MLUP/s andtherefore is not memory bound. Choosing a smaller blocksizeof 203, which fits entirely into L3 cache, only changes theperformance slightly. This is a further indication that thecode is limited by in-core execution instead of memory band-width. To determine an upper bound for in-core executionwe compare the attained FLOP/s rate with the maximalpossible rate. One SuperMUC core runs with a frequency of2.7 GHz [3] and can perform 8 floating point operations percycle, resulting in 21.6 GFLOP/s per core. The achieved 4.2

0 2 4 6 8 10 12 14MLUP/s for φ-kernel only

with shortcuts

with staggered buffer

with T(z) optimization

with SIMD intrinsicssingle cell

basic waLBerlaimplementation

general purposeC code liquid

solid

interface

0 1 2 3 4 5 6 7 8 9MLUP/s for µ-kernel only

with shortcuts

with staggered buffer

with T(z) optimization

with SIMD intrinsicsfour cells

basic waLBerlaimplementation

general purposeC code liquid

solid

interface

Figure 6: Optimization for φ-kernel (left) and µ-kernel (right), run in interface blocks, block size 603

0 2 4 6 8 10 12 14 16Cores

0

10

20

30

40

50

60

70

MLU

P/s

of µ k

ern

el only

Blocksize 40x40x40

Blocksize 20x20x20

Figure 7: Intranode Scaling of µ-kernel withoutshortcut optimization on one SuperMUC node

MLUP/s per core correspond to 5.8 GFLOP/s and thereforeto 27% peak performance of one core. The in-core executiontime is therefore not limited by the total number of floatingpoint operations. To understand why we cannot utilize thefull computational capability of one core, we employ the IntelArchitecture Code Analyzer (IACA) [1]. This tool inspectsthe assembly code of the kernel and statically predicts theestimated execution time on a given Intel architecture underthe assumption that all data resides in L1D Cache. TheIACA analysis shows that even though the code is fully vec-torized, it can attain at most 43% peak under ideal front-end,out-of-order engine, and memory hierarchy conditions. Thisis caused predominantly by imbalance in the number of ad-ditions and multiplication as well as latencies for divisionoperations. The IACA report shows, that further, highly ar-chitecture dependent optimizations are possible, for examplemanually reordering addition and multiplication instructions.These kind of optimizations would however require signifi-cant development effort due to the high complexity of thekernel of several thousand lines of code. A similar analysisfor the φ-kernel shows that this kernel is also compute boundand attains approximately 21% peak performance on oneSuperMUC core.

5.1.2 Scaling Results

25 26 27 28 29 210 211 212

Cores

1

2

3

4

5

6

Tim

e s

pent

in c

om

munic

ati

on

in m

s per

tim

est

ep

µ with overlap

µ no overlap

φ with overlap

φ no overlap

Figure 8: Time spent in communication, Super-MUC, blocksize 603

Before conducting scaling experiments we evaluate theeffect of the presented communication hiding schemes. Allfour possible combinations of hiding the communication ofthe φ- and µ-field were tested. Figure 8 shows the time spentin both communication routines. The amount of exchangeddata is higher in the φ-communication, thus the overall com-munciation times are higher in this case. As expected, theeffective communication times decrease for both fields whencommunication hiding is enabled. The remaining time inthe communication routines is spent for packing and un-packing messages which cannot be overlapped. Overlappingφ-communication introduces additional overhead since theµ-kernel has to be split up into two parts. As a consequence,the temperature dependent value computation, which is doneonce for each x-y-slice in the no-overlap case, has to be donetwice when φ communication overlap is enabled. This over-head is much bigger than the benefit of communication hiding,thus the version with only µ communication hiding yieldsthe best overall performance.

In order to test the scalability of our implementation weexecute weak scaling experiments on three of the largestGerman supercomputers. Weak scaling scenarios, where thedomain size and the process count are increased by the samefactor, are of practical relevance for this code, since the goalis to maximize the domain size while keeping the time to

solution constant. On SuperMUC, we scale three differentscenarios up to 32,768 cores which were availabe to us. Thefull machine could not be utilized at the time of writing dueto installation of a hardware upgrade. Figure 9 shows thatbecause of the “shortcut” optimization, we obtain slightlydifferent runtimes for the three scenarios, the “interface” sce-nario being the slowest due to higher workload in interfacecells. In production runs, where all of the three block compo-sitions occur in the domain, the runtime is dominated by theinterface blocks. We experimented with various load balanc-ing techniques offered by the waLBerla framework, whichdid, however, not decrease the total runtime significantly,because the moving window technique makes it possible tosimulate only the interface region, such that, in productionruns, most blocks have a composition similar to the “inter-face” benchmark. For the scaling experiments on Hornet andJUQUEEN, we only used the “interface” scenario, which isrepresentative for the performance achieved in productionruns.

On SuperMUC and Hornet we placed one MPI process oneach physical core. Simultaneous multithreading (SMT) didnot improve the overall performance on these machines. OnJUQUEEN however, 4-way SMT was employed to make fulluse of the in-order processing units. On this Blue Gene/Qmachine we scaled up to 262,144 cores using 1,048,576 MPIprocesses.

5.2 Simulation ResultsDue to the highly optimized code, we are able to study

the evolution of microstructures in representative volume ele-ments (RVE), which are required for a better understandingof the various pattern formations. The performed simulationsallow us to study the evolution of microstructures, especiallyin three dimensions.

Common experimental analyzing methods, like micro-graphs, only provide two dimensional information. Thesemicrographs can be used for a visual validation, but do notprovide information on the three dimensional structure. Incomparison to experimental micrographs, we can show agood visual accordance with cross sections of our three di-mensional simulations. A simulation with 2420 × 2420 × 1474cells, simulated on the Hornet supercomputer, is depicted inFigure 10(a). In the experiment as well as the simulation,the phases arrange in similar patterns as chained brick-likestructures that are connected or form ring-like structures, asshown in Figure 10. Obtaining three dimensional informationfrom experiments requires a large technical effort, using syn-chrotron tomography to resolve the different phases. A threedimensional reconstructed tomography result of directionalsolidification of the system Ag-Al-Cu from A. Dennstedt isdepicted in Figure 10(b). Even with these methods, only thefinal sample can be reconstructed, whereas the time evolu-tion cannot be observed. Using large-scale simulations, wefound multiple microstructure characteristics, which couldnot be observed up to now due to small domain sizes. Threedimensional simulations are crucial to capture all relevantphysical phenomena: In two dimensional cross sections paral-lel to the growth front, the phases simple arrange in differentlamellae as brick-like structures while in three dimensions,various splits and merges of these lamellae can be observed.Single lamellae of the simulation in Figure 10 are exemptedin Figure 11. In Figure 11(a), two connected lamellae ofthe phase Al2Cu and in Figure 11(b) two lamellae of the

phase Ag2Al are shown. The evolution of the microstructure,especially the splitting of lamellae and merging, is visible,and allows us to study the stability of different phase ar-rangements. These phase arrangements influence the shapeof the occurring phases and are of special importance totechnical applications since the microstructure strongly influ-ences the mechanical properties, e.g. the elastic and plasticdeformation behavior. Besides the good visual accordanceof the large-scale simulations and two dimensional micro-graphs, also good agreement with the experimental 3D datais achieved. A more detailed comparison is shown in [16]and a quantitative comparison using Principal ComponentAnalysis on two-point correlation is in preparation. In thispublication, also the quantitative necessity of these large do-main sizes will be presented. In addition to the good visualagreement with the experiments, the simulations allow us toconduct parameter variations under well-defined conditions,which provide excellent ways to study the underlying physicalmechanisms of the resulting microstructure formation.

6. CONCLUSIONIn this work, we have developed and evaluated a massively

parallel simulation code for a thermodynamic consistentphase-field model, based on the grand potential approach,which runs efficiently on current supercomputing architec-tures. With this code, we simulated the directional solidifi-cation of ternary eutectics, including four phases and threechemical species. Since big domain sizes are required toobserve the formation of mircostructural patterns, a HPCimplementation is crucial to get physical results for the consid-ered scenario. Starting from a general purpose and validatedphase-field code of this complicated multiphase model forthe solidification of alloys, we developed a new, highly op-timized implementation that can effectively utilize modernsupercomputing architectures. We applied optimizations onvarious levels to the highly complex stencil code, rangingfrom scenario dependent model simplifications to architec-ture specific optimizations like explicit SIMD vectorization.Systematic node level performance engineering resulted in afactor 80 speedup compared to the original code, as well as25% of peak performance on node level. Additionally, com-munication hiding techniques were applied to optimize theperformance on system level, too. We have shown excellentscaling results of our code on three of the largest German su-percomputers: SuperMUC, Hornet, and Juqueen. Only withthis optimized implementation, it was possible to achieve thedomain sizes necessary to gain microstructure patterns thatare in good visual accordance with experimental data.

For future work, we plan to switch from the explicit Eu-ler time stepping scheme to an implicit solver. With thepresented, optimized model, further parameter studies willbe conducted. Furthermore, support for a wider range ofarchitectures (Xeon Phi, GPU) is in preparation.

7. ACKNOWLEDGMENTSWe are grateful to the High Performance Computing Cen-

ter Stuttgart, the Leibniz Rechenzentrum in Garching, andthe Julich Supercomputing Center for providing computa-tional resources. We also thank Dr. Anne Dennstedt fromGerman Aerospace Center (DLR) in Cologne for providingthe three dimensional synchrotron tomography reconstruc-tion and the many fruitful discussions.

20 22 24 26 28 210 212 214

cores

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

MLU

P/s

per

core

interfaceliquidsolid

25 26 27 28 29 210 211 212 213

cores

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

29 210 211 212 213 214 215 216 217 218

cores

0.00

0.05

0.10

0.15

0.20

0.25

Figure 9: Weak Scaling on SuperMUC(left), Hornet(middle) and JUQUEEN(right)

ring connection chain




(a) Simulation result





(b) tomography reconstruction of experiment fromA.Dennstedt

Figure 10: Three dimensional simulation and experimental results of directional solidification of the ternaryeutectic system Ag-Al-Cu

(a) Phase Al2Cu (b) Phase Ag2Al

Figure 11: Exempted lamellae from the simulation depicted in Figure 10. The lamellae grew from left toright.

8. REFERENCES[1] Intel architecture code analyzer, version 2.1.

https://software.intel.com/en-us/articles/

intel-architecture-code-analyzer, 2015. [Online;accessed March-2015].

[2] TOP500 List.http://www.top500.org/lists/2014/11/, 2015.[Online; accessed April-2015].

[3] SuperMUC Petascale System. https://www.lrz.de/services/compute/supermuc/systemdescription/,2015. [Online; accessed March-2015].

[4] Visualization and Computer Graphics Library.http://vcg.sourceforge.net, 2015. [Online; accessedJanuary-2015].

[5] A. Choudhury, M. Kellner, and B. Nestler. A methodfor coupling the phase-field model based on agrand-potential. Current Opinion in Solid State andMaterials Science, 2015. accepted.

[6] A. Choudhury, M. Plapp, and B. Nestler. Theoreticaland numerical study of lamellar eutectic three-phasegrowth in ternary alloys. Phys. Rev. E, 83:051608, May2011.

[7] A. N. Choudhury. Quantitative phase-field model forphase transformations in multi-component alloys(Schriftenreihe des Instituts fuer AngewandteMaterialien, Karlsruher Institut fuer Technologie). KITScientific Publishing, 6 2013.

[8] A. Dennstedt and L. Ratke. Microstructures ofDirectionally Solidified Al-Ag-Cu Ternary Eutectics.Transactions of the Indian Institute of Metals,65(6):777–782, 2012.

[9] A. Dennstedt, L. Ratke, A. Choudhury, and B. Nestler.New Metallographic Method for Estimation ofOrdering and Lattice Parameter in Ternary EutecticSystems. Metallography, Microstructure, and Analysis,2(3):140–147, 2013.

[10] J. Ettrich. Fluid Flow and Heat Transfer in CellularSolids, volume 39. KIT Scientific Publishing, 2014.

[11] C. Feichtinger, S. Donath, H. Kostler, J. Gotz, andU. Rude. Walberla: HPC software design forcomputational engineering simulations. Journal ofComputational Science, 2(2):105–112, 2011.

[12] M. Garland and P. S. Heckbert. Surface simplificationusing quadric error metrics. In Proceedings of the 24thannual conference on Computer graphics andinteractive techniques, pages 209–216. ACMPress/Addison-Wesley Publishing Co., 1997.

[13] A. Genau and L. Ratke. Morphological characterizationof the Al-Ag-Cu ternary eutectic. International Journalof Materials Research, 103(4):469–475, April 2012.

[14] M. Gilge et al. IBM system blue gene solution bluegene/Q application development. IBM Redbooks, 2014.

[15] C. Godenschwager, F. Schornbaum, M. Bauer,H. Kostler, and U. Rude. A framework for hybridparallel flow simulations with a trillion cells in complexgeometries. In Proceedings of the InternationalConference on High Performance Computing,Networking, Storage and Analysis, page 35. ACM, 2013.

[16] J. Hotzer, M. Jainta, P. Steinmetz, B. Nestler,A. Dennstedt, A. Genau, M. Bauer, H. Kostler, andU. Rude. Large scale phase-field simulations ofdirectional ternary eutectic solidification. Acta

Materialia, 2015.

[17] S. K. Kang, D.-Y. Shih, D. Leonard, D. W. Henderson,T. Gosselin, S.-i. Cho, J. Yu, and W. K. Choi.Controlling Ag3Sn plate formation innear-ternary-eutectic Sn-Ag-Cu solder by minor Znalloying. JOM, 56(6):34–38, 2004.

[18] J. Kim, W. Dally, S. Scott, and D. Abts.Technology-driven, highly-scalable dragonfly topology.In Computer Architecture, 2008. ISCA ’08. 35thInternational Symposium on, pages 77–88, June 2008.

[19] D. Lewis, S. Allen, M. Notis, and A. Scotch.Determination of the eutectic structure in theAg-Cu-Sn system. Journal of Electronic Materials,31(2):161–167, 2002.

[20] C. Lomont. Fast inverse square root. Technical Report,2003.

[21] W. E. Lorensen and H. E. Cline. Marching cubes: Ahigh resolution 3D surface construction algorithm. InACM siggraph computer graphics, volume 21, pages163–169. ACM, 1987.

[22] J. D. McCalpin. Memory bandwidth and machinebalance in current high performance computers. IEEEComputer Society Technical Committee on ComputerArchitecture (TCCA) Newsletter, pages 19–25, 1995.

[23] N. Moelans. A quantitative and thermodynamicallyconsistent phase-field interpolation function formulti-phase systems. Acta Materialia, 59(3):1077–1086,2011.

[24] B. Nestler, H. Garcke, and B. Stinner. Multicomponentalloy solidification: Phase-field modeling andsimulations. PHYSICAL REVIEW E 71, 041609, 2005.

[25] M. A. Ruggiero and J. W. Rutter. Origin ofmicrostructure in the 332 k eutectic of the Bi-In-Snsystem. Materials Science and Technology, 13(1):5–11,1997.

[26] D. Schneider, S. Schmid, M. Selzer, T. Bohlke, andB. Nestler. Small strain elasto-plastic multiphase-fieldmodel. Computational Mechanics, 55(1):27–35, 2015.

[27] D. Schneider, M. Selzer, J. Bette, I. Rementeria,A. Vondrous, M. J. Hoffmann, and B. Nestler.Phase-field modeling of diffusion coupled crackpropagation processes. Advanced Engineering Materials,16(2):142–146, 2014.

[28] D. Schneider, O. Tschukin, A. Choudhury, M. Selzer,T. Bohlke, and B. Nestler. Phase-field elasticity modelbased on mechanical jump conditions. ComputationalMechanics, pages 1–15, 2015.

[29] T. Shimokawabe, T. Aoki, T. Takaki, A. Yamanaka,A. Nukada, T. Endo, N. Maruyama, and S. Matsuoka.Peta-scale phase-field simulation for dendriticsolidification on the TSUBAME 2.0 supercomputer. InHigh Performance Computing, Networking, Storage andAnalysis (SC), 2011 International Conference for,pages 1–11. IEEE, 2011.

[30] T. Takaki, M. Ohno, T. Shimokawabe, and T. Aoki.Two-dimensional phase-field simulations of dendritecompetitive growth during the directional solidificationof a binary alloy bicrystal. Acta Materialia, 81(0):272 –283, 2014.

[31] T. Takaki, T. Shimokawabe, M. Ohno, A. Yamanaka,and T. Aoki. Unexpected selection of growing dendritesby very-large-scale phase-field simulation. Journal of

https://software.intel.com/en-us/articles/intel-architecture-code-analyzer

https://software.intel.com/en-us/articles/intel-architecture-code-analyzer

http://www.top500.org/lists/2014/11/

https://www.lrz.de/services/compute/ supermuc/systemdescription/

https://www.lrz.de/services/compute/ supermuc/systemdescription/

http://vcg.sourceforge.net

Crystal Growth, 382:21–25, 2013.

[32] J. Treibig, G. Hager, and G. Wellein. LIKWID: Alightweight performance-oriented tool suite for x86multicore environments. In Proceedings of PSTI2010,the First International Workshop on Parallel SoftwareTools and Tool Infrastructures, San Diego CA, 2010.

[33] A. Vondrous, M. Selzer, J. Hotzer, and B. Nestler.Parallel Computing for Phase-field Models. Int. J. HighPerform. Comput. Appl., 28(1):61–72, Feb. 2014.

[34] S. Williams, A. Waterman, and D. Patterson. Roofline:an insightful visual performance model for multicorearchitectures. Communications of the ACM,52(4):65–76, 2009.

[35] V. Witusiewicz, U. Hecht, S. Fries, and S. Rex. TheAg–Al–Cu system: Part I: Reassessment of the

constituent binaries on the basis of new experimentaldata. Journal of alloys and compounds, 385(1):133–143,2004.

[36] V. Witusiewicz, U. Hecht, S. Fries, and S. Rex. TheAg-Al-Cu system: Part II: A thermodynamicevaluation of the ternary system. Journal of alloys andcompounds, 387(1):217–227, 2005.

[37] A. Yamanaka, T. Aoki, S. Ogawa, and T. Takaki.GPU-accelerated phase-field simulation of dendriticsolidification in a binary alloy. Journal of CrystalGrowth, 318(1):40–45, 2011.

[38] A. Yamanaka, M. Okamoto, T. Shimokawabe, andT. Aoki. Multi-GPU computation of multi-phase fieldsimulation of the evolution of metallic polycrystallinemicrostructure. Tsubame ESJ.: e-science journal,13:18–22, 2014.

massively parallel phase-field simulations for ternary eutectic … · 2015-06-05 · massively...

Documents