a family of assembly-free physics-based deflation methods

134
A Family of Assembly-free Physics-based Deflation Methods for Large-scale Finite Element Analysis By Praveen Yadav A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Mechanical Engineering) at the UNIVERSITY OF WISCONSIN-MADISON 2015 Date of final oral examination: 12/16/2015 The dissertation is approved by the following members of the Final Oral Committee: Krishnan Suresh, Professor, Mechanical Engineering Matthew S. Allen, Associate Professor, Engineering Physics Dan Negrut, Associate Professor, Mechanical Engineering Heidi-Lynn Ploeg, Associate Professor, Mechanical Engineering Xiaoping Qian, Associate Professor, Mechanical Engineering

Upload: others

Post on 09-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

A Family of Assembly-free Physics-based Deflation Methods

for Large-scale Finite Element Analysis

By

Praveen Yadav

A dissertation submitted in partial fulfillment of

the requirements for the degree of

Doctor of Philosophy

(Mechanical Engineering)

at the

UNIVERSITY OF WISCONSIN-MADISON

2015

Date of final oral examination: 12/16/2015

The dissertation is approved by the following members of the Final Oral Committee:

Krishnan Suresh, Professor, Mechanical Engineering

Matthew S. Allen, Associate Professor, Engineering Physics

Dan Negrut, Associate Professor, Mechanical Engineering

Heidi-Lynn Ploeg, Associate Professor, Mechanical Engineering

Xiaoping Qian, Associate Professor, Mechanical Engineering

i

Abstract

Finite element analysis (FEA) is the most popular numerical method today for solving solid

mechanics and other boundary value problems. FEA is used to solve a variety or problems

including static, modal, buckling, transient, etc. While the formulation for each of these problems

may differ, they all necessitate solving linearized system of equations.

Further, for large scale problems, solving the linear system is often the computational bottleneck.

Iterative solvers have been accepted as preferred methods for such large systems. The two main

challenges in iterative solvers are: 1) cost of Sparse Matrix-Vector multiplication (SpMV), and

2) number of iterations required for convergence.

Several methods have been proposed by researchers to address both these challenges. In this

thesis, a physics-motivated assembly-free deflated conjugate gradient (deflated-CG) is

presented. The physics-based deflation method presented here accelerates the convergence of

conjugate gradient (CG); this is then implemented through an efficient assembly-free SpMV

method, exploiting mesh congruence.

Thus, the main contribution of this thesis is to develop a family of assembly-free physics-based

deflation techniques that can be applied to a variety of large-scale FEA problem in solid

mechanics.

The deflation methods discussed in this thesis exploit the expected physical behavior of a

structure. The proposed physics-based deflation allows one to use the specificity of known

reduction techniques, such as plate and beam theory, to accelerate iterative methods in a robust

ii

and generalized 3D framework. Furthermore, the implementation of such physics-based deflation

allows one to reduce the memory-cost of assembling and storing global matrices.

The concept of ‘element-congruency’ is proposed to reduce memory requirement for SpMV.

Specifically, congruent elements exhibit similar stiffness, and therefore can be accessed by one

stiffness element. This has not been explored in the scientific community. Element-congruency

addresses the need for an efficient SpMV, through limiting the memory for assembly-free SpMV.

For large scale problems, numerical results show that exploiting element-congruency is very

valuable towards improving speed.

Numerical results will show the efficiency and scalability of the proposed assembly-free

deflated-CG for a variety of applications in FEA. The advantage of proposed method is also

illustrated through its application in topology optimization.

iii

Acknowledgements

In memory of my mother, this thesis is dedicated to my loving parents, family and friends.

I am deeply grateful for the opportunity my advisor Prof Krishnan Suresh provided me with, by

allowing me to work in ERSL. Without his patience, guidance and constant support, this work

would not have been possible. He taught me the skills necessary to approach a research problem

systematically, and to be critical of any solutions that are applicable to those problems. His

expertise in finite elements and computational mechanics helped me grow as a researcher. I also

thank my committee members, Prof Matt Allen, Prof Dan Negrut, Prof Heidi-lynn Ploeg and

Prof Xiaoping Qian for their valuable suggestions and comments over the course of this

program.

I extend many thanks to the ERSL members, old and new, for the enjoyable discussions related

to research and life in general. Thanks to all the friends at badminton and cricket for the

camaraderie outside the lab.

I thank my parents and my sister for their unquestionable belief in my ability, which I think they

over-estimate. I am especially thankful for my mother, Late Smt. Nisha Yadav and father, Shri

Parashu Ram Yadav; for providing me with every opportunity to succeed in life. Their

unwavering dedication to our family is something I can only hope to emulate. I would be remiss

if I didn’t thank my wife, Emily and my in-laws. They motivate me to improve in every aspect of

life.

Last but not the least; I also extend my thanks to Mr. Stephen Colbert of ‘the Late Show with

Stephen Colbert’.

iv

Table of Contents

Abstract ........................................................................................................................................... i

Acknowledgements ...................................................................................................................... iii

List of Figures .............................................................................................................................. vii

List of Tables ................................................................................................................................ xi

1 Accelerating Large-scale Finite Element Analysis ............................................................. 1

1.1 Introduction to Finite Element Analysis .......................................................................... 1

1.1.1 Finite Element Discretization ................................................................................... 2

1.2 Linear Solvers .................................................................................................................. 4

1.2.1 Direct......................................................................................................................... 5

1.2.2 Iterative ..................................................................................................................... 5

1.3 Accelerating convergence ................................................................................................ 9

1.3.1 Preconditioning ......................................................................................................... 9

1.3.2 Deflation ................................................................................................................. 11

1.3.3 Physics-based deflation ........................................................................................... 13

1.4 Thesis overview.............................................................................................................. 14

2 Physics-based Deflation ....................................................................................................... 16

2.1 Convergence of conjugate gradient method for thin structure ....................................... 16

2.2 Agglomeration ................................................................................................................ 19

2.3 Planar-rigid-body ........................................................................................................... 22

2.4 Rigid-body ...................................................................................................................... 24

2.5 Kirchoff-Love plate ........................................................................................................ 26

2.6 Euler-Bernoulli beam ..................................................................................................... 28

2.7 Elastic Polynomial.......................................................................................................... 30

2.8 Summary ........................................................................................................................ 32

v

3 Assembly-free Implementation .......................................................................................... 34

3.1 Congruence of elements ................................................................................................. 35

3.1.1 Geometric method ................................................................................................... 37

3.1.2 Stiffness method...................................................................................................... 39

3.2 FEA results with congruency ......................................................................................... 42

3.3 Special case of SpMV on Voxel-mesh ........................................................................... 47

3.3.1 Element-connectivity based .................................................................................... 48

3.3.2 Node-connectivity based ......................................................................................... 49

3.3.3 Single SpMV results ............................................................................................... 49

3.4 Assembly-free deflation ................................................................................................. 51

3.4.1 Prolongation ............................................................................................................ 51

3.4.2 Restriction ............................................................................................................... 52

3.4.3 Deflating stiffness ................................................................................................... 53

3.5 Summary ........................................................................................................................ 54

4 Assembly-free Finite Element Analysis ............................................................................. 55

4.1 Assembly-free modal analysis ....................................................................................... 65

4.1.1 Rayleigh-Ritz conjugate gradient ........................................................................... 65

4.1.2 Computing multiple modes ..................................................................................... 67

4.1.3 Subspace augmentation ........................................................................................... 69

4.1.4 Numerical results: modal analysis for Knuckle ...................................................... 71

4.1.5 Numerical results: modal analysis for Housing cover ............................................ 74

4.2 Assembly-free static analysis ......................................................................................... 55

4.2.1 Numerical results: static analysis for Knuckle........................................................ 56

4.2.2 Numerical results: static analysis of thin plate under pressure ............................... 60

4.2.3 Numerical results: static analysis of ‘Thomas’ engine ........................................... 63

vi

4.3 Assembly-free buckling analysis ................................................................................... 75

4.3.1 Inverse iteration ...................................................................................................... 78

4.3.2 Numerical results: buckling analysis of a square beam .......................................... 79

4.3.3 Numerical results: buckling analysis of cylindrical column ................................... 81

4.3.4 Numerical results: buckling analysis of a rectangular column with a hole ............ 83

4.4 Assembly-free large-deformation analysis..................................................................... 85

4.4.1 Total Lagrangian (TL) formulation ........................................................................ 87

4.4.2 Numerical results: large-deformation analysis of beam ......................................... 92

4.5 Summary ........................................................................................................................ 97

5 Application: topology optimization .................................................................................... 99

5.1 Voxel-mesh in topology optimization .......................................................................... 102

5.2 Buckling constrained optimization............................................................................... 105

5.2.1 Numerical results: optimizing a thin column ........................................................ 107

5.2.2 Numerical results: optimizing a thin plate ............................................................ 109

5.3 Impact of Assembly-free FEA in compliance optimization......................................... 111

5.4 Summary ...................................................................................................................... 112

6 Conclusion and future work ............................................................................................. 114

6.1 Conclusion .................................................................................................................... 114

6.2 Future work .................................................................................................................. 115

6.2.1 Effectiveness of Elastic-polynomials .................................................................... 115

6.2.2 Feature based deflation ......................................................................................... 116

6.2.3 Assembly-free non-linear FEA ............................................................................. 116

6.2.4 Assembly-free SpMV for conforming mesh......................................................... 117

References .................................................................................................................................. 118

vii

List of Figures

Figure 1: Thin Plate example .......................................................................................................... 1

Figure 2: Discretized domain .......................................................................................................... 2

Figure 3: Displacement plot for thin plate ...................................................................................... 3

Figure 4: Modal plots for thin plate ................................................................................................ 4

Figure 5: Direct vs Iterative solve: as reported in [9] ..................................................................... 7

Figure 6: Convergence of residual for thin plate. ........................................................................... 8

Figure 7: Structures for large-scale FEA [12], [13] ........................................................................ 9

Figure 8: A two-level geometric multi-grid. ................................................................................. 12

Figure 9: a) Finite element mesh, b) agglomeration of nodes in 16 groups ................................. 13

Figure 10: Thesis overview ........................................................................................................... 14

Figure 11: Thin Plate examples for convergence analysis............................................................ 17

Figure 12: Convergence plot for thin plate ................................................................................... 17

Figure 13: Convergence for Agglomeration ................................................................................. 21

Figure 14: Agglomerated mode-shapes ........................................................................................ 22

Figure 15: Convergence for in-plane load: Planar rigid-body vs Agglomeration ........................ 23

Figure 16: Convergence for out-of-plane load: Planar-rigid-body ............................................... 24

Figure 17: Rigid-body mode shapes ............................................................................................. 25

Figure 18: Convergence for out-of-plane load: Rigid body vs Agglomeration ............................ 26

Figure 19: Curvature effects in thin structures ............................................................................. 27

Figure 20: Convergence: Kirchoff-Love plate vs Rigid-body ...................................................... 28

Figure 21: Convergence for Euler-Bernoulli beam deflation ....................................................... 30

Figure 22: Convergence for out-of-plane loading: 1st order Elastic-polynomial vs Rigid-body . 32

viii

Figure 23: Element Congruency in Mesh ..................................................................................... 36

Figure 24: Distinct Element located around notch ....................................................................... 36

Figure 25: Geometry and boundary conditions on L-bracket ....................................................... 38

Figure 26: Quad-mesh for L-bracket ............................................................................................ 38

Figure 27: Geometric congruence vs mesh size for various tolerances ........................................ 39

Figure 28: Stiffness congruence vs mesh size for various tolerances ........................................... 40

Figure 29: Reduced-stiffness congruence vs mesh size for various tolerances ............................ 42

Figure 30: Quad-mesh for L-bracket with 9600 elements ............................................................ 43

Figure 31: Stress and displacement error for 0.1% tolerance ....................................................... 45

Figure 32: Stress plot with maximum displacement (δ) and maximum stress (σ) ....................... 46

Figure 33: Knuckle with (a) Conforming Mesh, and (b) Non-conforming Mesh ........................ 47

Figure 34: Element connectivity based SpMV implementation ................................................... 48

Figure 35: A beam geometry and its mesh ................................................................................... 49

Figure 36: Assembly-free SpMV on the CPU with and without exploiting element-

congruency[51] ............................................................................................................................. 50

Figure 37: GPU implementation of prolongation. ........................................................................ 52

Figure 38: GPU implementation for restriction. ........................................................................... 53

Figure 39: (a) Knuckle geometry and loading, and (b) Voxel mesh ............................................ 56

Figure 40: Static (a) displacement, and (b) stress for knuckle ...................................................... 57

Figure 41: 100 and 1000 rigid-body groups ................................................................................. 57

Figure 42: Convergence of DCG vs Jacobi-PCG ......................................................................... 59

Figure 43: Loading on a thin plate ................................................................................................ 60

Figure 44: Convergence of DCG vs Jacobi-PCG for thin plate .................................................... 61

ix

Figure 45: CUDA Profile for Rigid-Body deflation ..................................................................... 63

Figure 46: Structural problem over a Thomas engine. ................................................................. 64

Figure 47: Deflection from a 50 million DOF system. ................................................................. 64

Figure 48: (a) Knuckle geometry, (b) Conforming mesh, and (c) voxel-mesh ............................ 71

Figure 49: (a) Gear-housing: eigen-spectrum is desired, (b) Meshing can fail for complex

structures ....................................................................................................................................... 74

Figure 50: Brute-force voxelization of the structure..................................................................... 74

Figure 51: First Eigen-mode for Gear Housing ............................................................................ 75

Figure 52: Bucking of a pinned-pinned beam ............................................................................... 76

Figure 53: Predicted critical load using AFBA and Solidworks [10] ........................................... 80

Figure 54: Computing time vs #DOF for AFBA and SolidWorks [10] ....................................... 81

Figure 55: Accuracy plot for Cylindrical Column: Buckling load vs #DOF ................................ 82

Figure 56: Computing time vs #DOF for cylindrical column ...................................................... 82

Figure 57: Rectangular column with a hole .................................................................................. 83

Figure 58: Predicted critical load using AFBA and SolidWorks[10] for rectangular beam with

hole ................................................................................................................................................ 84

Figure 59: Computing time for rectangular beam with hole......................................................... 84

Figure 60: Cantilever beam displacement for linear elastic vs large-deformation formulation ... 86

Figure 61: Mapping of current mesh through deformation gradient ............................................ 90

Figure 62: Large-deformation analysis on beam .......................................................................... 93

Figure 63: Displacement results for linear static FEA .................................................................. 93

Figure 64: Displacement results for large-deformation FEA ....................................................... 94

Figure 65: Convergence plot: CG w/o deflation vs Rigid-body ................................................... 95

x

Figure 66: Convergence plot: Rigid-body vs 1st order Elastic-polynomial ................................. 96

Figure 67: Convergence plot: Rigid-body vs Euler-Bernoulli beam ............................................ 97

Figure 68: Topology Optimization of 2D Cantilever ................................................................. 100

Figure 69: 3D Cantilever Beam .................................................................................................. 101

Figure 70: 3D Cantilever Beam Optimized ................................................................................ 101

Figure 71: Geometry of 2D L-Bracket........................................................................................ 102

Figure 72: (a) Conforming mesh for L-bracket, and (b) Optimized topology ............................ 103

Figure 73: (a) Grid mesh for L-bracket, and (b) Optimized topology ........................................ 103

Figure 74: Optimized for minimum stress .................................................................................. 104

Figure 75: Algorithm for buckling-constrained topology optimization ..................................... 107

Figure 76: Thin column with compressive load ......................................................................... 108

Figure 77: Stiff designs with different safety factors.................................................................. 109

Figure 78: Plate with compressive load ...................................................................................... 110

Figure 79: Optimized topologies for various safety factors........................................................ 111

xi

List of Tables

Table 1: Results for error in maximum displacement for different mesh sizes ............................ 43

Table 2: Results for error in maximum von Mises stress for different mesh sizes ....................... 44

Table 3: Assembly-Free SpMV Timing results ............................................................................ 51

Table 4: Total iterations and time taken to solve knuckle with varying number of groups ......... 58

Table 5: Time taken to solve the knuckle problem using SolidWorks [10] and proposed method.

....................................................................................................................................................... 59

Table 6: Comparison of Rigid-body deflation vs Kirchoff-Love deflation .................................. 61

Table 7: Time taken to solve thin-plate with proposed method vs SolidWorks [10] ................... 62

Table 8: First 5 eigen-modes for Knuckle: Solidworks [10] vs SaRCG [54] ............................... 72

Table 9: Time for computing first-5 frequency of Knuckle ......................................................... 73

Table 10: Results for Computing Fundamental Frequency of Gear Housing .............................. 75

Table 11: Minimizing volume for Stiff structure ........................................................................ 109

Table 12: Optimizing plate with buckling constraints ................................................................ 111

Table 13: Optimization speed for compliance minimization ...................................................... 112

Table 14: Comparison of optimization speed across various platforms ..................................... 112

1

1 ACCELERATING LARGE-SCALE FINITE ELEMENT ANALYSIS

1.1 Introduction to Finite Element Analysis

Finite element analysis (FEA) is the most popular numerical method today for solving solid

mechanics and other boundary value problems. FEA includes a family of analyses, such as,

static, modal, buckling, transient, etc. Each of these analyses is modelled on the governing

principles such as equilibrium, conservation of energy, conservation of momentum, and so on.

FEA approximates the governing equations, which are often partial-differential-equations (PDE),

as a linearized system of algebraic equations [1].

For example, consider the static linear elasticity problem of a thin plate illustrated in Figure 1.

The domain, Ω is the entire volume enclosed by boundary ∂Ω. The edges of the plate have a

prescribed displacement,u which is 0 for the given system. A traction, t, is applied on one of the

free surfaces as a pressure in normal direction.

The strong form of the equilibrium equation is a PDE that solves for stress tensor field over Ω

[1], [2]. A functional of total potential energy represents the weak form of the same equation as

[1], [2]:

Figure 1: Thin Plate example

2

1

2

T Td u td

(1)

FEA solves for displacement field by minimizing the functional described in Equation (1). The resulting

linear system of equation obtained through minimization process is described next.

1.1.1 Finite Element Discretization

The first step in FEA is breaking the domain into finite element mesh [1], [2]. Figure 2 illustrates a typical

finite element mesh described in Equation (2). The sub-script e represents an arbitrary element.

e

e

(2)

Displacement within an element is approximated through nodal displacements, ue by using appropriate

shape functions, N[2]. The next step is computing the element properties using those displacement

approximations. For given example, element stiffness and nodal force for the element is required. The

detailed process of getting element properties using the shape functions can be found in [2], [3].

The third step is assembling the element properties to obtain the linearized system of equation for the

whole structure as:

Ku f (3)

Figure 2: Discretized domain

3

The coefficient matrix, also known as global stiffness matrix K, is often sparse-symmetric and

positive definite[2], [3].

The next step is solving the assembled system. Once solved, the solution vector, u can be post

processed to compute stresses [3]. The displacement field obtained from u is shown in Figure 3.

Similar to the static linear problem, finite element formulation can also solve a modal problem

for homogenous solution of a spatial PDE in a dynamic system [2], [3]. The process of

discretizing and computing element properties remains the same.

In discretized form, the first few natural modes of vibration can be approximated by solving the

generalized eigen-value problem [2], [3]:

Kx Mx (4)

Here, K and M represent stiffness and mass matrix respectively, and and x represent the

eigenvalue and mode shape to be solved [2], [3]. The first few mode shapes are illustrated in

figure 4.

Figure 3: Displacement plot for thin plate

4

Thus, FEA provides us with a very valuable tool to analyze several governing equations. The

solution method for each of the governing equation may differ, but in some capacity they all rely

on a linear solver as an important step in the algorithm [3].

The degree of accuracy in a finite element analysis is governed by discretization parameters,

such as, number of elements, types of elements, the shape functions within the elements, etc. For

high accuracy, a large number of elements are often required during discretization [1], [2]. This

brings us to the challenge of large-scale FEA, especially when the linear solve becomes the

bottle-neck. Existing methods for solving linear systems in FEA are discussed in the next

section.

1.2 Linear Solvers

Linear solvers can be classified as direct or iterative, which are discussed in the following sub-

sections.

a) First mode b) Second mode

Figure 4: Modal plots for thin plate

5

1.2.1 Direct

Direct solvers [4] are commonly favored for linear systems of moderate size. They are robust and

well-understood, and rely on factoring the stiffness matrix or coefficient matrix (for the

symmetric matrix) into Cholesky decomposition:

TK LL (5)

This is followed by a triangular solve:

1( )Tu L L f (6)

The advantage of using direct solvers is that they terminate in a fixed number of steps. However,

due to the explicit factorization, direct solvers are memory intensive [5]. To quote the ANSYS

manual [6], “[sparse direct solver] is the most robust solver in ANSYS, but it is also compute- and

I/O-intensive”. For large scale problems with one million degrees of freedom (DOF) [6]:

Approximately 1 GB of memory is needed for assembly

However, 10 to 20 GB additional memory is needed for factorization.

Since memory-access is often the bottle-neck, this translates to increased wall-clock time. A

popular option is to use iterative solvers.

1.2.2 Iterative

Iterative solvers do not require decomposition of a stiffness matrix [7]. They start with an initial

guess for u0 that is used to compute the residual, r0:

0 or f Ku (7)

6

The residual is then used to update the solution [8]. The process is repeated until the residual is

less than a specified tolerance.

The scalability of direct and iterative method was compared for both memory required and time

taken to solve a linear system by Mirzendehdal and Suresh in [9]. Their paper focuses on solving

a structural dynamics problem via FEA. For analysis, the algorithm presented in [9] requires a

linear solver for:

eff effK u f (8)

In FEA, linear systems expressed in Equation (8) are common [3]. The coefficient matrices are

often labeled as effective stiffness (commonly for dynamics analysis [3]) or tangent stiffness (for

non-linear FEA [3]).

The effective stiffness matrix presented in Equation (8) is a linear combination of K and M

described in greater detail in [9]. A scaling analysis was performed to compare direct and

iterative methods available in SolidWorks [10] for solving Equation (8). Figure 5 illustrates their

findings on scalability of direct and iterative methods for time taken and memory requirements

for the solution.

7

It is evident that a direct solver scales poorly for large-scale FEA when compared to an iterative

solver. Therefore, iterative solver is the focus of this thesis.

The two main bottle-necks for iterative solvers are:

Sparse matrix-vector multiplication (SpMV) for Ku, and

Number of iterations required to converge to the solution.

a) Scalability analysis w.r.t solution time

b) Scalability analysis w.r.t memory required

Figure 5: Direct vs Iterative solve: as reported in [9]

8

Efficiency of SpMV is a very important aspect of iterative solver as it is the most expensive

operation in any iterative algorithm. But, first convergence of an iterative solver is discussed.

For example, consider once again the thin plate elasticity problem described in Figure 1. When

discretized using 550,000 hexahedral elements, it results in a linear system with 2 million

unknowns in nodal displacements also referred to as degree-of-freedom (DOF). Using conjugate

gradient (CG) as an iterative solver, without any preconditioner, the solution converges in

approximately 6400 iterations to a specified tolerance of 10-8

. Figure 6 illustrates the

convergence plot for the same.

An argument can be made that a thin plate does not require that many elements or a better

element technology can be used to exploit the nature of the problem [11]. However, there are far

more complex problems in the real world application which require large number of elements

and suffer from a similar issue of slow convergence. Figure 7 illustrates a few examples of large

scale FEA.

Figure 6: Convergence of residual for thin plate.

9

Accelerating convergence, therefore, remains an important challenge to be addressed in the

scientific community. In the next section, some of the existing methods to address the issue of

convergence are discussed.

1.3 Accelerating convergence

Faster convergence is usually achieved through a sequence of operation that reduces the

condition number of the effective stiffness matrix [7], [14]. An overview of types of operators

used to this end is presented in the following sub-sections.

1.3.1 Preconditioning

Preconditioning is a process of applying transformation so that instead of solving Equation (3),

we solve [15]:

1 1A Ku A f (9)

where A is a characteristic preconditioner. The transformation can also be applied from right side

of the coefficient matrix [7], [14], [15], such that:

Figure 7: Structures for large-scale FEA [12], [13]

10

1 1; KA u f u A u (10)

Preconditioning can also be split on either side of the coefficient matrix [7], [14], [15] by

following transformation:

1 1 1 1

1 2 1 2; A KA u A f u A u (11)

A practical preconditioner should be inexpensive to compute, and the preconditioned system

should converge rapidly.

There are several general-purpose preconditioners available. Jacobi preconditioner is one of the

oldest methods; that uses diagonal scaling [7]. It does not require assembly of the stiffness

matrix, and is therefore scalable and easily parallelizable. But it is not very effective for many ill-

conditioned problems in solid mechanics [5]; this is confirmed later through numerical

experiments. Methods such as Gauss-Seidel and Symmetric Successive Over-relaxation (SSOR)

perform better than Jacobi, but have similar limitations [16].

The incomplete Cholesky (IC) is perhaps the most robust and efficient preconditioner for

symmetric matrices [15], [17]. It relies on an approximate Cholesky factorization where, for

example, the lower-triangular matrix L from Equation (5) is forced to have the same sparsity-

pattern as K. The memory requirement becomes an issue for large-scale systems.

Preconditioning methods accelerate convergence through spectral transformation, i.e., they shift

the eigen-values of the coefficient matrix closer together [14], [16], [18]. They do not affect the

size of the problem. Improving convergence through dimensional reduction is considered under

deflation-based methods [19]–[21] discussed next.

11

1.3.2 Deflation

Deflation relies on projection methods to reduce the size of the problem [21]. This reduced

system when solved, eliminates the eigen-modes from the residual that span that projected space

[20], [21]. The process is described next.

Deflation starts with constructing a deflation-space or workspace, W, which is a rectangular

matrix of size n-by-m, where m is far smaller than n. The workspace then deflates the linear

system through following projection operations [21]:

TK W KW (12)

( ) ( )i T i

Wr W r (13)

This results in a reduced m-by-m linear system in the projected deflation-space [21]:

( ) ( )i i

W WKu r (14)

The solution to this reduced system is then projected back to the solution space. The residual is

then orthogonalized w.r.t the projected vector through[21]:

( ) ( ) ( )i i i

Wr r Wu (15)

The superscript, i, indicates that the operation has to be performed every iteration. The projection

into the deflation-space, represented by Equation (13), is referred to as restriction operation,

while the reverse is called prolongation [20], [21].

At this point, it is important to note that deflation-space, W, is effective when it spans low-

frequency modes of K [19]–[21].

12

Direct solvers can be used for Equation (14) since m<<n. One can also use iterative method to

solve Equation (14) and further use another level of deflation nested inside the reduced system

solver. The method of using nested deflation is commonly referred to as multi-grid[22]–[24].

Geometric multi-grid uses finite element approximations to construct deflation-space through a

coarse mesh [22], [23]. Using a coarse mesh also allows geometric multi-grid to construct

reduced system coefficient matrix, through finite element methods rather than projection

operation described in Equation (12). Figure 8 illustrates the basic concept behind a two-level

geometric multi-grid method.

In the algebraic multi-grid method [20], the restriction and prolongation operators are

constructed in an algebraic fashion, rather than through a geometric mesh transfer. The

properties and performance are similar to that of the geometric multi-grid.

While multi-grid methods perform particularly well for scalar problems and solid mechanics

(vectors) posed over ‘thick solids’, they are prone to Poisson locking and ill-conditioning for

problems posed over ‘thin solids’ and composite materials [25]. Improvements over the multi-

grid method for thin structures were proposed in [26], [27], where lower-dimensional models

Figure 8: A two-level geometric multi-grid.

13

were used instead of coarse-meshes, thus avoiding locking issues. The effectiveness of the

method was in exploiting the physical behavior of the structure.

This leads to physics-based deflation, discussed in the next sub-section.

1.3.3 Physics-based deflation

As mentioned earlier, an effective deflation-space is one that spans the low-frequency modes.

Since computing the eigen-modes is typically expensive, Bulgakov, et al [28] suggested a simple

agglomeration technique where finite element nodes are collected into small number of groups.

Then, to construct the W matrix, nodes within each group are collectively treated as a rigid body.

The motivation is that these agglomerated rigid body modes mimic the low-frequency eigen-

modes. The results were promising [28].

This opens up the possibility of constructing an entire family of deflation space that utilizes the

awareness of physical behavior of the system. This thesis is an attempt to highlight the

effectiveness and ease of implementation of such methods, and their application.

Figure 9: a) Finite element mesh, b) agglomeration of nodes in 16 groups

14

1.4 Thesis overview

In this thesis, a general-purpose iterative method called Assembly-Free Deflated-CG is proposed.

This method can solve a large variety of problems in solid mechanics that require large-scale

FEA. It is robust and easy to implement. It is also easy to integrate into existing applications with

very low memory cost added to the process. The layout of this thesis is shown in Figure 10.

Chapter 2 details a non-exhaustive list of physics-based deflation methods. Numerical methods

illustrate improvement in convergence due to such methods. A comparison among a few of the

implemented method is presented.

Figure 10: Thesis overview

15

The ease of implementation of these deflation-spaces motivates us to find efficient ways of

implementing iterative solver in an assembly-free manner. In chapter 3, the focus shifts towards

assembly-free implementation of SpMV and the corresponding deflation operations.

The main contribution of this thesis lies in identifying several physics-based deflations and then

implementing them in an assembly-free FEA to solve large-scale problems with limited memory.

In chapter 4, we discuss algorithms that can exploit assembly-free deflated-CG for different types

of FEA. Numerical results presented in the chapter will illustrate the performance of such

algorithms when compared to commercially available solvers in Solidworks [10].

Application of assembly-free FEA for topology optimization is presented in chapter 5.

Concluding remarks and future work are discussed in the final chapter.

16

2 PHYSICS-BASED DEFLATION

In this chapter, an analysis of physical aspects of some commonly used deflation space is

presented. The focus is primarily on convergence behavior of different deflation methods for two

problems.

2.1 Convergence of conjugate gradient method for thin structure

The matrices considered in this thesis are symmetric positive definite. Conjugate gradient (CG)

is preferred iterative method for these types of system [8]. As is well known, CG’s convergence

can be poor if the stiffness matrix exhibits high condition number, or if the eigen-values of the

stiffness matrix are spread out [7], [16]. In solid mechanics, poor convergence of CG is fairly

common, for example in thin structures, composite materials, multi-scale problems, etc. [25]–

[27]. It is important to note that physical nature of boundary condition plays a role in slow

convergence.

To illustrate this, consider an example of static FEA with two types of loading on a thin plate.

The dimensions of the plate are 100mm X 100mm X 2.5mm. The load cases are in-plane and

out-of-plane as illustrated in Figure 11. Both cases have a uniformly distributed load of 100N,

applied on the same surface but in different directions.

17

The material used is alloy steel, with Young’s modulus, E = 2.1 x 1011

and Poisson’s ratio, ν =

0.28. Plate is discretized using 600,000 hexahedral elements resulting in 2 million degrees-of-

freedom (DOF). Equation (3) is solved for displacement vector. The stiffness matrix is same for

both cases. The plot in Figure 12 illustrates convergence for both types of loading.

a) In-plane load b) Out-of-plane load

Figure 11: Thin Plate examples for convergence analysis

Figure 12: Convergence plot for thin plate

18

It is evident from the plot that, the physical direction of the load changes the behavior of

convergence. The deflation techniques therefore, must account for these physical effects of the

problem.

The deflated-CG algorithm to solve such a system is described below (derivation and theoretical

analysis of the algorithm can be found in [21]):

Algorithm 1: Deflated CG (DCG); solve 𝑲𝒅 = 𝒇

1. Construct the deflated stiffness K

2. Choose 0d where 0

0TW r & 0 0r f Kd

3. Solve 0

T

WKu W Kr for W

u ; 0 0 Wp r Wu

4. For 1,2,..., ,j m d0:

5. 1 1

1

1 1

T

j j

j T

j j

r r

p Kp

6. 1 1 1j j j jd d p

7. 1 1 1j j j jr r Kp

8. 1

1 1

T

j j

j T

j j

r r

r r

9. Solve T

W jKu W Kr for W

u

10. 1 1j j j j Wp p r Wu

11. End-do

Observe that when the deflated space W is far smaller than the size of the system:

SpMV in step 5 and 9 are primary computations.

Additional computations include the restriction operation TW x in step 9, the prolongation

WWu in step 10, and the solution of the reduced linear system in step 9.

19

Deflation-space accelerates convergence of Algorithm 1 through step 9 and 10. Several deflation

methods and their physical effects on the system are discussed next.

2.2 Agglomeration

Agglomeration is the simplest deflation method used today. It was introduced by Nicolaides [19]

as an ad-hoc method to approximate eigen-modes of the system. The method relies on collecting

a group of closely positioned nodes and treats the group as point object with translational DOF.

The mapping between translational DOF of group, ug and nodal DOF, un can be expressed as:

1 0 0

0 1 0

0 0 1

n gu u

(16)

To further understand the mapping given in Equation (16), consider a series of nodes, 1 thru n,

which lie in group g. The collective mapping of group DOF of g to nodal DOF of 1 thru n can be

expressed as:

11

22

1 2

Here,

1 0 0

0 1 0

0 0 1

g

g

g

ngn

g g ng

Wu

Wuu

Wu

W W W

(17)

With the mapping between any nodal and group DOF given in Equation (16), one can construct

an assembled deflation matrix, W that will project an assembled deflation-space vector, uW to the

solution space vector, u:

20

Wu Wu (18)

The mapping operation in Equation (18) is referred to as prolongation and is repeated through

the iteration every time deflated system given in Equation (14) is solved.

One can use W to deflate the stiffness matrix as described in Equation (12). This is a one-time

operation and it does not have to be repeated during iterations. Restricting residual to deflation

space through Equation (13) also takes place during iteration.

Using agglomeration as a deflation space, the thin-plate example from Figure 11 is solved. The

convergence plot for different number of agglomerated groups is shown in Figure 13.

21

It is clear from the plots that agglomeration accelerates CG much more effectively for in-plane

load cases as opposed to out-of-plane load cases. Physically, each group behaves like a block

a) Convergence for in-plane loading

b) Convergence for out-of-plane loading

Figure 13: Convergence for Agglomeration

22

that is only allowed to move through translation. For in-plane load cases, the collection of block

translation can approximate the mode-shape that dominates the residual. When out-of-plane

modes are excited, large number of groups are needed to approximate the mode-shape. Figure 14

illustrates how mode-shape can be approximated by agglomerated groups.

The limitation in approximating out-of-plane mode shapes can be remedied by allowing the

groups to rotate. This is precisely what rigid-body method employs for deflation.

2.3 Planar-rigid-body

To overcome, the restriction imposed by simple translation, the groups can be treated as rigid

bodies. Rigid-body deflation was suggested by Bulgakov, et al [28]. It remains a popular

deflation method for a variety of problems.

First consider the rigid-body motion within a plane. If restricted to move only within the plane,

any group’s translation and rotation DOF can be expressed as:

0

0

0

g

u

u v

(19)

For the group’s DOF, one can construct a mapping to nodal DOF of any node n, which lies

within the group through [28]:

Figure 14: Agglomerated mode-shapes

23

where

1 0

0 1

0 0 0

; ;

n ng g

ng

g n g n

u W u

y

W x

x x x y y y

(20)

The deflation may appear rank deficient, but if the forces dictate planar motion, the deflation is

highly effective even for a 3D FEA. The convergence plot in Figure 15 shows the efficiency of

planar- rigid-body.

The rank deficiency doesn’t become an issue for planar-loads. This is primarily due to the modes

excited by in-plane load. If we apply the planar-rigid-body to out-of-plane load, deflation space

will not correct the out-of-plane displacement causing stagnation in residual similar to regular

CG as shown in Figure 16.

Figure 15: Convergence for in-plane load: Planar rigid-body vs Agglomeration

24

The planar-rigid-body deflation has been exclusively used for 2D FEA in the past [5]. However,

it is evident that planar-rigid-body is equally effective for 3D FEA, if the physics of the problem

dictates planar displacements. For out-of-plane load, one can always use a more generalized

rigid-body deflation discussed next.

2.4 Rigid-body

The agglomerated rigid bodies for general 3D system will have 3 rotational DOF in addition to

the 3 translational DOF. The group DOF vector, ug can be expressed as:

0 0 0 , , , , , T

g x y zu u v w (21)

These 6 DOF can be mapped into nodal displacement for any node within the group through

[28]:

Figure 16: Convergence for out-of-plane load: Planar-rigid-body

25

where

1 0 0 0

0 1 0 0

0 0 1 0

; ;

n ng g

ng

g n g n g n

u W u

z y

W z x

y x

x x x y y y z z z

(22)

Figure 17 illustrates the mode shapes that can be approximated through a combination of rigid-

body groups.

The addition of rotational DOF allows agglomerated groups to better approximate both the in-

plane and out-of-plane mode shapes with fewer group. This is evident from the convergence plot

presented in Figure 18.

Figure 17: Rigid-body mode shapes

26

For out-of-plane load, convergence achieved through the same number of groups is phenomenal

for a group using rigid body deflation.

This also presents an insight into the situation, as to why restrict to a rigid-body approximation

for deflation space for thin structures. If higher bending is expected, one can expand the group

DOF to include curvature sensitivity. Exploiting ‘Kirchoff-Love plate theory’ to include such

curvature sensitivity becomes a clear choice.

2.5 Kirchoff-Love plate

This was the first step towards expanding the deflation-space based on the physical nature of the

problem; published in [29].

The low order eigen-mode of thin solids involves curvature effects. Schematically it is illustrated

in Figure 19.

Figure 18: Convergence for out-of-plane load: Rigid body vs Agglomeration

27

To accommodate curvature, the rigid-body group DOF is extended with 2nd

order derivative

terms as follows:

0 0 0 , , , , , , , , , , , T

g x y z xx yy xyu u v w w w w (23)

The 2nd

order terms are mapped into nodal DOF using Kirchoff-Love plate theory[30]. The

resulting expression for nodal displacements through group DOF is then expressed as[29]:

2 2

where

1 0 0 0 0

0 1 0 0 0

0 0 1 02 2

; ;

n ng g

ng

g n g n g n

u W u

z y zx zy

W z x zy zx

x yy x xy

x x x y y y z z z

(24)

The convergence of Kirchoff-Love plate deflation is compared with rigid-body deflation in

Figure 20. The plot compares the convergence for equal number of DOF for deflated-space, and

it is evident that Kirchoff-Love plate is an improvement over rigid-body.

Figure 19: Curvature effects in thin structures

28

Kirchoff-Love plate deflation can be compared to dimensional reduction methods discussed in

[26], [27]. However, instead of using lower dimensional FEA to construct deflated stiffness

matrix, algebraic reduction using W is sufficient. Furthermore, meshing is not required for any

reduced model. Aggregation of nodes is sufficient for nodal relationship to a deflated group.

2.6 Euler-Bernoulli beam

In the same research paper expanding on physics-based deflation [29], Euler-Bernoulli beam

theory is also used for deriving a deflation space.

Unlike a thin plate, the curvature of the beam varies only along one major axis. Euler-Bernoulli

beam theory can be used to extend deflation to beam like problem [31]. For Euler-Bernoulli

beam, the group variables are:

0 0 0 , , , , , , , T

g x y z xxu u v w w (25)

Figure 20: Convergence: Kirchoff-Love plate vs Rigid-body

29

It is important to note, that it is a constrained version of Kirchoff-Plate theory; the same way

planar-rigid-body is a constrained version of rigid-body deflation. The nodal mapping expression

therefore becomes [29]:

2

where

1 0 0 0

0 1 0 0 0

0 0 1 02

; ;

n ng g

ng

g n g n g n

u W u

z y zx

W z x

xy x

x x x y y y z z z

(26)

The expression assumes the displacement due to bending is predominantly in w. However, if

bending is expected in arbitrary direction, a simple transformation can be used to orient the

problem such that bending prescribes to the nodal mapping given in Equation (26).

Euler-Bernoulli beam deflation can be used for the out-of-plane loading described in Figure

11(b). Figure 21 illustrates the effectiveness of Euler-Bernoulli beam deflation even for a plate

problem.

30

Both Kirchoff-Love plate and Euler-Bernoulli beam deflation stem from the understanding of

physical behavior of rigid-body deflation [29]. They exploit expected displacement behavior

based on the applied boundary conditions. Trial functions that satisfy the physics of the problem

can also be used to construct deflation space. One such family of trial functions is polynomial

elastic fields [32].

2.7 Elastic Polynomial

The polynomial elastic fields for 3D problems were introduced by Wang et al [32]. The work

presented in [32] extends the general representation of nth order homogeneous polynomial Airy

stress function from 2D [33] to 3D elasticity problem. It is shown that for 3D elasticity problems,

the polynomial stress field is obtained via “a 3D harmonic vector, p(x,y,z)” used as a trial

function. The unknown coefficients (discussed later) in the trial function are determined by

satisfying the linear elasticity PDE over prescribed control points.

Figure 21: Convergence for Euler-Bernoulli beam deflation

31

However, “A one-one relation between a 3D harmonic vector, p(x,y,z) of nth order and

displacement field is shown as” [32]:

4(1 ) ( )

Here is a position vector

Tu p x p

x

(27)

For 1st order polynomial, harmonic vector can be expressed as:

0 1 2 3

0 1 2 3

0 1 2 3

x x x x

y y y y

z z z z

C C x C y C z

p C C x C y C z

C C x C y C z

(28)

The coefficients are all unknowns for the trial polynomial. Instead of solving for those

coefficients through Airy stress functions [32] over control points, one can treat those unknowns

as group DOF. These group DOF can be mapped in to nodal DOF through:

1 2 1 1

1 1 2 1

1 1 1 2

1 2

0 0 0 1 1 1 2 2 2 3 3 3

Here,

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

3 4 ; 2 4 ;

; ;

, , , , , , , , , , , ,

n ng g

ng

g n g n g n

T

g x y z x y z x y z x y z

u W u

T T x y z T y T z

W T T x x T y z T

T T x T y x y T z

T T

x x x y y y z z z

u C C C C C C C C C C C C

(29)

A careful observation shows that 1st order elastic polynomial captures all the DOFs of a rigid-

body group and adds the components of dilation into the group.

The rigid-body deflation is very efficient in capturing all 1st order displacement behavior because

dilation does not play a significant role in linear elasticity problem. It is further illustrated in the

32

convergence behavior of 1st order elastic polynomial when compared with rigid-body deflation.

For linear elastic out-of-plane load the convergence of 1st order elastic-polynomial deflation is

exactly same as rigid-body deflation (shown in Figure 22).

While there is no significant advantage in 1st order elastic polynomial deflation, the

implementation opens up the possibility of using elastic polynomial field of nth order as a

deflation space. The number of DOF per group will increase rapidly with higher order

polynomials, but the convergence can be faster with fewer groups than that of rigid-body

deflation. However, further investigation is required to support this assertion.

2.8 Summary

Agglomeration [19] and rigid-body deflation [28] were implemented as a general-purpose

deflation method. To this day, rigid-body deflation is one of the superior methods for large scale

problem [20]. The deflation process is algebraic and does not require the information regarding

the physics of the problem.

Figure 22: Convergence for out-of-plane loading: 1st order Elastic-polynomial vs Rigid-body

33

For complex structural analysis, there also exists a vast resource of simplification method

through dimensional reduction [27], [34]–[40] that utilizes special element types suited for those

models. The list is non-exhaustive. 3D shapes are simplified to shells, plates, beams, etc. to

effectively reduce the complexity and improve convergence.

Then there are trial functions that exploit closed form solution expressed by Airy stress function

in 2D and 3D to create shape functions for FEA as described in [32], [33].

The first main contribution of this thesis lies in exploiting the physics of the problem to generate

efficient deflation spaces for 3D FEA. The effectiveness of any of the methods listed lies in

ability to capture the physical behavior of the system. Physics-based deflation allows one to use

the specificity of various reduction techniques with the simplicity of general 3D FEA.

A handful of example deflation-spaces are introduced in this thesis such as Kirchoff-Love plate,

Euler-Bernoulli beam [29], and Elastic polynomial (ongoing research). Deflation-spaces allow a

problem to be defined in 3D and exploit dimensional reduction to solve the system efficiently

through CG.

Most of the deflation-space discussed in this chapter relies only on nodal coordinates that are

readily available. They are easy to implement and require very little additional storage;

illustrated through examples.

The second main contribution of this thesis is to exploit this fact in implementing a limited

memory deflated-CG for large-scale problems.

To achieve this, assembly-free implementation of SpMV and deflation operations is the focus of

discussion in the next chapter.

34

3 ASSEMBLY-FREE IMPLEMENTATION

There is a large volume of research for efficient implementation of sparse matrix-vector product

(SpMV). Research involving efficient storage and access of sparse matrices is presented in [41]–

[43]. There are also methods that exploit the computational capabilities provided by multi-core

architecture [44]–[46]. One such technique is assembly-free method.

Assembly-free method for iterative solver was first proposed by Hughes et al [47] in 1983. It was

an attempt to parallelize the solution process. Since then there have been multiple attempts to

implement assembly-free method due to growing interest in fine grain parallelization[48]. The

idea is to never assemble K; instead perform SpMV at element level. In other words, instead of

the usual “assemble and then multiply”:

e

assemble

Ku K u (30)

the strategy is to “multiply and then assemble”:

e e

assemble

Ku K u (31)

However, assembly-free method is not particularly advantageous over assembled approach

unless: 1) the total memory consumption can be reduced, and 2) CG can be accelerated in

assembly-free way.

Accelerating CG was discussed in the previous chapter. Assembly-free implementation of those

methods (deflation-space) is discussed later in the chapter. But first, we discuss ways to reduce

memory access for SpMV.

35

3.1 Congruence of elements

Storing and retrieving K is the primary memory access in CG. It can be reduced if element-

congruency can be exploited. Element-congruency is an aspect that is not explored in the

scientific community.

The proposed element-congruency can be defined as:

Definition: For FEA problems, two elements e1 and e2 are said to be congruent within a

specified tolerance ε if:

2 1

1

Where

is element stiffness

e e

e

e

K K

K

K

(32)

Note that, both elements should have same number of DOF for above criteria to be true. It is also

important to note that perfect congruency between any two elements in an unstructured mesh is

improbable due to numerical errors in computation. A tolerance is therefore specified to limit the

difference between normalized values of elements that are being compared. In case of isotropic

elements, the element-congruency can also be decided by comparing geometry. A more formal

approach to compare elements is provided later in the sub-sections.

Large-scale meshes have a significant number of congruent elements. For example, consider the

finite element mesh of a composite specimen [49] in Figure 23, consisting of 83000 elements;

the mesh was generated using ANSYS.

36

Through a simple congruency test [50], one can determine that the mesh contains only 322

distinct elements, i.e., less than 0.4%, are geometrically and materially distinct within a specified

tolerance of 𝜀 = 10−8. The distinct elements are near a notch feature as shown in Figure 24.

Congruent elements have ‘identical’ element stiffness matrix. For assembly-free method, only

the distinct element stiffness matrices need to be computed and stored, since all the elements are

represented by the set of those distinct ‘template’ matrices.

While there are several methods to check for geometric congruency in different types of shapes

[50], quadrilateral and hexahedral are the only two shapes considered for congruency in this

thesis.

Figure 23: Element Congruency in Mesh

Figure 24: Distinct Element located around notch

37

The difference between the elements can be quantified through one of the two methods as

discussed next.

3.1.1 Geometric method

In geometric method, the first step is to compute local nodal coordinates. For each node, n within

the element, e, they are defined about the centroid C of the element through:

ˆ C

n n eX X X (33)

The local nodal coordinates are stacked into a single vector, 𝑒. The normalized difference of

this vector is used to compare the similarity between any two elements, say e1 and e2, through

the following expression:

2 1

1

ˆ ˆ

ˆ

e e

e

X X

X

(34)

This is a naïve implementation. The congruency check does not consider a scaling or rotation of

an element. However, with increase in mesh size, even this method yields a significant

percentage of congruent elements. For example, consider an L-bracket with a fillet. The top edge

is fixed with a tip load on the end of free hanging surface as shown in Figure 25.

38

The geometry is discretized using different mesh sizes. An example of discretized geometry with

600 quadrilateral elements is shown in Figure 26.

Figure 25: Geometry and boundary conditions on L-bracket

Figure 26: Quad-mesh for L-bracket

39

The elements within the mesh are checked for congruency using different tolerances. For various

acceptable tolerances, the congruence trend with increasing mesh size is illustrated in Figure 27.

The default method is comparing element stiffness as described in Equation (34).

3.1.2 Stiffness method

Stiffness matrix, Ke of an element captures the sensitivity of a spatial element w.r.t reference

element. Furthermore, stiffness matrix accurately scales the effects of nodal displacements on the

structure. Congruency through stiffness-method is therefore expected to be more accurate.

The congruency check for different L-bracket mesh is repeated with the criteria defined in

Equation (34). The congruency plots for various acceptable tolerances in Figure 28 illustrate that

stiffness method yields slightly higher percentage of congruency compared to geometric method.

Figure 27: Geometric congruence vs mesh size for various tolerances

40

While effective, computing the stiffness matrix for all the elements is still very expensive. One

only requires an effective sensitivity function to compare; an expression to map spatial element

to reference element without neglecting material properties. A careful expansion of element

stiffness can yield such an expression. For example, consider a typical element stiffness

expression, numerically integrated over multiple gauss points [3]:

1 1

, , ,

#

,

1

,

1 1

,

1

,

,

( ( ) | | )

Here,

is the shape function gradient w.r.t reference coordinates ;

0 0

= 0 0 ;

0 0

is the Jacobian;

is a mapping of displacement gr

T T T

e GP

GP

K N J T ETJ N X wt

N

X

J X

X

X

T

adient field to strains;

is material matrix;

is the weight of the gauss point;GP

E

wt

(35)

Figure 28: Stiffness congruence vs mesh size for various tolerances

41

The shape function gradient in the above expression remains the same for all elements and

therefore can be omitted for reduced element stiffness computation. The reduced element

stiffness expression thus becomes:

1 1

,

#

ˆ (( ) | | )

Here,

ˆ is the reduced stiffness;

T T

e GP

GP

e

K J T ETJ X wt

K

(36)

This reduced stiffness matrix does not require computing an entire element stiffness. It can be

used to compare elements similar to the stiffness comparison in Equation (34) as:

2 1

1

ˆ ˆ

ˆ

e e

e

K K

K

(37)

The congruence results, through the above expression, plotted for increasing mesh sizes in

Figure 29 resembles the plot shown in Figure 28.

42

It is interesting to note that irrespective of method used to determine congruency, the mesh

exhibits greater congruency with higher number of elements. This supports the earlier statement,

that large-scale meshes have significant number of congruent elements.

FEA results through the lens of congruency are discussed next.

3.2 FEA results with congruency

The results in this section are presented in a relative scale. For example, a maximum

displacement result for a mesh with identified congruency is compared as % error for the same

result obtained when no congruency was exploited.

The example problem is of the L-bracket illustrated in Figure 25. The L-bracket is discretized

using quad-mesh for different mesh-sizes. An example of 9600 element mesh of the discretized

L-bracket is shown in Figure 30.

Figure 29: Reduced-stiffness congruence vs mesh size for various tolerances

43

Table 1 and Table 2 list the percentage error results for maximum displacement and von Mises

stress respectively. Error values greater than 1% are highlighted.

Table 1: Results for error in maximum displacement for different mesh sizes

Figure 30: Quad-mesh for L-bracket with 9600 elements

#Elements

Congruency

tolerance (as error

%)

% error

Max. Displacement

Geometric Stiffness Reduced-stiffness

2400

0.1 0.12 0.05 0.04

0.2 0.12 0.05 0.02

0.5 0.38 0.21 0.3

1 1.04 0 0.3

4800

0.1 0.16 0.07 0.07

0.2 0.36 0.16 0.2

0.5 0.73 0.24 0.31

1 2.18 0.76 1.01

9600

0.1 0.36 0.32 0.33

0.2 0.02 0.38 0.17

0.5 0.82 1.09 1.12

1 1.68 0.24 0.24

44

Table 2: Results for error in maximum von Mises stress for different mesh sizes

While there is no discernible trend, it is quite clear that tolerance for congruency should be less

than 0.5% error in the norm, if errors in results are not to exceed 1%. For a tolerance of 0.1%, the

errors for stress and displacements are well within 1% as shown in Figure 31.

#Elements

Congruency

tolerance (as error

%)

% error

Max. von Mises Stress

Geometric Stiffness Reduced-stiffness

2400

0.1 0.03 0.02 0.01

0.2 0.11 0.04 0.07

0.5 0.86 0.75 0.74

1 1.2 1.09 0.63

4800

0.1 0.12 0.04 0.05

0.2 0.24 0.2 0.21

0.5 1.33 0.82 0.82

1 1 1.44 1.41

9600

0.1 0.21 0.08 0.11

0.2 0.36 0.34 0.37

0.5 0.1 0.62 0.67

1 1.17 0.1 0.11

45

The variability in error plot can be attributed to the greedy nature of the congruency algorithm

implemented, as it stops checking for congruency with other ‘templates’ as soon as the tolerance

is satisfied. Figure 32 illustrates a 9600 quad-elements mesh providing small variation in results

a) Stress error plot

b) Displacement error plot

Figure 31: Stress and displacement error for 0.1% tolerance

46

for same amount (87%) of congruency. This variability in results indicates a need for refinement

in implementation.

Identifying congruent elements is not the only challenge. Once congruency is established the

SpMV operation should be optimized to exploit the congruent elements. Simply storing fewer

elements is not sufficient, if retrieving the templates is not scheduled efficiently during SpMV

computations. This is a challenge addressed towards memory management aspect of the

problem.

Figure 32: Stress plot with maximum displacement (δ) and maximum stress (σ)

47

To understand the advantage of proper memory management in detail, consider a special case of

voxel-mesh.

3.3 Special case of SpMV on Voxel-mesh

Voxel-mesh is a structured grid which is not constrained to conform to the boundary of a given

geometry. An example of voxel mesh is illustrated in Figure 33.

Computing and storing only one element property dramatically reduces the memory footprint,

and therefore accelerates SpMV. Using one template element, an efficient implementation of

SpMV on multi-core architecture is discussed next.

Direct implementation of Equation (31) suggests that one assign a thread to each element and

update the result element-by-element. However, this will create a race condition when a nodal

index connected to multiple elements is simultaneously accessed. Therefore, a thread is assigned

to a node instead of an element.

Figure 33: Knuckle with (a) Conforming Mesh, and (b) Non-conforming Mesh

48

Once a thread is assigned to a node, SpMV is implemented in two ways. The result of Ku for

nodal DOF is computed either based on element connectivity or node connectivity. Both

implementations are discussed in the following sub-sections.

3.3.1 Element-connectivity based

In element connectivity based method, the first step is gathering indexes of the neighboring

elements. For each element the rows of stiffness values associated with the node is gathered.

There are 8 such set of row values depending upon the location of the node within the element.

These set of row values are used for dot product with nodal DOFs of nodes within the element;

this is illustrated in Figure 34. This ensures that the product Keue is computed without race

conditions. The implementation was first presented in [51].

Figure 34: Element connectivity based SpMV implementation

49

3.3.2 Node-connectivity based

In node connectivity based method, all the neighboring node indexes are gathered. There are a

total of 255 unique node arrangements possible for any given node. The nodal row of K for these

255 possibilities are computed and stored as preprocessed information. The result for nodal

DOFs is then computed through a dot product of nodal row of K with u vector of neighboring

nodes.

The memory access for gathering nodal DOF is unfortunately not coalesced since the DOFs are

staggered based on element connectivity. However, once the result is computed the update in

device memory is coalesced.

3.3.3 Single SpMV results

A simple mesh shown in Figure 35 is used to illustrate the advantage of mesh congruency. The

mesh consists of all identical elements.

The experiment was conducted on a Windows 7 64-bit machine with following specification:

AMD Phenom™ II X4-955 processor running at 3.2GHz with 4GB of memory; OpenMP

commands were used to parallelize CPU code.

NVidia GeForce GTX 480 (448 cores) with 0.75GB of device memory; CUDA SDK 4.0

[52] and CuBLAS library [53] was used for implementation.

Figure 35: A beam geometry and its mesh

50

The computations were performed in double precision.

Timing results for a single SpMV, i.e., a single Ku, with assembly-free implementations are

summarized in Figure 36. The overhead of computing the global K has been neglected. All the

element stiffnesses are stored to have an equivalent effect as assembling a global K. With

congruence exploited, the memory requests are much faster as all of the elements are mapped to

single element stiffness. Once the element is fetched, the data remains in cache for quick access.

As the number of elements increases, a speed-up of 10x can be achieved in SpMV.

Furthermore, Table 3 lists the timing results for a single Ku (SpMV) computed using GPU and

CPU implementation for both element-connectivity and node-connectivity based methods. The

SpMV was performed for a 1 million DOF linear system.

Figure 36: Assembly-free SpMV on the CPU with and without exploiting element-congruency[51]

51

Table 3: Assembly-Free SpMV Timing results

Assembly-Free Implementation Time in m-secs

Element-Connectivity based CPU 64

Node-Connectivity based CPU 33

Element-Connectivity based GPU 14

Node-Connectivity based GPU 3.5

The implementation, specific to voxel-mesh, illustrates the advantage of reducing memory foot-

print for SpMV. With congruence exploited in SpMV, it is time to revisit the deflation space.

3.4 Assembly-free deflation

The assembly-free implementation of deflation operation on a Graphics Programmable Unit

(GPU) is discussed in this section. This implementation was also presented in [51]. GPU is used

as an example to show some commonly available multi-core systems. The general principle of

implementation would remain the same; however, some details such as block, warp, etc. are

specific to GPU programming [52].

3.4.1 Prolongation

The prolongation operation presented in Equation (17) is straight forward on a GPU system.

Since the vector is projected from deflation-space to solution-space, each thread can be assigned

to a node without any concern for race conditions.

The group number is identified for each node and Wngug is computed and stored for nodal DOF,

un. Memory access for prolongation is coalesced for the most part. The nodes can gather the

nodal coordinates (x,y,z) in a lock-step method. Gathering the corresponding group DOF has the

potential for latency due to sequential access. However, the length of the vector associated with

52

group is small, therefore this in not a serious issue. The results update is fully coalesced. Figure

37 illustrates the process for a single thread in a multi-core system.

3.4.2 Restriction

The restriction operation WngT

rn is much more challenging to parallelize on the GPU due to

potential race conditions. Instead of assigning a thread to each node, a block of threads is

assigned to a group. The thread in the block is assigned to a node which lies within the group.

Nodal projections are computed for each thread using disassembled Equation (14) expressed as:

T

g ng n

assemblenodes in g

r W r (38)

These nodal projections are saved in shared memory within the block; this is illustrated in Figure

38. Threads are synchronized after the shared memory update. A reduce operation is performed

on respective DOFs of the nodal projection to yield the resultant vector for the group. The

allowable number of threads within the block is thus restricted by the shared memory.

The memory access for this part of the implementation is not coalesced either, as nodes that

belong to a group may skip a large sequence of indexes. As shown in Figure 38, the warp may

Figure 37: GPU implementation of prolongation.

53

end up with coalesced memory access if a contiguous sequence of indexes is assigned for

restriction.

3.4.3 Deflating stiffness

Deflating K remains a challenge. At this point, it is important to note that storing deflated K or

as described in Equation (12) has to take place only once every linear solve. Also, storing is

less expensive because its size is far smaller than K. It is a dense matrix and therefore storing a

factorized lower triangular matrix is also beneficial as it can be readily used for direct solve in

deflated-space.

With the decision to store a factorized triangular matrix in view, one can assemble through

assembly-free methods element-by-element using:

T

e e e

assemble

K W K W

(39)

Figure 38: GPU implementation for restriction.

54

This concludes all the major aspects of assembly-free deflated-CG.

3.5 Summary

The main objective of this chapter was to outline an assembly-free implementation of all the

steps required for limited-memory deflated-CG. To limit memory requirement the concept of

‘element-congruency’ was defined.

Congruency is often considered a geometric term, and is hard to identify in an exact sense for an

unstructured conforming mesh. However, with some tolerance this can be an exceptional tool for

solving large-scale problems. The number of unique elements can also be reduced through

proposed reduced-stiffness congruency. It can produce a set of useable results in an efficient

manner without computing and storing individual stiffness for all the elements.

It is also important to take stock of the situation, that simply identifying congruency is not

sufficient. The challenge of memory-management and scheduling of operations has to be

considered. The implementation for a voxel-mesh is a special case highlighting the importance of

such techniques. But a more general implementation of SpMV for all congruent elements in a

conforming mesh is still an ongoing research problem at this time.

The chapter also illustrates the implementation for assembly-free deflation. Physics-based

deflation and its use in assembly-free deflated-CG for large-scale FEA define the core

contribution of this thesis.

In the next chapter, algorithms that can utilize assembly-free SpMV and assembly-free deflated-

CG for different types of FEA are discussed. The performance of assembly-free methods is

illustrated through examples.

55

4 ASSEMBLY-FREE FINITE ELEMENT ANALYSIS

In this chapter, the application of assembly-free deflated-CG for large-scale FEA is discussed.

The formulations and results for assembly-free FEA are presented for published work [29], [54],

[55] and ongoing research.

The numerical examples are presented to emphasize the following observations:

1) The ease of implementation of assembly-free deflated-CG for a variety of large-scale

FEA problems.

2) The limited memory characteristics of assembly-free deflated-CG.

3) The speed of assembly-free deflated-CG compared to commercial solvers, such as those

supported by SolidWorks [10].

4) For static analysis, a profile of assembly-free deflated-CG on a GPU is also presented to

illustrate the adaptability of the method on multi-core architecture.

The results for assembly-free deflated-CG in large-scale static FEA [29] are discussed first.

4.1 Assembly-free static analysis

In this section, the focus is on solving a linear static problem, expressed in Equation (3).

Assembly-free deflated-CG is used as a linear solver as per Algorithm 1. The experiments were

conducted on a Windows 7 64-bit machine with following specification (except when specified

otherwise):

AMD Phenom™ II X4-955 processor running at 3.2GHz with 4GB of memory; OpenMP

commands were used to parallelize CPU code.

56

NVidia GeForce GTX 480 (448 cores) with 0.75GB of device memory; standard function

calls within CUDA SDK 4.0 [52] and CuBLAS library [53] were used for GPU

implementation.

The computations were performed in double precision. Tolerance for relative residual was set to

10-8

for CG.

4.1.1 Numerical results: static analysis for Knuckle

Knuckle geometry is illustrated in Figure 39 (a). The knuckle is fixed at the two horizontal holes,

and a vertical force is applied on the third hole. Observe that the geometry is relatively ‘thick’,

i.e., there are no plate-like or beam-like features. A voxel mesh with 3.15 million DOF was

generated as shown in Figure 39 (b).

The Jacobi-Precondition Conjugate Gradient (Jacobi-PCG) required 1741 iterations and 245

seconds on the CPU. The displacement and stress plots are illustrated in Figure 40.

Figure 39: (a) Knuckle geometry and loading, and (b) Voxel mesh

57

The same system was solved using different number of rigid-body groups for deflation space.

For example, Figure 41 illustrates collection of nodes into 100 and 1000 groups.

The results for varying number of groups are summarized in Table 4.

Figure 40: Static (a) displacement, and (b) stress for knuckle

Figure 41: 100 and 1000 rigid-body groups

58

Table 4: Total iterations and time taken to solve knuckle with varying number of groups

#Groups #Iteration CPU Time (s) GPU Time (s) Memory Needed (MB)

0 1741 245 36 174

200 145 54 29 213

400 114 48 28 224

600 95 48 31 252

800 73 46 32 263

The following observations are worth noting:

Increasing the groups from zero (pure Jacobi-PCG) to 100 reduces the number of CG-

iterations by a factor of 10, but the CPU time reduces only by a factor of 4. The

underlying reason is that SpMV is performed twice every iteration in deflated-CG.

Further, increasing the number of groups beyond a certain limit can lead to an increase in

computation time. Finding an optimal number of groups is a topic of future research.

As the number of iteration reduces, the speed-up gained though GPU also reduces as

expected since the bottlenecks are the SpMV required per iteration, and TW x operation

which is not amenable to fine-grain parallelism.

Finally, the memory requirements are fairly small even for a 3.15 million DOF.

The timing results were also compared with solution methods available in SolidWorks [10].

However, the problem size was limited to 1.1 million DOF due to the memory constraint for

direct solver. The factorization for direct solver failed on account of not enough system memory.

The knuckle problem was solved using built-in sparse-direct solver and preconditioned-iterative

solver via SolidWorks [10]. For assembly-free deflated-CG, 300 rigid body groups were used for

the same number of DOF. The comparison for memory required, along with solution time is

listed in Table 5.

59

Table 5: Time taken to solve the knuckle problem using SolidWorks [10] and proposed

method.

Solution Method CPU Time Memory Required

Sparse-direct in SolidWorks [10] 1 h 52 m 32 s 2.9 GB

Preconditioned-iterative in SolidWorks [10] 30 s 524 MB

Proposed assembly-free deflated-CG 13.5 s 77 MB

Observe that direct solver performs poorly for large-scale FEA. The type of preconditioning is

not-known for preconditioned-iterative solver in SolidWorks [10], however memory required is

still high compared to proposed assembly-free deflated-CG. The speed-up observed is a

consequence of limited memory requirement for assembly-free deflated-CG.

The convergence plot in Figure 42 illustrates that the Jacobi-PCG converges slowly but steadily

towards the solution without stagnation; this is typical of solid mechanics problems posed over

‘thick’ solids. The rigid-body deflation leads to a dramatic drop in number of iterations as

mentioned earlier.

Thin solids however behave differently.

Figure 42: Convergence of DCG vs Jacobi-PCG

60

4.1.2 Numerical results: static analysis of thin plate under pressure

Consider the thin plate illustrated in Figure 43. The dimension of the plate is 100x100x2.5 (mm);

the four side faces are fixed, with static force applied to the top face. The geometry is discretized

using a voxel mesh of about 550000 elements with over 2 million DOF. The plate is fixed at all 4

edges and a unit pressure is applied to one of its free surface.

The solution obtained though Jacobi-PCG takes 6337 iteration in an average time of 72s. The

convergence plot in Figure 44 highlights the effectiveness of deflated-CG with rigid-body

deflation space in case of thin structures. The presence of numerous low-order eigen-modes leads

to stagnation for Jacobi-PCG whereas deflated-CG ensures that the low-order eigen modes are

smoothed effectively.

Figure 43: Loading on a thin plate

61

Next, the above problem is solved using rigid-body deflation and Kirchoff-Love deflation. The

results are listed in Table 6.

Table 6: Comparison of Rigid-body deflation vs Kirchoff-Love deflation

Rigid Body deflation Kirchoff-Love deflation

#Groups #Iteration GPU time

(s)

GPU

memory

(MB)

#Iteration GPU time

(s)

GPU

memory

(MB)

100 736 36 138 256 21 144

200 382 23 146 130 18 161

300 260 23 161 96 21 192

400 199 22 178 76 29 233

Observe that, although the Kirchoff-Love deflation uses 33% more DOF per group, the net-gain

is significant. In other words, for the same number of group-DOF, for thin structures, capturing

the curvature leads to faster convergence. It is a better alternative for large scale problems

constrained with limited memory.

The thin-plate problem was also solved using SolidWorks [10]. The mesh generation in

SolidWorks [10] was limited to a system of 1.2 million DOF as discretization failed for higher

Figure 44: Convergence of DCG vs Jacobi-PCG for thin plate

62

number of elements. This is different from what was observed in the knuckle problem, where

factorization for direct-solver failed for higher number of DOF. However, investigating failure of

discretization is not important for this discussion. The thin-plate problem was solved using both

sparse-direct and preconditioned-iterative method available in SolidWorks [10]. 300 rigid-body

groups were used for assembly-free deflated-CG for same number of DOF. The comparison for

solution time and memory required is shown in Table 7.

Table 7: Time taken to solve thin-plate with proposed method vs SolidWorks [10]

Solution Method CPU Time Memory Required

Sparse-direct in SolidWorks [10] 1 hour 11 min 5 sec 1.7 GB

Preconditioned-iterative in SolidWorks [10] 35 sec 547 MB

Proposed assembly-free deflated-CG 14.5 sec 82 MB

Again, it is important to observe that the limited memory requirement allows a significant speed-

up towards solution time. The implementation is easily portable to any multi-core architecture,

including GPU, due to this limited memory requirement.

63

Once the number of iteration is reduced, time taken for SpMV reduces to about 50% of the total

solution time. The restriction operation, which reduces vectors to deflation-space, occupies about

20% of the time. CUDA profile provides the information for % time spent on various functions

for proposed assembly-free deflated-CG, as illustrated in Figure 45.

4.1.3 Numerical results: static analysis of ‘Thomas’ engine

For a more complicated large-scale FEA problem, consider the ‘Thomas’ engine in Figure 46

whose wheels are fixed and a load is applied as shown.

Figure 45: CUDA Profile for Rigid-Body deflation

64

Since the implementation relies on a robust voxelization scheme, the detailed features of the

model need not be suppressed. Here, the model was voxelized using 20 million elements,

resulting in a 50 million DOF system.

This experiment was performed on a typical high-performance desktop that used a GTX Titan

GPU card with 6 GB of device memory. The linear system was solved on this GPU using rigid-

body deflation with 900 groups in 24 minutes, requiring less than 3 GB of memory.

Next the generalized-eigenvalue problem expressed in Equation (4) is revisited for modal and

buckling analysis. This published work [54] primarily focuses only on exploiting assembly-free

SpMV for solving generalized eigen-value problem for modal analysis posed in Equation (4).

Figure 46: Structural problem over a Thomas engine.

Figure 47: Deflection from a 50 million DOF system.

65

4.2 Assembly-free modal analysis

Most commercial methods today use the block-form of the shift-and-invert Lanczos algorithm,

also known as the block-Lanczos method [18], [56]–[58]. The method inverts a matrix 𝐾 − 𝜎𝑀

repeatedly to isolate the desired eigen-pairs, with σ determined during iterations. This requires

explicit LU factorization of 𝐾 − 𝜎𝑀, which is not desirable in large-scale problems. Methods to

eliminate the need for an explicit decomposition were developed in [58], [59]. Inversion was

carried out in an approximate sense over Krylov sub-space. Arbenz et al [58] use algebraic multi-

grid as pre-conditioner for factorizing the shifted matrix. They also compare the implementation

of alternate algorithms including ‘locally optimal block preconditioned CG’[60], ‘Davidson-

Jacobi’[61], etc. It is shown that, for large-scale eigen-value problems, these alternate algorithms

can be competitive, relative to block-Lancoz. Rayleigh-Ritz conjugate gradient is one such

algorithm[60], [62], and it is discussed next.

4.2.1 Rayleigh-Ritz conjugate gradient

Rayleigh-Ritz conjugate gradient (RCG) algorithm [63], [64] requires only an efficient SpMV.

Therefore, it exhibits numerous advantages including simplicity, low memory requirements, and

significant scope of parallelism. The key concept is computing Rayleigh quotient of an arbitrary

vector x, given by the equation:

( )T

T

x Kxx

x Mx (40)

If the vector x is an eigen-vector of (K, M), then the Rayleigh quotient is the corresponding

eigen-value. Thus, by minimizing the Rayleigh quotient, one can compute the lowest eigen-pair,

i.e., the eigen-value problem can be posed as a minimization problem:

66

T

Tx

x Kx

x MxMin (41)

A nonlinear conjugate gradient [65] can be used to solve the minimization problem to find the

lowest eigen-mode. Gradient of the above equation can be computed through:

2

( )( ) 2

M

Kx x Mxg x

x

(42)

Where

T

Mx x Mx (43)

Using the classic CG algorithm [65], RCG algorithm becomes:

Algorithm 2: Rayleigh-Ritz Conjugate Gradient (RCG)

1. Initialize (1) 0x such that (1) 1

Mx

2. Set(0) 0p ,

(0) 1 and 1k

3. Compute the gradient ( )kg via Equation (42)

4. Update

( ) ( )

( )

( 1) ( 1)

Tk k

k

Tk k

g g

g g

5. The conjugate search direction is given by: ( ) ( ) ( ) ( 1)k k k kp g p

6. Find the step length ( )k as described in [66]

7. Let ( 1) ( ) ( ) ( )k k k ky x p and ( 1) ( 1) ( 1)/k k k

Mx y y

8. Compute ( )( )kx via Equation (40)

9. If ( )kg , terminate; else, increment k , go to step 3

67

SpMV being the most costly operation makes RCG a very simple algorithm to implement.

However, to compute higher modes, the process has to be repeated with some constraints.

4.2.2 Computing multiple modes

To compute higher modes, observe that if 1 1( , )x and

2 2( , )x are two distinct modes, and if

1 2 , then it is easy to show that they must satisfy M-orthogonality:

1 2 0Tx Mx (44)

Further, if1 2 , one can always find a pair of eigen-mode that satisfy the above equation. Thus,

to find the second eigen-mode, we pose a constrained minimization problem:

1. . 0

T

Tx

T

x Kx

x Mx

s t x Mx

Min

(45)

For an arbitrary vector x, the M-orthogonality is enforced through the following:

1 1( )Tx x x Mx x (46)

M-orthogonality is ensured during the iteration through: 1) initializing start vector as M-

orthogonal to 𝑥1, and 2) maintaining search directions 𝑝𝑘 M-orthogonal to 𝑥1.

Given (m-1) lower modes computed as:

1 2 1 , , , mX x x x (47)

To compute the next mode, one must solve the constrained minimization problem:

68

. . 0

T

Tx

T

x Kx

x Mx

s t x MX

Min

(48)

where M-orthogonality can be enforced via:

1

1

( )m

T

i i

i

x x x Mx x

(49)

With this update the RCG algorithm can be updated for computing mth

lowest eigen-pairs [64].

Algorithm 3: RCG (multiple modes)

1. Suppose m-1 eigen-modes 1 2 1 , ,..., mX x x x have been computed.

2. Initialize (1) 0x such that (1) 1

Mx and

1 0Tx MX

3. Set (0) 0p , (0) 1 and 1k

4. Compute the gradient ( )kg via Equation (42)

5. Update

( ) ( )

( )

( 1) ( 1)

Tk k

k

Tk k

g g

g g

6. Let ( ) ( ) ( ) ( 1)k k k kp g p (preliminary direction)

7. Construct an M-orthogonal direction ( )kp via Equation (49)

8. Find the step length ( )k as described in [66]

9. Let ( 1) ( ) ( ) ( )k k k ky x p and ( 1) ( 1) ( 1)/k k k

Mx y y

10. Compute ( )( )kx via Equation (40)

11. If( )kg , terminate; else, increment k , go to step 4.

69

The advantage of the above method as opposed to other block-oriented methods is limited

memory requirement, and simplicity.

4.2.3 Subspace augmentation

There are some limitations to the RCG algorithm for multiple modes. The most important of

them are:

1. Missing modes: As we sweep eigen-spectrum, one or more eigen-modes may go

undetected, especially for repeated eigen-values.

2. Erroneous results: A large value for tolerance in step 10 of RCG can lead to erroneous

results; for that reason a low tolerance, typically 10~10 , is essential.

3. Slow convergence: For low tolerances, thousands of iterations are required for each

mode, especially when the matrix is ill-conditioned.

All three problems are addressed by subspace projection methods. Such projection methods are

common in modern-eigen solvers [18], [58], but have not been considered for RCG.

The idea is to create subspace through approximate eigen-vectors ( ) ( 1,2,..., )k

ix i m computed

through early termination of RCG. The sub-space is defined as:

( ) ( ) ( ) ( )

1 2 , ,..., k k k k

mS x x x (50)

Reduced stiffness and mass matrices are constructed using the subspace though the following

transformation:

( ) ( ) ( )

( ) ( ) ( )

k k T k

k k T k

K S KS

M S MS

(51)

70

Both of these matrices are constructed through a series of SpMV. The transformed matrices are

used to solve a smaller eigen-value problem exactly:

( ) ( ) ( ) ( ) ( )k k k k kK V M V (52)

A sharpened set of eigen-vectors are recovered by transforming the vectors back to original

space:

( 1) ( ) ( )k k kX S V (53)

Using these set of vectors as a starting vector, one can restart RCG method. This algorithm can

be termed as subspace augmented Rayleigh-Ritz conjugate gradient (SaRCG).

Algorithm 4: Subspace Augmented Rayleigh-Ritz Conjugate Gradient (SaRCG)

1. Initialize (1) (1) (1) (1)

1 2 , ,..., mX x x x (typically random vectors). Set 1k

2. Compute ( )k

ix with ( )k

ix as the starting vectors until ( )kg ( ~ 0.1 ), for 1,2,...,i m .

3. Construct the reduced matrices via Equation (51), where ( )kS is defined in Equation (50),

and solve Equation (52) to find ( )k and ( )kV .

4. If convergence in ( )k has not been achieved, construct the updated eigen-vectors ( 1)kX

via Equation (53), increment k and go to step 2.

The extension improves the robustness and convergence of the algorithm. The additional

computational cost of constructing and solving the reduced problem is negligible. SpMV remains

computationally the most expensive part of the algorithm.

71

4.2.4 Numerical results: modal analysis for Knuckle

In this section, the accuracy of SaRCG method is compared with results from a commercial

package. The first five eigen-modes for the ‘knuckle’ problem are considered. The problem is

illustrated in Figure 48(a), where two horizontal holes are clamped. Also illustrated in Figure

48(b) and Figure 48(c) are the conforming and non-conforming meshes used.

The first five eigen-modes are computed through SolidWorks [10] application using the

conforming tetrahedral mesh from Figure 48(b). Total computational time to obtain the first five

mode shapes was approximately 220s for 1 million DOF system. The corresponding five eigen-

modes computed via SaRCG [54] using a voxel-mesh are illustrated in Table 8 along with mode

shapes obtained from SolidWorks [10]. The eigen-values are within 1% accuracy, and the

computational time for SaRCG [54] was approximately 500 seconds for 1 million DOF system,

on the CPU.

An important observation made here is that eigen-values are relatively accurate despite using a

non-conforming mesh. While computing global properties, such as mode shapes, one can get

away with the use of a non-conforming mesh. Without compromising accuracy, the method

becomes very attractive with its robustness.

Figure 48: (a) Knuckle geometry, (b) Conforming mesh, and (c) voxel-mesh

72

Table 8: First 5 eigen-modes for Knuckle: Solidworks [10] vs SaRCG [54]

SolidWorks [10] SaRCG [54]

1st Frequency and mode shape

2457.5 Hz

2437.6 Hz

2nd

Frequency and mode shape

3795.8 Hz

3806.4 Hz

3rd

Frequency and mode shape

5262.1 Hz

5279.1 Hz

4th

Frequency and mode shape

9746.6 Hz

9744.2 Hz

5th

Frequency and mode shape

9957.6 Hz

10031 Hz

73

The results presented above do not illustrate the advantage of assembly-free SpMV thoroughly

on a CPU. At the time of publication [54], the focus was strictly on exploiting assembly-free

SpMV for large-scale modal analysis on a GPU. There was no effort made towards using the

assembly-free deflated-CG which was later published in [29].

However, once assembly-free deflated-CG was available, inverse iteration [67] (discussed later

in assembly-free buckling analysis) was determined to be a more effective method to solve

Equation (4). Inverse iteration allows one to compute eigen-modes by repeatedly solving a linear

system of equation of the form expressed in Equation (3), which can exploit assembly-free

deflated-CG. The advantage of using inverse iteration (Algorithm 5) [67] is discussed later for

assembly-free buckling analysis in the next section.

The experiment for computing first five eigen-modes for knuckle was repeated using Algorithm

5. The timing results for 1 million DOF system using SolidWorks [10], SaRCG (Algorithm 4),

and inverse iteration (Algorithm 5)[67] are listed in

Table 9. The linear system within the inverse iteration [67] was solved using 300 rigid-body

groups via assembly-free deflated-CG.

Table 9: Time for computing first-5 frequency of Knuckle

Solution Method 1

st-

Frequency

Discretizatio

n Time (s)

Solution

Time (s)

SolidWorks[10] 2457.5 Hz 52 220

SaRCG[54] 2437.6 Hz 21 500

Inverse Iteration w/ assembly-free deflated-CG[55] 2437.6 Hz 21 102

74

Table 9 illustrates the advantage of using assembly-free deflated-CG for large-scale modal

analysis. The next experiment highlights the robustness of using voxel-mesh.

4.2.5 Numerical results: modal analysis for Housing cover

The primary advantages of the proposed method are its robustness, simplicity and ability to

handle geometrically complex structures. For example, consider the gear housing illustrated in

Figure 49(a). The meshing for this structure failed (in SolidWorks [10]) as illustrated in Figure

49(b).

On the other hand, we can easily voxelize the geometry as shown in Figure 50. Brute force

voxelization of the structure produces a high density mesh. This is handled well with the

proposed SaRCG method.

Figure 49: (a) Gear-housing: eigen-spectrum is desired, (b) Meshing can fail for complex structures

Figure 50: Brute-force voxelization of the structure

75

The results are summarized below in Table 10. Figure 51 illustrates the mode shape computed

for the first eigen-mode of gear housing using SaRCG method [54].

Table 10: Results for Computing Fundamental Frequency of Gear Housing

DOF 1st-Frequency Voxelize Time (secs) Solution Time (secs)

150,000 70 Hz 33 9

300,000 76 Hz 52 18

425,000 74.2 Hz 81 35

2,000,000 74.6 Hz 220 191

There are more results available in [54] providing a broader perspective on the importance of

assembly-free SpMV for modal analysis on GPU.

The generalized eigen-value problem for buckling analysis is discussed next. The formulation

and results discussed are from published work for large-scale assembly-free buckling analysis

[55].

Figure 51: First Eigen-mode for Gear Housing

76

4.3 Assembly-free buckling analysis

Buckling is the sudden failure of a structural member carrying a compressive load. For example,

Figure 52 illustrates the classic buckling of a pinned-pinned beam. Structural elements, such as

those found in high-rise buildings are typically subjected to compressive loads, and must be

analyzed and designed to prevent such buckling failures.

Finite element analysis of linear buckling is typically carried out in two stages. In the first stage,

the structural member is subject to a unit load. The domain is discretized using a finite element

mesh, and the corresponding static linear-elasticity problem is posed and solved; this amounts to

solving a linear system expressed in Equation (3).

In the second stage, the linear displacement field u is post-processed to obtain the stress tensor

within each of the finite elements [3]:

xx xy xz

elem xy yy yz

xz yz zz elem

(54)

Then the stress tensor is used to define an element-level stress stiffness matrix [3]:

Figure 52: Bucking of a pinned-pinned beam

77

elem

elem

elem T

elem elem

elem

K G Gd

(55)

where G is shape function gradient matrix described in [3]. This is then assembled to construct

the global stress stiffness matrix [3]:

( )elem

assemble

K K (56)

Finally, the following generalized eigen-value problem is posed and solved [3]:

( ) 0K K w (57)

While there are multiple pairs of solutions to the above problem, only the lowest few are

typically important. In particular, the lowest eigen-value of Equation (57) determines the

buckling safety factor [3], i.e., the load at which buckling will occur (assuming a unit load has

been applied initially). The vector w in Equation (57) represents the associated buckling mode.

Observe that this equation is similar to the generalized eigen-value problem associated with

modal analysis as described by Equation (4). However, two key differences between Equation

(4) and Equation (57) are 1) the mass matrix M is positive definite, but the stress stiffness matrix

Kσ need not be positive definite, and 2) mass matrix depends only on the material and the

underlying mesh, while the stress stiffness matrix depends on the element stresses as well.

The second difference has implications in assembly-free analysis. It is important to note that

Algorithm 4 relies only on efficient SpMV for solving eigen-value problems for modal analysis.

This includes both Kx and Mx exploiting congruency in voxel-mesh. While SaRCG is efficient

78

for large-scale modal analysis, the effects of element stresses in Kσ make it difficult to exploit

congruency for SpMV in Kσw. Furthermore, storing every element stress stiffness matrix Kσ will

create a large memory footprint. This was observed to significantly slow down the computation.

4.3.1 Inverse iteration

This draws attention towards another method known as inverse iteration [67]. The basic principle

is to carry out:

1y K K w

(58)

and recycle the solution. The number of K w operations is considerably reduced, and the

computational burden falls on solving an equivalent static problem [29]. The algorithm to solve

Equation (57) through inverse iteration is given below.

Algorithm 5: Inverse iteration for buckling

1. Initialize (1) 0w such that (1)|| || 1w

2. Set 1i

3. Compute ( ) ( )i iz K w

4. Solve ( 1) ( )i iKy z for ( 1)iy

5. Update ( 1) ( 1) ( 1)/ || ||i i iw y y

6. Compute ( ) ( 1) ( 1) ( 1)i i i ig Kw K w

7. If ( )|| ||ig , terminate; else, increment i , and go to step 3

For a mode shape w, the Rayleigh-quotient in step 6 is expressed through:

79

T

T

w Kw

w K w

(59)

The number of iterations required to converge to the mode shape is far smaller than RCG as the

numerical error is primarily eliminated in the linear solution in step 4. The linear system in step 4

is solved through assembly-free deflated-CG in Algorithm 1.

The numerical results illustrate the advantage of using inverse iteration with assembly-free

deflated-CG for buckling analysis. The material for all buckling analysis examples is steel with

young’s modulus 𝐸 = 2.1 × 1011 𝑃𝑎 and Poisson’s ratio 𝜈 = 0.33. The results are compared

against those obtained through SolidWorks [10].

4.3.2 Numerical results: buckling analysis of a square beam

The first example is that of a beam of 1 meter in length, and 10 mm by 10 mm cross-section. The

beam is fixed at one end, and a compressive unit load is applied at the other. The classic fixed-

free Euler-beam analysis yields a critical load of [31]:

2

2431.8

(2 )cr

EIP

L

(60)

The results obtained through the proposed Assembly-Free Buckling Analysis (AFBA) and those

obtained from SolidWorks [10] using the same number of degrees of freedom (DOF) are

illustrated in Figure 53. Both methods converge to a critical load of 430.03. Note that 3-D FEA

results are not expected to converge to the exact Euler-buckling result in Equation (60) however;

both methods should yield similar results.

80

The real advantage of AFBA is in speed. Figure 54 illustrates the computing time for AFBA

versus SolidWorks [10]. The quadratic growth in computation time in SolidWorks [10] can be

attributed to the quadratic growth in memory requirements with increasing degrees of freedom.

Figure 53: Predicted critical load using AFBA and Solidworks [10]

81

4.3.3 Numerical results: buckling analysis of cylindrical column

To illustrate the potential deficiency of AFBA with voxel-mesh, consider an example of a

cylindrical column of 1 m in length, and a diameter of 10 mm. The classic fixed-free Euler-beam

analysis yields a critical load of [31]:

2

2254.3

(2 )cr

EIP

L

(61)

The predicted buckling loads are illustrated in Figure 55. The two results differ by 2.5%. The

difference can be attributed to the voxelization in current implementation of AFBA. The non-

conformity of mesh affects the accuracy of predicted stress field.

Figure 54: Computing time vs #DOF for AFBA and SolidWorks [10]

82

The time taken to solve the problem follows a similar trend as illustrated in Figure 56. Thus, if

one can tolerate small errors, the voxelized AFBA method can be significantly faster.

Figure 55: Accuracy plot for Cylindrical Column: Buckling load vs #DOF

Figure 56: Computing time vs #DOF for cylindrical column

83

It is important to note that assembly-free analysis is not restricted to voxel-mesh, but the

experiments simply highlight the speed at which useable results can be produced.

4.3.4 Numerical results: buckling analysis of a rectangular column with a hole

For the last example, consider the structure shown in Figure 57. The dimensions of the column

are 5x30x100 (mm), and the hole is of diameter 10 mm.

The results for the load factor computed using different mesh sizes are plotted in Figure 58. Here

we observe a 0.3% error in the solution. The computation time is plotted in Figure 59.

Figure 57: Rectangular column with a hole

84

The next section details an ongoing research problem for assembly-free large-deformation which

has not been published. The formulation that exploits assembly-free deflated-CG is selected for

large-deformation, and preliminary convergence analysis is presented.

Figure 58: Predicted critical load using AFBA and SolidWorks[10] for rectangular beam with hole

Figure 59: Computing time for rectangular beam with hole

85

4.4 Assembly-free large-deformation analysis

Large-displacement problems are common in the real world. Examples include, but are not

limited to, analyzing slender members in compliant mechanisms, crash analysis of shell-like

structures, soft tissue analysis for bio-mechanical systems, etc. Even for large-displacement

problems, the formulations require solution to an effective linearized system expressed as:

Where

is effective or tangent stiffness

is effective or unbalanced force

is solution for incremental displacement

eff eff

eff

eff

K u f

K

f

u

(62)

Equation (62) can be solved via assembly-free deflated-CG. In this section, the formulation that

exploits the proposed assembly-free methods is discussed.

When displacements are large, their effect on stiffness properties can no longer be ignored [68],

[69]. For such cases, large deformation analysis is required. Figure 60 illustrates an example of

cantilever beam solved as linear elastic and large deformation problem.

86

It can be observed that large-deformation formulation has lower maximum displacement

predicted for the structure. This is due to non-negligible non-linear terms in strain tensor that

have a stiffening effect on the structure [68], [69].

Since the non-linear terms in strain tensor depend on displacement field, the solution method

involves breaking down the external force in multiple intermediate steps [68], [69]. For each

force step, the displacement field that satisfies the equilibrium condition is determined through

Newton iteration [68], [69]. The equilibrium condition is expressed as:

a) Linear elastic formulation b) Large-displacement formulation

Figure 60: Cantilever beam displacement for linear elastic vs large-deformation formulation

87

:

Where

is step index

is stress tensor

is strain tensor

is work done by external load

t t t t t t

ext

ext

ed W

t t

e

W

(63)

The terms on the left hand side represents the internal strain energy, which must be equal to

external work done for any intermediate step, t + Δt.

There are several formulations that pose the equilibrium condition as a discretized system. A

thorough discussion for derivation of these formulations is available in [2], [3], [68], [69]. One

such formulation is ‘total Lagrangian’.

4.4.1 Total Lagrangian (TL) formulation

TL formulation integrates for internal strain energy in Equation (63) over initial un-deformed

domain [68], [69]. In simple terms, the nodal coordinates are not updated for intermediate

equilibrium conditions [68], [69]. Configuration of the mesh remains fixed, which allows the

element-congruency for voxels to remain unchanged. Therefore, TL formulation was used for

large-deformation analysis in this thesis.

In discretized form, the linearized system of equation solved during Newton iteration is

expressed as [68]:

88

( 1) ( ) ( 1)

0 0

( 1) ( 1)

0

( 1)

0

Where

is step index

is external force for current step

is internal force for displacement

is tangent stiffness for displacement

t m m t t m

ext in

t

ext

t m m

in

t m

K u f f

t

f

f u

K

( 1)

( )

is incremental displcement solved in current iteration

m

m

u

u

(64)

Internal force vector and tangent stiffness matrix depend on the displacement field determined

from previous iteration [68]. This makes assembly-free implementation desirable for large-

deformation analysis because stiffness matrix can be updated for any given step as and when

required. The algorithm for large-deformation elasticity problem is described below.

Algorithm 6: Newton method for Large-deformation elasticity

1. Initialize 0;u 0 0;extf 1;n

2. Compute extf based on total number of time step N

3. For incremental time step 1,2,...,n N

i) Initialize (0) 1

0 ;n n

in extf f(0) 0;u

ii) Update 1 ;n n

ext ext extf f f

iii) For k =1, 2, 3… until convergence

a) Check for convergence ( ( )norm R )

b) Compute ( 1)

0

n k

inf

c) Compute ( 1)

0

n n k

ext inR f f ,

d) Solve Assembly Free( 1) ( )

0 ;n k kK u R

89

e) Update ( ) ;n n ku u u

f) Update , ;n nS

g) Go to step 3.iii).a)

Observe that the algorithm uses a modular implementation of assembly-free deflated-CG [29] to

solve the large-deformation problem. The process to obtain element tangent stiffness and internal

force to set up the assembly-free linear solve is laid out next.

The terms in step 3.iii).f) in Algorithm 6 are Green-Lagrange strain and 2nd

Piola-Kirchoff (PK)

stress tensors respectively [68]. They are energy conjugate terms used for total Lagrangian

formulation [68] that depend on current displacement field. The linearized relation between 2nd

PK stress and Green-Lagrange strain in Voigt form is similar to the stress-strain relationship in

linear elasticity [68], [69]:

S D (65)

These terms are essential in updating internal force for the next iteration in Newton method [68],

[69]. The algorithm for updating these tensors for any given element is described next. Detailed

derivation of the algorithm is available in [68], [69].

Algorithm 7: Update deformation gradient and 2nd PK stress for an element

1. For all gauss points

i. Compute displacement gradient

8

1

( / ) ;ij I j iI

I

H N X u

ii. Compute deformation gradient ;defF I H

90

iii. Compute Green-Lagrange strain 1

( );2

T TH H H H

iv. Compute 2nd

PK stress in Voigt form ;S D

The deformation gradient expressed in step (1.ii.) of Algorithm 7 is a measure of spatial

deformation of current configuration of an element (at force step t) w.r.t the initial configuration

(at force step 0) [68], [69]. Mathematically, deformation gradient for any point in the domain is

expressed as:

0

0

Where

are spatial coordinates at step t

are initial spatial coordinates at step 0

tdef

t

XF

X

X

X

(66)

The mapping relationship between two configurations through deformation gradient is illustrated

in Figure 61.

Figure 61: Mapping of current mesh through deformation gradient

91

Use of this term allows for the numerical integration to be performed over initial configuration in

the TL formulation [68], [69]. Once deformation gradient and 2nd

PK stress for elements are

computed, the internal force can be updated for the element via Algorithm 8 [69].

Algorithm 8: Updating internal force for an element

1. Initialize 0inf

2. For all gauss points

i. Compute shape function gradient for all nodes [ ] [ ( ) / ];Ij I jB N X

ii. Get , defS F for the element,

iii. Compute Nominal Stress ;T

defP SF

iv. Update 0 | |T

in I in I I gpf f B P J w for all nodes I in the element

The displacement solution of large-deformation relies heavily on the assembly-free linear solve

step in Algorithm 6. Assembly-free linear solve requires an assembly-free implementation of

SpMV involving tangential stiffness matrix and incremental displacement vector. The tangential

stiffness matrix has two components, geometric stiffness and linearized material stiffness [68],

[69]. The geometric stiffness is also known as stress stiffness matrix for buckling analysis

defined in Equation (55); 2nd

PK stress is used for stiffness computation in large-deformation

analysis [68], [69]. The material stiffness is similar to the stiffness matrix in linear elasticity

when the integration is performed over current configuration. However, to use the element

stiffness from initial reference configuration, the deformation gradients are required from

Algorithm 7. Algorithm 9 outlines the assembly-free implementation of SpMV for tangent

stiffness; where y is the resultant vector and u is the displacement vector.

92

Algorithm 9: Assembly-free SpMV for an element

1. Initialize 0y

2. For all gauss points

i. Compute shape function gradient for all nodes [ ] [ ( ) / ];Ij I jB N X

ii. Get , defS F for the element,

iii. Get Template Linear Stiffness eK for the element at gauss point,

iv. For a given node pair I,J (3,3)

a. Compute Geometric Stiffness 0[ ] [ ] | | ;T

geo IJ I J gpK B SB J w

b. Compute Linearized Material Stiffness [ ] [ ];T

mat IJ def e defIJK F K F

c. Update ([ ] [ ] ) I I geo IJ mat IJ Jy y K K u for all nodes I

At this point, it is important to mention that the algorithm presented is not optimized for several

aspects. A few of the areas that require further improvement in implementation are 1) dynamic

force stepping [68], 2) reduced integration with hour-glass control [3], [69], 3) parallel

implementation, etc. Therefore, speed of assembly-free TL formulation remains an ongoing

research and timing results are not presented in this thesis.

The results only show the convergence behavior of deflated-CG to solve the linear system in step

3.iii.d of Algorithm 6.

4.4.2 Numerical results: large-deformation analysis of beam

The beam problem used for large deformation analysis is illustrated in Figure 62. The problem is

set up for large deformation via planar stretch for in-plane load, and bending deformation for

out-of-plane load.

93

The longest dimension of the beam is 0.5 m with cross-section of 20mm X 50mm. The material

is alloy steel with Young’s modulus, 𝐸 = 2.1 × 1011 and Poisson’s ratio 𝜈 = 0.28. The beam

was discretized using 5000 voxel-elements creating a 17,500 DOF system. The load values are

set to yield a maximum displacement of 0.06m for linear elasticity problem in both cases as

illustrated in Figure 63.

In-plane large-deformation

Out-of-plane large deformation

Figure 62: Large-deformation analysis on beam

a) Linear-elastic displacement for in-plane load

b) Linear-elastic displacement for out-of-plane load

Figure 63: Displacement results for linear static FEA

94

The system was solved for large-deformation formulation through Algorithm 6. The loading was

divided in 5 steps, and each step was allowed a tolerance of 𝜖 = 10−3for equilibrium condition

to be satisfied between internal and external forces. The linear system was solved through

assembly-free deflated-CG presented in [29]. The displacement results for large-deformation

formulation are illustrated in Figure 64.

The results were verified for accuracy through SolidWorks [10], and they were same for 15,000

DOF system. The convergence for CG without deflation is compared with rigid-body deflation

for in-plane large deformation (shown in Figure 65). The number of peaks in the plot illustrate

the number of linear solves performed by Algorithm 6 to achieve equilibrium.

a) Displacement for in-plane load

b) Displacement for out-of-plane load

Figure 64: Displacement results for large-deformation FEA

95

The advantage of deflated-CG is over-whelming for a well-conditioned problem posed by in-

plane load case even for a small system. Since the displacement is expected to have significant

dilation, the experiment was repeated with 1st order elastic-polynomial deflation. For same

number of group, the convergence of 1st order elastic-polynomial deflation was compared with

rigid-body deflation. The convergence plot is shown in Figure 66.

Figure 65: Convergence plot: CG w/o deflation vs Rigid-body

96

The advantage is limited, however, with increase in the size of the problem the 1st order Elastic-

polynomial deflation may possibly scale better than rigid-body deflation. This requires further

investigation.

The linear solve for out-of-plane load without deflation did not converge in 20,000 iterations,

even for a small system of 17,500 DOF. Therefore, the convergence plot was generated for rigid-

body deflation and Euler-Bernoulli beam deflation using same number of groups. The plot is

illustrated in Figure 67.

Figure 66: Convergence plot: Rigid-body vs 1st order Elastic-polynomial

97

The behavior of individual deflation method for large-deformation problem is left for future

research. But the naïve implementation presented for large-deformation highlights the

adaptability of physics-based deflation for all types of FEA problems in solid mechanics.

4.5 Summary

The formulations presented in this chapter illustrate the fact, that FEA typically requires an

efficient linear solver. Assembly-free deflated-CG is a general-purpose linear solver that fits the

criteria of being efficient linear solver for a large variety of problems in FEA. The formulations

emphasize the ease with which assembly-free deflation methods can be integrated as linear

solver for various types of FEA.

The implementation of assembly-free transient FEA was not included in this chapter, however,

Mirzendehdal and Suresh performed dynamic analysis in [9] using the assembly-free deflated-

CG presented in this thesis.

Figure 67: Convergence plot: Rigid-body vs Euler-Bernoulli beam

98

Results presented in this chapter highlight the advantage of using assembly-free FEA for large-

scale problems. The advantage and efficiency is a direct consequence of limiting the memory

requirement for solving the system. This allows the proposed assembly-free deflated-CG to be

competitive with commercial solvers.

While the algorithm for large-deformation analysis can be optimized to achieve equilibrium

condition faster, solving the linearized system remains the most expensive operation. Further

research in assembly-free SpMV for large-deformation is required.

There is also the issue of accuracy for non-conforming mesh, specifically voxelized geometry.

For this, assembly-free SpMV needs to be improved to better exploit congruency in conforming

mesh. On the other hand, voxelization offers a robust discretization method for complex

geometry without any need for de-featuring. It also provides quick solutions for large-scale

problems that can be used to optimize the design, as is the case in topology optimization. In the

next chapter, application of assembly-free FEA in topology optimization is discussed.

99

5 APPLICATION: TOPOLOGY OPTIMIZATION

In this chapter, the application of assembly-free FEA in large-scale topology optimization is

discussed.

In topology optimization, the goal is to find the optimum material distribution while minimizing

some objective function with given constraints [70], [71]. The ‘topology’ of the domain is treated

as design variable, i.e., introduction of ‘holes’ is permitted and expected during optimization.

Figure 68(a) outlines the steps involved in topology optimization. The steps are illustrated using

an example of a 2D cantilever minimizing compliance for a desired volume fraction of 0.5 in

Figure 68(b).

Step 1 in topology optimization shown in Figure 68a. is initializing the design space. Design

space (D) is the allowable region within which the material can be distributed. The boundary

consists of free boundary, Dirichlet (fixed) and Neumann (traction). Dirichlet and Neumann

boundaries are typically retained during the optimization.

100

Step 2 is performing Finite Element Analysis (FEA) over the current design. FEA is performed

by discretizing the design, typically via finite-elements, and solving, for example, the

equilibrium Equation (3).

Step 3 is performing the sensitivity analysis on the discretized space. Sensitivity is defined as the

change of objective function w.r.t change of design variables. The objective function in this case

is compliance (𝑓𝑇𝑢). Design variables vary depending on the optimization methods. Design

variable for optimization methods such as Solid Isotropic Material Penalization (SIMP) [72] is

element density (𝜌𝑒) where 𝜌𝑒𝜖[0,1]. Level set methods on the other hand typically use

boundary variation as design variable to compute sensitivity. Alternately, topological sensitivity

field is used in [73].

(a) Flow chart of topology optimization (b) An illustrative example

Figure 68: Topology Optimization of 2D Cantilever

101

Step 4 is filtering/smoothening the sensitivity field. This is typically needed to avoid pathological

conditions during optimization, such as checker-board patterns [74].

Step 5 is carrying out an optimization step where the design variables are updated and constraints

are verified, etc.

Topology optimization can involve hundreds of finite element operations, and this can be

computationally demanding, especially in 3D. For example, Wang et al [75] published results for

a compliance optimization problem on a 3D cantilever beam shown in Figure 69.

The optimized topology for the 3D cantilever beam with a volume fraction of 0.5 is shown in

Figure 70. Total time taken to optimize the topology using a mesh with 100,000 DOF was about

2.4 hours, whereas with 1 million DOF, their implementation took 45.7 hours to complete.

Optimizing such systems is fairly common in the industry [76].

Figure 69: 3D Cantilever Beam

Figure 70: 3D Cantilever Beam Optimized

102

Strategies have therefore been developed to accelerate topology optimization along two fronts: 1)

improved optimization techniques, and 2) faster FEA. Several papers have been published on

improved optimization methods [77]–[81], the list is in-exhaustive. The level-set method

described in [82] is used as the optimization algorithm to illustrate application of assembly-free

FEA.

5.1 Voxel-mesh in topology optimization

In this section, impact of using a non-conforming voxel-mesh for topology optimization is

discussed. For example, consider an L-bracket to be optimized for compliance as shown in

Figure 71.

Typically, the geometry would be discretized using a conforming mesh as shown in Figure 72(a).

The algorithm then eliminates the elements in the process of optimization. In this case, the

density field is used to determine optimum topology as shown in Figure 72(b).

Figure 71: Geometry of 2D L-Bracket

103

However, instead of using a conforming mesh, one can discretize the geometry using a grid mesh

as illustrated in Figure 73(a), which is non-conforming to the boundary. Using the same

optimization algorithm, observe that the final topology for grid mesh (shown in Figure 73(b)) is

very similar to the topology obtained with conforming mesh Figure 72(b).

As a second example, the same geometry was optimized for minimum stress for a volume

fraction of 0.5. The optimized topology for both conforming and grid mesh are illustrated in

Figure 74. Observe that high stress features (such as the fillet) are modified during the

Figure 72: (a) Conforming mesh for L-bracket, and (b) Optimized topology

Figure 73: (a) Grid mesh for L-bracket, and (b) Optimized topology

104

optimization and thus the advantages of having a conforming mesh for better local stress results

is lost.

One can hypothesize that topology optimization is relatively insensitive to non-conformity of

mesh. The sensitivity computed for the given meshes would certainly be different. However, the

noise introduced in the sensitivity field due to non-conformity is eliminated in the

filtering/smoothening step of topology optimization.

Since topology optimization is used in the initial stages of conceptual design, structured grid-

meshes are sufficient as they lead to similar conceptual designs and that one can exploit grid-

meshes to accelerate finite element analysis. Therefore, assembly-free FEA implemented for

voxel-mesh becomes ideal for topology optimization

As an example of optimization problem solved through assembly-free FEA, consider a buckling

constrained topology optimization presented in [55].

(a) Conforming mesh (b) Grid mesh

Figure 74: Optimized for minimum stress

105

5.2 Buckling constrained optimization

The buckling-constrained topology optimization problem is posed as a constrained minimization

problem in [55]:

max

max

min

max

max

| |

Where

: domain of objective topology

: allowable design space

: compliance of structure

: maximum allowable compliance

: maximum von-mises for current design

: maximum allo

D

c

Min

J J

D

J

J

min

wable von-mises stress

: critical buckling load

: minimum buckling load

c

(67)

The sensitivity expression for buckling is derived, for example, in [83], and is given by:

T

T

w K K w

w K w

(68)

It is assumed that the eigenvectors have been Kortho-normalized, such that:

1Tw K w (69)

Thus, sensitivity expression can be rewritten as:

Tw K K w (70)

106

Unlike SIMP methods, where the sensitivity is obtained with respect to pseudo-density variables,

here the sensitivity is discrete addition and subtraction of element; for example, the discrete

sensitivity of the stiffness matrix to element deletion is given by:

0 0 0

0 0

0 0 0

e

NXN

K k

(71)

where ek is the elemental stiffness matrix. The second part of the derivative can be neglected for

linear elastic problem as per the derivation presented in [83]. While the derivative is computed in

[83] with respect to element density, the same can be extended for a discrete element variable.

The element-by-element sensitivity can be projected to the nodes to obtain a continuous field of

topological sensitivity.

The sensitivity fields for stress and compliance were obtained using the implementation

presented by Suresh and Takalloozadeh in [84]. Combining the topological sensitivity of

buckling with the sensitivity of compliance [ref], and stress [ref] one can generate topological

level-set T for constrained optimization problem described in Equation (67). Figure 75 illustrates

the algorithm proposed in [55] for buckling constrained optimization.

107

For the optimization, FEA for static and buckling analysis is solved using proposed assembly-

free deflated-CG.

5.2.1 Numerical results: optimizing a thin column

Consider minimizing the volume of a thin column with a compressive load, illustrated in Figure

76.

Figure 75: Algorithm for buckling-constrained topology optimization

108

Specifically, the objective is to solve the topology optimization problem:

0

0

0

| |

5

5

(SF)

D

c

Min

J J

(72)

In other words, the maximum allowable von Mises stress and compliance is 5 times their initial

values, respectively. For the buckling constraint, a safety factor (SF) was prescribed with respect

to the initial buckling load.

The structure was voxelized with 500,000 DOF, and the time taken for buckling analysis was 46

sec. As the safety factor (SF) is increased in Equation (72), the buckling constraint begins to

dominate, resulting in topologies illustrated in Figure 77.

Figure 76: Thin column with compressive load

109

Table 11 lists the timing results for optimization results shown in Figure 77.

Table 11: Minimizing volume for Stiff structure

Prescribed S.F. Final Volume Fraction Time (in min) #FEA

No constraint 0.3 15 64

1.1 0.3 38.3 86

1.5 0.42 41.5 98

2 0.52 24 74

5.2.2 Numerical results: optimizing a thin plate

The structural problem considered next is illustrated in Figure 78. The plate is of dimensions

100x100x10 (mm); the lower face is fixed while a uniform load is applied on the top. The

topology optimization problem is the same as defined in Equation (72).

a) w/o buckling; b) SF= 1.1; c) SF = 1.5; d) SF = 2

Figure 77: Stiff designs with different safety factors

110

The structure was again voxelized with 500,000 DOF. The time taken for one FEA to run

including buckling analysis was 7.1 seconds. With various buckling safety factor imposed, the

resulting topologies are illustrated in Figure 79. Observe, as the safety factor is increased,

additional ribs are introduced.

Figure 78: Plate with compressive load

111

The final volume fractions and time taken are summarized in Table 12. Once again, as the

buckling safety factor is increased, the optimization process converges to a higher volume

fraction, as expected.

Table 12: Optimizing plate with buckling constraints

Buckling Safety Factor Time (in seconds) Volume fraction #FEA

No constraint 300 0.3 61

4 432 0.35 98

7 332 0.42 85

10 230 0.61 62

5.3 Impact of Assembly-free FEA in compliance optimization

Consider the compliance optimization of cantilever beam presented by Wang et al [75],

illustrated in Figure 69 and Figure 70. The optimization is carried out using proposed assembly-

a) No buckling constraint b) SF = 4

c) SF = 7 d) SF = 10

Figure 79: Optimized topologies for various safety factors.

112

free FEA. The experiment was carried out over several sub-implementations over the course of

this research.

The speed of optimization through all such experiments is compared against the published data

from [75]. Results for time taken to complete the optimization are listed in Table 13. The

computer architecture and the year the data was generated are also listed in the table.

Table 13: Optimization speed for compliance minimization

#DOF

Data from

Wang et al

[ref]

(2007)

Element Connectivity

based Assembly-free

SpMV (2014)

Node Connectivity

based Assembly-free

SpMV (2014)

Assembly-

free FEA

with

deflation

(2015)

AMD

Opteron 2

core, 8GB

RAM

AMD

Phenom 4

core, 4GB

RAM

GTX 460,

336 core, 0.75

GB device

memory

AMD

Phenom 4

core, 4GB

RAM

GTX 460,

336 core, 0.75

GB device

memory

Intel i7 8

core, 16 GB

RAM

107,184 2.4 hours 4.6 mins 2.2 mins 2.7 mins 1.4 mins 20 secs

1,010,160 45.7 hours 6 hours 2 hours 4.2 hours 1.1 hours 7.9 mins

5.4 Summary

In their paper, Wang et al [75] illustrate findings that strongly recommend using iterative

methods for large-scale optimization. The assembly-free deflated-CG proposed in this thesis is

another step towards improving efficiency by limiting the memory requirements for such large-

scale optimization problems.

The numbers indicate a massive improvement over the course of this research, but an

observation should be made about the evolving technology. A fair representation of the

capabilities of current computational power when compared with the one last year is listed in

Table 14.

113

Table 14: Comparison of optimization speed across various platforms

#DOF

Element Connectivity based

Assembly-free SpMV

Node Connectivity based Assembly-

free SpMV

AMD

Phenom 4

core, 4GB

RAM

GTX 460, 336

core, 0.75 GB

device memory

Intel i7 8

core, 16 GB

RAM

AMD

Phenom 4

core, 4GB

RAM

GTX 460, 336

core, 0.75 GB

device memory

Intel i7 8

core, 16 GB

RAM

107,184 4.6 mins 2.2 mins 2.3 mins 2.7 mins 1.4 mins 1.25 mins

1,010,160 6 hours 2 hours 34.4 mins 4.2 hours 1.1 hours 22.4 mins

The assembly-free FEA exploits the compute capabilities provided by these technologies, by

minimizing the amount of data required to perform the analysis and rather spending more time

on the analysis themselves.

114

6 CONCLUSION AND FUTURE WORK

6.1 Conclusion

The main contribution of this thesis is an efficient assembly-free deflated-CG for large-scale

FEA, specifically targeting solid mechanics.

To achieve efficiency in an assembly-free manner, physics-based deflation was explored.

Exploratory researches lead to popular methods such as, agglomeration [19] and rigid-body

deflation [28]. Using those as a starting point a group of physics-based deflation methods were

proposed:

1) Kirchoff-Love plate [29]: that exploits curvature based terms in ‘thin plate-like’

structures.

2) Euler-Bernoulli beam [29]: that exploits the bending behavior of a ‘thin beam-like’

structures.

3) Planar-rigid-body: that exploits the dimensional reduction of 3D problem to 2D planar

system without adding the framework of 2D meshing.

4) Elastic-polynomial: that exploits the displacement trial functions generated to satisfy Airy

stress functions for 3D system [32].

Deflation methods allow solver to exploit reduced formulation in a straightforward 3D FEA

without any need for explicit FEA over reduced system.

The main advantage of the proposed deflation methods lie in their assembly-free

implementation. Through assembly-free deflation, one can accelerate convergence with minimal

115

memory overhead. The assembly-free deflation also requires an assembly-free SpMV to have a

truly effective solver for large-scale FEA.

For effective SpMV, congruency of element is proposed to limit the memory requirements. The

benefits of congruency were clearly evident in a non-conforming voxel-based discretization.

The examples presented in chapter 4 and 5 illustrate the applicability of assembly-free deflated-

CG for large-scale FEA. Moreover, the examples for static FEA illustrate the limited memory

requirement to handle very large systems.

6.2 Future work

There are many open research problems presented in this thesis that require further investigation.

A few of them are listed nest.

6.2.1 Effectiveness of Elastic-polynomials

The entire family of generalized polynomials for elastic fields introduced in [32] appears to be

promising for deflation methods. The deflation-space developed from 1st order elastic-

polynomial appeared to be only marginally better than rigid-body deflation.

However, a careful observation of the mapping operator in Equation (29) will reveal the nature

of additional DOF available in 1st order elastic-polynomial. The DOFs in elastic-polynomial

have the 6 rigid-body motions, 3 DOF for dilation in each of the directions X, Y, Z, and 3 DOF

for linearized twisting about X, Y, Z.

For small displacements, the dilation and twisting effects are small, which provides the

marginally better performance observed in deflated-CG. It will be an interesting to observe its

behavior for non-linear thermos-elastic problems with softer materials and high thermal

116

coefficients that are susceptible to large dilations. This also opens up the entire family of elastic-

polynomials for membrane problems.

6.2.2 Feature based deflation

The implementation of deflation-space presented in this thesis, applies a unique type of deflation

over the entire discretized domain. The domain is identified as thick, plate-like, beam-like, etc.

and the appropriate deflation method is applied over the aggregated nodes.

Rather than identifying the entire domain as a specific type of problem, the aggregated nodes can

be represented by feature they are part of, such as, thick, plate-like, beam-like, etc. Specific

deflation can be applied to these groups depending on their feature. It is similar to ‘hybrid FEA’

with the advantage of simple assembly-free deflation instead of reduced dimension FEA over

different features.

Identifying features based on geometry is a research problem that needs further investigation.

6.2.3 Assembly-free non-linear FEA

Assembly-free FEA for large-deformation problem presented in this thesis is a simplistic

approach to geometric non-linearity. The implementation is basic and requires improvement in

convergence for Newton method in Algorithm 6. While the research is available for effective

convergence of equilibrium equation, the assembly-free implementation of several modules need

to be addressed. The current implementation relies on full 8-point integration scheme, which is

highly inefficient.

117

Furthermore, material non-linearity has to be explored for assembly-free non-linear FEA. The

assembly-free deflated-CG proposed in this thesis has been successfully used for multi-material

FEA in [85]. Therefore, one can easily use proposed deflated-CG for material non-linearity.

6.2.4 Assembly-free SpMV for conforming mesh

This is perhaps the most important problem that needs to be addressed to make the proposed

method absolutely versatile. The congruency criteria presented is the first step towards solving

this issue.

The implementation of finding congruent elements has to be optimized. This research problem is

perhaps better suited for computational sciences.

118

REFERENCES

[1] O. C. Zienkiewicz, The Finite Element Method: Its Basis and Fundamentals. Elsevier

Butterworth Heinemann, 2005.

[2] O. C. Zienkiewicz and R. L. Taylor, The Finite Element Method for Solid and Structural

Mechanics. Elsevier, 2005.

[3] R. D. Cook, Concepts and Applications of Finite Element Analysis. John Wiley & Sons,

2002.

[4] G. H. Golub and C. F. V. Loan, Matrix Computations. JHU Press, 1996.

[5] R. Aubry, F. Mut, S. Dey, and R. Lohner, “Deflated Preconditioned Conjugate Gradient

Solvers for Linear Elasticity,” International Journal for Numerical Methods in

Engineering, vol. 88, pp. 1112–1127.

[6] ANSYS 13. ANSYS; www.ansys.com, 2012.

[7] Y. Saad, Iterative Methods for Sparse Linear Systems. SIAM, 2003.

[8] J. R. Shewchuk, “An Introduction to the Conjugate Gradient Method Without the

Agonizing Pain.”

[9] A. M. Mirzendehdel and K. Suresh, “A Fast Time-Stepping Strategy for the Newmark-Beta

Method,” in ASME 2014 International Design Engineering Technical Conferences and

Computers and Information in Engineering Conference, 2014, pp. V01AT02A020–

V01AT02A020.

[10] SolidWorks; www.solidworks.com. 2005.

[11] K. Y. Sze, “Three-dimensional continuum finite element models for plate/shell analysis,”

Prog. Struct. Eng. Mater, vol. 4, pp. 400–407, 2002.

[12] “HOME - PATRIOT Engineering Company.” [Online]. Available:

http://www.patriotengineeringco.com/index.htm. [Accessed: 02-Dec-2015].

[13] “Finite Element FEA | FEA and Plastic Injection Solution.” [Online]. Available:

http://feasolution.com/finite-element-analysis-solution/. [Accessed: 02-Dec-2015].

[14] R. Barrett, M. W. Berry, T. F. Chan, J. Demmel, J. Dionato, J. Diongarra, V. Eijkhout, R.

Pozo, C. Romine, and H. van der Vorst, Templates for the Solution of Linear Systems:

Building Blocks for Iterative Methods. Philadelphia: SIAM Press, 1993.

[15] M. Benzi, “Preconditioning Techniques for Large Linear Systems: A Survey,” Journal of

Computational Physics, vol. 182, no. 2, pp. 418–477, Nov. 2002.

[16] L. N. Trefethen and D. Bau III, Numerical linear algebra, vol. 50. Siam, 1997.

[17] M. Benzi and M. Tuma, “A Robust Incomplete Factorization Preconditioner for Positive

Definite Matrices,” Numerical Linear Algebra With Applications, vol. 10, pp. 385–400,

2003.

[18] Y. Saad, Numerical methods for large eigenvalue problems. Manchester University Press.

[19] R. A. Nicolades, “Deflation of Conjugate Gradients with Applications to Boundary Value

Problems,” SIAM Journal on Numerical Analysis, vol. 24, no. 2, pp. 355–365.

[20] M. Adams, “Evaluation of three unstructured multigrid methods on 3D finite element

problems in solid mechanics,” International Journal for Numerical Methods in

Engineering, vol. 55, no. 2, pp. 519–534, 2002.

[21] Y. Saad, M. Yeung, J. Erhel, and F. Guyomarc’h, “A Deflated Version of the Conjugate

Gradient Algorithm,” pp. 1909–1926, 2000.

[22] S. F. McCormick, Ed., Multigrid Methods. SIAM, 1987.

119

[23] W. L. Briggs, V. E. Henson, and S. F. McCormick, A Multigrid Tutorial. SIAM, 2000.

[24] J. A. Mitchell and J. N. Reddy, “A Multilevel Hierarchical Preconditioner for Thin Elastic

Solids,” International Journal of Numerical Methods in Engineering, vol. 43, pp. 1383–

1400, 1998.

[25] P. Arbenz and et. al., “A Scalable Multi-level Preconditioner for Matrix-Free μ-Finite

Element Analysis of Human Bone Structures,” International Journal for Numerical

Methods in Engineering, vol. 73, no. 7, pp. 927–947, 2008.

[26] V. Mishra and K. Suresh, “Efficient Analysis of 3-D Plates via Algebraic Reduction**,”

San Diego, CA, 2009, vol. 2, pp. 75–82.

[27] V. Mishra and K. Suresh, “Dual Representation Methods for Efficient and Automatable

Analysis of 3D Plates**,” Journal of Computing and Information Science in Engineering,

vol. 10, no. 4, Nov. 2010.

[28] V. E. Bulgakov and G. Kuhn, “High-performance multilevel iterative aggregation solver for

large finite-element structural analysis problems,” International Journal for Numerical

Methods in Engineering, vol. 38, no. 20, pp. 3529–3544, 1995.

[29] P. Yadav and K. Suresh, “Large Scale Finite Element Analysis Via Assembly-Free

Deflated Conjugate Gradient,” J. Comput. Inf. Sci. Eng., vol. 14, no. 4, pp. 041008–041008,

Oct. 2014.

[30] S. Timoshenko and S. W. Krieger, Theory of Plates and Shells. New York: McGraw-Hill

Book Company, 1959.

[31] W. Pilkey, Analysis and Design of Elastic Beams. New York, NY: John Wiley, 2002.

[32] M.-Z. Wang, B.-X. Xu, and Y.-T. Zhao, “General representations of polynomial elastic

fields,” Journal of Applied Mechanics, vol. 79, no. 2, p. 021017, 2012.

[33] X.-R. Fu, S. Cen, C. F. Li, and X.-M. Chen, “Analytical trial function method for

development of new 8-node plane element based on the variational principle containing

Airy stress function,” Engineering Computations, vol. 27, no. 4, pp. 442–463, 2010.

[34] M. Duan, “5-node hybrid/mixed finite element for Reissner-Mindlin plate,” Finite Elements

in Analysis and Design, vol. 33, pp. 167–185, 1999.

[35] K. Y. Sze, “A hybrid stress ANS solid-shell element and its generalization for smart

structure modelling. Part I - solid-shell element formulation,” International Journal for

Numerical Methods in Engineering, vol. 48, pp. 545–564, 2000.

[36] M. Duan, “Effective Hybrid/Mixed Finite Elements for Folded-Plate Structures,” Journal of

Engineering Mechanics, vol. 128, no. 2, pp. 202–208, 2002.

[37] K. Suresh, “Generalization of the mid-element based dimensional reduction,” Journal of

Computing and Information Science in Engineering, vol. 3, no. 4, pp. 308–314, 2003.

[38] K. Suresh, “Generalization of the Kantorovich Method of Dimensional Reduction,”

Albuquerque, 2003.

[39] K. Gallivan, “Model reduction via truncation: an interpolation point of view,” Linear

Algebra and its Applications, vol. 375, pp. 115–134, 2003.

[40] K. Jorabchi and K. Suresh, “Algebraic Dimensional Reduction,” San Francisco, CA, 2007.

[41] E. Cuthill and J. McKee, “Reducing the bandwidth of sparse symmetric matrices,” in

Proceedings of the 1969 24th national conference, 1969, pp. 157–172.

[42] J. R. Gilbert, C. Moler, and R. Schreiber, “Sparse matrices in MATLAB: design and

implementation,” SIAM Journal on Matrix Analysis and Applications, vol. 13, no. 1, pp.

333–356, 1992.

120

[43] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of

sparse matrix–vector multiplication on emerging multicore platforms,” Parallel Computing,

vol. 35, no. 3, pp. 178–194, 2009.

[44] J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder, “Sparse Matrix Solvers on the GPU:

Conjugate Gradients and Multigrid,” in ACM SIGGRAPH 2003 Papers, New York, NY,

USA, 2003, pp. 917–924.

[45] N. Bell, “Efficient Sparse Matrix-Vector Multiplication on CUDA,” 2008.

[46] X. Yang, S. Parthasarathy, and P. Sadayappan, “Fast Sparse Matrix-vector Multiplication

on GPUs: Implications for Graph Mining,” Proc. VLDB Endow., vol. 4, no. 4, pp. 231–242,

Jan. 2011.

[47] T. J. R. Hughes, I. Levit, and J. Winget, “An element-by-element solution algorithm for

problems of structural and solid mechanics,” Computer Methods in Applied Mechanics and

Engineering, vol. 36, no. 2, pp. 241–254, Feb. 1983.

[48] A. Akbariyeh, “Large Scale Finite Element Analysis Using GPU Parallel Computing,”

hgpu.org, Aug. 2012.

[49] J. Michopoulos, J. C. Hermanson, A. P. Iliopoulos, S. G. Lambrakos, and T. Furukawa,

“Data-Driven Design Optimization for Composite Material Characterization,” J. Comput.

Inf. Sci. Eng, vol. 11, no. 2, 2011.

[50] A. Borisov, M. Dickinson, and S. Hastings, “A Congruence Problem for Polyhedra,” The

American Mathematical Monthly, vol. 117, no. 3, pp. 232–249, 2010.

[51] P. Yadav and K. Suresh, “Limited-Memory Deflated Conjugate Gradient for Solid

Mechanics,” presented at the IDETC/CIE 2014, Buffalo, NY, 2014.

[52] NVIDIA CUDA: Compute Unified Device Architecture, Programming Guide. Santa Clara.,

2008.

[53] “cuBLAS.” [Online]. Available: http://docs.nvidia.com/cuda/cublas/#axzz3BWEcDEag.

[Accessed: 26-Aug-2014].

[54] P. Yadav and K. Suresh, “Assembly-Free Large-Scale Modal Analysis on the Graphics-

Programmable Unit,” Journal of Computing and Information Science in Engineering, vol.

13, no. 1, p. 011003, 2013.

[55] X. Bian, P. Yadav, and K. Suresh, “Assembly-Free Buckling Analysis for Topology

Optimization,” DETC2015-46351, ASME-IDETC Conference, Boston, MA, Aug. 2015.

[56] R. G. Grimes, J. G. Lewis, and H. D. Simon, “A Shifted Block Lanczos Algorithm for

Solving Sparse Symmetric Generalized Eigenproblems,” p. 228, 1994.

[57] D. C. Sorensen, “Numerical methods for large eigenvalue problems,” pp. 519–584, 2002.

[58] P. Arbenz, U. L. Hetmaniuk, R. B. Lehoucq, and R. S. Tuminaro, “A Comparison of

Eigensolvers for Large-scale 3D Modal Analysis using AMG-Preconditioned Iterative

Methods,” pp. 204–236, 2005.

[59] G. H. Golub and Q. Ye, “An Inverse free Preconditioned Krylov Subspace method for

Symmetric Generalized Eigenvalue problems,” pp. 312–334, 2002.

[60] A. V. Knyazev, “Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block

Preconditioned Conjugate Gradient Method,” pp. 517–541, 2001.

[61] G. L. G. Sleijpen and H. A. Van der Vorst, “A Jacobi-Davidson Iteration Method for Linear

Eigenvalue Problems,” pp. 401–425, 1996.

[62] L. Bergamaschi, Á. Martínez, and G. Pini, “Parallel preconditioned conjugate gradient

optimization of the Rayleigh quotient for the solution of sparse eigenproblems,” p. 1964,

2006.

121

[63] Y. T. Feng and D. R. J. Owen, “Conjugate Gradient Methods for Solving the Smallest

Eigenpair of Large Symmetric Eigenvalue Problems,” pp. 2209–2230, 1996.

[64] H.-J. Jang, “Preconditioned Conjugate Gradient Method for Large Generalized

Eigenproblems,” Trends in Mathematics Information Center for Mathematical Sciences,

vol. 4, no. 2, pp. 103–109, 2001.

[65] J. Wright, S. Nocedal, Numerical optimization. New York: Springer Science + Business

Media, 2006.

[66] H. Yang, “Conjugate Gradient Methods for the Rayleigh Quotient Minimization of

Generalized Eigenvalue Problems,” pp. 79–94, 1993.

[67] I. C. F. Ipsen, “Computing an Eigenvector with Inverse Iteration,” pp. 254–291, 1997.

[68] K. J. Bathe, Finite Element Procedures. Eaglewood Cliffs, NJ: Prentice-Hall, 1996.

[69] T. Belytschko, Nonlinear finite elements for continua and structures. John Wiley & Sons

Inc, 2000.

[70] M. P. Bendsøe, Topology optimization theory, methods and applications. Berlin

Heidelberg: Springer Verlag, 2003.

[71] T. Y. Chen, “Multiobjective optimal topology design of structures,” Computational

Mechanics, vol. 21, pp. 483–492, 1998.

[72] O. Sigmund, “A 99 line topology optimization code written in Matlab,” Structural and

Multidisciplinary Optimization, vol. 21, no. 2, pp. 120–127, 2001.

[73] S. Amstutz, “A new algorithm for topology optimization using a level-set method,” Journal

of Computational Physics, vol. 216, pp. 573–588, 2006.

[74] O. Sigmund, “Numerical instabilities in topology optimization: A survey on procedures

dealing with checkerboards, mesh-dependencies and local minima,” Structural and

Multidisciplinary Optimization, vol. 16, no. 1, pp. 68–75, 1998.

[75] S. Wang, E. D. Sturler, and G. Paulino, “Large-scale topology optimization using

preconditioned Krylov subspace methods with recycling,” International Journal for

Numerical Methods in Engineering, vol. 69, no. 12, pp. 2441–2468, 2007.

[76] H. A. Eschenauer and N. Olhoff, “Topology optimization of continuum structures: A

review,” Applied Mechanics Review, vol. 54, no. 4, pp. 331–389, 2001.

[77] M. P. Bendsøe and N. Kikuchi, “Generating optimal topologies in structural design using a

homogenization method,” Computer Methods in Applied Mechanics and Engineering, vol.

71, pp. 197–224, 1988.

[78] G. I. N. Rozvany, “A critical review of established methods of structural topology

optimization,” Structural and Multidisciplinary Optimization, vol. 37, no. 3, pp. 217–237,

2009.

[79] Y. I. Kim and B. M. Kwak, “Design space optimization using a numerical design

continuation method,” International Journal for Numerical Methods in Engineering, vol.

53, no. 1979–2002, 2002.

[80] J. F. Aguilar Madeira, “Multi-objective optimization of structures topology by genetic

algorithms,” Advances in Engineering Software, vol. 36, no. 1, pp. 21–28, 2005.

[81] T. Belytschko, “Topology Optimization with Implicit Functions and Regularization,”

International Journal for Numerical Methods in Engineering, vol. 57, no. 8Bely03a, pp.

1177–1196, 2003.

[82] K. Suresh, “Efficient Generation of Large-Scale Pareto-Optimal Topologies**,” Structural

and Multidisciplinary Optimization, vol. 47, no. 1, pp. 49–61, 2013.

122

[83] S. J. van den Boom, “Topology Optimisation Including Buckling Analysis,” Delft

University of Technology, Delft, 2014.

[84] K. Suresh and M. Takalloozadeh, “Stress-Constrained Topology Optimization: A

Topological Level-Set Approach,” Structural and Multidisciplinary Optimization, vol.

Submitted, 2012.

[85] A. M. Mirzendehdel and K. Suresh, “A Pareto-Optimal Approach to Multimaterial

Topology Optimization,” Journal of Mechanical Design, vol. 137, no. 10, p. 101701, 2015.