a family of assembly-free physics-based deflation methods
TRANSCRIPT
A Family of Assembly-free Physics-based Deflation Methods
for Large-scale Finite Element Analysis
By
Praveen Yadav
A dissertation submitted in partial fulfillment of
the requirements for the degree of
Doctor of Philosophy
(Mechanical Engineering)
at the
UNIVERSITY OF WISCONSIN-MADISON
2015
Date of final oral examination: 12/16/2015
The dissertation is approved by the following members of the Final Oral Committee:
Krishnan Suresh, Professor, Mechanical Engineering
Matthew S. Allen, Associate Professor, Engineering Physics
Dan Negrut, Associate Professor, Mechanical Engineering
Heidi-Lynn Ploeg, Associate Professor, Mechanical Engineering
Xiaoping Qian, Associate Professor, Mechanical Engineering
i
Abstract
Finite element analysis (FEA) is the most popular numerical method today for solving solid
mechanics and other boundary value problems. FEA is used to solve a variety or problems
including static, modal, buckling, transient, etc. While the formulation for each of these problems
may differ, they all necessitate solving linearized system of equations.
Further, for large scale problems, solving the linear system is often the computational bottleneck.
Iterative solvers have been accepted as preferred methods for such large systems. The two main
challenges in iterative solvers are: 1) cost of Sparse Matrix-Vector multiplication (SpMV), and
2) number of iterations required for convergence.
Several methods have been proposed by researchers to address both these challenges. In this
thesis, a physics-motivated assembly-free deflated conjugate gradient (deflated-CG) is
presented. The physics-based deflation method presented here accelerates the convergence of
conjugate gradient (CG); this is then implemented through an efficient assembly-free SpMV
method, exploiting mesh congruence.
Thus, the main contribution of this thesis is to develop a family of assembly-free physics-based
deflation techniques that can be applied to a variety of large-scale FEA problem in solid
mechanics.
The deflation methods discussed in this thesis exploit the expected physical behavior of a
structure. The proposed physics-based deflation allows one to use the specificity of known
reduction techniques, such as plate and beam theory, to accelerate iterative methods in a robust
ii
and generalized 3D framework. Furthermore, the implementation of such physics-based deflation
allows one to reduce the memory-cost of assembling and storing global matrices.
The concept of ‘element-congruency’ is proposed to reduce memory requirement for SpMV.
Specifically, congruent elements exhibit similar stiffness, and therefore can be accessed by one
stiffness element. This has not been explored in the scientific community. Element-congruency
addresses the need for an efficient SpMV, through limiting the memory for assembly-free SpMV.
For large scale problems, numerical results show that exploiting element-congruency is very
valuable towards improving speed.
Numerical results will show the efficiency and scalability of the proposed assembly-free
deflated-CG for a variety of applications in FEA. The advantage of proposed method is also
illustrated through its application in topology optimization.
iii
Acknowledgements
In memory of my mother, this thesis is dedicated to my loving parents, family and friends.
I am deeply grateful for the opportunity my advisor Prof Krishnan Suresh provided me with, by
allowing me to work in ERSL. Without his patience, guidance and constant support, this work
would not have been possible. He taught me the skills necessary to approach a research problem
systematically, and to be critical of any solutions that are applicable to those problems. His
expertise in finite elements and computational mechanics helped me grow as a researcher. I also
thank my committee members, Prof Matt Allen, Prof Dan Negrut, Prof Heidi-lynn Ploeg and
Prof Xiaoping Qian for their valuable suggestions and comments over the course of this
program.
I extend many thanks to the ERSL members, old and new, for the enjoyable discussions related
to research and life in general. Thanks to all the friends at badminton and cricket for the
camaraderie outside the lab.
I thank my parents and my sister for their unquestionable belief in my ability, which I think they
over-estimate. I am especially thankful for my mother, Late Smt. Nisha Yadav and father, Shri
Parashu Ram Yadav; for providing me with every opportunity to succeed in life. Their
unwavering dedication to our family is something I can only hope to emulate. I would be remiss
if I didn’t thank my wife, Emily and my in-laws. They motivate me to improve in every aspect of
life.
Last but not the least; I also extend my thanks to Mr. Stephen Colbert of ‘the Late Show with
Stephen Colbert’.
iv
Table of Contents
Abstract ........................................................................................................................................... i
Acknowledgements ...................................................................................................................... iii
List of Figures .............................................................................................................................. vii
List of Tables ................................................................................................................................ xi
1 Accelerating Large-scale Finite Element Analysis ............................................................. 1
1.1 Introduction to Finite Element Analysis .......................................................................... 1
1.1.1 Finite Element Discretization ................................................................................... 2
1.2 Linear Solvers .................................................................................................................. 4
1.2.1 Direct......................................................................................................................... 5
1.2.2 Iterative ..................................................................................................................... 5
1.3 Accelerating convergence ................................................................................................ 9
1.3.1 Preconditioning ......................................................................................................... 9
1.3.2 Deflation ................................................................................................................. 11
1.3.3 Physics-based deflation ........................................................................................... 13
1.4 Thesis overview.............................................................................................................. 14
2 Physics-based Deflation ....................................................................................................... 16
2.1 Convergence of conjugate gradient method for thin structure ....................................... 16
2.2 Agglomeration ................................................................................................................ 19
2.3 Planar-rigid-body ........................................................................................................... 22
2.4 Rigid-body ...................................................................................................................... 24
2.5 Kirchoff-Love plate ........................................................................................................ 26
2.6 Euler-Bernoulli beam ..................................................................................................... 28
2.7 Elastic Polynomial.......................................................................................................... 30
2.8 Summary ........................................................................................................................ 32
v
3 Assembly-free Implementation .......................................................................................... 34
3.1 Congruence of elements ................................................................................................. 35
3.1.1 Geometric method ................................................................................................... 37
3.1.2 Stiffness method...................................................................................................... 39
3.2 FEA results with congruency ......................................................................................... 42
3.3 Special case of SpMV on Voxel-mesh ........................................................................... 47
3.3.1 Element-connectivity based .................................................................................... 48
3.3.2 Node-connectivity based ......................................................................................... 49
3.3.3 Single SpMV results ............................................................................................... 49
3.4 Assembly-free deflation ................................................................................................. 51
3.4.1 Prolongation ............................................................................................................ 51
3.4.2 Restriction ............................................................................................................... 52
3.4.3 Deflating stiffness ................................................................................................... 53
3.5 Summary ........................................................................................................................ 54
4 Assembly-free Finite Element Analysis ............................................................................. 55
4.1 Assembly-free modal analysis ....................................................................................... 65
4.1.1 Rayleigh-Ritz conjugate gradient ........................................................................... 65
4.1.2 Computing multiple modes ..................................................................................... 67
4.1.3 Subspace augmentation ........................................................................................... 69
4.1.4 Numerical results: modal analysis for Knuckle ...................................................... 71
4.1.5 Numerical results: modal analysis for Housing cover ............................................ 74
4.2 Assembly-free static analysis ......................................................................................... 55
4.2.1 Numerical results: static analysis for Knuckle........................................................ 56
4.2.2 Numerical results: static analysis of thin plate under pressure ............................... 60
4.2.3 Numerical results: static analysis of ‘Thomas’ engine ........................................... 63
vi
4.3 Assembly-free buckling analysis ................................................................................... 75
4.3.1 Inverse iteration ...................................................................................................... 78
4.3.2 Numerical results: buckling analysis of a square beam .......................................... 79
4.3.3 Numerical results: buckling analysis of cylindrical column ................................... 81
4.3.4 Numerical results: buckling analysis of a rectangular column with a hole ............ 83
4.4 Assembly-free large-deformation analysis..................................................................... 85
4.4.1 Total Lagrangian (TL) formulation ........................................................................ 87
4.4.2 Numerical results: large-deformation analysis of beam ......................................... 92
4.5 Summary ........................................................................................................................ 97
5 Application: topology optimization .................................................................................... 99
5.1 Voxel-mesh in topology optimization .......................................................................... 102
5.2 Buckling constrained optimization............................................................................... 105
5.2.1 Numerical results: optimizing a thin column ........................................................ 107
5.2.2 Numerical results: optimizing a thin plate ............................................................ 109
5.3 Impact of Assembly-free FEA in compliance optimization......................................... 111
5.4 Summary ...................................................................................................................... 112
6 Conclusion and future work ............................................................................................. 114
6.1 Conclusion .................................................................................................................... 114
6.2 Future work .................................................................................................................. 115
6.2.1 Effectiveness of Elastic-polynomials .................................................................... 115
6.2.2 Feature based deflation ......................................................................................... 116
6.2.3 Assembly-free non-linear FEA ............................................................................. 116
6.2.4 Assembly-free SpMV for conforming mesh......................................................... 117
References .................................................................................................................................. 118
vii
List of Figures
Figure 1: Thin Plate example .......................................................................................................... 1
Figure 2: Discretized domain .......................................................................................................... 2
Figure 3: Displacement plot for thin plate ...................................................................................... 3
Figure 4: Modal plots for thin plate ................................................................................................ 4
Figure 5: Direct vs Iterative solve: as reported in [9] ..................................................................... 7
Figure 6: Convergence of residual for thin plate. ........................................................................... 8
Figure 7: Structures for large-scale FEA [12], [13] ........................................................................ 9
Figure 8: A two-level geometric multi-grid. ................................................................................. 12
Figure 9: a) Finite element mesh, b) agglomeration of nodes in 16 groups ................................. 13
Figure 10: Thesis overview ........................................................................................................... 14
Figure 11: Thin Plate examples for convergence analysis............................................................ 17
Figure 12: Convergence plot for thin plate ................................................................................... 17
Figure 13: Convergence for Agglomeration ................................................................................. 21
Figure 14: Agglomerated mode-shapes ........................................................................................ 22
Figure 15: Convergence for in-plane load: Planar rigid-body vs Agglomeration ........................ 23
Figure 16: Convergence for out-of-plane load: Planar-rigid-body ............................................... 24
Figure 17: Rigid-body mode shapes ............................................................................................. 25
Figure 18: Convergence for out-of-plane load: Rigid body vs Agglomeration ............................ 26
Figure 19: Curvature effects in thin structures ............................................................................. 27
Figure 20: Convergence: Kirchoff-Love plate vs Rigid-body ...................................................... 28
Figure 21: Convergence for Euler-Bernoulli beam deflation ....................................................... 30
Figure 22: Convergence for out-of-plane loading: 1st order Elastic-polynomial vs Rigid-body . 32
viii
Figure 23: Element Congruency in Mesh ..................................................................................... 36
Figure 24: Distinct Element located around notch ....................................................................... 36
Figure 25: Geometry and boundary conditions on L-bracket ....................................................... 38
Figure 26: Quad-mesh for L-bracket ............................................................................................ 38
Figure 27: Geometric congruence vs mesh size for various tolerances ........................................ 39
Figure 28: Stiffness congruence vs mesh size for various tolerances ........................................... 40
Figure 29: Reduced-stiffness congruence vs mesh size for various tolerances ............................ 42
Figure 30: Quad-mesh for L-bracket with 9600 elements ............................................................ 43
Figure 31: Stress and displacement error for 0.1% tolerance ....................................................... 45
Figure 32: Stress plot with maximum displacement (δ) and maximum stress (σ) ....................... 46
Figure 33: Knuckle with (a) Conforming Mesh, and (b) Non-conforming Mesh ........................ 47
Figure 34: Element connectivity based SpMV implementation ................................................... 48
Figure 35: A beam geometry and its mesh ................................................................................... 49
Figure 36: Assembly-free SpMV on the CPU with and without exploiting element-
congruency[51] ............................................................................................................................. 50
Figure 37: GPU implementation of prolongation. ........................................................................ 52
Figure 38: GPU implementation for restriction. ........................................................................... 53
Figure 39: (a) Knuckle geometry and loading, and (b) Voxel mesh ............................................ 56
Figure 40: Static (a) displacement, and (b) stress for knuckle ...................................................... 57
Figure 41: 100 and 1000 rigid-body groups ................................................................................. 57
Figure 42: Convergence of DCG vs Jacobi-PCG ......................................................................... 59
Figure 43: Loading on a thin plate ................................................................................................ 60
Figure 44: Convergence of DCG vs Jacobi-PCG for thin plate .................................................... 61
ix
Figure 45: CUDA Profile for Rigid-Body deflation ..................................................................... 63
Figure 46: Structural problem over a Thomas engine. ................................................................. 64
Figure 47: Deflection from a 50 million DOF system. ................................................................. 64
Figure 48: (a) Knuckle geometry, (b) Conforming mesh, and (c) voxel-mesh ............................ 71
Figure 49: (a) Gear-housing: eigen-spectrum is desired, (b) Meshing can fail for complex
structures ....................................................................................................................................... 74
Figure 50: Brute-force voxelization of the structure..................................................................... 74
Figure 51: First Eigen-mode for Gear Housing ............................................................................ 75
Figure 52: Bucking of a pinned-pinned beam ............................................................................... 76
Figure 53: Predicted critical load using AFBA and Solidworks [10] ........................................... 80
Figure 54: Computing time vs #DOF for AFBA and SolidWorks [10] ....................................... 81
Figure 55: Accuracy plot for Cylindrical Column: Buckling load vs #DOF ................................ 82
Figure 56: Computing time vs #DOF for cylindrical column ...................................................... 82
Figure 57: Rectangular column with a hole .................................................................................. 83
Figure 58: Predicted critical load using AFBA and SolidWorks[10] for rectangular beam with
hole ................................................................................................................................................ 84
Figure 59: Computing time for rectangular beam with hole......................................................... 84
Figure 60: Cantilever beam displacement for linear elastic vs large-deformation formulation ... 86
Figure 61: Mapping of current mesh through deformation gradient ............................................ 90
Figure 62: Large-deformation analysis on beam .......................................................................... 93
Figure 63: Displacement results for linear static FEA .................................................................. 93
Figure 64: Displacement results for large-deformation FEA ....................................................... 94
Figure 65: Convergence plot: CG w/o deflation vs Rigid-body ................................................... 95
x
Figure 66: Convergence plot: Rigid-body vs 1st order Elastic-polynomial ................................. 96
Figure 67: Convergence plot: Rigid-body vs Euler-Bernoulli beam ............................................ 97
Figure 68: Topology Optimization of 2D Cantilever ................................................................. 100
Figure 69: 3D Cantilever Beam .................................................................................................. 101
Figure 70: 3D Cantilever Beam Optimized ................................................................................ 101
Figure 71: Geometry of 2D L-Bracket........................................................................................ 102
Figure 72: (a) Conforming mesh for L-bracket, and (b) Optimized topology ............................ 103
Figure 73: (a) Grid mesh for L-bracket, and (b) Optimized topology ........................................ 103
Figure 74: Optimized for minimum stress .................................................................................. 104
Figure 75: Algorithm for buckling-constrained topology optimization ..................................... 107
Figure 76: Thin column with compressive load ......................................................................... 108
Figure 77: Stiff designs with different safety factors.................................................................. 109
Figure 78: Plate with compressive load ...................................................................................... 110
Figure 79: Optimized topologies for various safety factors........................................................ 111
xi
List of Tables
Table 1: Results for error in maximum displacement for different mesh sizes ............................ 43
Table 2: Results for error in maximum von Mises stress for different mesh sizes ....................... 44
Table 3: Assembly-Free SpMV Timing results ............................................................................ 51
Table 4: Total iterations and time taken to solve knuckle with varying number of groups ......... 58
Table 5: Time taken to solve the knuckle problem using SolidWorks [10] and proposed method.
....................................................................................................................................................... 59
Table 6: Comparison of Rigid-body deflation vs Kirchoff-Love deflation .................................. 61
Table 7: Time taken to solve thin-plate with proposed method vs SolidWorks [10] ................... 62
Table 8: First 5 eigen-modes for Knuckle: Solidworks [10] vs SaRCG [54] ............................... 72
Table 9: Time for computing first-5 frequency of Knuckle ......................................................... 73
Table 10: Results for Computing Fundamental Frequency of Gear Housing .............................. 75
Table 11: Minimizing volume for Stiff structure ........................................................................ 109
Table 12: Optimizing plate with buckling constraints ................................................................ 111
Table 13: Optimization speed for compliance minimization ...................................................... 112
Table 14: Comparison of optimization speed across various platforms ..................................... 112
1
1 ACCELERATING LARGE-SCALE FINITE ELEMENT ANALYSIS
1.1 Introduction to Finite Element Analysis
Finite element analysis (FEA) is the most popular numerical method today for solving solid
mechanics and other boundary value problems. FEA includes a family of analyses, such as,
static, modal, buckling, transient, etc. Each of these analyses is modelled on the governing
principles such as equilibrium, conservation of energy, conservation of momentum, and so on.
FEA approximates the governing equations, which are often partial-differential-equations (PDE),
as a linearized system of algebraic equations [1].
For example, consider the static linear elasticity problem of a thin plate illustrated in Figure 1.
The domain, Ω is the entire volume enclosed by boundary ∂Ω. The edges of the plate have a
prescribed displacement,u which is 0 for the given system. A traction, t, is applied on one of the
free surfaces as a pressure in normal direction.
The strong form of the equilibrium equation is a PDE that solves for stress tensor field over Ω
[1], [2]. A functional of total potential energy represents the weak form of the same equation as
[1], [2]:
Figure 1: Thin Plate example
2
1
2
T Td u td
(1)
FEA solves for displacement field by minimizing the functional described in Equation (1). The resulting
linear system of equation obtained through minimization process is described next.
1.1.1 Finite Element Discretization
The first step in FEA is breaking the domain into finite element mesh [1], [2]. Figure 2 illustrates a typical
finite element mesh described in Equation (2). The sub-script e represents an arbitrary element.
e
e
(2)
Displacement within an element is approximated through nodal displacements, ue by using appropriate
shape functions, N[2]. The next step is computing the element properties using those displacement
approximations. For given example, element stiffness and nodal force for the element is required. The
detailed process of getting element properties using the shape functions can be found in [2], [3].
The third step is assembling the element properties to obtain the linearized system of equation for the
whole structure as:
Ku f (3)
Figure 2: Discretized domain
3
The coefficient matrix, also known as global stiffness matrix K, is often sparse-symmetric and
positive definite[2], [3].
The next step is solving the assembled system. Once solved, the solution vector, u can be post
processed to compute stresses [3]. The displacement field obtained from u is shown in Figure 3.
Similar to the static linear problem, finite element formulation can also solve a modal problem
for homogenous solution of a spatial PDE in a dynamic system [2], [3]. The process of
discretizing and computing element properties remains the same.
In discretized form, the first few natural modes of vibration can be approximated by solving the
generalized eigen-value problem [2], [3]:
Kx Mx (4)
Here, K and M represent stiffness and mass matrix respectively, and and x represent the
eigenvalue and mode shape to be solved [2], [3]. The first few mode shapes are illustrated in
figure 4.
Figure 3: Displacement plot for thin plate
4
Thus, FEA provides us with a very valuable tool to analyze several governing equations. The
solution method for each of the governing equation may differ, but in some capacity they all rely
on a linear solver as an important step in the algorithm [3].
The degree of accuracy in a finite element analysis is governed by discretization parameters,
such as, number of elements, types of elements, the shape functions within the elements, etc. For
high accuracy, a large number of elements are often required during discretization [1], [2]. This
brings us to the challenge of large-scale FEA, especially when the linear solve becomes the
bottle-neck. Existing methods for solving linear systems in FEA are discussed in the next
section.
1.2 Linear Solvers
Linear solvers can be classified as direct or iterative, which are discussed in the following sub-
sections.
a) First mode b) Second mode
Figure 4: Modal plots for thin plate
5
1.2.1 Direct
Direct solvers [4] are commonly favored for linear systems of moderate size. They are robust and
well-understood, and rely on factoring the stiffness matrix or coefficient matrix (for the
symmetric matrix) into Cholesky decomposition:
TK LL (5)
This is followed by a triangular solve:
1( )Tu L L f (6)
The advantage of using direct solvers is that they terminate in a fixed number of steps. However,
due to the explicit factorization, direct solvers are memory intensive [5]. To quote the ANSYS
manual [6], “[sparse direct solver] is the most robust solver in ANSYS, but it is also compute- and
I/O-intensive”. For large scale problems with one million degrees of freedom (DOF) [6]:
Approximately 1 GB of memory is needed for assembly
However, 10 to 20 GB additional memory is needed for factorization.
Since memory-access is often the bottle-neck, this translates to increased wall-clock time. A
popular option is to use iterative solvers.
1.2.2 Iterative
Iterative solvers do not require decomposition of a stiffness matrix [7]. They start with an initial
guess for u0 that is used to compute the residual, r0:
0 or f Ku (7)
6
The residual is then used to update the solution [8]. The process is repeated until the residual is
less than a specified tolerance.
The scalability of direct and iterative method was compared for both memory required and time
taken to solve a linear system by Mirzendehdal and Suresh in [9]. Their paper focuses on solving
a structural dynamics problem via FEA. For analysis, the algorithm presented in [9] requires a
linear solver for:
eff effK u f (8)
In FEA, linear systems expressed in Equation (8) are common [3]. The coefficient matrices are
often labeled as effective stiffness (commonly for dynamics analysis [3]) or tangent stiffness (for
non-linear FEA [3]).
The effective stiffness matrix presented in Equation (8) is a linear combination of K and M
described in greater detail in [9]. A scaling analysis was performed to compare direct and
iterative methods available in SolidWorks [10] for solving Equation (8). Figure 5 illustrates their
findings on scalability of direct and iterative methods for time taken and memory requirements
for the solution.
7
It is evident that a direct solver scales poorly for large-scale FEA when compared to an iterative
solver. Therefore, iterative solver is the focus of this thesis.
The two main bottle-necks for iterative solvers are:
Sparse matrix-vector multiplication (SpMV) for Ku, and
Number of iterations required to converge to the solution.
a) Scalability analysis w.r.t solution time
b) Scalability analysis w.r.t memory required
Figure 5: Direct vs Iterative solve: as reported in [9]
8
Efficiency of SpMV is a very important aspect of iterative solver as it is the most expensive
operation in any iterative algorithm. But, first convergence of an iterative solver is discussed.
For example, consider once again the thin plate elasticity problem described in Figure 1. When
discretized using 550,000 hexahedral elements, it results in a linear system with 2 million
unknowns in nodal displacements also referred to as degree-of-freedom (DOF). Using conjugate
gradient (CG) as an iterative solver, without any preconditioner, the solution converges in
approximately 6400 iterations to a specified tolerance of 10-8
. Figure 6 illustrates the
convergence plot for the same.
An argument can be made that a thin plate does not require that many elements or a better
element technology can be used to exploit the nature of the problem [11]. However, there are far
more complex problems in the real world application which require large number of elements
and suffer from a similar issue of slow convergence. Figure 7 illustrates a few examples of large
scale FEA.
Figure 6: Convergence of residual for thin plate.
9
Accelerating convergence, therefore, remains an important challenge to be addressed in the
scientific community. In the next section, some of the existing methods to address the issue of
convergence are discussed.
1.3 Accelerating convergence
Faster convergence is usually achieved through a sequence of operation that reduces the
condition number of the effective stiffness matrix [7], [14]. An overview of types of operators
used to this end is presented in the following sub-sections.
1.3.1 Preconditioning
Preconditioning is a process of applying transformation so that instead of solving Equation (3),
we solve [15]:
1 1A Ku A f (9)
where A is a characteristic preconditioner. The transformation can also be applied from right side
of the coefficient matrix [7], [14], [15], such that:
Figure 7: Structures for large-scale FEA [12], [13]
10
1 1; KA u f u A u (10)
Preconditioning can also be split on either side of the coefficient matrix [7], [14], [15] by
following transformation:
1 1 1 1
1 2 1 2; A KA u A f u A u (11)
A practical preconditioner should be inexpensive to compute, and the preconditioned system
should converge rapidly.
There are several general-purpose preconditioners available. Jacobi preconditioner is one of the
oldest methods; that uses diagonal scaling [7]. It does not require assembly of the stiffness
matrix, and is therefore scalable and easily parallelizable. But it is not very effective for many ill-
conditioned problems in solid mechanics [5]; this is confirmed later through numerical
experiments. Methods such as Gauss-Seidel and Symmetric Successive Over-relaxation (SSOR)
perform better than Jacobi, but have similar limitations [16].
The incomplete Cholesky (IC) is perhaps the most robust and efficient preconditioner for
symmetric matrices [15], [17]. It relies on an approximate Cholesky factorization where, for
example, the lower-triangular matrix L from Equation (5) is forced to have the same sparsity-
pattern as K. The memory requirement becomes an issue for large-scale systems.
Preconditioning methods accelerate convergence through spectral transformation, i.e., they shift
the eigen-values of the coefficient matrix closer together [14], [16], [18]. They do not affect the
size of the problem. Improving convergence through dimensional reduction is considered under
deflation-based methods [19]–[21] discussed next.
11
1.3.2 Deflation
Deflation relies on projection methods to reduce the size of the problem [21]. This reduced
system when solved, eliminates the eigen-modes from the residual that span that projected space
[20], [21]. The process is described next.
Deflation starts with constructing a deflation-space or workspace, W, which is a rectangular
matrix of size n-by-m, where m is far smaller than n. The workspace then deflates the linear
system through following projection operations [21]:
TK W KW (12)
( ) ( )i T i
Wr W r (13)
This results in a reduced m-by-m linear system in the projected deflation-space [21]:
( ) ( )i i
W WKu r (14)
The solution to this reduced system is then projected back to the solution space. The residual is
then orthogonalized w.r.t the projected vector through[21]:
( ) ( ) ( )i i i
Wr r Wu (15)
The superscript, i, indicates that the operation has to be performed every iteration. The projection
into the deflation-space, represented by Equation (13), is referred to as restriction operation,
while the reverse is called prolongation [20], [21].
At this point, it is important to note that deflation-space, W, is effective when it spans low-
frequency modes of K [19]–[21].
12
Direct solvers can be used for Equation (14) since m<<n. One can also use iterative method to
solve Equation (14) and further use another level of deflation nested inside the reduced system
solver. The method of using nested deflation is commonly referred to as multi-grid[22]–[24].
Geometric multi-grid uses finite element approximations to construct deflation-space through a
coarse mesh [22], [23]. Using a coarse mesh also allows geometric multi-grid to construct
reduced system coefficient matrix, through finite element methods rather than projection
operation described in Equation (12). Figure 8 illustrates the basic concept behind a two-level
geometric multi-grid method.
In the algebraic multi-grid method [20], the restriction and prolongation operators are
constructed in an algebraic fashion, rather than through a geometric mesh transfer. The
properties and performance are similar to that of the geometric multi-grid.
While multi-grid methods perform particularly well for scalar problems and solid mechanics
(vectors) posed over ‘thick solids’, they are prone to Poisson locking and ill-conditioning for
problems posed over ‘thin solids’ and composite materials [25]. Improvements over the multi-
grid method for thin structures were proposed in [26], [27], where lower-dimensional models
Figure 8: A two-level geometric multi-grid.
13
were used instead of coarse-meshes, thus avoiding locking issues. The effectiveness of the
method was in exploiting the physical behavior of the structure.
This leads to physics-based deflation, discussed in the next sub-section.
1.3.3 Physics-based deflation
As mentioned earlier, an effective deflation-space is one that spans the low-frequency modes.
Since computing the eigen-modes is typically expensive, Bulgakov, et al [28] suggested a simple
agglomeration technique where finite element nodes are collected into small number of groups.
Then, to construct the W matrix, nodes within each group are collectively treated as a rigid body.
The motivation is that these agglomerated rigid body modes mimic the low-frequency eigen-
modes. The results were promising [28].
This opens up the possibility of constructing an entire family of deflation space that utilizes the
awareness of physical behavior of the system. This thesis is an attempt to highlight the
effectiveness and ease of implementation of such methods, and their application.
Figure 9: a) Finite element mesh, b) agglomeration of nodes in 16 groups
14
1.4 Thesis overview
In this thesis, a general-purpose iterative method called Assembly-Free Deflated-CG is proposed.
This method can solve a large variety of problems in solid mechanics that require large-scale
FEA. It is robust and easy to implement. It is also easy to integrate into existing applications with
very low memory cost added to the process. The layout of this thesis is shown in Figure 10.
Chapter 2 details a non-exhaustive list of physics-based deflation methods. Numerical methods
illustrate improvement in convergence due to such methods. A comparison among a few of the
implemented method is presented.
Figure 10: Thesis overview
15
The ease of implementation of these deflation-spaces motivates us to find efficient ways of
implementing iterative solver in an assembly-free manner. In chapter 3, the focus shifts towards
assembly-free implementation of SpMV and the corresponding deflation operations.
The main contribution of this thesis lies in identifying several physics-based deflations and then
implementing them in an assembly-free FEA to solve large-scale problems with limited memory.
In chapter 4, we discuss algorithms that can exploit assembly-free deflated-CG for different types
of FEA. Numerical results presented in the chapter will illustrate the performance of such
algorithms when compared to commercially available solvers in Solidworks [10].
Application of assembly-free FEA for topology optimization is presented in chapter 5.
Concluding remarks and future work are discussed in the final chapter.
16
2 PHYSICS-BASED DEFLATION
In this chapter, an analysis of physical aspects of some commonly used deflation space is
presented. The focus is primarily on convergence behavior of different deflation methods for two
problems.
2.1 Convergence of conjugate gradient method for thin structure
The matrices considered in this thesis are symmetric positive definite. Conjugate gradient (CG)
is preferred iterative method for these types of system [8]. As is well known, CG’s convergence
can be poor if the stiffness matrix exhibits high condition number, or if the eigen-values of the
stiffness matrix are spread out [7], [16]. In solid mechanics, poor convergence of CG is fairly
common, for example in thin structures, composite materials, multi-scale problems, etc. [25]–
[27]. It is important to note that physical nature of boundary condition plays a role in slow
convergence.
To illustrate this, consider an example of static FEA with two types of loading on a thin plate.
The dimensions of the plate are 100mm X 100mm X 2.5mm. The load cases are in-plane and
out-of-plane as illustrated in Figure 11. Both cases have a uniformly distributed load of 100N,
applied on the same surface but in different directions.
17
The material used is alloy steel, with Young’s modulus, E = 2.1 x 1011
and Poisson’s ratio, ν =
0.28. Plate is discretized using 600,000 hexahedral elements resulting in 2 million degrees-of-
freedom (DOF). Equation (3) is solved for displacement vector. The stiffness matrix is same for
both cases. The plot in Figure 12 illustrates convergence for both types of loading.
a) In-plane load b) Out-of-plane load
Figure 11: Thin Plate examples for convergence analysis
Figure 12: Convergence plot for thin plate
18
It is evident from the plot that, the physical direction of the load changes the behavior of
convergence. The deflation techniques therefore, must account for these physical effects of the
problem.
The deflated-CG algorithm to solve such a system is described below (derivation and theoretical
analysis of the algorithm can be found in [21]):
Algorithm 1: Deflated CG (DCG); solve 𝑲𝒅 = 𝒇
1. Construct the deflated stiffness K
2. Choose 0d where 0
0TW r & 0 0r f Kd
3. Solve 0
T
WKu W Kr for W
u ; 0 0 Wp r Wu
4. For 1,2,..., ,j m d0:
5. 1 1
1
1 1
T
j j
j T
j j
r r
p Kp
6. 1 1 1j j j jd d p
7. 1 1 1j j j jr r Kp
8. 1
1 1
T
j j
j T
j j
r r
r r
9. Solve T
W jKu W Kr for W
u
10. 1 1j j j j Wp p r Wu
11. End-do
Observe that when the deflated space W is far smaller than the size of the system:
SpMV in step 5 and 9 are primary computations.
Additional computations include the restriction operation TW x in step 9, the prolongation
WWu in step 10, and the solution of the reduced linear system in step 9.
19
Deflation-space accelerates convergence of Algorithm 1 through step 9 and 10. Several deflation
methods and their physical effects on the system are discussed next.
2.2 Agglomeration
Agglomeration is the simplest deflation method used today. It was introduced by Nicolaides [19]
as an ad-hoc method to approximate eigen-modes of the system. The method relies on collecting
a group of closely positioned nodes and treats the group as point object with translational DOF.
The mapping between translational DOF of group, ug and nodal DOF, un can be expressed as:
1 0 0
0 1 0
0 0 1
n gu u
(16)
To further understand the mapping given in Equation (16), consider a series of nodes, 1 thru n,
which lie in group g. The collective mapping of group DOF of g to nodal DOF of 1 thru n can be
expressed as:
11
22
1 2
Here,
1 0 0
0 1 0
0 0 1
g
g
g
ngn
g g ng
Wu
Wuu
Wu
W W W
(17)
With the mapping between any nodal and group DOF given in Equation (16), one can construct
an assembled deflation matrix, W that will project an assembled deflation-space vector, uW to the
solution space vector, u:
20
Wu Wu (18)
The mapping operation in Equation (18) is referred to as prolongation and is repeated through
the iteration every time deflated system given in Equation (14) is solved.
One can use W to deflate the stiffness matrix as described in Equation (12). This is a one-time
operation and it does not have to be repeated during iterations. Restricting residual to deflation
space through Equation (13) also takes place during iteration.
Using agglomeration as a deflation space, the thin-plate example from Figure 11 is solved. The
convergence plot for different number of agglomerated groups is shown in Figure 13.
21
It is clear from the plots that agglomeration accelerates CG much more effectively for in-plane
load cases as opposed to out-of-plane load cases. Physically, each group behaves like a block
a) Convergence for in-plane loading
b) Convergence for out-of-plane loading
Figure 13: Convergence for Agglomeration
22
that is only allowed to move through translation. For in-plane load cases, the collection of block
translation can approximate the mode-shape that dominates the residual. When out-of-plane
modes are excited, large number of groups are needed to approximate the mode-shape. Figure 14
illustrates how mode-shape can be approximated by agglomerated groups.
The limitation in approximating out-of-plane mode shapes can be remedied by allowing the
groups to rotate. This is precisely what rigid-body method employs for deflation.
2.3 Planar-rigid-body
To overcome, the restriction imposed by simple translation, the groups can be treated as rigid
bodies. Rigid-body deflation was suggested by Bulgakov, et al [28]. It remains a popular
deflation method for a variety of problems.
First consider the rigid-body motion within a plane. If restricted to move only within the plane,
any group’s translation and rotation DOF can be expressed as:
0
0
0
g
u
u v
(19)
For the group’s DOF, one can construct a mapping to nodal DOF of any node n, which lies
within the group through [28]:
Figure 14: Agglomerated mode-shapes
23
where
1 0
0 1
0 0 0
; ;
n ng g
ng
g n g n
u W u
y
W x
x x x y y y
(20)
The deflation may appear rank deficient, but if the forces dictate planar motion, the deflation is
highly effective even for a 3D FEA. The convergence plot in Figure 15 shows the efficiency of
planar- rigid-body.
The rank deficiency doesn’t become an issue for planar-loads. This is primarily due to the modes
excited by in-plane load. If we apply the planar-rigid-body to out-of-plane load, deflation space
will not correct the out-of-plane displacement causing stagnation in residual similar to regular
CG as shown in Figure 16.
Figure 15: Convergence for in-plane load: Planar rigid-body vs Agglomeration
24
The planar-rigid-body deflation has been exclusively used for 2D FEA in the past [5]. However,
it is evident that planar-rigid-body is equally effective for 3D FEA, if the physics of the problem
dictates planar displacements. For out-of-plane load, one can always use a more generalized
rigid-body deflation discussed next.
2.4 Rigid-body
The agglomerated rigid bodies for general 3D system will have 3 rotational DOF in addition to
the 3 translational DOF. The group DOF vector, ug can be expressed as:
0 0 0 , , , , , T
g x y zu u v w (21)
These 6 DOF can be mapped into nodal displacement for any node within the group through
[28]:
Figure 16: Convergence for out-of-plane load: Planar-rigid-body
25
where
1 0 0 0
0 1 0 0
0 0 1 0
; ;
n ng g
ng
g n g n g n
u W u
z y
W z x
y x
x x x y y y z z z
(22)
Figure 17 illustrates the mode shapes that can be approximated through a combination of rigid-
body groups.
The addition of rotational DOF allows agglomerated groups to better approximate both the in-
plane and out-of-plane mode shapes with fewer group. This is evident from the convergence plot
presented in Figure 18.
Figure 17: Rigid-body mode shapes
26
For out-of-plane load, convergence achieved through the same number of groups is phenomenal
for a group using rigid body deflation.
This also presents an insight into the situation, as to why restrict to a rigid-body approximation
for deflation space for thin structures. If higher bending is expected, one can expand the group
DOF to include curvature sensitivity. Exploiting ‘Kirchoff-Love plate theory’ to include such
curvature sensitivity becomes a clear choice.
2.5 Kirchoff-Love plate
This was the first step towards expanding the deflation-space based on the physical nature of the
problem; published in [29].
The low order eigen-mode of thin solids involves curvature effects. Schematically it is illustrated
in Figure 19.
Figure 18: Convergence for out-of-plane load: Rigid body vs Agglomeration
27
To accommodate curvature, the rigid-body group DOF is extended with 2nd
order derivative
terms as follows:
0 0 0 , , , , , , , , , , , T
g x y z xx yy xyu u v w w w w (23)
The 2nd
order terms are mapped into nodal DOF using Kirchoff-Love plate theory[30]. The
resulting expression for nodal displacements through group DOF is then expressed as[29]:
2 2
where
1 0 0 0 0
0 1 0 0 0
0 0 1 02 2
; ;
n ng g
ng
g n g n g n
u W u
z y zx zy
W z x zy zx
x yy x xy
x x x y y y z z z
(24)
The convergence of Kirchoff-Love plate deflation is compared with rigid-body deflation in
Figure 20. The plot compares the convergence for equal number of DOF for deflated-space, and
it is evident that Kirchoff-Love plate is an improvement over rigid-body.
Figure 19: Curvature effects in thin structures
28
Kirchoff-Love plate deflation can be compared to dimensional reduction methods discussed in
[26], [27]. However, instead of using lower dimensional FEA to construct deflated stiffness
matrix, algebraic reduction using W is sufficient. Furthermore, meshing is not required for any
reduced model. Aggregation of nodes is sufficient for nodal relationship to a deflated group.
2.6 Euler-Bernoulli beam
In the same research paper expanding on physics-based deflation [29], Euler-Bernoulli beam
theory is also used for deriving a deflation space.
Unlike a thin plate, the curvature of the beam varies only along one major axis. Euler-Bernoulli
beam theory can be used to extend deflation to beam like problem [31]. For Euler-Bernoulli
beam, the group variables are:
0 0 0 , , , , , , , T
g x y z xxu u v w w (25)
Figure 20: Convergence: Kirchoff-Love plate vs Rigid-body
29
It is important to note, that it is a constrained version of Kirchoff-Plate theory; the same way
planar-rigid-body is a constrained version of rigid-body deflation. The nodal mapping expression
therefore becomes [29]:
2
where
1 0 0 0
0 1 0 0 0
0 0 1 02
; ;
n ng g
ng
g n g n g n
u W u
z y zx
W z x
xy x
x x x y y y z z z
(26)
The expression assumes the displacement due to bending is predominantly in w. However, if
bending is expected in arbitrary direction, a simple transformation can be used to orient the
problem such that bending prescribes to the nodal mapping given in Equation (26).
Euler-Bernoulli beam deflation can be used for the out-of-plane loading described in Figure
11(b). Figure 21 illustrates the effectiveness of Euler-Bernoulli beam deflation even for a plate
problem.
30
Both Kirchoff-Love plate and Euler-Bernoulli beam deflation stem from the understanding of
physical behavior of rigid-body deflation [29]. They exploit expected displacement behavior
based on the applied boundary conditions. Trial functions that satisfy the physics of the problem
can also be used to construct deflation space. One such family of trial functions is polynomial
elastic fields [32].
2.7 Elastic Polynomial
The polynomial elastic fields for 3D problems were introduced by Wang et al [32]. The work
presented in [32] extends the general representation of nth order homogeneous polynomial Airy
stress function from 2D [33] to 3D elasticity problem. It is shown that for 3D elasticity problems,
the polynomial stress field is obtained via “a 3D harmonic vector, p(x,y,z)” used as a trial
function. The unknown coefficients (discussed later) in the trial function are determined by
satisfying the linear elasticity PDE over prescribed control points.
Figure 21: Convergence for Euler-Bernoulli beam deflation
31
However, “A one-one relation between a 3D harmonic vector, p(x,y,z) of nth order and
displacement field is shown as” [32]:
4(1 ) ( )
Here is a position vector
Tu p x p
x
(27)
For 1st order polynomial, harmonic vector can be expressed as:
0 1 2 3
0 1 2 3
0 1 2 3
x x x x
y y y y
z z z z
C C x C y C z
p C C x C y C z
C C x C y C z
(28)
The coefficients are all unknowns for the trial polynomial. Instead of solving for those
coefficients through Airy stress functions [32] over control points, one can treat those unknowns
as group DOF. These group DOF can be mapped in to nodal DOF through:
1 2 1 1
1 1 2 1
1 1 1 2
1 2
0 0 0 1 1 1 2 2 2 3 3 3
Here,
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
3 4 ; 2 4 ;
; ;
, , , , , , , , , , , ,
n ng g
ng
g n g n g n
T
g x y z x y z x y z x y z
u W u
T T x y z T y T z
W T T x x T y z T
T T x T y x y T z
T T
x x x y y y z z z
u C C C C C C C C C C C C
(29)
A careful observation shows that 1st order elastic polynomial captures all the DOFs of a rigid-
body group and adds the components of dilation into the group.
The rigid-body deflation is very efficient in capturing all 1st order displacement behavior because
dilation does not play a significant role in linear elasticity problem. It is further illustrated in the
32
convergence behavior of 1st order elastic polynomial when compared with rigid-body deflation.
For linear elastic out-of-plane load the convergence of 1st order elastic-polynomial deflation is
exactly same as rigid-body deflation (shown in Figure 22).
While there is no significant advantage in 1st order elastic polynomial deflation, the
implementation opens up the possibility of using elastic polynomial field of nth order as a
deflation space. The number of DOF per group will increase rapidly with higher order
polynomials, but the convergence can be faster with fewer groups than that of rigid-body
deflation. However, further investigation is required to support this assertion.
2.8 Summary
Agglomeration [19] and rigid-body deflation [28] were implemented as a general-purpose
deflation method. To this day, rigid-body deflation is one of the superior methods for large scale
problem [20]. The deflation process is algebraic and does not require the information regarding
the physics of the problem.
Figure 22: Convergence for out-of-plane loading: 1st order Elastic-polynomial vs Rigid-body
33
For complex structural analysis, there also exists a vast resource of simplification method
through dimensional reduction [27], [34]–[40] that utilizes special element types suited for those
models. The list is non-exhaustive. 3D shapes are simplified to shells, plates, beams, etc. to
effectively reduce the complexity and improve convergence.
Then there are trial functions that exploit closed form solution expressed by Airy stress function
in 2D and 3D to create shape functions for FEA as described in [32], [33].
The first main contribution of this thesis lies in exploiting the physics of the problem to generate
efficient deflation spaces for 3D FEA. The effectiveness of any of the methods listed lies in
ability to capture the physical behavior of the system. Physics-based deflation allows one to use
the specificity of various reduction techniques with the simplicity of general 3D FEA.
A handful of example deflation-spaces are introduced in this thesis such as Kirchoff-Love plate,
Euler-Bernoulli beam [29], and Elastic polynomial (ongoing research). Deflation-spaces allow a
problem to be defined in 3D and exploit dimensional reduction to solve the system efficiently
through CG.
Most of the deflation-space discussed in this chapter relies only on nodal coordinates that are
readily available. They are easy to implement and require very little additional storage;
illustrated through examples.
The second main contribution of this thesis is to exploit this fact in implementing a limited
memory deflated-CG for large-scale problems.
To achieve this, assembly-free implementation of SpMV and deflation operations is the focus of
discussion in the next chapter.
34
3 ASSEMBLY-FREE IMPLEMENTATION
There is a large volume of research for efficient implementation of sparse matrix-vector product
(SpMV). Research involving efficient storage and access of sparse matrices is presented in [41]–
[43]. There are also methods that exploit the computational capabilities provided by multi-core
architecture [44]–[46]. One such technique is assembly-free method.
Assembly-free method for iterative solver was first proposed by Hughes et al [47] in 1983. It was
an attempt to parallelize the solution process. Since then there have been multiple attempts to
implement assembly-free method due to growing interest in fine grain parallelization[48]. The
idea is to never assemble K; instead perform SpMV at element level. In other words, instead of
the usual “assemble and then multiply”:
e
assemble
Ku K u (30)
the strategy is to “multiply and then assemble”:
e e
assemble
Ku K u (31)
However, assembly-free method is not particularly advantageous over assembled approach
unless: 1) the total memory consumption can be reduced, and 2) CG can be accelerated in
assembly-free way.
Accelerating CG was discussed in the previous chapter. Assembly-free implementation of those
methods (deflation-space) is discussed later in the chapter. But first, we discuss ways to reduce
memory access for SpMV.
35
3.1 Congruence of elements
Storing and retrieving K is the primary memory access in CG. It can be reduced if element-
congruency can be exploited. Element-congruency is an aspect that is not explored in the
scientific community.
The proposed element-congruency can be defined as:
Definition: For FEA problems, two elements e1 and e2 are said to be congruent within a
specified tolerance ε if:
2 1
1
Where
is element stiffness
e e
e
e
K K
K
K
(32)
Note that, both elements should have same number of DOF for above criteria to be true. It is also
important to note that perfect congruency between any two elements in an unstructured mesh is
improbable due to numerical errors in computation. A tolerance is therefore specified to limit the
difference between normalized values of elements that are being compared. In case of isotropic
elements, the element-congruency can also be decided by comparing geometry. A more formal
approach to compare elements is provided later in the sub-sections.
Large-scale meshes have a significant number of congruent elements. For example, consider the
finite element mesh of a composite specimen [49] in Figure 23, consisting of 83000 elements;
the mesh was generated using ANSYS.
36
Through a simple congruency test [50], one can determine that the mesh contains only 322
distinct elements, i.e., less than 0.4%, are geometrically and materially distinct within a specified
tolerance of 𝜀 = 10−8. The distinct elements are near a notch feature as shown in Figure 24.
Congruent elements have ‘identical’ element stiffness matrix. For assembly-free method, only
the distinct element stiffness matrices need to be computed and stored, since all the elements are
represented by the set of those distinct ‘template’ matrices.
While there are several methods to check for geometric congruency in different types of shapes
[50], quadrilateral and hexahedral are the only two shapes considered for congruency in this
thesis.
Figure 23: Element Congruency in Mesh
Figure 24: Distinct Element located around notch
37
The difference between the elements can be quantified through one of the two methods as
discussed next.
3.1.1 Geometric method
In geometric method, the first step is to compute local nodal coordinates. For each node, n within
the element, e, they are defined about the centroid C of the element through:
ˆ C
n n eX X X (33)
The local nodal coordinates are stacked into a single vector, 𝑒. The normalized difference of
this vector is used to compare the similarity between any two elements, say e1 and e2, through
the following expression:
2 1
1
ˆ ˆ
ˆ
e e
e
X X
X
(34)
This is a naïve implementation. The congruency check does not consider a scaling or rotation of
an element. However, with increase in mesh size, even this method yields a significant
percentage of congruent elements. For example, consider an L-bracket with a fillet. The top edge
is fixed with a tip load on the end of free hanging surface as shown in Figure 25.
38
The geometry is discretized using different mesh sizes. An example of discretized geometry with
600 quadrilateral elements is shown in Figure 26.
Figure 25: Geometry and boundary conditions on L-bracket
Figure 26: Quad-mesh for L-bracket
39
The elements within the mesh are checked for congruency using different tolerances. For various
acceptable tolerances, the congruence trend with increasing mesh size is illustrated in Figure 27.
The default method is comparing element stiffness as described in Equation (34).
3.1.2 Stiffness method
Stiffness matrix, Ke of an element captures the sensitivity of a spatial element w.r.t reference
element. Furthermore, stiffness matrix accurately scales the effects of nodal displacements on the
structure. Congruency through stiffness-method is therefore expected to be more accurate.
The congruency check for different L-bracket mesh is repeated with the criteria defined in
Equation (34). The congruency plots for various acceptable tolerances in Figure 28 illustrate that
stiffness method yields slightly higher percentage of congruency compared to geometric method.
Figure 27: Geometric congruence vs mesh size for various tolerances
40
While effective, computing the stiffness matrix for all the elements is still very expensive. One
only requires an effective sensitivity function to compare; an expression to map spatial element
to reference element without neglecting material properties. A careful expansion of element
stiffness can yield such an expression. For example, consider a typical element stiffness
expression, numerically integrated over multiple gauss points [3]:
1 1
, , ,
#
,
1
,
1 1
,
1
,
,
( ( ) | | )
Here,
is the shape function gradient w.r.t reference coordinates ;
0 0
= 0 0 ;
0 0
is the Jacobian;
is a mapping of displacement gr
T T T
e GP
GP
K N J T ETJ N X wt
N
X
J X
X
X
T
adient field to strains;
is material matrix;
is the weight of the gauss point;GP
E
wt
(35)
Figure 28: Stiffness congruence vs mesh size for various tolerances
41
The shape function gradient in the above expression remains the same for all elements and
therefore can be omitted for reduced element stiffness computation. The reduced element
stiffness expression thus becomes:
1 1
,
#
ˆ (( ) | | )
Here,
ˆ is the reduced stiffness;
T T
e GP
GP
e
K J T ETJ X wt
K
(36)
This reduced stiffness matrix does not require computing an entire element stiffness. It can be
used to compare elements similar to the stiffness comparison in Equation (34) as:
2 1
1
ˆ ˆ
ˆ
e e
e
K K
K
(37)
The congruence results, through the above expression, plotted for increasing mesh sizes in
Figure 29 resembles the plot shown in Figure 28.
42
It is interesting to note that irrespective of method used to determine congruency, the mesh
exhibits greater congruency with higher number of elements. This supports the earlier statement,
that large-scale meshes have significant number of congruent elements.
FEA results through the lens of congruency are discussed next.
3.2 FEA results with congruency
The results in this section are presented in a relative scale. For example, a maximum
displacement result for a mesh with identified congruency is compared as % error for the same
result obtained when no congruency was exploited.
The example problem is of the L-bracket illustrated in Figure 25. The L-bracket is discretized
using quad-mesh for different mesh-sizes. An example of 9600 element mesh of the discretized
L-bracket is shown in Figure 30.
Figure 29: Reduced-stiffness congruence vs mesh size for various tolerances
43
Table 1 and Table 2 list the percentage error results for maximum displacement and von Mises
stress respectively. Error values greater than 1% are highlighted.
Table 1: Results for error in maximum displacement for different mesh sizes
Figure 30: Quad-mesh for L-bracket with 9600 elements
#Elements
Congruency
tolerance (as error
%)
% error
Max. Displacement
Geometric Stiffness Reduced-stiffness
2400
0.1 0.12 0.05 0.04
0.2 0.12 0.05 0.02
0.5 0.38 0.21 0.3
1 1.04 0 0.3
4800
0.1 0.16 0.07 0.07
0.2 0.36 0.16 0.2
0.5 0.73 0.24 0.31
1 2.18 0.76 1.01
9600
0.1 0.36 0.32 0.33
0.2 0.02 0.38 0.17
0.5 0.82 1.09 1.12
1 1.68 0.24 0.24
44
Table 2: Results for error in maximum von Mises stress for different mesh sizes
While there is no discernible trend, it is quite clear that tolerance for congruency should be less
than 0.5% error in the norm, if errors in results are not to exceed 1%. For a tolerance of 0.1%, the
errors for stress and displacements are well within 1% as shown in Figure 31.
#Elements
Congruency
tolerance (as error
%)
% error
Max. von Mises Stress
Geometric Stiffness Reduced-stiffness
2400
0.1 0.03 0.02 0.01
0.2 0.11 0.04 0.07
0.5 0.86 0.75 0.74
1 1.2 1.09 0.63
4800
0.1 0.12 0.04 0.05
0.2 0.24 0.2 0.21
0.5 1.33 0.82 0.82
1 1 1.44 1.41
9600
0.1 0.21 0.08 0.11
0.2 0.36 0.34 0.37
0.5 0.1 0.62 0.67
1 1.17 0.1 0.11
45
The variability in error plot can be attributed to the greedy nature of the congruency algorithm
implemented, as it stops checking for congruency with other ‘templates’ as soon as the tolerance
is satisfied. Figure 32 illustrates a 9600 quad-elements mesh providing small variation in results
a) Stress error plot
b) Displacement error plot
Figure 31: Stress and displacement error for 0.1% tolerance
46
for same amount (87%) of congruency. This variability in results indicates a need for refinement
in implementation.
Identifying congruent elements is not the only challenge. Once congruency is established the
SpMV operation should be optimized to exploit the congruent elements. Simply storing fewer
elements is not sufficient, if retrieving the templates is not scheduled efficiently during SpMV
computations. This is a challenge addressed towards memory management aspect of the
problem.
Figure 32: Stress plot with maximum displacement (δ) and maximum stress (σ)
47
To understand the advantage of proper memory management in detail, consider a special case of
voxel-mesh.
3.3 Special case of SpMV on Voxel-mesh
Voxel-mesh is a structured grid which is not constrained to conform to the boundary of a given
geometry. An example of voxel mesh is illustrated in Figure 33.
Computing and storing only one element property dramatically reduces the memory footprint,
and therefore accelerates SpMV. Using one template element, an efficient implementation of
SpMV on multi-core architecture is discussed next.
Direct implementation of Equation (31) suggests that one assign a thread to each element and
update the result element-by-element. However, this will create a race condition when a nodal
index connected to multiple elements is simultaneously accessed. Therefore, a thread is assigned
to a node instead of an element.
Figure 33: Knuckle with (a) Conforming Mesh, and (b) Non-conforming Mesh
48
Once a thread is assigned to a node, SpMV is implemented in two ways. The result of Ku for
nodal DOF is computed either based on element connectivity or node connectivity. Both
implementations are discussed in the following sub-sections.
3.3.1 Element-connectivity based
In element connectivity based method, the first step is gathering indexes of the neighboring
elements. For each element the rows of stiffness values associated with the node is gathered.
There are 8 such set of row values depending upon the location of the node within the element.
These set of row values are used for dot product with nodal DOFs of nodes within the element;
this is illustrated in Figure 34. This ensures that the product Keue is computed without race
conditions. The implementation was first presented in [51].
Figure 34: Element connectivity based SpMV implementation
49
3.3.2 Node-connectivity based
In node connectivity based method, all the neighboring node indexes are gathered. There are a
total of 255 unique node arrangements possible for any given node. The nodal row of K for these
255 possibilities are computed and stored as preprocessed information. The result for nodal
DOFs is then computed through a dot product of nodal row of K with u vector of neighboring
nodes.
The memory access for gathering nodal DOF is unfortunately not coalesced since the DOFs are
staggered based on element connectivity. However, once the result is computed the update in
device memory is coalesced.
3.3.3 Single SpMV results
A simple mesh shown in Figure 35 is used to illustrate the advantage of mesh congruency. The
mesh consists of all identical elements.
The experiment was conducted on a Windows 7 64-bit machine with following specification:
AMD Phenom™ II X4-955 processor running at 3.2GHz with 4GB of memory; OpenMP
commands were used to parallelize CPU code.
NVidia GeForce GTX 480 (448 cores) with 0.75GB of device memory; CUDA SDK 4.0
[52] and CuBLAS library [53] was used for implementation.
Figure 35: A beam geometry and its mesh
50
The computations were performed in double precision.
Timing results for a single SpMV, i.e., a single Ku, with assembly-free implementations are
summarized in Figure 36. The overhead of computing the global K has been neglected. All the
element stiffnesses are stored to have an equivalent effect as assembling a global K. With
congruence exploited, the memory requests are much faster as all of the elements are mapped to
single element stiffness. Once the element is fetched, the data remains in cache for quick access.
As the number of elements increases, a speed-up of 10x can be achieved in SpMV.
Furthermore, Table 3 lists the timing results for a single Ku (SpMV) computed using GPU and
CPU implementation for both element-connectivity and node-connectivity based methods. The
SpMV was performed for a 1 million DOF linear system.
Figure 36: Assembly-free SpMV on the CPU with and without exploiting element-congruency[51]
51
Table 3: Assembly-Free SpMV Timing results
Assembly-Free Implementation Time in m-secs
Element-Connectivity based CPU 64
Node-Connectivity based CPU 33
Element-Connectivity based GPU 14
Node-Connectivity based GPU 3.5
The implementation, specific to voxel-mesh, illustrates the advantage of reducing memory foot-
print for SpMV. With congruence exploited in SpMV, it is time to revisit the deflation space.
3.4 Assembly-free deflation
The assembly-free implementation of deflation operation on a Graphics Programmable Unit
(GPU) is discussed in this section. This implementation was also presented in [51]. GPU is used
as an example to show some commonly available multi-core systems. The general principle of
implementation would remain the same; however, some details such as block, warp, etc. are
specific to GPU programming [52].
3.4.1 Prolongation
The prolongation operation presented in Equation (17) is straight forward on a GPU system.
Since the vector is projected from deflation-space to solution-space, each thread can be assigned
to a node without any concern for race conditions.
The group number is identified for each node and Wngug is computed and stored for nodal DOF,
un. Memory access for prolongation is coalesced for the most part. The nodes can gather the
nodal coordinates (x,y,z) in a lock-step method. Gathering the corresponding group DOF has the
potential for latency due to sequential access. However, the length of the vector associated with
52
group is small, therefore this in not a serious issue. The results update is fully coalesced. Figure
37 illustrates the process for a single thread in a multi-core system.
3.4.2 Restriction
The restriction operation WngT
rn is much more challenging to parallelize on the GPU due to
potential race conditions. Instead of assigning a thread to each node, a block of threads is
assigned to a group. The thread in the block is assigned to a node which lies within the group.
Nodal projections are computed for each thread using disassembled Equation (14) expressed as:
T
g ng n
assemblenodes in g
r W r (38)
These nodal projections are saved in shared memory within the block; this is illustrated in Figure
38. Threads are synchronized after the shared memory update. A reduce operation is performed
on respective DOFs of the nodal projection to yield the resultant vector for the group. The
allowable number of threads within the block is thus restricted by the shared memory.
The memory access for this part of the implementation is not coalesced either, as nodes that
belong to a group may skip a large sequence of indexes. As shown in Figure 38, the warp may
Figure 37: GPU implementation of prolongation.
53
end up with coalesced memory access if a contiguous sequence of indexes is assigned for
restriction.
3.4.3 Deflating stiffness
Deflating K remains a challenge. At this point, it is important to note that storing deflated K or
as described in Equation (12) has to take place only once every linear solve. Also, storing is
less expensive because its size is far smaller than K. It is a dense matrix and therefore storing a
factorized lower triangular matrix is also beneficial as it can be readily used for direct solve in
deflated-space.
With the decision to store a factorized triangular matrix in view, one can assemble through
assembly-free methods element-by-element using:
T
e e e
assemble
K W K W
(39)
Figure 38: GPU implementation for restriction.
54
This concludes all the major aspects of assembly-free deflated-CG.
3.5 Summary
The main objective of this chapter was to outline an assembly-free implementation of all the
steps required for limited-memory deflated-CG. To limit memory requirement the concept of
‘element-congruency’ was defined.
Congruency is often considered a geometric term, and is hard to identify in an exact sense for an
unstructured conforming mesh. However, with some tolerance this can be an exceptional tool for
solving large-scale problems. The number of unique elements can also be reduced through
proposed reduced-stiffness congruency. It can produce a set of useable results in an efficient
manner without computing and storing individual stiffness for all the elements.
It is also important to take stock of the situation, that simply identifying congruency is not
sufficient. The challenge of memory-management and scheduling of operations has to be
considered. The implementation for a voxel-mesh is a special case highlighting the importance of
such techniques. But a more general implementation of SpMV for all congruent elements in a
conforming mesh is still an ongoing research problem at this time.
The chapter also illustrates the implementation for assembly-free deflation. Physics-based
deflation and its use in assembly-free deflated-CG for large-scale FEA define the core
contribution of this thesis.
In the next chapter, algorithms that can utilize assembly-free SpMV and assembly-free deflated-
CG for different types of FEA are discussed. The performance of assembly-free methods is
illustrated through examples.
55
4 ASSEMBLY-FREE FINITE ELEMENT ANALYSIS
In this chapter, the application of assembly-free deflated-CG for large-scale FEA is discussed.
The formulations and results for assembly-free FEA are presented for published work [29], [54],
[55] and ongoing research.
The numerical examples are presented to emphasize the following observations:
1) The ease of implementation of assembly-free deflated-CG for a variety of large-scale
FEA problems.
2) The limited memory characteristics of assembly-free deflated-CG.
3) The speed of assembly-free deflated-CG compared to commercial solvers, such as those
supported by SolidWorks [10].
4) For static analysis, a profile of assembly-free deflated-CG on a GPU is also presented to
illustrate the adaptability of the method on multi-core architecture.
The results for assembly-free deflated-CG in large-scale static FEA [29] are discussed first.
4.1 Assembly-free static analysis
In this section, the focus is on solving a linear static problem, expressed in Equation (3).
Assembly-free deflated-CG is used as a linear solver as per Algorithm 1. The experiments were
conducted on a Windows 7 64-bit machine with following specification (except when specified
otherwise):
AMD Phenom™ II X4-955 processor running at 3.2GHz with 4GB of memory; OpenMP
commands were used to parallelize CPU code.
56
NVidia GeForce GTX 480 (448 cores) with 0.75GB of device memory; standard function
calls within CUDA SDK 4.0 [52] and CuBLAS library [53] were used for GPU
implementation.
The computations were performed in double precision. Tolerance for relative residual was set to
10-8
for CG.
4.1.1 Numerical results: static analysis for Knuckle
Knuckle geometry is illustrated in Figure 39 (a). The knuckle is fixed at the two horizontal holes,
and a vertical force is applied on the third hole. Observe that the geometry is relatively ‘thick’,
i.e., there are no plate-like or beam-like features. A voxel mesh with 3.15 million DOF was
generated as shown in Figure 39 (b).
The Jacobi-Precondition Conjugate Gradient (Jacobi-PCG) required 1741 iterations and 245
seconds on the CPU. The displacement and stress plots are illustrated in Figure 40.
Figure 39: (a) Knuckle geometry and loading, and (b) Voxel mesh
57
The same system was solved using different number of rigid-body groups for deflation space.
For example, Figure 41 illustrates collection of nodes into 100 and 1000 groups.
The results for varying number of groups are summarized in Table 4.
Figure 40: Static (a) displacement, and (b) stress for knuckle
Figure 41: 100 and 1000 rigid-body groups
58
Table 4: Total iterations and time taken to solve knuckle with varying number of groups
#Groups #Iteration CPU Time (s) GPU Time (s) Memory Needed (MB)
0 1741 245 36 174
200 145 54 29 213
400 114 48 28 224
600 95 48 31 252
800 73 46 32 263
The following observations are worth noting:
Increasing the groups from zero (pure Jacobi-PCG) to 100 reduces the number of CG-
iterations by a factor of 10, but the CPU time reduces only by a factor of 4. The
underlying reason is that SpMV is performed twice every iteration in deflated-CG.
Further, increasing the number of groups beyond a certain limit can lead to an increase in
computation time. Finding an optimal number of groups is a topic of future research.
As the number of iteration reduces, the speed-up gained though GPU also reduces as
expected since the bottlenecks are the SpMV required per iteration, and TW x operation
which is not amenable to fine-grain parallelism.
Finally, the memory requirements are fairly small even for a 3.15 million DOF.
The timing results were also compared with solution methods available in SolidWorks [10].
However, the problem size was limited to 1.1 million DOF due to the memory constraint for
direct solver. The factorization for direct solver failed on account of not enough system memory.
The knuckle problem was solved using built-in sparse-direct solver and preconditioned-iterative
solver via SolidWorks [10]. For assembly-free deflated-CG, 300 rigid body groups were used for
the same number of DOF. The comparison for memory required, along with solution time is
listed in Table 5.
59
Table 5: Time taken to solve the knuckle problem using SolidWorks [10] and proposed
method.
Solution Method CPU Time Memory Required
Sparse-direct in SolidWorks [10] 1 h 52 m 32 s 2.9 GB
Preconditioned-iterative in SolidWorks [10] 30 s 524 MB
Proposed assembly-free deflated-CG 13.5 s 77 MB
Observe that direct solver performs poorly for large-scale FEA. The type of preconditioning is
not-known for preconditioned-iterative solver in SolidWorks [10], however memory required is
still high compared to proposed assembly-free deflated-CG. The speed-up observed is a
consequence of limited memory requirement for assembly-free deflated-CG.
The convergence plot in Figure 42 illustrates that the Jacobi-PCG converges slowly but steadily
towards the solution without stagnation; this is typical of solid mechanics problems posed over
‘thick’ solids. The rigid-body deflation leads to a dramatic drop in number of iterations as
mentioned earlier.
Thin solids however behave differently.
Figure 42: Convergence of DCG vs Jacobi-PCG
60
4.1.2 Numerical results: static analysis of thin plate under pressure
Consider the thin plate illustrated in Figure 43. The dimension of the plate is 100x100x2.5 (mm);
the four side faces are fixed, with static force applied to the top face. The geometry is discretized
using a voxel mesh of about 550000 elements with over 2 million DOF. The plate is fixed at all 4
edges and a unit pressure is applied to one of its free surface.
The solution obtained though Jacobi-PCG takes 6337 iteration in an average time of 72s. The
convergence plot in Figure 44 highlights the effectiveness of deflated-CG with rigid-body
deflation space in case of thin structures. The presence of numerous low-order eigen-modes leads
to stagnation for Jacobi-PCG whereas deflated-CG ensures that the low-order eigen modes are
smoothed effectively.
Figure 43: Loading on a thin plate
61
Next, the above problem is solved using rigid-body deflation and Kirchoff-Love deflation. The
results are listed in Table 6.
Table 6: Comparison of Rigid-body deflation vs Kirchoff-Love deflation
Rigid Body deflation Kirchoff-Love deflation
#Groups #Iteration GPU time
(s)
GPU
memory
(MB)
#Iteration GPU time
(s)
GPU
memory
(MB)
100 736 36 138 256 21 144
200 382 23 146 130 18 161
300 260 23 161 96 21 192
400 199 22 178 76 29 233
Observe that, although the Kirchoff-Love deflation uses 33% more DOF per group, the net-gain
is significant. In other words, for the same number of group-DOF, for thin structures, capturing
the curvature leads to faster convergence. It is a better alternative for large scale problems
constrained with limited memory.
The thin-plate problem was also solved using SolidWorks [10]. The mesh generation in
SolidWorks [10] was limited to a system of 1.2 million DOF as discretization failed for higher
Figure 44: Convergence of DCG vs Jacobi-PCG for thin plate
62
number of elements. This is different from what was observed in the knuckle problem, where
factorization for direct-solver failed for higher number of DOF. However, investigating failure of
discretization is not important for this discussion. The thin-plate problem was solved using both
sparse-direct and preconditioned-iterative method available in SolidWorks [10]. 300 rigid-body
groups were used for assembly-free deflated-CG for same number of DOF. The comparison for
solution time and memory required is shown in Table 7.
Table 7: Time taken to solve thin-plate with proposed method vs SolidWorks [10]
Solution Method CPU Time Memory Required
Sparse-direct in SolidWorks [10] 1 hour 11 min 5 sec 1.7 GB
Preconditioned-iterative in SolidWorks [10] 35 sec 547 MB
Proposed assembly-free deflated-CG 14.5 sec 82 MB
Again, it is important to observe that the limited memory requirement allows a significant speed-
up towards solution time. The implementation is easily portable to any multi-core architecture,
including GPU, due to this limited memory requirement.
63
Once the number of iteration is reduced, time taken for SpMV reduces to about 50% of the total
solution time. The restriction operation, which reduces vectors to deflation-space, occupies about
20% of the time. CUDA profile provides the information for % time spent on various functions
for proposed assembly-free deflated-CG, as illustrated in Figure 45.
4.1.3 Numerical results: static analysis of ‘Thomas’ engine
For a more complicated large-scale FEA problem, consider the ‘Thomas’ engine in Figure 46
whose wheels are fixed and a load is applied as shown.
Figure 45: CUDA Profile for Rigid-Body deflation
64
Since the implementation relies on a robust voxelization scheme, the detailed features of the
model need not be suppressed. Here, the model was voxelized using 20 million elements,
resulting in a 50 million DOF system.
This experiment was performed on a typical high-performance desktop that used a GTX Titan
GPU card with 6 GB of device memory. The linear system was solved on this GPU using rigid-
body deflation with 900 groups in 24 minutes, requiring less than 3 GB of memory.
Next the generalized-eigenvalue problem expressed in Equation (4) is revisited for modal and
buckling analysis. This published work [54] primarily focuses only on exploiting assembly-free
SpMV for solving generalized eigen-value problem for modal analysis posed in Equation (4).
Figure 46: Structural problem over a Thomas engine.
Figure 47: Deflection from a 50 million DOF system.
65
4.2 Assembly-free modal analysis
Most commercial methods today use the block-form of the shift-and-invert Lanczos algorithm,
also known as the block-Lanczos method [18], [56]–[58]. The method inverts a matrix 𝐾 − 𝜎𝑀
repeatedly to isolate the desired eigen-pairs, with σ determined during iterations. This requires
explicit LU factorization of 𝐾 − 𝜎𝑀, which is not desirable in large-scale problems. Methods to
eliminate the need for an explicit decomposition were developed in [58], [59]. Inversion was
carried out in an approximate sense over Krylov sub-space. Arbenz et al [58] use algebraic multi-
grid as pre-conditioner for factorizing the shifted matrix. They also compare the implementation
of alternate algorithms including ‘locally optimal block preconditioned CG’[60], ‘Davidson-
Jacobi’[61], etc. It is shown that, for large-scale eigen-value problems, these alternate algorithms
can be competitive, relative to block-Lancoz. Rayleigh-Ritz conjugate gradient is one such
algorithm[60], [62], and it is discussed next.
4.2.1 Rayleigh-Ritz conjugate gradient
Rayleigh-Ritz conjugate gradient (RCG) algorithm [63], [64] requires only an efficient SpMV.
Therefore, it exhibits numerous advantages including simplicity, low memory requirements, and
significant scope of parallelism. The key concept is computing Rayleigh quotient of an arbitrary
vector x, given by the equation:
( )T
T
x Kxx
x Mx (40)
If the vector x is an eigen-vector of (K, M), then the Rayleigh quotient is the corresponding
eigen-value. Thus, by minimizing the Rayleigh quotient, one can compute the lowest eigen-pair,
i.e., the eigen-value problem can be posed as a minimization problem:
66
T
Tx
x Kx
x MxMin (41)
A nonlinear conjugate gradient [65] can be used to solve the minimization problem to find the
lowest eigen-mode. Gradient of the above equation can be computed through:
2
( )( ) 2
M
Kx x Mxg x
x
(42)
Where
T
Mx x Mx (43)
Using the classic CG algorithm [65], RCG algorithm becomes:
Algorithm 2: Rayleigh-Ritz Conjugate Gradient (RCG)
1. Initialize (1) 0x such that (1) 1
Mx
2. Set(0) 0p ,
(0) 1 and 1k
3. Compute the gradient ( )kg via Equation (42)
4. Update
( ) ( )
( )
( 1) ( 1)
Tk k
k
Tk k
g g
g g
5. The conjugate search direction is given by: ( ) ( ) ( ) ( 1)k k k kp g p
6. Find the step length ( )k as described in [66]
7. Let ( 1) ( ) ( ) ( )k k k ky x p and ( 1) ( 1) ( 1)/k k k
Mx y y
8. Compute ( )( )kx via Equation (40)
9. If ( )kg , terminate; else, increment k , go to step 3
67
SpMV being the most costly operation makes RCG a very simple algorithm to implement.
However, to compute higher modes, the process has to be repeated with some constraints.
4.2.2 Computing multiple modes
To compute higher modes, observe that if 1 1( , )x and
2 2( , )x are two distinct modes, and if
1 2 , then it is easy to show that they must satisfy M-orthogonality:
1 2 0Tx Mx (44)
Further, if1 2 , one can always find a pair of eigen-mode that satisfy the above equation. Thus,
to find the second eigen-mode, we pose a constrained minimization problem:
1. . 0
T
Tx
T
x Kx
x Mx
s t x Mx
Min
(45)
For an arbitrary vector x, the M-orthogonality is enforced through the following:
1 1( )Tx x x Mx x (46)
M-orthogonality is ensured during the iteration through: 1) initializing start vector as M-
orthogonal to 𝑥1, and 2) maintaining search directions 𝑝𝑘 M-orthogonal to 𝑥1.
Given (m-1) lower modes computed as:
1 2 1 , , , mX x x x (47)
To compute the next mode, one must solve the constrained minimization problem:
68
. . 0
T
Tx
T
x Kx
x Mx
s t x MX
Min
(48)
where M-orthogonality can be enforced via:
1
1
( )m
T
i i
i
x x x Mx x
(49)
With this update the RCG algorithm can be updated for computing mth
lowest eigen-pairs [64].
Algorithm 3: RCG (multiple modes)
1. Suppose m-1 eigen-modes 1 2 1 , ,..., mX x x x have been computed.
2. Initialize (1) 0x such that (1) 1
Mx and
1 0Tx MX
3. Set (0) 0p , (0) 1 and 1k
4. Compute the gradient ( )kg via Equation (42)
5. Update
( ) ( )
( )
( 1) ( 1)
Tk k
k
Tk k
g g
g g
6. Let ( ) ( ) ( ) ( 1)k k k kp g p (preliminary direction)
7. Construct an M-orthogonal direction ( )kp via Equation (49)
8. Find the step length ( )k as described in [66]
9. Let ( 1) ( ) ( ) ( )k k k ky x p and ( 1) ( 1) ( 1)/k k k
Mx y y
10. Compute ( )( )kx via Equation (40)
11. If( )kg , terminate; else, increment k , go to step 4.
69
The advantage of the above method as opposed to other block-oriented methods is limited
memory requirement, and simplicity.
4.2.3 Subspace augmentation
There are some limitations to the RCG algorithm for multiple modes. The most important of
them are:
1. Missing modes: As we sweep eigen-spectrum, one or more eigen-modes may go
undetected, especially for repeated eigen-values.
2. Erroneous results: A large value for tolerance in step 10 of RCG can lead to erroneous
results; for that reason a low tolerance, typically 10~10 , is essential.
3. Slow convergence: For low tolerances, thousands of iterations are required for each
mode, especially when the matrix is ill-conditioned.
All three problems are addressed by subspace projection methods. Such projection methods are
common in modern-eigen solvers [18], [58], but have not been considered for RCG.
The idea is to create subspace through approximate eigen-vectors ( ) ( 1,2,..., )k
ix i m computed
through early termination of RCG. The sub-space is defined as:
( ) ( ) ( ) ( )
1 2 , ,..., k k k k
mS x x x (50)
Reduced stiffness and mass matrices are constructed using the subspace though the following
transformation:
( ) ( ) ( )
( ) ( ) ( )
k k T k
k k T k
K S KS
M S MS
(51)
70
Both of these matrices are constructed through a series of SpMV. The transformed matrices are
used to solve a smaller eigen-value problem exactly:
( ) ( ) ( ) ( ) ( )k k k k kK V M V (52)
A sharpened set of eigen-vectors are recovered by transforming the vectors back to original
space:
( 1) ( ) ( )k k kX S V (53)
Using these set of vectors as a starting vector, one can restart RCG method. This algorithm can
be termed as subspace augmented Rayleigh-Ritz conjugate gradient (SaRCG).
Algorithm 4: Subspace Augmented Rayleigh-Ritz Conjugate Gradient (SaRCG)
1. Initialize (1) (1) (1) (1)
1 2 , ,..., mX x x x (typically random vectors). Set 1k
2. Compute ( )k
ix with ( )k
ix as the starting vectors until ( )kg ( ~ 0.1 ), for 1,2,...,i m .
3. Construct the reduced matrices via Equation (51), where ( )kS is defined in Equation (50),
and solve Equation (52) to find ( )k and ( )kV .
4. If convergence in ( )k has not been achieved, construct the updated eigen-vectors ( 1)kX
via Equation (53), increment k and go to step 2.
The extension improves the robustness and convergence of the algorithm. The additional
computational cost of constructing and solving the reduced problem is negligible. SpMV remains
computationally the most expensive part of the algorithm.
71
4.2.4 Numerical results: modal analysis for Knuckle
In this section, the accuracy of SaRCG method is compared with results from a commercial
package. The first five eigen-modes for the ‘knuckle’ problem are considered. The problem is
illustrated in Figure 48(a), where two horizontal holes are clamped. Also illustrated in Figure
48(b) and Figure 48(c) are the conforming and non-conforming meshes used.
The first five eigen-modes are computed through SolidWorks [10] application using the
conforming tetrahedral mesh from Figure 48(b). Total computational time to obtain the first five
mode shapes was approximately 220s for 1 million DOF system. The corresponding five eigen-
modes computed via SaRCG [54] using a voxel-mesh are illustrated in Table 8 along with mode
shapes obtained from SolidWorks [10]. The eigen-values are within 1% accuracy, and the
computational time for SaRCG [54] was approximately 500 seconds for 1 million DOF system,
on the CPU.
An important observation made here is that eigen-values are relatively accurate despite using a
non-conforming mesh. While computing global properties, such as mode shapes, one can get
away with the use of a non-conforming mesh. Without compromising accuracy, the method
becomes very attractive with its robustness.
Figure 48: (a) Knuckle geometry, (b) Conforming mesh, and (c) voxel-mesh
72
Table 8: First 5 eigen-modes for Knuckle: Solidworks [10] vs SaRCG [54]
SolidWorks [10] SaRCG [54]
1st Frequency and mode shape
2457.5 Hz
2437.6 Hz
2nd
Frequency and mode shape
3795.8 Hz
3806.4 Hz
3rd
Frequency and mode shape
5262.1 Hz
5279.1 Hz
4th
Frequency and mode shape
9746.6 Hz
9744.2 Hz
5th
Frequency and mode shape
9957.6 Hz
10031 Hz
73
The results presented above do not illustrate the advantage of assembly-free SpMV thoroughly
on a CPU. At the time of publication [54], the focus was strictly on exploiting assembly-free
SpMV for large-scale modal analysis on a GPU. There was no effort made towards using the
assembly-free deflated-CG which was later published in [29].
However, once assembly-free deflated-CG was available, inverse iteration [67] (discussed later
in assembly-free buckling analysis) was determined to be a more effective method to solve
Equation (4). Inverse iteration allows one to compute eigen-modes by repeatedly solving a linear
system of equation of the form expressed in Equation (3), which can exploit assembly-free
deflated-CG. The advantage of using inverse iteration (Algorithm 5) [67] is discussed later for
assembly-free buckling analysis in the next section.
The experiment for computing first five eigen-modes for knuckle was repeated using Algorithm
5. The timing results for 1 million DOF system using SolidWorks [10], SaRCG (Algorithm 4),
and inverse iteration (Algorithm 5)[67] are listed in
Table 9. The linear system within the inverse iteration [67] was solved using 300 rigid-body
groups via assembly-free deflated-CG.
Table 9: Time for computing first-5 frequency of Knuckle
Solution Method 1
st-
Frequency
Discretizatio
n Time (s)
Solution
Time (s)
SolidWorks[10] 2457.5 Hz 52 220
SaRCG[54] 2437.6 Hz 21 500
Inverse Iteration w/ assembly-free deflated-CG[55] 2437.6 Hz 21 102
74
Table 9 illustrates the advantage of using assembly-free deflated-CG for large-scale modal
analysis. The next experiment highlights the robustness of using voxel-mesh.
4.2.5 Numerical results: modal analysis for Housing cover
The primary advantages of the proposed method are its robustness, simplicity and ability to
handle geometrically complex structures. For example, consider the gear housing illustrated in
Figure 49(a). The meshing for this structure failed (in SolidWorks [10]) as illustrated in Figure
49(b).
On the other hand, we can easily voxelize the geometry as shown in Figure 50. Brute force
voxelization of the structure produces a high density mesh. This is handled well with the
proposed SaRCG method.
Figure 49: (a) Gear-housing: eigen-spectrum is desired, (b) Meshing can fail for complex structures
Figure 50: Brute-force voxelization of the structure
75
The results are summarized below in Table 10. Figure 51 illustrates the mode shape computed
for the first eigen-mode of gear housing using SaRCG method [54].
Table 10: Results for Computing Fundamental Frequency of Gear Housing
DOF 1st-Frequency Voxelize Time (secs) Solution Time (secs)
150,000 70 Hz 33 9
300,000 76 Hz 52 18
425,000 74.2 Hz 81 35
2,000,000 74.6 Hz 220 191
There are more results available in [54] providing a broader perspective on the importance of
assembly-free SpMV for modal analysis on GPU.
The generalized eigen-value problem for buckling analysis is discussed next. The formulation
and results discussed are from published work for large-scale assembly-free buckling analysis
[55].
Figure 51: First Eigen-mode for Gear Housing
76
4.3 Assembly-free buckling analysis
Buckling is the sudden failure of a structural member carrying a compressive load. For example,
Figure 52 illustrates the classic buckling of a pinned-pinned beam. Structural elements, such as
those found in high-rise buildings are typically subjected to compressive loads, and must be
analyzed and designed to prevent such buckling failures.
Finite element analysis of linear buckling is typically carried out in two stages. In the first stage,
the structural member is subject to a unit load. The domain is discretized using a finite element
mesh, and the corresponding static linear-elasticity problem is posed and solved; this amounts to
solving a linear system expressed in Equation (3).
In the second stage, the linear displacement field u is post-processed to obtain the stress tensor
within each of the finite elements [3]:
xx xy xz
elem xy yy yz
xz yz zz elem
(54)
Then the stress tensor is used to define an element-level stress stiffness matrix [3]:
Figure 52: Bucking of a pinned-pinned beam
77
elem
elem
elem T
elem elem
elem
K G Gd
(55)
where G is shape function gradient matrix described in [3]. This is then assembled to construct
the global stress stiffness matrix [3]:
( )elem
assemble
K K (56)
Finally, the following generalized eigen-value problem is posed and solved [3]:
( ) 0K K w (57)
While there are multiple pairs of solutions to the above problem, only the lowest few are
typically important. In particular, the lowest eigen-value of Equation (57) determines the
buckling safety factor [3], i.e., the load at which buckling will occur (assuming a unit load has
been applied initially). The vector w in Equation (57) represents the associated buckling mode.
Observe that this equation is similar to the generalized eigen-value problem associated with
modal analysis as described by Equation (4). However, two key differences between Equation
(4) and Equation (57) are 1) the mass matrix M is positive definite, but the stress stiffness matrix
Kσ need not be positive definite, and 2) mass matrix depends only on the material and the
underlying mesh, while the stress stiffness matrix depends on the element stresses as well.
The second difference has implications in assembly-free analysis. It is important to note that
Algorithm 4 relies only on efficient SpMV for solving eigen-value problems for modal analysis.
This includes both Kx and Mx exploiting congruency in voxel-mesh. While SaRCG is efficient
78
for large-scale modal analysis, the effects of element stresses in Kσ make it difficult to exploit
congruency for SpMV in Kσw. Furthermore, storing every element stress stiffness matrix Kσ will
create a large memory footprint. This was observed to significantly slow down the computation.
4.3.1 Inverse iteration
This draws attention towards another method known as inverse iteration [67]. The basic principle
is to carry out:
1y K K w
(58)
and recycle the solution. The number of K w operations is considerably reduced, and the
computational burden falls on solving an equivalent static problem [29]. The algorithm to solve
Equation (57) through inverse iteration is given below.
Algorithm 5: Inverse iteration for buckling
1. Initialize (1) 0w such that (1)|| || 1w
2. Set 1i
3. Compute ( ) ( )i iz K w
4. Solve ( 1) ( )i iKy z for ( 1)iy
5. Update ( 1) ( 1) ( 1)/ || ||i i iw y y
6. Compute ( ) ( 1) ( 1) ( 1)i i i ig Kw K w
7. If ( )|| ||ig , terminate; else, increment i , and go to step 3
For a mode shape w, the Rayleigh-quotient in step 6 is expressed through:
79
T
T
w Kw
w K w
(59)
The number of iterations required to converge to the mode shape is far smaller than RCG as the
numerical error is primarily eliminated in the linear solution in step 4. The linear system in step 4
is solved through assembly-free deflated-CG in Algorithm 1.
The numerical results illustrate the advantage of using inverse iteration with assembly-free
deflated-CG for buckling analysis. The material for all buckling analysis examples is steel with
young’s modulus 𝐸 = 2.1 × 1011 𝑃𝑎 and Poisson’s ratio 𝜈 = 0.33. The results are compared
against those obtained through SolidWorks [10].
4.3.2 Numerical results: buckling analysis of a square beam
The first example is that of a beam of 1 meter in length, and 10 mm by 10 mm cross-section. The
beam is fixed at one end, and a compressive unit load is applied at the other. The classic fixed-
free Euler-beam analysis yields a critical load of [31]:
2
2431.8
(2 )cr
EIP
L
(60)
The results obtained through the proposed Assembly-Free Buckling Analysis (AFBA) and those
obtained from SolidWorks [10] using the same number of degrees of freedom (DOF) are
illustrated in Figure 53. Both methods converge to a critical load of 430.03. Note that 3-D FEA
results are not expected to converge to the exact Euler-buckling result in Equation (60) however;
both methods should yield similar results.
80
The real advantage of AFBA is in speed. Figure 54 illustrates the computing time for AFBA
versus SolidWorks [10]. The quadratic growth in computation time in SolidWorks [10] can be
attributed to the quadratic growth in memory requirements with increasing degrees of freedom.
Figure 53: Predicted critical load using AFBA and Solidworks [10]
81
4.3.3 Numerical results: buckling analysis of cylindrical column
To illustrate the potential deficiency of AFBA with voxel-mesh, consider an example of a
cylindrical column of 1 m in length, and a diameter of 10 mm. The classic fixed-free Euler-beam
analysis yields a critical load of [31]:
2
2254.3
(2 )cr
EIP
L
(61)
The predicted buckling loads are illustrated in Figure 55. The two results differ by 2.5%. The
difference can be attributed to the voxelization in current implementation of AFBA. The non-
conformity of mesh affects the accuracy of predicted stress field.
Figure 54: Computing time vs #DOF for AFBA and SolidWorks [10]
82
The time taken to solve the problem follows a similar trend as illustrated in Figure 56. Thus, if
one can tolerate small errors, the voxelized AFBA method can be significantly faster.
Figure 55: Accuracy plot for Cylindrical Column: Buckling load vs #DOF
Figure 56: Computing time vs #DOF for cylindrical column
83
It is important to note that assembly-free analysis is not restricted to voxel-mesh, but the
experiments simply highlight the speed at which useable results can be produced.
4.3.4 Numerical results: buckling analysis of a rectangular column with a hole
For the last example, consider the structure shown in Figure 57. The dimensions of the column
are 5x30x100 (mm), and the hole is of diameter 10 mm.
The results for the load factor computed using different mesh sizes are plotted in Figure 58. Here
we observe a 0.3% error in the solution. The computation time is plotted in Figure 59.
Figure 57: Rectangular column with a hole
84
The next section details an ongoing research problem for assembly-free large-deformation which
has not been published. The formulation that exploits assembly-free deflated-CG is selected for
large-deformation, and preliminary convergence analysis is presented.
Figure 58: Predicted critical load using AFBA and SolidWorks[10] for rectangular beam with hole
Figure 59: Computing time for rectangular beam with hole
85
4.4 Assembly-free large-deformation analysis
Large-displacement problems are common in the real world. Examples include, but are not
limited to, analyzing slender members in compliant mechanisms, crash analysis of shell-like
structures, soft tissue analysis for bio-mechanical systems, etc. Even for large-displacement
problems, the formulations require solution to an effective linearized system expressed as:
Where
is effective or tangent stiffness
is effective or unbalanced force
is solution for incremental displacement
eff eff
eff
eff
K u f
K
f
u
(62)
Equation (62) can be solved via assembly-free deflated-CG. In this section, the formulation that
exploits the proposed assembly-free methods is discussed.
When displacements are large, their effect on stiffness properties can no longer be ignored [68],
[69]. For such cases, large deformation analysis is required. Figure 60 illustrates an example of
cantilever beam solved as linear elastic and large deformation problem.
86
It can be observed that large-deformation formulation has lower maximum displacement
predicted for the structure. This is due to non-negligible non-linear terms in strain tensor that
have a stiffening effect on the structure [68], [69].
Since the non-linear terms in strain tensor depend on displacement field, the solution method
involves breaking down the external force in multiple intermediate steps [68], [69]. For each
force step, the displacement field that satisfies the equilibrium condition is determined through
Newton iteration [68], [69]. The equilibrium condition is expressed as:
a) Linear elastic formulation b) Large-displacement formulation
Figure 60: Cantilever beam displacement for linear elastic vs large-deformation formulation
87
:
Where
is step index
is stress tensor
is strain tensor
is work done by external load
t t t t t t
ext
ext
ed W
t t
e
W
(63)
The terms on the left hand side represents the internal strain energy, which must be equal to
external work done for any intermediate step, t + Δt.
There are several formulations that pose the equilibrium condition as a discretized system. A
thorough discussion for derivation of these formulations is available in [2], [3], [68], [69]. One
such formulation is ‘total Lagrangian’.
4.4.1 Total Lagrangian (TL) formulation
TL formulation integrates for internal strain energy in Equation (63) over initial un-deformed
domain [68], [69]. In simple terms, the nodal coordinates are not updated for intermediate
equilibrium conditions [68], [69]. Configuration of the mesh remains fixed, which allows the
element-congruency for voxels to remain unchanged. Therefore, TL formulation was used for
large-deformation analysis in this thesis.
In discretized form, the linearized system of equation solved during Newton iteration is
expressed as [68]:
88
( 1) ( ) ( 1)
0 0
( 1) ( 1)
0
( 1)
0
Where
is step index
is external force for current step
is internal force for displacement
is tangent stiffness for displacement
t m m t t m
ext in
t
ext
t m m
in
t m
K u f f
t
f
f u
K
( 1)
( )
is incremental displcement solved in current iteration
m
m
u
u
(64)
Internal force vector and tangent stiffness matrix depend on the displacement field determined
from previous iteration [68]. This makes assembly-free implementation desirable for large-
deformation analysis because stiffness matrix can be updated for any given step as and when
required. The algorithm for large-deformation elasticity problem is described below.
Algorithm 6: Newton method for Large-deformation elasticity
1. Initialize 0;u 0 0;extf 1;n
2. Compute extf based on total number of time step N
3. For incremental time step 1,2,...,n N
i) Initialize (0) 1
0 ;n n
in extf f(0) 0;u
ii) Update 1 ;n n
ext ext extf f f
iii) For k =1, 2, 3… until convergence
a) Check for convergence ( ( )norm R )
b) Compute ( 1)
0
n k
inf
c) Compute ( 1)
0
n n k
ext inR f f ,
d) Solve Assembly Free( 1) ( )
0 ;n k kK u R
89
e) Update ( ) ;n n ku u u
f) Update , ;n nS
g) Go to step 3.iii).a)
Observe that the algorithm uses a modular implementation of assembly-free deflated-CG [29] to
solve the large-deformation problem. The process to obtain element tangent stiffness and internal
force to set up the assembly-free linear solve is laid out next.
The terms in step 3.iii).f) in Algorithm 6 are Green-Lagrange strain and 2nd
Piola-Kirchoff (PK)
stress tensors respectively [68]. They are energy conjugate terms used for total Lagrangian
formulation [68] that depend on current displacement field. The linearized relation between 2nd
PK stress and Green-Lagrange strain in Voigt form is similar to the stress-strain relationship in
linear elasticity [68], [69]:
S D (65)
These terms are essential in updating internal force for the next iteration in Newton method [68],
[69]. The algorithm for updating these tensors for any given element is described next. Detailed
derivation of the algorithm is available in [68], [69].
Algorithm 7: Update deformation gradient and 2nd PK stress for an element
1. For all gauss points
i. Compute displacement gradient
8
1
( / ) ;ij I j iI
I
H N X u
ii. Compute deformation gradient ;defF I H
90
iii. Compute Green-Lagrange strain 1
( );2
T TH H H H
iv. Compute 2nd
PK stress in Voigt form ;S D
The deformation gradient expressed in step (1.ii.) of Algorithm 7 is a measure of spatial
deformation of current configuration of an element (at force step t) w.r.t the initial configuration
(at force step 0) [68], [69]. Mathematically, deformation gradient for any point in the domain is
expressed as:
0
0
Where
are spatial coordinates at step t
are initial spatial coordinates at step 0
tdef
t
XF
X
X
X
(66)
The mapping relationship between two configurations through deformation gradient is illustrated
in Figure 61.
Figure 61: Mapping of current mesh through deformation gradient
91
Use of this term allows for the numerical integration to be performed over initial configuration in
the TL formulation [68], [69]. Once deformation gradient and 2nd
PK stress for elements are
computed, the internal force can be updated for the element via Algorithm 8 [69].
Algorithm 8: Updating internal force for an element
1. Initialize 0inf
2. For all gauss points
i. Compute shape function gradient for all nodes [ ] [ ( ) / ];Ij I jB N X
ii. Get , defS F for the element,
iii. Compute Nominal Stress ;T
defP SF
iv. Update 0 | |T
in I in I I gpf f B P J w for all nodes I in the element
The displacement solution of large-deformation relies heavily on the assembly-free linear solve
step in Algorithm 6. Assembly-free linear solve requires an assembly-free implementation of
SpMV involving tangential stiffness matrix and incremental displacement vector. The tangential
stiffness matrix has two components, geometric stiffness and linearized material stiffness [68],
[69]. The geometric stiffness is also known as stress stiffness matrix for buckling analysis
defined in Equation (55); 2nd
PK stress is used for stiffness computation in large-deformation
analysis [68], [69]. The material stiffness is similar to the stiffness matrix in linear elasticity
when the integration is performed over current configuration. However, to use the element
stiffness from initial reference configuration, the deformation gradients are required from
Algorithm 7. Algorithm 9 outlines the assembly-free implementation of SpMV for tangent
stiffness; where y is the resultant vector and u is the displacement vector.
92
Algorithm 9: Assembly-free SpMV for an element
1. Initialize 0y
2. For all gauss points
i. Compute shape function gradient for all nodes [ ] [ ( ) / ];Ij I jB N X
ii. Get , defS F for the element,
iii. Get Template Linear Stiffness eK for the element at gauss point,
iv. For a given node pair I,J (3,3)
a. Compute Geometric Stiffness 0[ ] [ ] | | ;T
geo IJ I J gpK B SB J w
b. Compute Linearized Material Stiffness [ ] [ ];T
mat IJ def e defIJK F K F
c. Update ([ ] [ ] ) I I geo IJ mat IJ Jy y K K u for all nodes I
At this point, it is important to mention that the algorithm presented is not optimized for several
aspects. A few of the areas that require further improvement in implementation are 1) dynamic
force stepping [68], 2) reduced integration with hour-glass control [3], [69], 3) parallel
implementation, etc. Therefore, speed of assembly-free TL formulation remains an ongoing
research and timing results are not presented in this thesis.
The results only show the convergence behavior of deflated-CG to solve the linear system in step
3.iii.d of Algorithm 6.
4.4.2 Numerical results: large-deformation analysis of beam
The beam problem used for large deformation analysis is illustrated in Figure 62. The problem is
set up for large deformation via planar stretch for in-plane load, and bending deformation for
out-of-plane load.
93
The longest dimension of the beam is 0.5 m with cross-section of 20mm X 50mm. The material
is alloy steel with Young’s modulus, 𝐸 = 2.1 × 1011 and Poisson’s ratio 𝜈 = 0.28. The beam
was discretized using 5000 voxel-elements creating a 17,500 DOF system. The load values are
set to yield a maximum displacement of 0.06m for linear elasticity problem in both cases as
illustrated in Figure 63.
In-plane large-deformation
Out-of-plane large deformation
Figure 62: Large-deformation analysis on beam
a) Linear-elastic displacement for in-plane load
b) Linear-elastic displacement for out-of-plane load
Figure 63: Displacement results for linear static FEA
94
The system was solved for large-deformation formulation through Algorithm 6. The loading was
divided in 5 steps, and each step was allowed a tolerance of 𝜖 = 10−3for equilibrium condition
to be satisfied between internal and external forces. The linear system was solved through
assembly-free deflated-CG presented in [29]. The displacement results for large-deformation
formulation are illustrated in Figure 64.
The results were verified for accuracy through SolidWorks [10], and they were same for 15,000
DOF system. The convergence for CG without deflation is compared with rigid-body deflation
for in-plane large deformation (shown in Figure 65). The number of peaks in the plot illustrate
the number of linear solves performed by Algorithm 6 to achieve equilibrium.
a) Displacement for in-plane load
b) Displacement for out-of-plane load
Figure 64: Displacement results for large-deformation FEA
95
The advantage of deflated-CG is over-whelming for a well-conditioned problem posed by in-
plane load case even for a small system. Since the displacement is expected to have significant
dilation, the experiment was repeated with 1st order elastic-polynomial deflation. For same
number of group, the convergence of 1st order elastic-polynomial deflation was compared with
rigid-body deflation. The convergence plot is shown in Figure 66.
Figure 65: Convergence plot: CG w/o deflation vs Rigid-body
96
The advantage is limited, however, with increase in the size of the problem the 1st order Elastic-
polynomial deflation may possibly scale better than rigid-body deflation. This requires further
investigation.
The linear solve for out-of-plane load without deflation did not converge in 20,000 iterations,
even for a small system of 17,500 DOF. Therefore, the convergence plot was generated for rigid-
body deflation and Euler-Bernoulli beam deflation using same number of groups. The plot is
illustrated in Figure 67.
Figure 66: Convergence plot: Rigid-body vs 1st order Elastic-polynomial
97
The behavior of individual deflation method for large-deformation problem is left for future
research. But the naïve implementation presented for large-deformation highlights the
adaptability of physics-based deflation for all types of FEA problems in solid mechanics.
4.5 Summary
The formulations presented in this chapter illustrate the fact, that FEA typically requires an
efficient linear solver. Assembly-free deflated-CG is a general-purpose linear solver that fits the
criteria of being efficient linear solver for a large variety of problems in FEA. The formulations
emphasize the ease with which assembly-free deflation methods can be integrated as linear
solver for various types of FEA.
The implementation of assembly-free transient FEA was not included in this chapter, however,
Mirzendehdal and Suresh performed dynamic analysis in [9] using the assembly-free deflated-
CG presented in this thesis.
Figure 67: Convergence plot: Rigid-body vs Euler-Bernoulli beam
98
Results presented in this chapter highlight the advantage of using assembly-free FEA for large-
scale problems. The advantage and efficiency is a direct consequence of limiting the memory
requirement for solving the system. This allows the proposed assembly-free deflated-CG to be
competitive with commercial solvers.
While the algorithm for large-deformation analysis can be optimized to achieve equilibrium
condition faster, solving the linearized system remains the most expensive operation. Further
research in assembly-free SpMV for large-deformation is required.
There is also the issue of accuracy for non-conforming mesh, specifically voxelized geometry.
For this, assembly-free SpMV needs to be improved to better exploit congruency in conforming
mesh. On the other hand, voxelization offers a robust discretization method for complex
geometry without any need for de-featuring. It also provides quick solutions for large-scale
problems that can be used to optimize the design, as is the case in topology optimization. In the
next chapter, application of assembly-free FEA in topology optimization is discussed.
99
5 APPLICATION: TOPOLOGY OPTIMIZATION
In this chapter, the application of assembly-free FEA in large-scale topology optimization is
discussed.
In topology optimization, the goal is to find the optimum material distribution while minimizing
some objective function with given constraints [70], [71]. The ‘topology’ of the domain is treated
as design variable, i.e., introduction of ‘holes’ is permitted and expected during optimization.
Figure 68(a) outlines the steps involved in topology optimization. The steps are illustrated using
an example of a 2D cantilever minimizing compliance for a desired volume fraction of 0.5 in
Figure 68(b).
Step 1 in topology optimization shown in Figure 68a. is initializing the design space. Design
space (D) is the allowable region within which the material can be distributed. The boundary
consists of free boundary, Dirichlet (fixed) and Neumann (traction). Dirichlet and Neumann
boundaries are typically retained during the optimization.
100
Step 2 is performing Finite Element Analysis (FEA) over the current design. FEA is performed
by discretizing the design, typically via finite-elements, and solving, for example, the
equilibrium Equation (3).
Step 3 is performing the sensitivity analysis on the discretized space. Sensitivity is defined as the
change of objective function w.r.t change of design variables. The objective function in this case
is compliance (𝑓𝑇𝑢). Design variables vary depending on the optimization methods. Design
variable for optimization methods such as Solid Isotropic Material Penalization (SIMP) [72] is
element density (𝜌𝑒) where 𝜌𝑒𝜖[0,1]. Level set methods on the other hand typically use
boundary variation as design variable to compute sensitivity. Alternately, topological sensitivity
field is used in [73].
(a) Flow chart of topology optimization (b) An illustrative example
Figure 68: Topology Optimization of 2D Cantilever
101
Step 4 is filtering/smoothening the sensitivity field. This is typically needed to avoid pathological
conditions during optimization, such as checker-board patterns [74].
Step 5 is carrying out an optimization step where the design variables are updated and constraints
are verified, etc.
Topology optimization can involve hundreds of finite element operations, and this can be
computationally demanding, especially in 3D. For example, Wang et al [75] published results for
a compliance optimization problem on a 3D cantilever beam shown in Figure 69.
The optimized topology for the 3D cantilever beam with a volume fraction of 0.5 is shown in
Figure 70. Total time taken to optimize the topology using a mesh with 100,000 DOF was about
2.4 hours, whereas with 1 million DOF, their implementation took 45.7 hours to complete.
Optimizing such systems is fairly common in the industry [76].
Figure 69: 3D Cantilever Beam
Figure 70: 3D Cantilever Beam Optimized
102
Strategies have therefore been developed to accelerate topology optimization along two fronts: 1)
improved optimization techniques, and 2) faster FEA. Several papers have been published on
improved optimization methods [77]–[81], the list is in-exhaustive. The level-set method
described in [82] is used as the optimization algorithm to illustrate application of assembly-free
FEA.
5.1 Voxel-mesh in topology optimization
In this section, impact of using a non-conforming voxel-mesh for topology optimization is
discussed. For example, consider an L-bracket to be optimized for compliance as shown in
Figure 71.
Typically, the geometry would be discretized using a conforming mesh as shown in Figure 72(a).
The algorithm then eliminates the elements in the process of optimization. In this case, the
density field is used to determine optimum topology as shown in Figure 72(b).
Figure 71: Geometry of 2D L-Bracket
103
However, instead of using a conforming mesh, one can discretize the geometry using a grid mesh
as illustrated in Figure 73(a), which is non-conforming to the boundary. Using the same
optimization algorithm, observe that the final topology for grid mesh (shown in Figure 73(b)) is
very similar to the topology obtained with conforming mesh Figure 72(b).
As a second example, the same geometry was optimized for minimum stress for a volume
fraction of 0.5. The optimized topology for both conforming and grid mesh are illustrated in
Figure 74. Observe that high stress features (such as the fillet) are modified during the
Figure 72: (a) Conforming mesh for L-bracket, and (b) Optimized topology
Figure 73: (a) Grid mesh for L-bracket, and (b) Optimized topology
104
optimization and thus the advantages of having a conforming mesh for better local stress results
is lost.
One can hypothesize that topology optimization is relatively insensitive to non-conformity of
mesh. The sensitivity computed for the given meshes would certainly be different. However, the
noise introduced in the sensitivity field due to non-conformity is eliminated in the
filtering/smoothening step of topology optimization.
Since topology optimization is used in the initial stages of conceptual design, structured grid-
meshes are sufficient as they lead to similar conceptual designs and that one can exploit grid-
meshes to accelerate finite element analysis. Therefore, assembly-free FEA implemented for
voxel-mesh becomes ideal for topology optimization
As an example of optimization problem solved through assembly-free FEA, consider a buckling
constrained topology optimization presented in [55].
(a) Conforming mesh (b) Grid mesh
Figure 74: Optimized for minimum stress
105
5.2 Buckling constrained optimization
The buckling-constrained topology optimization problem is posed as a constrained minimization
problem in [55]:
max
max
min
max
max
| |
Where
: domain of objective topology
: allowable design space
: compliance of structure
: maximum allowable compliance
: maximum von-mises for current design
: maximum allo
D
c
Min
J J
D
J
J
min
wable von-mises stress
: critical buckling load
: minimum buckling load
c
(67)
The sensitivity expression for buckling is derived, for example, in [83], and is given by:
T
T
w K K w
w K w
(68)
It is assumed that the eigenvectors have been Kortho-normalized, such that:
1Tw K w (69)
Thus, sensitivity expression can be rewritten as:
Tw K K w (70)
106
Unlike SIMP methods, where the sensitivity is obtained with respect to pseudo-density variables,
here the sensitivity is discrete addition and subtraction of element; for example, the discrete
sensitivity of the stiffness matrix to element deletion is given by:
0 0 0
0 0
0 0 0
e
NXN
K k
(71)
where ek is the elemental stiffness matrix. The second part of the derivative can be neglected for
linear elastic problem as per the derivation presented in [83]. While the derivative is computed in
[83] with respect to element density, the same can be extended for a discrete element variable.
The element-by-element sensitivity can be projected to the nodes to obtain a continuous field of
topological sensitivity.
The sensitivity fields for stress and compliance were obtained using the implementation
presented by Suresh and Takalloozadeh in [84]. Combining the topological sensitivity of
buckling with the sensitivity of compliance [ref], and stress [ref] one can generate topological
level-set T for constrained optimization problem described in Equation (67). Figure 75 illustrates
the algorithm proposed in [55] for buckling constrained optimization.
107
For the optimization, FEA for static and buckling analysis is solved using proposed assembly-
free deflated-CG.
5.2.1 Numerical results: optimizing a thin column
Consider minimizing the volume of a thin column with a compressive load, illustrated in Figure
76.
Figure 75: Algorithm for buckling-constrained topology optimization
108
Specifically, the objective is to solve the topology optimization problem:
0
0
0
| |
5
5
(SF)
D
c
Min
J J
(72)
In other words, the maximum allowable von Mises stress and compliance is 5 times their initial
values, respectively. For the buckling constraint, a safety factor (SF) was prescribed with respect
to the initial buckling load.
The structure was voxelized with 500,000 DOF, and the time taken for buckling analysis was 46
sec. As the safety factor (SF) is increased in Equation (72), the buckling constraint begins to
dominate, resulting in topologies illustrated in Figure 77.
Figure 76: Thin column with compressive load
109
Table 11 lists the timing results for optimization results shown in Figure 77.
Table 11: Minimizing volume for Stiff structure
Prescribed S.F. Final Volume Fraction Time (in min) #FEA
No constraint 0.3 15 64
1.1 0.3 38.3 86
1.5 0.42 41.5 98
2 0.52 24 74
5.2.2 Numerical results: optimizing a thin plate
The structural problem considered next is illustrated in Figure 78. The plate is of dimensions
100x100x10 (mm); the lower face is fixed while a uniform load is applied on the top. The
topology optimization problem is the same as defined in Equation (72).
a) w/o buckling; b) SF= 1.1; c) SF = 1.5; d) SF = 2
Figure 77: Stiff designs with different safety factors
110
The structure was again voxelized with 500,000 DOF. The time taken for one FEA to run
including buckling analysis was 7.1 seconds. With various buckling safety factor imposed, the
resulting topologies are illustrated in Figure 79. Observe, as the safety factor is increased,
additional ribs are introduced.
Figure 78: Plate with compressive load
111
The final volume fractions and time taken are summarized in Table 12. Once again, as the
buckling safety factor is increased, the optimization process converges to a higher volume
fraction, as expected.
Table 12: Optimizing plate with buckling constraints
Buckling Safety Factor Time (in seconds) Volume fraction #FEA
No constraint 300 0.3 61
4 432 0.35 98
7 332 0.42 85
10 230 0.61 62
5.3 Impact of Assembly-free FEA in compliance optimization
Consider the compliance optimization of cantilever beam presented by Wang et al [75],
illustrated in Figure 69 and Figure 70. The optimization is carried out using proposed assembly-
a) No buckling constraint b) SF = 4
c) SF = 7 d) SF = 10
Figure 79: Optimized topologies for various safety factors.
112
free FEA. The experiment was carried out over several sub-implementations over the course of
this research.
The speed of optimization through all such experiments is compared against the published data
from [75]. Results for time taken to complete the optimization are listed in Table 13. The
computer architecture and the year the data was generated are also listed in the table.
Table 13: Optimization speed for compliance minimization
#DOF
Data from
Wang et al
[ref]
(2007)
Element Connectivity
based Assembly-free
SpMV (2014)
Node Connectivity
based Assembly-free
SpMV (2014)
Assembly-
free FEA
with
deflation
(2015)
AMD
Opteron 2
core, 8GB
RAM
AMD
Phenom 4
core, 4GB
RAM
GTX 460,
336 core, 0.75
GB device
memory
AMD
Phenom 4
core, 4GB
RAM
GTX 460,
336 core, 0.75
GB device
memory
Intel i7 8
core, 16 GB
RAM
107,184 2.4 hours 4.6 mins 2.2 mins 2.7 mins 1.4 mins 20 secs
1,010,160 45.7 hours 6 hours 2 hours 4.2 hours 1.1 hours 7.9 mins
5.4 Summary
In their paper, Wang et al [75] illustrate findings that strongly recommend using iterative
methods for large-scale optimization. The assembly-free deflated-CG proposed in this thesis is
another step towards improving efficiency by limiting the memory requirements for such large-
scale optimization problems.
The numbers indicate a massive improvement over the course of this research, but an
observation should be made about the evolving technology. A fair representation of the
capabilities of current computational power when compared with the one last year is listed in
Table 14.
113
Table 14: Comparison of optimization speed across various platforms
#DOF
Element Connectivity based
Assembly-free SpMV
Node Connectivity based Assembly-
free SpMV
AMD
Phenom 4
core, 4GB
RAM
GTX 460, 336
core, 0.75 GB
device memory
Intel i7 8
core, 16 GB
RAM
AMD
Phenom 4
core, 4GB
RAM
GTX 460, 336
core, 0.75 GB
device memory
Intel i7 8
core, 16 GB
RAM
107,184 4.6 mins 2.2 mins 2.3 mins 2.7 mins 1.4 mins 1.25 mins
1,010,160 6 hours 2 hours 34.4 mins 4.2 hours 1.1 hours 22.4 mins
The assembly-free FEA exploits the compute capabilities provided by these technologies, by
minimizing the amount of data required to perform the analysis and rather spending more time
on the analysis themselves.
114
6 CONCLUSION AND FUTURE WORK
6.1 Conclusion
The main contribution of this thesis is an efficient assembly-free deflated-CG for large-scale
FEA, specifically targeting solid mechanics.
To achieve efficiency in an assembly-free manner, physics-based deflation was explored.
Exploratory researches lead to popular methods such as, agglomeration [19] and rigid-body
deflation [28]. Using those as a starting point a group of physics-based deflation methods were
proposed:
1) Kirchoff-Love plate [29]: that exploits curvature based terms in ‘thin plate-like’
structures.
2) Euler-Bernoulli beam [29]: that exploits the bending behavior of a ‘thin beam-like’
structures.
3) Planar-rigid-body: that exploits the dimensional reduction of 3D problem to 2D planar
system without adding the framework of 2D meshing.
4) Elastic-polynomial: that exploits the displacement trial functions generated to satisfy Airy
stress functions for 3D system [32].
Deflation methods allow solver to exploit reduced formulation in a straightforward 3D FEA
without any need for explicit FEA over reduced system.
The main advantage of the proposed deflation methods lie in their assembly-free
implementation. Through assembly-free deflation, one can accelerate convergence with minimal
115
memory overhead. The assembly-free deflation also requires an assembly-free SpMV to have a
truly effective solver for large-scale FEA.
For effective SpMV, congruency of element is proposed to limit the memory requirements. The
benefits of congruency were clearly evident in a non-conforming voxel-based discretization.
The examples presented in chapter 4 and 5 illustrate the applicability of assembly-free deflated-
CG for large-scale FEA. Moreover, the examples for static FEA illustrate the limited memory
requirement to handle very large systems.
6.2 Future work
There are many open research problems presented in this thesis that require further investigation.
A few of them are listed nest.
6.2.1 Effectiveness of Elastic-polynomials
The entire family of generalized polynomials for elastic fields introduced in [32] appears to be
promising for deflation methods. The deflation-space developed from 1st order elastic-
polynomial appeared to be only marginally better than rigid-body deflation.
However, a careful observation of the mapping operator in Equation (29) will reveal the nature
of additional DOF available in 1st order elastic-polynomial. The DOFs in elastic-polynomial
have the 6 rigid-body motions, 3 DOF for dilation in each of the directions X, Y, Z, and 3 DOF
for linearized twisting about X, Y, Z.
For small displacements, the dilation and twisting effects are small, which provides the
marginally better performance observed in deflated-CG. It will be an interesting to observe its
behavior for non-linear thermos-elastic problems with softer materials and high thermal
116
coefficients that are susceptible to large dilations. This also opens up the entire family of elastic-
polynomials for membrane problems.
6.2.2 Feature based deflation
The implementation of deflation-space presented in this thesis, applies a unique type of deflation
over the entire discretized domain. The domain is identified as thick, plate-like, beam-like, etc.
and the appropriate deflation method is applied over the aggregated nodes.
Rather than identifying the entire domain as a specific type of problem, the aggregated nodes can
be represented by feature they are part of, such as, thick, plate-like, beam-like, etc. Specific
deflation can be applied to these groups depending on their feature. It is similar to ‘hybrid FEA’
with the advantage of simple assembly-free deflation instead of reduced dimension FEA over
different features.
Identifying features based on geometry is a research problem that needs further investigation.
6.2.3 Assembly-free non-linear FEA
Assembly-free FEA for large-deformation problem presented in this thesis is a simplistic
approach to geometric non-linearity. The implementation is basic and requires improvement in
convergence for Newton method in Algorithm 6. While the research is available for effective
convergence of equilibrium equation, the assembly-free implementation of several modules need
to be addressed. The current implementation relies on full 8-point integration scheme, which is
highly inefficient.
117
Furthermore, material non-linearity has to be explored for assembly-free non-linear FEA. The
assembly-free deflated-CG proposed in this thesis has been successfully used for multi-material
FEA in [85]. Therefore, one can easily use proposed deflated-CG for material non-linearity.
6.2.4 Assembly-free SpMV for conforming mesh
This is perhaps the most important problem that needs to be addressed to make the proposed
method absolutely versatile. The congruency criteria presented is the first step towards solving
this issue.
The implementation of finding congruent elements has to be optimized. This research problem is
perhaps better suited for computational sciences.
118
REFERENCES
[1] O. C. Zienkiewicz, The Finite Element Method: Its Basis and Fundamentals. Elsevier
Butterworth Heinemann, 2005.
[2] O. C. Zienkiewicz and R. L. Taylor, The Finite Element Method for Solid and Structural
Mechanics. Elsevier, 2005.
[3] R. D. Cook, Concepts and Applications of Finite Element Analysis. John Wiley & Sons,
2002.
[4] G. H. Golub and C. F. V. Loan, Matrix Computations. JHU Press, 1996.
[5] R. Aubry, F. Mut, S. Dey, and R. Lohner, “Deflated Preconditioned Conjugate Gradient
Solvers for Linear Elasticity,” International Journal for Numerical Methods in
Engineering, vol. 88, pp. 1112–1127.
[6] ANSYS 13. ANSYS; www.ansys.com, 2012.
[7] Y. Saad, Iterative Methods for Sparse Linear Systems. SIAM, 2003.
[8] J. R. Shewchuk, “An Introduction to the Conjugate Gradient Method Without the
Agonizing Pain.”
[9] A. M. Mirzendehdel and K. Suresh, “A Fast Time-Stepping Strategy for the Newmark-Beta
Method,” in ASME 2014 International Design Engineering Technical Conferences and
Computers and Information in Engineering Conference, 2014, pp. V01AT02A020–
V01AT02A020.
[10] SolidWorks; www.solidworks.com. 2005.
[11] K. Y. Sze, “Three-dimensional continuum finite element models for plate/shell analysis,”
Prog. Struct. Eng. Mater, vol. 4, pp. 400–407, 2002.
[12] “HOME - PATRIOT Engineering Company.” [Online]. Available:
http://www.patriotengineeringco.com/index.htm. [Accessed: 02-Dec-2015].
[13] “Finite Element FEA | FEA and Plastic Injection Solution.” [Online]. Available:
http://feasolution.com/finite-element-analysis-solution/. [Accessed: 02-Dec-2015].
[14] R. Barrett, M. W. Berry, T. F. Chan, J. Demmel, J. Dionato, J. Diongarra, V. Eijkhout, R.
Pozo, C. Romine, and H. van der Vorst, Templates for the Solution of Linear Systems:
Building Blocks for Iterative Methods. Philadelphia: SIAM Press, 1993.
[15] M. Benzi, “Preconditioning Techniques for Large Linear Systems: A Survey,” Journal of
Computational Physics, vol. 182, no. 2, pp. 418–477, Nov. 2002.
[16] L. N. Trefethen and D. Bau III, Numerical linear algebra, vol. 50. Siam, 1997.
[17] M. Benzi and M. Tuma, “A Robust Incomplete Factorization Preconditioner for Positive
Definite Matrices,” Numerical Linear Algebra With Applications, vol. 10, pp. 385–400,
2003.
[18] Y. Saad, Numerical methods for large eigenvalue problems. Manchester University Press.
[19] R. A. Nicolades, “Deflation of Conjugate Gradients with Applications to Boundary Value
Problems,” SIAM Journal on Numerical Analysis, vol. 24, no. 2, pp. 355–365.
[20] M. Adams, “Evaluation of three unstructured multigrid methods on 3D finite element
problems in solid mechanics,” International Journal for Numerical Methods in
Engineering, vol. 55, no. 2, pp. 519–534, 2002.
[21] Y. Saad, M. Yeung, J. Erhel, and F. Guyomarc’h, “A Deflated Version of the Conjugate
Gradient Algorithm,” pp. 1909–1926, 2000.
[22] S. F. McCormick, Ed., Multigrid Methods. SIAM, 1987.
119
[23] W. L. Briggs, V. E. Henson, and S. F. McCormick, A Multigrid Tutorial. SIAM, 2000.
[24] J. A. Mitchell and J. N. Reddy, “A Multilevel Hierarchical Preconditioner for Thin Elastic
Solids,” International Journal of Numerical Methods in Engineering, vol. 43, pp. 1383–
1400, 1998.
[25] P. Arbenz and et. al., “A Scalable Multi-level Preconditioner for Matrix-Free μ-Finite
Element Analysis of Human Bone Structures,” International Journal for Numerical
Methods in Engineering, vol. 73, no. 7, pp. 927–947, 2008.
[26] V. Mishra and K. Suresh, “Efficient Analysis of 3-D Plates via Algebraic Reduction**,”
San Diego, CA, 2009, vol. 2, pp. 75–82.
[27] V. Mishra and K. Suresh, “Dual Representation Methods for Efficient and Automatable
Analysis of 3D Plates**,” Journal of Computing and Information Science in Engineering,
vol. 10, no. 4, Nov. 2010.
[28] V. E. Bulgakov and G. Kuhn, “High-performance multilevel iterative aggregation solver for
large finite-element structural analysis problems,” International Journal for Numerical
Methods in Engineering, vol. 38, no. 20, pp. 3529–3544, 1995.
[29] P. Yadav and K. Suresh, “Large Scale Finite Element Analysis Via Assembly-Free
Deflated Conjugate Gradient,” J. Comput. Inf. Sci. Eng., vol. 14, no. 4, pp. 041008–041008,
Oct. 2014.
[30] S. Timoshenko and S. W. Krieger, Theory of Plates and Shells. New York: McGraw-Hill
Book Company, 1959.
[31] W. Pilkey, Analysis and Design of Elastic Beams. New York, NY: John Wiley, 2002.
[32] M.-Z. Wang, B.-X. Xu, and Y.-T. Zhao, “General representations of polynomial elastic
fields,” Journal of Applied Mechanics, vol. 79, no. 2, p. 021017, 2012.
[33] X.-R. Fu, S. Cen, C. F. Li, and X.-M. Chen, “Analytical trial function method for
development of new 8-node plane element based on the variational principle containing
Airy stress function,” Engineering Computations, vol. 27, no. 4, pp. 442–463, 2010.
[34] M. Duan, “5-node hybrid/mixed finite element for Reissner-Mindlin plate,” Finite Elements
in Analysis and Design, vol. 33, pp. 167–185, 1999.
[35] K. Y. Sze, “A hybrid stress ANS solid-shell element and its generalization for smart
structure modelling. Part I - solid-shell element formulation,” International Journal for
Numerical Methods in Engineering, vol. 48, pp. 545–564, 2000.
[36] M. Duan, “Effective Hybrid/Mixed Finite Elements for Folded-Plate Structures,” Journal of
Engineering Mechanics, vol. 128, no. 2, pp. 202–208, 2002.
[37] K. Suresh, “Generalization of the mid-element based dimensional reduction,” Journal of
Computing and Information Science in Engineering, vol. 3, no. 4, pp. 308–314, 2003.
[38] K. Suresh, “Generalization of the Kantorovich Method of Dimensional Reduction,”
Albuquerque, 2003.
[39] K. Gallivan, “Model reduction via truncation: an interpolation point of view,” Linear
Algebra and its Applications, vol. 375, pp. 115–134, 2003.
[40] K. Jorabchi and K. Suresh, “Algebraic Dimensional Reduction,” San Francisco, CA, 2007.
[41] E. Cuthill and J. McKee, “Reducing the bandwidth of sparse symmetric matrices,” in
Proceedings of the 1969 24th national conference, 1969, pp. 157–172.
[42] J. R. Gilbert, C. Moler, and R. Schreiber, “Sparse matrices in MATLAB: design and
implementation,” SIAM Journal on Matrix Analysis and Applications, vol. 13, no. 1, pp.
333–356, 1992.
120
[43] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of
sparse matrix–vector multiplication on emerging multicore platforms,” Parallel Computing,
vol. 35, no. 3, pp. 178–194, 2009.
[44] J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder, “Sparse Matrix Solvers on the GPU:
Conjugate Gradients and Multigrid,” in ACM SIGGRAPH 2003 Papers, New York, NY,
USA, 2003, pp. 917–924.
[45] N. Bell, “Efficient Sparse Matrix-Vector Multiplication on CUDA,” 2008.
[46] X. Yang, S. Parthasarathy, and P. Sadayappan, “Fast Sparse Matrix-vector Multiplication
on GPUs: Implications for Graph Mining,” Proc. VLDB Endow., vol. 4, no. 4, pp. 231–242,
Jan. 2011.
[47] T. J. R. Hughes, I. Levit, and J. Winget, “An element-by-element solution algorithm for
problems of structural and solid mechanics,” Computer Methods in Applied Mechanics and
Engineering, vol. 36, no. 2, pp. 241–254, Feb. 1983.
[48] A. Akbariyeh, “Large Scale Finite Element Analysis Using GPU Parallel Computing,”
hgpu.org, Aug. 2012.
[49] J. Michopoulos, J. C. Hermanson, A. P. Iliopoulos, S. G. Lambrakos, and T. Furukawa,
“Data-Driven Design Optimization for Composite Material Characterization,” J. Comput.
Inf. Sci. Eng, vol. 11, no. 2, 2011.
[50] A. Borisov, M. Dickinson, and S. Hastings, “A Congruence Problem for Polyhedra,” The
American Mathematical Monthly, vol. 117, no. 3, pp. 232–249, 2010.
[51] P. Yadav and K. Suresh, “Limited-Memory Deflated Conjugate Gradient for Solid
Mechanics,” presented at the IDETC/CIE 2014, Buffalo, NY, 2014.
[52] NVIDIA CUDA: Compute Unified Device Architecture, Programming Guide. Santa Clara.,
2008.
[53] “cuBLAS.” [Online]. Available: http://docs.nvidia.com/cuda/cublas/#axzz3BWEcDEag.
[Accessed: 26-Aug-2014].
[54] P. Yadav and K. Suresh, “Assembly-Free Large-Scale Modal Analysis on the Graphics-
Programmable Unit,” Journal of Computing and Information Science in Engineering, vol.
13, no. 1, p. 011003, 2013.
[55] X. Bian, P. Yadav, and K. Suresh, “Assembly-Free Buckling Analysis for Topology
Optimization,” DETC2015-46351, ASME-IDETC Conference, Boston, MA, Aug. 2015.
[56] R. G. Grimes, J. G. Lewis, and H. D. Simon, “A Shifted Block Lanczos Algorithm for
Solving Sparse Symmetric Generalized Eigenproblems,” p. 228, 1994.
[57] D. C. Sorensen, “Numerical methods for large eigenvalue problems,” pp. 519–584, 2002.
[58] P. Arbenz, U. L. Hetmaniuk, R. B. Lehoucq, and R. S. Tuminaro, “A Comparison of
Eigensolvers for Large-scale 3D Modal Analysis using AMG-Preconditioned Iterative
Methods,” pp. 204–236, 2005.
[59] G. H. Golub and Q. Ye, “An Inverse free Preconditioned Krylov Subspace method for
Symmetric Generalized Eigenvalue problems,” pp. 312–334, 2002.
[60] A. V. Knyazev, “Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block
Preconditioned Conjugate Gradient Method,” pp. 517–541, 2001.
[61] G. L. G. Sleijpen and H. A. Van der Vorst, “A Jacobi-Davidson Iteration Method for Linear
Eigenvalue Problems,” pp. 401–425, 1996.
[62] L. Bergamaschi, Á. Martínez, and G. Pini, “Parallel preconditioned conjugate gradient
optimization of the Rayleigh quotient for the solution of sparse eigenproblems,” p. 1964,
2006.
121
[63] Y. T. Feng and D. R. J. Owen, “Conjugate Gradient Methods for Solving the Smallest
Eigenpair of Large Symmetric Eigenvalue Problems,” pp. 2209–2230, 1996.
[64] H.-J. Jang, “Preconditioned Conjugate Gradient Method for Large Generalized
Eigenproblems,” Trends in Mathematics Information Center for Mathematical Sciences,
vol. 4, no. 2, pp. 103–109, 2001.
[65] J. Wright, S. Nocedal, Numerical optimization. New York: Springer Science + Business
Media, 2006.
[66] H. Yang, “Conjugate Gradient Methods for the Rayleigh Quotient Minimization of
Generalized Eigenvalue Problems,” pp. 79–94, 1993.
[67] I. C. F. Ipsen, “Computing an Eigenvector with Inverse Iteration,” pp. 254–291, 1997.
[68] K. J. Bathe, Finite Element Procedures. Eaglewood Cliffs, NJ: Prentice-Hall, 1996.
[69] T. Belytschko, Nonlinear finite elements for continua and structures. John Wiley & Sons
Inc, 2000.
[70] M. P. Bendsøe, Topology optimization theory, methods and applications. Berlin
Heidelberg: Springer Verlag, 2003.
[71] T. Y. Chen, “Multiobjective optimal topology design of structures,” Computational
Mechanics, vol. 21, pp. 483–492, 1998.
[72] O. Sigmund, “A 99 line topology optimization code written in Matlab,” Structural and
Multidisciplinary Optimization, vol. 21, no. 2, pp. 120–127, 2001.
[73] S. Amstutz, “A new algorithm for topology optimization using a level-set method,” Journal
of Computational Physics, vol. 216, pp. 573–588, 2006.
[74] O. Sigmund, “Numerical instabilities in topology optimization: A survey on procedures
dealing with checkerboards, mesh-dependencies and local minima,” Structural and
Multidisciplinary Optimization, vol. 16, no. 1, pp. 68–75, 1998.
[75] S. Wang, E. D. Sturler, and G. Paulino, “Large-scale topology optimization using
preconditioned Krylov subspace methods with recycling,” International Journal for
Numerical Methods in Engineering, vol. 69, no. 12, pp. 2441–2468, 2007.
[76] H. A. Eschenauer and N. Olhoff, “Topology optimization of continuum structures: A
review,” Applied Mechanics Review, vol. 54, no. 4, pp. 331–389, 2001.
[77] M. P. Bendsøe and N. Kikuchi, “Generating optimal topologies in structural design using a
homogenization method,” Computer Methods in Applied Mechanics and Engineering, vol.
71, pp. 197–224, 1988.
[78] G. I. N. Rozvany, “A critical review of established methods of structural topology
optimization,” Structural and Multidisciplinary Optimization, vol. 37, no. 3, pp. 217–237,
2009.
[79] Y. I. Kim and B. M. Kwak, “Design space optimization using a numerical design
continuation method,” International Journal for Numerical Methods in Engineering, vol.
53, no. 1979–2002, 2002.
[80] J. F. Aguilar Madeira, “Multi-objective optimization of structures topology by genetic
algorithms,” Advances in Engineering Software, vol. 36, no. 1, pp. 21–28, 2005.
[81] T. Belytschko, “Topology Optimization with Implicit Functions and Regularization,”
International Journal for Numerical Methods in Engineering, vol. 57, no. 8Bely03a, pp.
1177–1196, 2003.
[82] K. Suresh, “Efficient Generation of Large-Scale Pareto-Optimal Topologies**,” Structural
and Multidisciplinary Optimization, vol. 47, no. 1, pp. 49–61, 2013.
122
[83] S. J. van den Boom, “Topology Optimisation Including Buckling Analysis,” Delft
University of Technology, Delft, 2014.
[84] K. Suresh and M. Takalloozadeh, “Stress-Constrained Topology Optimization: A
Topological Level-Set Approach,” Structural and Multidisciplinary Optimization, vol.
Submitted, 2012.
[85] A. M. Mirzendehdel and K. Suresh, “A Pareto-Optimal Approach to Multimaterial
Topology Optimization,” Journal of Mechanical Design, vol. 137, no. 10, p. 101701, 2015.