symbolic factorisation of sparse matrix using …...symbolic factorisation of sparse matrix using...

Symbolic Factorisation of SparseMatrix Using Elimination Trees

A Thesis Submitted forpartial fulfillment of Requirements for the Degree ofBachelor-Master of Technology (Dual Degree)

by

Peeyush Jain

Department of Computer Science and EngineeringIndian Institute of Technology Kanpur

Kanpur

Dedicated to My Parents and Teachers

Acknowledgements

I would like to take this opportunity to express my deep sense of gratitude to the per-

son who has taught me what dedication is, my thesis supervisor Dr. Phalguni Gupta.

His benevolent guidance, apt suggestion, unstinted help and constructive criticism has

inspired me in successful completion of present work.

I also extend my sincere thanks to all the faculty members of the Department

of Computer Science and Engineering, Indian Institute of Technology Kanpur, for the

invaluable knowledge they have imparted to me and for teaching the principles in most

exciting and enjoyable way. My stay at Indian Institute of Technology Kanpur has been

excited and enlightening. The time, I spent with my friends Gaurav, Mohit, Rahul,

Ashish, Ashvin, Saeed is unforgettable. I am grateful for their continuous attachment

which strengthen me at difficult moments.

I take this opportunity to thank my parents for all that they have done for me.

Without their love, support and encouragement, I would never have reached this stage

in my life.

Peeyush Jain

Abstract

Many problems in science and engineering require the solving of linear systems of equa-

tions. As the problems get larger it becomes increasingly important to exploit the spar-

sity inherent in many such linear systems. It is well recognized that finding a fill-reducing

ordering is crucial to the success of the numerical solution of sparse linear systems. The

use of hybrid ordering partitioner is expected to improve significantly the fill-in of the

factorized matrices, and also the scalability of the elimination tree obtained by symbolic

factorization. The most obvious way to get the required increase in performance would

be to use parallel algorithms.

For dense symmetric matrices, there are quite a few well-known parallel algorithms

that scale well and can be implemented efficiently on parallel computers. On the other

hand, there are not many efficient, scalable parallel formulation for the sparse matrix

factorization using elimination tree.

A well-known sparse matrix ordering scheme PORD (Paderborn Ordering tool)

uses the last element to compute the present element computation, so this prevents to

do the parallelization part globally for the whole algorithm. PORD spend most of its

time in splitting the graph into two parts and coloring it. So this thesis tries to parallelize

the most occurring part of the factorization algorithm for getting the better result in

symbolic factorization step. The given approach in this thesis might be useful in some of

the parallel graph computation which uses sequential ordering step. By some estimates,

more than 90% of the eigenvalue problems are real symmetric or complex Hermitian

problems. This gives us flexibility to use the ordering step parallely with many other

parallel algorithm of matrix computation.

Several simple modifications to the minimum local fill-in ordering strategy has

also been presented in this thesis such that these strategies exploit readily available in-

formation about node adjacencies to improve the fill bounds used to select a node for

elimination. This thesis describes two simple modifications to the well known node selec-

tion strategy AMMF(Approximate minimum local fill-in) that further improve ordering

quality. It is demonstrated that different types of node selection strategies give less

amount of number of fronts which gives us better ordering followed by better subsequent

factorization complexity.

iii

Contents

Acknowledgements i

Abstract ii

Contents iv

List of Figures viii

1 Introduction 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Application used in our thesis . . . . . . . . . . . . . . . . . . 4

1.1.2 Parallel implementation part of our application . . . . . . . . 5

1.1.3 Node selection strategies of the proposed algorithm . . . . . . 6

1.2 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Literature Review 8

2.1 Greedy ordering heuristics . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Graph-partitioning based heuristics . . . . . . . . . . . . . . . . . . . 10

2.3 Hybrid heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Parallel Construction of ordering scheme 13

iv

3.1 different types of ordering methods . . . . . . . . . . . . . . . . . . . 13

3.1.1 Bottom-up Methods . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.2 Top-down methods . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.2.1 Multilevel approach . . . . . . . . . . . . . . . . . . 14

3.1.2.2 Domain decomposition approach . . . . . . . . . . . 15

3.1.3 Hybrid methods . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Ordering Scheme used by PORD . . . . . . . . . . . . . . . . . . . . 17

3.3 Parallel implementation of the defined scheme . . . . . . . . . . . . . 18

3.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Node Selection Strategies for the Construction of Vertex Separators 25

4.1 Node selection strategy of PORD . . . . . . . . . . . . . . . . . . . . 26

4.2 Proposed approach for the node selection . . . . . . . . . . . . . . . . 27

4.2.1 First Modification . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.2 Second Modification . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Software, tools and the configuration 32

5.1 Some Information About the Softwares Used . . . . . . . . . . . . . . 32

5.1.1 Blas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1.2 Blacs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1.3 Scalapack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1.4 MPICH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1.5 MUMPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2 How to use the system . . . . . . . . . . . . . . . . . . . . . . . . . . 34

v

6 Conclusion and Scope for Future Work 36

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 Scope of future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Appendix A 42

A Symbolic Factorization and Elimination Tree 43

A.1 Cholesky Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . 43

A.2 Numerical Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . 43

A.3 Symbolic Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . 44

A.4 Algorithms for symbolic factorisation . . . . . . . . . . . . . . . . . . 45

A.4.1 A graph representation of symbolic matrices . . . . . . . . . . 45

A.4.2 A basic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 46

A.4.3 Fast symbolic Cholesky factorisation . . . . . . . . . . . . . . 47

A.5 elimination tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Appendix B 48

B Elimination Graph and Quotient Graph 49

B.1 Elimination graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

B.2 Quotient graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Appendix C 52

C Message Passing Interface 53

C.1 Message Passing Interface . . . . . . . . . . . . . . . . . . . . . . . . 53

C.1.1 Point to Point Communication Routines . . . . . . . . . . . . 54

vi

C.1.2 Collective Communication Routines . . . . . . . . . . . . . . . 54

C.1.3 Group and Communicator Management Routines . . . . . . . 55

Appendix D 55

D Global Array Toolkit 56

vii

List of Figures

1.1 Conversion from symmetric matrix A to cholesky factor LLT without

any ordering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Conversion from symmetric matrix A to cholesky factor LLT after

applying minimum degree algorithm . . . . . . . . . . . . . . . . . 3

3.1 Changes in runtime vs number of processor . . . . . . . . . . . . . 23

3.2 Number of fronts vs number of processor . . . . . . . . . . . . . . . 24

5.1 Dependencies of the softwares . . . . . . . . . . . . . . . . . . . . . 35

A.1 Graph induced by the sparse matrix . . . . . . . . . . . . . . . . . 46

A.2 Note that the forest happens to contain only one tree. . . . . . . . 48

B.1 Elimination graph, quotient graph, and matrix for first three steps 52

D.1 Structure of Global Array Toolkit . . . . . . . . . . . . . . . . . . . 57

viii

Chapter 1

Introduction

1.1 Introduction

When solving large sparse symmetric linear systems of the form Ax = b, it is common

to precede the numerical factorization by a symmetric reordering. This reordering is

chosen so that pivoting down the diagonal in order on the resulting permuted matrix

PAP T = LLT produces much less fill-in and work than computing the factors of A by

pivoting down the diagonal in the original order. This reordering is computed using only

information on the matrix structure without taking account of numerical values and so

may not be stable for general matrices. However, if the matrix A is positive-definite, a

cholesky factorization(Appendix A) can safely be used. This technique of preceding the

numerical factorization with a symbolic analysis can also be extended to unsymmetric

systems although the numerical factorization phase must allow for subsequent numerical

pivoting. The goal of the preordering is to find a permutation matrix P so that the

subsequent factorization has the least fill-in. Unfortunately, this problem is NP-complete,

so heuristics are used.

The main challenge in sparse matrix ordering algorithms is to find a fill-minimizing

permutation without computing AT A or even its nonzero structure. While computing

the nonzero structure of AT A allows us to use existing symmetric ordering algorithms and

codes, it may be grossly inefficient. For example, when an n×n matrix A has non-zeros

only in the first row , first column(because of symmetry) and along the main diagonal,

computing AT A takes Ω(n2) work, but factoring it takes only O(n) work (Figure 1.1).

Improving the run time and quality of ordering heuristics has been a subject of

1

Figure 1.1: Conversion from symmetric matrix A to cholesky factor LLT without anyordering algorithm

research for almost three decades. Two main classes of successful heuristics have evolved

over the years: (1) minimum-degree (MD)-based heuristics, and (2) graph-partitioning

(GP)-based heuristics. MD-based heuristics are local greedy heuristics that reorder the

columns of a symmetric sparse matrix such that the column with the fewest non-zeros at

a given stage of factorization is the next one to be eliminated at that stage. GP-based

heuristics regard the symmetric sparse matrix as the adjacency matrix of a graph and

follow a divide-and-conquer strategy to label the nodes of the graph by partitioning it

into smaller subgraphs.

The striking success of MD-based heuristics prompted intense research to improve

their run time and quality, and they have been the methods of choice among practi-

tioners. The multiple minimum-degree (MMD) algorithm by George and Liu[3] and

the approximate minimum-degree(AMD) algorithm by Davis, Amestoy, and Duff [34]

represent the state of the art in MD-based heuristics. There are other heuristics exists

which are used widely in sparse matrix ordering step such as Minimum local fill-in al-

gorithm(MMF), Approximate minimum mean local fill-in(AMMF), a nested dissection

(ND) or a externally computed hybrid ordering such as Scotch, Metis, Pspases, Spooles,

PORD.

The minimum degree ordering algorithm is one of the most widely used heuristics,

since it produces factors with relatively low fill-in on a wide range of matrices. Because

of this, the algorithm has received much attention over the past three decades. Since

the algorithm performs its pivot selection by choosing from a graph a node of minimum

degree, some improvements has been made to the algorithm to reduce the memory

complexity so that the algorithm can operate within the storage of the original matrix,

and reduce the amount of work needed to keep track of the degrees of nodes in the graph

(which is the most computationally intensive part of the algorithm). More recently,

several researchers have relaxed this heuristic by computing upper bounds on the degrees,

2

rather than the exact degrees, and selecting a node of minimum upper bound on the

degree.

Figure 1.2: Conversion from symmetric matrix A to cholesky factor LLT after applyingminimum degree algorithm

Nested Dissection is one more effective method of finding an elimination ordering.

The algorithm uses a divide and conquer strategy on the graph. Removal of a set of

vertices results in two new graphs on which this dissection may be performed separately.

The results for the two parts may then be combined to find the solution of the entire

graph. This algorithm is based on finding separators. The recursion of these algorithms

suggests a natural decomposition of graphs in terms of their separators. At the highest

level is a separator that divides the graph into components. These components them-

selves have separators, and so on. At the lowest levels are components that may not

be divided any further (possibly singleton vertex sets). This method has been shown to

result in good elimination orderings for certain classes of graphs.

In general, the above two algorithms can be combined to produce better order-

ing. This hybrid ordering is sometimes called incomplete nested dissection. Incomplete

nested dissection is used to produce more robust orderings. In this scheme the recursive

process of constructing vertex separators is terminated after a few levels, and the vertices

in the remaining subgraphs are ordered, using either multiple minimum degree (MMD),

or constraint minimum degree(CMD). Independently Ashcraft and Liu[1, 2], and Roth-

berg [14] have shown that the quality of an incomplete nested dissection ordering can be

further improved when using minimum degree to order the separator vertices instead of

following the given nested dissection ordering. To summarize, two levels of hybridiza-

tion can be found in the literature: incomplete nested dissection, and minimum degree

post-processing on an incomplete nested dissection. The latter one has been applied

successfully in state-of-the-art ordering codes such as BEND, SPOOLES, and WGPP.

Ashcraft and Liu[26] present a more general classification of hybrid schemes known as

multi-section ordering.

3

PORD (Paderborn Ordering tool) which also uses the multi-section ordering, de-

fined by Ashcraft and Liu, basically presents an ordering algorithm that achieves a tighter

coupling of bottom-up and top-down methods with the interpretation of vertex separa-

tors as the boundaries of the last elements in a bottom-up ordering. The fundamental

idea of this algorithm is to generate a sequence of quotient graphs using a bottom-up

node selection strategy. The quality of a bottom-up scheme such as minimum degree is

quite sensitive to the way ties are broken when there is more than one vertex eligible for

elimination.

1.1.1 Application used in our thesis

Sparse matrix ordering is an important problem that has extensive applications in many

areas including scientific computing, VLSI design, task scheduling, geographical infor-

mation systems and operations research etc. MUMPS (‘MUltifrontal Massively Parallel

Solver’) also uses some sparse matrix ordering schemes for solving systems of linear equa-

tions of the form Ax = b, where the matrix A is sparse and can be either unsymmetric,

symmetric positive definite, or general symmetric. MUMPS uses a multi-frontal tech-

nique which is a direct method based on either the LU or LDLT factorization of the

matrix[18].

MUMPS distributes the work tasks among the processors, but an identified proces-

sor (the host) is required to perform most of the analysis phase, distribute the incoming

matrix to the other processors (slaves) in the case where the matrix is centralized, and

collect the solution. The system Ax = b is solved in three main steps:

1. Analysis – A range of orderings to preserve sparsity is available in the analysis

phase. The host performs an ordering based on the symmetrized pattern A + AT , and

carries out symbolic factorization (Appendix A). A mapping of the multifrontal compu-

tational graph is then computed, and symbolic information is transferred from the host

to the other processors. Using this information, the processors estimate the memory

necessary for factorization and solution.

2. Factorization – The original matrix is first distributed to processors that will

participate in the numerical factorization. The numerical factorization on each frontal

matrix is conducted by a master processor (determined by the analysis phase) and one

or more slave processors (determined dynamically). Each processor allocates an array

for contribution blocks and factors; the factors must be kept for the solution phase.

3. Solution – The right-hand side b is broadcast from the host to the other

4

processors. These processors compute the solution X using the (distributed) factors

computed during previous step, and the solution is either assembled on the host or kept

distributed on the processors.

This thesis tries to reduce the complexity of the ordering step to reduce the fill-ins

in the analysis phase. We see that the host(single) is the main processor which do the

analysis phase sequentially. So in this thesis, we try to do the analysis phase parallely

instead of assigning it to one processor. MUMPS uses different types of ordering scheme

to compute the analysis phase. This thesis uses PORD as the main ordering scheme.

1.1.2 Parallel implementation part of our application

The most obvious way to get the required increase in performance would be to use

parallel algorithms. The obvious parallelism in the above algorithms is not so obvious

to use. The observation was that nodes that are far apart can be eliminated (ie, deleted

and the cliques of their neighbors created) in the same step. So, codes that use multi-

section ordering can eliminate several nodes in the same step, and on a parallel machine

different processors could be responsible for eliminating each node. Unfortunately, this

parallelism is very fine grained, which makes it difficult to avoid high communication

costs.

So, because of the tighter coupling of PORD, it is quite a bit difficult to parallelize

the whole algorithm globally because it contains lots of node elimination in one single

step. So this thesis tries to parallelize the most occurring part for getting the better

result in symbolic factorization step.

For the parallelization of the most occurring phase of two-level method, in our im-

plementation we first try to construct it with MPI(Message Passing Interface) (appendix

C)which is used in many software to do the sparse matrix computation in parallel. But

MPI only works with the passing of messages between the processors, it does not have

any notion of global shared memory. our process checks the multisector and changes the

representative of that multisector according to their vertex type, checksum and indis-

tinguishability. And whenever it gets change some representative or vertex type by one

process, then other process has to know about it, which takes too much time to gather

all the information right. So this decreases the performance of the ordering step too

much such that sequential is far better than the parallel version.

In case of increasing the performance of the parallel version, we have to use the

possibility of providing some global shared memory space for some structure. So for

5

this, Global Array toolkit(GA) (Appendix D) has been used which provides a portable

Non-Uniform Memory Access (NUMA) shared memory programming environment in

the context of distributed array data structures (called ’global arrays’).

This approach has been explained in the later section. The given approach in

this thesis might be useful in some of the parallel graph computation which are using

sequential ordering algorithm for better result.

1.1.3 Node selection strategies of the proposed algorithm

In multi-section ordering, selection of the nodes based on their degree, checksum and

vertex weight is an important issue for the better ordering. PORD uses different types of

node selection strategy to get good ordering in less amount of time. A most efficient but

less popular strategy for node selection is the minimum local fill or minimum deficiency

algorithm, which always chooses a node whose elimination creates the least amount of

fill. But this takes huge amount of time to compute the ordering.

There are two reasons why the minimum deficiency algorithm has not become as

popular as the minimum degree algorithm. First, the minimum deficiency algorithm

is typically much more expensive than the minimum degree algorithm. Second, it was

believed that the quality of minimum deficiency orderings is not much better than that

of minimum degree orderings. Contrary to popular belief, minimum local fill produces

significantly better orderings than minimum degree, albeit at a greatly increased runtime.

PORD uses the AMMF node selection strategy for better ordering. Because of

better ordering, PORD invest less amount of time in numerical factorization(appendix

A). This thesis describes two simple modifications to this heuristic that produce even

better orderings. Thus our proposed approach explores simple approximations to these

fill metrics with the goal of reducing the runtime of the known minimum local fill-in

algorithm. The best of these approximations is no more difficult to compute than the

degree, yet the orderings produced require roughly less factorization work than those

produced by minimum degree strategy.

This modification gives us better result in the number of fronts which is quite a bit

striking factor in subsequent ordering.

6

1.2 Organization of the thesis

The present thesis is divided into six chapters, a brief detail of each is given as follows:

• First chapter discusses the general aspects of solving sparse matrix computation.

This chapter mainly discusses about the pros and cons of different types of orderings

algorithms. This also involves some understanding of the symbolic factorization

and the elimination trees which is explained in the appendices at the end of the

thesis.

• In second chapter, a complete literature review is presented to update the user with

the ongoing work in the area of ordering algorithm which is required for solving

large sparse symmetric linear systems.

• In third chapter, A parallel implementation of our proposed approach has been

presented. This chapter also talks about the different types of methods used for

ordering a sparse matrix. This chapter defines the methodology used by PORD and

after that our parallel approach has been presented. Results and some discussion

about this implementation is also been presented in this chapter.

• In fourth chapter, Different types of node selection strategy has been presented.

This chapter have separately presented the PORD’s node selection strategy and

according to that strategies, two new node selection strategies have also been de-

fined to get better ordering. Result and a brief discussion about the cause of node

selection strategies has also been presented in this chapter.

• In fifth chapter, A set of software’s utilities has been discussed with the discus-

sion of the implemented environment to run our software with different types of

dependencies between these utilities.

• Last chapter presents the conclusion drawn from the present work and its contri-

bution towards literature.The scope for future work and the applications have also

been discussed in the same chapter.

7

Chapter 2

Literature Review

It is well known that ordering the rows and columns of a matrix is a crucial step in

the solution of sparse linear systems using elimination graph[12, 22]. The ordering can

drastically affect the amount of fill introduced during factorization and hence the cost

of computing the factorization. When the matrix is symmetric and positive definite,

the ordering step is independent of the numerical values and can be performed prior to

numerical factorization. An ideal choice, for example, is an ordering that introduces the

least fill.

As a matter of fact, ordering has an important place in solving sparse linear ma-

trices[29, 35]. But because of its NP-completeness, almost all ordering algorithms are

heuristic in nature. Examples include reverse Cuthill-McKee, automatic nested dissec-

tion, and minimum degree.

2.1 Greedy ordering heuristics

A greedy ordering heuristic numbers columns successively by selecting at each step a

column with the optimal value of a metric. In the minimum degree algorithm of Tinney

and Walker [41], the metric is the number of operations in the rank-1 update associated

with a column in a right-looking, sparse Cholesky factorization. The algorithm can

be stated in terms of vertex eliminations in a graph representing the matrix. In this

framework, the number of operations in the rank-1 update is proportional to the square

of the degree of a vertex; consequently, implementations use the degree as the metric.

Efficient implementations of minimum degree are due to George and Liu[1, 2, 3] and

8

the minimum degree algorithm with multiple eliminations (MMD), due to Liu[28], has

become very popular in the last decade. Multiple independent vertices are eliminated at

a single step in MMD to reduce the ordering time. More recently, Amestoy, Davis, and

Duff [34] have developed the approximate minimum degree (AMD) algorithm. AMD uses

an approximation to the degree to further reduce the ordering time without degrading

the quality of orderings produced. Berman and Schnitger [37] have analytically shown

that the minimum degree algorithm can, in some rare cases, produce a poor ordering.

However, experiments has shown that the minimum degree algorithm and its variants are

effective heuristics for generating fill-reducing orderings. In fact, only some very recent

separator-based schemes have outperformed MMD for certain classes of sparse matrices.

Two of these new schemes are hybrids of a separator-based scheme and a greedy ordering

strategy such as the minimum degree algorithm.

One more greedy ordering heuristic that was also proposed by Tinney and Walker,

but has largely been ignored, is the minimum deficiency (or minimum fill) algorithm.

The minimum deficiency algorithm minimizes the number of fill entries introduced at

each step of sparse Cholesky factorization (or deficiency in graph terminology). Although

the metrics look similar, the minimum deficiency and minimum degree algorithms are

different. For example, the deficiency could well be zero even when the degree is not.

Some results by Rothberg demonstrate that minimum deficiency leads to significantly

better orderings than minimum degree. However, current implementations of the mini-

mum deficiency algorithm are slower than MMD by more than an order of magnitude.

Rothberg has investigated metrics for greedy ordering schemes based on approximations

to the deficiency.

A recent known greedy ordering algorithm is proposed by Ng and Raghavan[16].

They establish many of the techniques used in efficient implementations of the minimum

degree algorithm (namely, indistinguishable vertices, mass elimination and outmatching)

also apply to the minimum deficiency algorithm. They also corroborate Rothberg’s [15]

empirical results, establishing the superior performance of the minimum deficiency met-

ric. They have describe two heuristics based on approximations to the deficiency and the

degree. Both metrics can be implemented using either the update mechanism in MMD

or the faster scheme in AMD. In their paper, the correction term as an approximation

to edges has been missed because they restrict their work to partial cliques that are

disjoint. However, the heuristic performs poorly if the correction term is absent.

9

2.2 Graph-partitioning based heuristics

Graph partitioning based heuristics are capable of producing better-quality orderings

than Minimum degree based heuristics for finite-element problems, while staying within

a small constant factor of the run time of Minimum degree based heuristics.

An important area where sparse-matrix orderings are used is that of linear program-

ming. Until now, with the exception of Rothberg and Hendrickson[5], most researchers

have focused on ordering sparse matrices arising in finite-element applications, and these

applications have guided the development of the ordering heuristics. The use of the

interior-point method for solving linear programming problems is relatively recent. As a

result, the linear programming community has been using these well-established heuris-

tics that were not originally developed for their applications. graph partitioning based

sparse matrix ordering algorithm is capable of generating robust orderings of sparse

matrices arising in linear problems, in addition to finite-element and finite-difference

matrices.

Graph partitioning based ordering methods are more suitable for solving sparse

systems using direct methods on distributed-memory parallel computers than MD-based

methods, in two respects. First, there is strong theoretical and experimental evidence

that the process of graph partitioning and sparse-matrix ordering based on it can be

parallelized effectively. On the other hand, the only attempt to perform a minimum-

degree ordering in parallel that we are aware of was not successful in reducing the ordering

time over a serial implementation. Second, in addition to being parallelizable itself, a

graph partitioning based ordering also aids the parallelization of the factorization and

triangular solution phases of a direct solver.

Gupta, Karypis, and Kumar [17, 31] have proposed a highly scalable parallel for-

mulation of sparse Cholesky factorization. This algorithm derives a significant part of

its parallelism from the underlying partitioning of the graph of the sparse matrix. Gupta

and Kumar present efficient parallel algorithms for solving lower and upper-triangular

systems resulting from sparse factorization. In both parallel factorization and triangular

solutions, a part of the parallelism would be lost if an Minimum degree based heuristic

were used to preorder the sparse matrix.

Recent research has shown multilevel algorithms to be fast and effective in com-

puting graph partitions. A typical multilevel graph-partitioning algorithm has four com-

ponents: coarsening, initial partitioning, uncoarsening, and refining. Recently Anshul

Gupta[4] has presented a fast and effective way of graph partition which he called WGGP.

10

In WGGP, With a Graph partitioning based ordering, the matrix columns correspond-

ing to the nodes of a separator usually tend to become denser during factorization than

the columns corresponding to the nodes of the subgraphs that the separator separates.

This is because the separator columns receive fill-in from the columns of both of the

subgraphs that they separate. On the other hand, the columns of each subgraph receive

fill-in from the nodes of only that subgraph. In addition, separators typically have fewer

nodes than the separated subgraphs. A large number of columns contributing fill-in to a

small number of separator columns results in the separator columns becoming relatively

dense during factorization. This is a quite a large drawback for the lower dense input

matrices.

2.3 Hybrid heuristics

The quality of a greedy ordering scheme such as minimum degree is quite sensitive to the

way ties are broken when there is more than one vertex eligible for elimination. Berman

and Schnitger [37] describe a minimum degree elimination sequence for the k × k grid

so that the number of factor entries and the number of factorization operations is an

order of magnitude higher than optimal. On the other hand, graph partitioning scheme

such as nested dissection produces asymptotically optimal orderings for these grids. The

situation changes completely when considering h×k grids with large aspect ratio. Here,

minimum degree outperforms nested dissection.

Hybridization of two heuristics like: incomplete nested dissection, and minimum

degree post-processing on an incomplete nested dissection, is an important issue to im-

prove the ordering scheme. This scheme had been widely used successfully in state-of-

the-art ordering codes such as BEND, SPOOLES, TAUCS and PSPASES. Ashcraft and

Liu[6, 7, 8, 9, 10] also present a more general classification of hybrid schemes known as

multi-section ordering.

Jurgen Schulze[26] presented a hybrid heuristic ordering scheme which he called

as PORD. PORD uses the two-level method to construct a vertex separator with the

help of domain decomposition. But it works as a sequential algorithm. Using PORD

as a parallel algorithm is not a easy task because of its tighter coupling of bottom-up

and top-down methods with the interpretation of vertex separators as the boundaries

of the last elements in a bottom-up ordering. So multiple node elimination become the

bottle-neck problem for the parallelization of the PORD.

11

PORD also uses different types of node selection strategies for the construction

of vertex separators which improves the ordering further. In their methodology vertex

separators are interpreted as the boundaries of the last elements in a bottom-up ordering.

They are considered as a tool for guiding the elimination process. The shortcomings of

a bottom-up method such as minimum degree are largely due to the local nature of

the algorithm. On the other hand, vertex separators afford an insight into the global

structure of the graph. They observe that the quality of an ordering will be improved,

if the elements created in the elimination process have smooth boundaries.

In PORD, since the removal of a vertex separator S partitions a graph in two

subgraphs, the variables corresponding to S constitute a large boundary segment that is

shared by two well aligned elements. When eliminating the vertex separators according

to the given nested dissection order, well-aligned elements are merged to form new well-

aligned elements. This is achieved by the recursive structure of the nested dissection

algorithm. However, the nested dissection order represents only one possibility to create

new well-aligned elements. PORD is considering all orderings that can be created by

the rule: (a) Eliminate all vertex separators in levels l + 1, ..., lev using the given nested

dissection order, and (b) Eliminate all vertex separators in levels 0, ...., l using a bottom-

up algorithm where lev denote the depth of the elimination tree and l ∈ 0, ..., l.

Two different types of enhancement of this node selection strategy has been pre-

sented in this thesis which simply decreases the number of fronts of the elimination tree

obtained after the ordering step.

12

Chapter 3

Parallel Construction of ordering

scheme

In this chapter, first all different types of sparse matrix ordering methods are defined

which are most commonly used in the linear programing. Then after that this chapter

demonstrates the ordering method used by PORD and the parallel implementation of

that scheme with the result and discussion.

3.1 different types of ordering methods

From several years, there are lots of heuristics has been proposed in order to get a good

ordering methods. The most common methods used for ordering is presented here. We

have also defined the matrix graph relation in this section.

3.1.1 Bottom-up Methods

There are many bottom-up ordering methods known in sparse matrix world. The mini-

mum degree algorithm is one of the most popular bottom-up ordering schemes[3, 6, 39].

Over the years many enhancements have been proposed to the basic algorithm that have

greatly improved its efficiency.

Perhaps one of the most important enhancements is the concept of supernodes.

Two vertices x, y of an elimination graph Gk belong to the same supernode, if adjGk (x)∪x = adjGk (y) ∪ y. In this context the vertices x, y are called indistinguishable.

Indistinguishable vertices possess two important properties: (a) they can be eliminated

13

consecutively in a minimum degree ordering, and (b) they remain indistinguishable in all

subsequent elimination graphs. As a consequence, all vertices that belong to a supernode

I can be replaced by a single logical node with weight I. Thus, the runtime of the

minimum degree algorithm is significantly reduced.

3.1.2 Top-down methods

The most efficient top-down ordering scheme is Georges nested dissection algorithm[1].

Nested dissection is a divide-and-conquer strategy for ordering sparse matrices. Let Vs

be a set of vertices (called a separator) whose removal, along with all edges incident on

vertices in Vs disconnects the graph into two remaining subgraphs,G1 = (V1, E1)andG2 =

(V2, E2). If the matrix is reordered so that the vertices within each subgraph are num-

bered contiguously and the vertices in the separator are numbered last, then the matrix

will have a bordered block diagonal format. This idea can be applied recursively, break-

ing each subgraph into smaller and smaller pieces with successive separators, giving a

nested sequence of dissections of the graph that inhibit fill and promote concurrency at

each level.

The effectiveness of nested dissection in limiting fill depends on the size of the

separators that split the graph, with smaller separators obviously being better[32, 33].

The relative sizes of the resulting subgraphs is also important. Maximum benefit from

the divide-and-conquer approach is obtained when the remaining subgraphs are of about

the same size, an effective nested dissection algorithm should not permit an arbitrarily

skewed ratio between the sizes of the pieces.

In contrast to the bottom-up methods introduced above the nested dissection al-

gorithm is quite ill-specified. Determination of the separator is an important issue for

better implementation. There are some approaches which are discussed below:

3.1.2.1 Multilevel approach

Multilevel algorithms have been applied successfully to the construction of edge sepa-

rators. Roughly speaking, a multilevel algorithm consists of three phases. In the first

phase the original graph G is approximated by a sequence of smaller graphs that main-

tain the essential properties of G(coarsening phase). Then, an initial edge separator is

constructed for the last graph in the sequence (partitioning phase). Finally, the edge

separator is projected backwards to the next larger graph in the sequence until G is

14

reached (uncoarsening phase). A local improvement heuristic such as Kernighan-Lin or

Fiduccia-Mattheyses is used to refine the edge separator after each uncoarsening step.

This approach is used in graph-partitioning algorithms prominently.

3.1.2.2 Domain decomposition approach

Domain decomposition is the widely known top-down approach used by several ordering

scheme. In contrast to multilevel method, Ashcraft and Liu propose a two-level approach

to construct a vertex separator. Analogous to the domain decomposition methods for

solving PDEs (partial differential equations), the vertex set X of G is partitioned into

X = φ ∪ Ω1 ∪ ........ ∪ Ωr with adjG(Ωi) ⊂ φ for all 1 ≤ i ≤ r where Ωi are the sets

containing the domain. The set φ is called multisector. The removal of φ splits G into

connected subgraphs G(Ω1), ......, G(Ωr). Once φ has been found, a color from WHITE,

BLACK is assigned to each Ωi. This induces a coloring of the vertices u ∈ φ.

color (u) =

WHITE, if all Ωi with u ∈ adjG (Ωi) are colored WHITE

BLACK, if all Ωi with u ∈ adjG (Ωi) are colored BLACK

GRAY, otherwise

According to Ashcraft and Liu, the set S = u ∈ φ; color(u) = GRAY constitutes

a vertex separator of G for every coloring of Ω1, ...., Ωr if and only if

∀u, v ∈ φ : u, v ∈ E ⇒ ∃Ωi with u, v ∈ adjG (Ωi)

In general, not all vertices u, v ∈ φ satisfy the above equation. These vertices

are then blocked to a segment V ⊂ φ. As a result, one obtains a partitioning P =

V1, ..., Vs , φ = V1 ∪ ...∪ Vs, of the multisector. The segments in P satisfy the following

condition:

∀V, V ′ ∈ P : adjG (V ) ∩ V ′ 6= φ ⇒ ∃Ωi with V, V ′ ∩ adjG (Ωi) 6= φ

So now coloring can be defined as:

15

color (V ) =

WHITE, if all Ωi with V ∩ adjG (Ωi) 6= φ are colored WHITE

BLACK, if all Ωi with V ∩ adjG (Ωi) 6= φ are colored BLACK

GRAY, otherwise

This guarantees that S = u ∈ φ;∃V ∈ P with u ∈ V andcolor(V ) = GRAY constitutes a vertex separator of G for every coloring of Ω1, ........, Ωr.

Ashcraft and Liu are using a block Fiduccia-Mattheyes scheme to determine a col-

oring of the sets Ω1, ........, Ωr, that minimizes the size of the induced vertex separator

S. Once S has been found, a sophisticated network-flow algorithm is used to smooth S.

(This part has been taken from Jurgen Schulze[26])

3.1.3 Hybrid methods

There are lots of hybrid methods exist with the combination of bottom-up and top-down

methods. In order to improve both run-times and ordering qualities, this thesis actually

use a hybrid of minimum degree and nested dissection. PORD actually hybridize the

methods in two ways. The first is the standard incomplete nested dissection method.

Starting with the original graph, they perform several levels of nested dissection. Once

the subgraphs are smaller than a certain size, they order them using minimum degree.

This allows the ordering to reap the benefits of nested dissection at the top levels, where

most of the factorization work is performed, while obtaining the runtime advantages of

minimum degree on the smaller problems.

The second hybridization PORD use is minimum degree post-processing on an

incomplete nested dissection ordering. The idea is to reorder the separator vertices

using minimum degree. A simple intuition behind this hybrid method is that nested

dissection makes an implicit assumption that recursive division of the problem is the

best approach to ordering. Allowing minimum degree to reorder the separator vertices

removes this assumption.

16

3.2 Ordering Scheme used by PORD

In PORD, vertex separators are interpreted as the boundaries of the last elements in

a bottom-up ordering. As a consequence, we are using quotient graphs (Appendix B)

and special node selection strategies for the construction of separators. This achieves a

tighter coupling of bottom-up and top-down methods.

Most ordering schemes are using a matching technique to coarsen a graph G =

(X, E). However, in PORD, the coarsening process relies on quotient graphs. It starts

by constructing an initial quotient graph G0 from G. Based on G0 a sequence of quo-

tient graphs G1, ...,Gt is produced, where Gi is obtained from Gi−1, 1 ≤ i ≤ t, by the

elimination of certain variables.

PORD also applies the coloring scheme after finding the separator. PORD tries to

minimize the total weight of all the variables in the quotient graph which increases the

probability of finding a small(i.e. light weighted) separator of Gi and G. This consists

of four steps.

1. Construction of the initial quotient graph – This start with computing a

maximal independent set of the graph. This set is called a multisector and PORD

tries to remove this vertex set from the initial graph as a separator. This separator

follows the rule of the domain-decomposition approach.

2. Construction of further quotient graphs – Once initial separator has been

found, this implementation try to construct the further quotient graph Gi with

the help of Gi−1. Each merging operation corresponds to an elimination step in a

bottom-up algorithm. Next, all remaining variables that are adjacent to exactly

one element are merged with that element. Once a new quotient graph has been

constructed, all variables that are adjacent to the same set of elements can be re-

placed by a single supervariable. This further reduces the number of nodes in that

quotient graph.

3. Coloring of quotient graphs – After finding the separator of each quotient

graph, this scheme try to color the vertices with the help of the coloring scheme

defined by the domain-decomposition approach. This contains three types of col-

oring(white, black and gray) of each supervariable of the quotient graph.

17

4. Smoothing the final separator – After coloring the vertices, this method try to

balance the weight of the white and the black vertices to get a good separator from

that quotient graph. Often separator can be improved by exchanging a boundary

segment with the vertices of a domain. The whole process is repeated until none

of the two minimum weighted vertex covers(black and white) improves the actual

separator.

The performance of this quotient graph method crucially depends on the perfor-

mance of separator function. This function represents the entry point of our optimization

algorithm.

3.3 Parallel implementation of the defined scheme

PORD uses the two-level method to construct a vertex separator with the help of domain

decomposition. In the prevoius section, we already saw that separator function is the

main basic function of the whole ordering algorithm. To do the parallelization part

globally is quite a difficult task because of the tighter coupling of the ordering scheme

and the multivertex elimination in one single step. So we try to choose the most occurring

function and try to parallelize it.

For the computation of large sparse matrix, Shrink Domain Decomposition takes

lots of time, which is a part of separator function. This thesis tries to parallelize this

function, such that the performance increases and gets better result out of it. This func-

tion mainly deals with the multisector properties. First It eliminate all multisectors that

are adjacent to only one domain, then it merge all indistinguishable multisectors accord-

ing to their checksums which is calculated based on the degree of that multisect vertex

with different types of scoretype like QMRDV (maximal relative decrease of variables

in quotient graph), QMD (minimum degree in quotient graph) or QRAND (randomly

generated degree).

This function of shrinking the domain decomposition takes place until the number

of domain of the remaining graph will be lesser than the provided min domain or number

of edges will be lesser than the number of vertices of the remaining graph. This function

works similar as coarsening phase of multilevel method described above which is used

in many sparse matrix ordering. So parallelizing this shrinking method increases the

performance of the symbolic factorization of the matrix.

18

So, this gives us some possibility of parallelization. For the parallelization of the

shrinking method of the defined approach, We first try to construct it with MPI (Message

Passing Interface)[13, 30, 36, 38, 40] which is used in many software to do the sparse ma-

trix computation in parallel. But MPI only works with the passing of messages between

the processors, it does not have any notion of global shared memory. The shrinking pro-

cess checks the multisector and changes the representative of that multisector according

to their vertex type, checksum and indistinguishability. And whenever it gets change

some representative or vertex type by one process, then other process has to know about

it, which takes too much time to gather all the information right. So this decreases the

performance of the ordering step too much such that sequential is far better than the

parallel version.

For the better increment of the performance of the parallel version, we have to

use the possibility of providing some global shared memory space for some structure.

So for this, Global Array toolkit(GA)[24, 25] has been used which provides a portable

Non-Uniform Memory Access (NUMA) shared memory programming environment in

the context of distributed array data structures (called ‘global arrays’). From the user

perspective, a global array can be used as if it was stored in shared memory. All details

of the data distribution, addressing, and data access are encapsulated in the global

array objects. Information about the actual data distribution and locality can be easily

obtained and taken advantage of whenever data locality is important. The primary target

architectures for which GA was developed are massively-parallel distributed-memory and

scalable shared-memory systems.

Global array divides logically shared data structures into ‘local’ and ‘remote’ por-

tions. It recognizes variable data transfer costs required to access the data depending

on the proximity attributes. A local portion of the shared memory is assumed to be

faster to access and the remainder (remote portion) is considered slower to access. In

addition, any processes can access a local portion of the shared data directly/in-place

like any other data in process local memory. Access to other portions of the shared

data must be done through the GA library calls. The Global Arrays library supports

two programming styles: task-parallel and data-parallel. The GA task-parallel model

of computations is based on the explicit remote memory copy: The remote portion of

shared data has to be copied into the local memory area of a process before it can be

used in computations by that process. Of course, the ‘local’ portion of shared data can

always be accessed directly thus avoiding the memory copy.The data distribution and

locality control are provided to the programmer.

19

In this implementation, many functions have been in-cooperated with the help of

global array to increase the run-time of the given ordering scheme. Suppose that one

sequential algorithm is given below for merging the indistinguishable multisectors.

/* ------------- merge indistinguishable multisecs ------------ */

for (k = 0; k < nlist; k++)

u = msvtxlist[k];

if (vtype[u] == 2)

chk = checksum[u];

v = bin[chk]; bin[chk] = -1; /* examine all multisecs in bin[hash] */

while (v != -1)

istart = xadj[v]; istop = xadj[v+1];

for (i = istart; i < istop; i++)

tmp[rep[adjncy[i]]] = flag;

ulast = v; u = next[v]; /* v is principal and u is a potential */

while (u != -1)

keepon = TRUE;

if (key[u] != key[v])

keepon = FALSE;

if (keepon)

istart = xadj[u]; istop = xadj[u+1];


if (tmp[rep[adjncy[i]]] != flag)

keepon = FALSE; break;

if (keepon) /* found it! mark u as nonprincipal */

rep[u] = v; vtype[u] = 4;

u = next[u]; next[ulast] = u; /* remove u from bin */

else /* failed */

ulast = u; u = next[u];

v = next[v]; /* no more variables can be absorbed by v */

flag++; /* clear tmp vector for next round */

20

Now we have simply put GA function calls after dividing the number of vertices

into processors with the help of their checksums.

/* ----- merge indistinguishable multisecs with global array functionality ----- */

for (k = 0; k < nlist; k++)

u = msvtxlist[k];

if (vt[u] == 2)

chk = checksum[u];

if((chk >= lo) && (chk < hi+1))

v = bin[chk]; bin[chk] = -1; // examine all multisecs in bin[hash]

while (v != -1)

istart = xadj[v]; istop = xadj[v+1];

for (i = istart; i < istop; i++) tmp[rp[adjncy[i]]] = flag;

ulast = v; u = next[v]; // v is principal and u is a potential

while (u != -1)

keepon = TRUE;

if (key[u] != key[v]) keepon = FALSE;

if (keepon)

istart = xadj[u]; istop = xadj[u+1];


if (tmp[rp[adjncy[i]]] != flag)

keepon = FALSE; break;

if (keepon) // found it! mark u as nonprincipal

NGA_Put(rep, &u, &u, &v, &nnew); //put some variable in global array

NGA_Put(vtype, &u, &u, &newv, &nnew);

rp[u] = v; vt[u] = 4; u = next[u]; next[ulast] = u;

else ulast = u; u = next[u]; /* failed */

v = next[v]; // no more variables can be absorbed by v

flag++; // clear tmp vector for next round

21

Here in the above function, ‘lo’ and ‘hi’ are the lower and the higher indexes of the

checksum value for that processor number of given processor group. We have defined two

global array ‘vtype’ and ‘rep’ here which gives us the flexibility to use the function for

less number of computation according to their checksum values. Everytime a processor

enters into this piece of code, it asks for the values of these two global array and store

that global array into two local array ‘vt’ and ‘rp’ here.

So whenever a processor changes some value into that global array, it suddenly gets

change the value of that array independent of the other processors. Some data-structure

have also been changed globally for getting the better result. (By changing some global

data-structure implies that we have in-cooperated some data arrays with that structure.)

3.4 Results and Discussion

Results of proposed parallel implementation of the algorithms is define below. The

proposed approach looks at the methods for computing elimination trees (in short, at

the symbolic factorization phase)[19, 20, 27] and also at the runtime of the symbolic

factorization phases and the effect of different choices of the real symmetric sample

matrices of harwell-boeing matrix collections[23].

A complete analysis of the proposed parallel multilevel algorithm has to account

for the communication overhead in each shrinking step and the idling overhead that

results from waiting for the different processor to give their results. This performance

evaluation has been taken place only on 8 or less than 8 processors.

But this thesis also says that using more number of processors with lesser amount

of existing load into the processor for lesser run-time, the changes in the time-complexity

is getting decreased with the increment in the number of processors which can be seen

in Figure 3.1. By seeing this change in time-complexity, this can be inferred that the

time-complexity can further improved by increasing the number of processors.

This have been seen that by using more number of processors, a deductive amount

of change in the time complexity can be get. And by this parallel implementation, this

can also be inferred that in some methods, sometimes by the random computation by

each processor, less numbers of fronts of the resulting elimination tree can be obtained.

22

Figure 3.1: Changes in runtime vs number of processor

This will provide us a better ordering at the end of this parallel implementation. For

one matrix (bcsstk17.rsa) number of fronts has been shown here in Figure 3.2.

Figure 3.2 shows that after increasing the number of processor, the less number of

fronts has been obtained in the end of the ordering steps. If further number of proces-

sor is increased, then we can get better result with less time-complexity in the ordering

phase. For defining the structure globally, synchronization of the global data for the

whole parallel implementation is important bottleneck. This synchronization process

will take less time if we have large data for the single processor computation which can

only be done for large sparse matrix. In a way, our parallel implementation perform

good in terms of time-complexity when we have the large sparse matrix.

So, this part of the thesis shows that in a way by parallel computing, the proposed

implementation gets less number of fronts and less complexity if we use high number of

processors. This scheme is also good for the further steps of computation after matrix

ordering because of better ordering. The rest of decrement of number of fronts will be

discussed in the next chapter of the thesis.

23

Figure 3.2: Number of fronts vs number of processor

If we get better ordering in ordering phase then we can say that we have to invest

less amount of time in solving the solution of the matrix. This simply increase our total

time-complexity for the linear programming.

24

Chapter 4

Node Selection Strategies for the

Construction of Vertex Separators

PORD uses the node selection strategy in the construction of quotient graphs which is the

most time-consuming part of the algorithm. Generally, PORD uses the AMMF(Approximate

minimum mean fill-in algorithm) strategy to find the eliminated node.

Basically, the minimum local fill or minimum deficiency heuristic uses the exact

amount of fill rather than the bound above to select a node for elimination[11, 21].

This approach is generally thought to provide limited quality advantages over minimum

degree while requiring significantly higher runtime. The minimum local fill heuristic

has received less attention in the literature than minimum degree, primarily because its

runtime is prohibitive. To compute the fill that would result from the elimination of a

node k, this has to be determined which pairs of nodes in Adj(k) are already adjacent,

and this is much more expensive than simply computing |Adj(k)|. To compound the

problem, while the elimination of a node k can only affect the degrees of nodes in Adj(k),

it can affect fill counts for both nodes in Adj(k) and their neighbors. While many of

the enhancements described for minimum degree are applicable to minimum local fill

(particularly supernodes), run-times are still prohibitive. But, in the recent years, some

algorithms for the minimum local fill have been developed which have the good ordering

with limited amount of run-time.

25

4.1 Node selection strategy of PORD

In construction of quotient graph of multilevel algorithm, Gi+1 is obtained from a quotient

graph Gi by eliminating a set of independent variables U ⊂ V i. This coarsening scheme

leads us to the following interesting questions: What node selection strategy should be

used to find U , and how does the strategy influence the construction of the separators?

Each separator Si = Vp1 , ...., Vpt of Gi induces a separator S = Vp1 ∪ .... ∪ Vpt of

G. . Thus, S is composed of variables that belong to the boundaries of certain elements.

Since our primary goal is to find a small (i.e. light weighted) separator S the elements

of a quotient graph should be merged so that the number and the weight of variables

that are adjacent to the newly formed elements is minimized. This corresponds exactly

to the strategy of the minimum degree algorithm. Therefore, a suitable node selection

strategy can be as follows: For each variable V ∈ V i compute its degree

deg (V ) =∑

U∈MV

weight(U) (4.1)

with MV = V ′ ∈ V i − V ;∃D ∈ adjGi (D). Sort the variables according to their

degrees in ascending order and fill the independent set U starting with the first one in

that order.

In order to accelerate the degree computations, an approximation of the above

equation(4.1) has been set for each element D ∈ Di

deg (D) =∑

V ′∈adjGi (D)

weight(V ′) (4.2)

Score function is also defined for the approximate minimum degree approach for

the vertices in the quotient graph.

scoreQAMD (V ) =∑

D∈adjGi (V )

(deg(D)− weight(V )) (4.3)

This score function is called approximate-minimum-degree-in-quotient-graph (QAMD).

A more direct way to produce elements with light boundaries is to eliminate an

independent set of heavy weighted variables. Unfortunately, this node selection strategy

can lead to a strong growing of only a few elements. Typically, a heavy weighted variable

26

V represents a large boundary segment shared by two large elements/domains. When

removing V the two elements are merged together with V to form an even larger ele-

ment/domain. This unbalanced growing of elements cripples our optimization algorithm.

Therefore, we penalize the growing of large elements by relating the weight of V to the

weight of the newly formed element. This motivates a node selection strategy based on

scoreQMRDV (V ) =1

weight(V )·

∑D∈adjGi (V )

weight(D) (4.4)

The score function is called maximal-relative-decrease-of-variables-in-quotient-graph

(QMRDV). This has been demonstrated that the QMRDV strategy is very effective in

absorbing the vertices of a graph. Also a random elimination strategy enables a fast

coarsening of G.

4.2 Proposed approach for the node selection

This section describes several modifications to the minimum local fill algorithm that

improve the quality of the computed orderings. Furthermore in this section, an intuitive

explanation of their effectiveness and attempt to provide a more formal explanation has

been presented.

To easily describe these new heuristics, ordering algorithms introduce a function

score(K) that captures the cost of eliminating an uneliminated supernode K. The

ordering algorithm always chooses a node with minimum score to eliminate next. In

the case of minimum degree, score(K) = |Adj(K)|; for minimum local fill, score(K) =

|Fill(K)|, where Fill(K) is the set of edges that would be added if K is eliminated.

Generally in approximate minimum mean fill-in(AMMF) algorithm, score function

is different from the standard minimum fill-in. Eliminating a supernode K corresponds

to |K| single-node eliminations, so the average fill associated with each elimination is

score(K) = |Fill(K)| / |K| (4.5)

Since this node selection strategy is widely used in sparse matrix ordering algo-

rithm, so in this thesis, two types of enhancement of this node selection strategy has

been presented.

So This thesis consider a modification to the minimum fill-in node selection strat-

27

egy. Our goal is to introduce some of the flavor of minimum local fill without also

introducing the prohibitive cost.

4.2.1 First Modification

The first modification is motivated by the observation that eliminating a supernode K

corresponds to |K| single-node eliminations. But instead of average fill associated with

each elimination, here the score is used in different manner. This is shown that the score

function is totally dependent upon the supernode K elimination. So in the first function

the following modification has been done.

score(K) = |Fill(K)| / |log2(K + 1)| (4.6)

4.2.2 Second Modification

The second modification is done by seeing that increase in nonzero entries simply failing

the previous modification. So for large input this thesis introduces one more modification

to AMMF. Here in the previous and other modification of MMF, this is shown that

score function is modified according to the decreasing function of K. So in this defined

approach, second modified function is demonstrated according to the large number of

non-zero entries. The second modification has been defined below:

score(K) = |Fill(K)| / |exp(K)| (4.7)

These two modifications simply decreases the number of fronts of the elimination

tree obtained after the ordering step. This thesis demonstrates that the increment of

non-zero entries in the input matrices gives us better ordering if we multiply the score

function of MMF with the higher valued function of K.

4.3 Result and Discussion

To evaluate the effectiveness of these scoring functions, This thesis looks at ordering

quality over a set of more than 40 sparse symmetric matrices from the Harwell-Boeing

sparse matrix test set[23]. To reduce the effect of tie-breaking strategies, all nonzero

28

and operation counts are obtained by ordering each matrix several times (randomly

permuting the rows and columns before each ordering) and taking the median. For the

minimum fill variants, our algorithm takes the median over three permutations. While

for the less costly approximate minimum fill variants, our algorithm take the median

over eleven permutations.

The first modification of the node selection strategy is shown in Table 4.1. This

table provides the nonzero entries in the input matrix A, the number of fronts in the

lower triangle of L required to perform the further factorization after applying the first

modification of node-selection strategies.

Table 4.1: nfronts after first modification

Matrix NZ in A nfronts inAMMF

nfronts in firstmodification

bcsstk13 42945 612 604bcsstk15 60882 1303 1251bcsstk16 147631 728 690bcsstk17 219812 2595 2567bcsstk21 15100 2661 2643bcsstk24 81736 425 415bcsstk25 133 7820 7801bcsstk33 300 1270 1197bcsstm27 28675 140 135dwt 992 8868 284 280bcsstk08 7017 830 828bcsstk09 9760 459 457bcsstk11 17857 408 405

This modification gives us better result in the number of fronts which is quite a bit

striking factor in subsequent ordering. But in some of matrices of the non-zero entries

greater than 100000, this work has seen that second modification gives us far better

result than the first modification. The second modification is shown in Table 4.2. This

table shows the nonzero entries in the input matrix A, the number of fronts in the lower

triangle of L required to perform the further factorization after applying the second

modification of node-selection strategies.

These tables demonstrate that as increasing the multiplying factor, the better

ordering scheme has been obtained. But this multiplying factor should be decreasing as

the increment in the K. Recall that maintaining exact fill information requires updating

the scores of the neighbors of the eliminated nodes as well. Since the approximate

29

Table 4.2: nfronts after Second modification

Matrix NZ in A nfronts inAMMF

nfronts in firstmodification

nfronts in secondmodification

bcsstk18 80519 7843 7845 7681bcsstk19 3835 489 489 482bcsstk23 24156 1579 1588 1530bcsstk25 133840 7820 7801 7488bcsstk29 316740 3139 3171 2980eris1176 9864 903 903 883bcspwr03 297 115 115 110bcspwr04 943 246 246 240bcspwr05 1033 433 433 425bcspwr06 3377 1431 1431 1412bcspwr07 3718 1584 1584 1566bcspwr08 3837 1594 1594 1577bcspwr09 4117 1686 1686 1671bcspwr10 13571 4979 4979 4939

fill variants only update the scores for eliminated nodes, computing exact fill on these

nodes gives an upper bound on the improvement that can be obtained by refining our

approximations.

This thesis believes that the main problem with minimum fill-in is that the cliques

created in the elimination process often have non-smooth boundaries, which have shown

that non-smooth boundaries can lead to asymptotically suboptimal orderings. Note that

AMMF generates significantly smaller cliques than AMD for a given number of interior

nodes, which means that its cliques have smoother boundaries.

So, the alternative scoring functions produce smoother clique boundaries because

of the way they form large cliques. Recall that the approximate fill scoring functions

are more willing to select nodes that already belong to large cliques. As a result these

variants tend to grow large cliques into larger ones. In contrast, AMD forms large cliques

by merging smaller ones. The AMMF approach exhibits significant local growth in clique

sizes; in contrast clique sizes in AMD grow more smoothly.

Our approach finds that our modified strategy generally grows a clique further than

AMF. This is understandable since growing a clique often creates supernodes within the

current clique. Since these supernodes have reduced scores in our strategy, the clique

continues to grow. Apparently, growing larger cliques than those grown by AMF is

beneficial. Our approach experimented with scoring functions that encouraged cliques

30

to continue growing beyond the point where they would stop with our strategies.

Looking at clique growth patterns for the exact fill variants, we have found that

they have actually growing cliques less than the approximate fill variants. Clearly, they

are using a different mechanism to compute good orderings. Our thesis believe one

important property that the exact fill scores capture is clique alignment.

To summarize, this thesis conjecture that AMF is more effective than AMD because

the process of growing cliques creates smoother clique boundaries than the process of

merging smaller cliques. AMMF is even more effective because it allows the clique-

growing process to continue longer. MF is still more effective because cliques must

eventually be merged and exact fill scores capture some notion of clique alignment,

which leads to smoother clique boundaries. These logic gives us more stability to our

modification of node-selection.

31

Chapter 5

Software, tools and the configuration

5.1 Some Information About the Softwares Used

In the implementation, first some libraries has been built for the configuration of the

linear solver MUMPS. Here some sort of information is provided to get a better feel of

these libraries.

5.1.1 Blas

The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard build-

ing blocks for performing basic vector and matrix operations. The Level 1 BLAS performs

scalar, vector and vector-vector operations, where the Level 2 BLAS performs matrix-

vector operations, and the Level 3 BLAS performs matrix-matrix operations. Because

the BLAS are efficient, portable, and widely available, they are commonly used in the

development of high quality linear algebra software. LAPACK is an example.

5.1.2 Blacs

The BLACS (Basic Linear Algebra Communication Subprograms) project is an inves-

tigation which solves the purpose to create a linear algebra oriented message passing

interface that may be implemented efficiently and uniformly across a large range of

distributed memory platforms. The length of time required to implement efficient dis-

tributed memory algorithms makes it impractical to rewrite programs for every new

parallel machine. The BLACS exists in order to make linear algebra applications both

32

easier to program and more portable to design.

5.1.3 Scalapack

The ScaLAPACK (or Scalable LAPACK) library includes a subset of LAPACK routines

redesigned for distributed memory MIMD parallel computers. It is currently written in

a Single-Program-Multiple-Data style using explicit message passing for interprocessor

communication. It assumes matrices are laid out in a two-dimensional block cyclic

decomposition. ScaLAPACK is designed for heterogeneous computing and is portable

on any computer that supports MPI or PVM. Like LAPACK, the ScaLAPACK routines

are based on block-partitioned algorithms in order to minimize the frequency of data

movement between different levels of the memory hierarchy. For such machines, the

memory hierarchy includes the off-processor memory of other processors, in addition to

the hierarchy of registers, cache, and local memory on each processor. The fundamental

building blocks of the ScaLAPACK library are distributed memory versions (PBLAS)

of the Level 1, 2 and 3 BLAS, and a set of BLACS for communication tasks that arise

frequently in parallel linear algebra computations. In the ScaLAPACK routines, all

interprocessor communication occurs within the PBLAS and the BLACS. One of the

design goals of ScaLAPACK is to have the ScaLAPACK routines resemble their LAPACK

equivalents as much as possible.

5.1.4 MPICH

Mpich is a freely available implementation of the MPI standard that runs on a wide

variety of systems. The mpich implementation provides tools that simplify creating

MPI executables. Because mpich programs may require special libraries and compile

options, the commands that mpich provides for compiling and linking programs has to

be used. When mpich is configured, the installation process normally looks for Fortran

90 compiler, and if it finds one, builds two different versions of MPI module. One

module includes only the MPI routines that do not take ‘choice’ arguments while the

other includes all MPI routines.

The relevant information about the ‘MPI’ and the ‘Global Array Toolkit’ has been

given in the Appendices of the thesis.

33

5.1.5 MUMPS

The solution of large sparse linear systems lies at the heart of most calculations in

computational science and engineering and is of increasing importance in computations

in the financial and business sectors. Today, systems of equations with more than one

million unknowns need to be solved. To solve such large systems in a reasonable time

requires the use of powerful parallel computers. To date, only limited software for such

systems has been generally available. The MUMPS software addresses this issue.

The original MUMPS package was only designed for real matrices but, in the new

version, complex symmetric and complex unsymmetric systems are permitted. If there

is sufficient demand, a version for complex Hermitian systems might be developed in

the future. The MUMPS software is written in Fortran 90. It requires MPI for mes-

sage passing and makes use of BLAS, LAPACK, BLACS, and ScaLAPACK subroutines.

However, in recognition that some users prefer the C programming environment, a C

interface has been developed for the new release, and a version has been written that

avoids the use of MPI, BLACS, and ScaLAPACK. This would be suitable for running in

a single processor environment, perhaps for testing and development purposes.

5.2 How to use the system

We have used 8 distributed parallel processor to run our implemented code. First Fortran

compiler has been installed in all the machine. Then we have built Mpich with the help of

Fortran compiler and gcc compiler. We have used ssh protocol for parallel communication

with mpi. We have checked the sample mpi programs working on that machines.

Now some libraries has been built with the help of mpi interface to build the

libraries, which further get used in building the MUMPS.

• BLAS is built with the help of Fortran compiler.

• BLACS is built used with mpich installed on every computer.

• Scalapack is built with mpich, Blas and Blacs libraries.

After this we have installed MUMPS (which is used to solve the system of linear

equation) with the help of above three libraries, Fortran compiler and mpich.

34

Figure 5.1: Dependencies of the softwares

In the multifrontal method, the last step of the factorization phase consists in

the factorization of a dense matrix. MUMPS uses ScaLAPACK for this final node.

Unfortunately ScaLAPACK does not offer the possibility to compute the inertia of a

dense matrix. (And in fact it does not offer either the possibility to perform an LDLT

factorization of a dense matrix so that we use ScaLAPACK LU on that final node).

We need to decide that which system(sequential or parallel) we want to use at the

time of configuration. Because sequential and parallel libraries cannot coexist in the

same application since they have the same interface. We need to decide at compilation

time which library we want to install. If we install the parallel version after the sequen-

tial version (or vice-versa), be sure you use ’make clean’ in between. If we plan to run

MUMPS sequentially from a parallel MPI application, we need to install the parallel ver-

sion of MUMPS and pass a communicator containing a single processor to the MUMPS

library. The reason for this behavior is that the sequential MUMPS uses a special library

libmpiseq.a instead of the true MPI, BLACS, ScaLAPACK. As this libraries implements

all the symbols needed by MUMPS for a sequential environment, it cannot co-exist with

MPI/BLACS/ScaLAPACK.

35

Chapter 6

Conclusion and Scope for Future

Work

6.1 Conclusion

Ordering phase in solving the system of linear equation become more interesting and

striking from the last three decades. There are many algorithm has been proposed for

the better ordering step in sufficient amount of time. PORD has been become so much

popular because of its hybrid scheme and tighter coupling. So we have tried to improve

the ordering with some modification.

In the first part of our present work, a new parallel two-way scheme for the construc-

tion of vertex separators have been presented. The fundamental idea of our algorithm

is to parallely generate a sequence of quotient graphs using a bottom-up node selection

strategy. Our computational experiments indicate that small vertex separators can be

obtained by this approach. Once all vertex separators have been obtained, we are using

them as a skeleton for the computation of several bottom-up orderings. The motiva-

tion is that the recursion in the nested dissection algorithm offers many possibilities for

merging well-aligned elements to new well-aligned elements. We feel that the exploration

of other merging strategies can lead to further improvements. By doing the recursion

parallely our thesis try to get better result.

In the second part of our present work, several simple modifications to the mini-

mum local fill ordering heuristic that exploit readily available information about node

adjacencies to improve the fill bounds used to select a node for elimination has been

presented. Perhaps the most practical of these modifications, which is called AMMF, re-

36

duces floating-point operation counts very significantly. This thesis improve the ordering

steps by reducing the number of fronts by modifying the AMMF node selection strategy.

This part of the thesis have given two modification in the current implementation of

PORD which is giving us better result in terms of number of fronts. Our computational

experiments indicate that increasing the value of non-zero entries in the input matrix for

symbolic factorization, the function with higher value of multiplying factor works better

than the usual AMMF strategy.

These two approaches gives us better ordering with the parallel implementation of

the ordering scheme. We have used parallel implementation of some most occurring func-

tions of the scheme, which gives us less communication overhead. We have defined global

array only for some important variable because of its synchronization issues. This gives

us partial implementation of parallelization in the ordering scheme. We have checked this

implementation for different types of global array variables and whichever gets better

result for this implementation, we have simply implemented that global variable. Many

functions for the modification of the node selection strategies has been experimented

in this implementation. Out of these, two good modification have been selected which

improves the ordering to greater extent in limited amount of time.

The capabilities of the present methodologies are demonstrated using various ex-

amples in the respective chapters.

6.2 Scope of future work

Further research should focus on improving the speedup of the parallel elimination tree

computation, since this part of the algorithm limits the speedup of the ordering scheme.

We give some suggestions for further research in the parallel implementation of the

ordering step.

In this thesis, all processors have full knowledge of the elimination tree after com-

puting it in parallel. However, not all processors use all that information for performing

the symbolic factorization. An improvement of the proposed algorithm could be to dis-

tribute the elimination tree over the processors on a need-to-know basis. A reduction of

the communication volume could be obtained and performance could hence be improved.

Another improvement could be reached by improving the move of the first nonzero

of each column in a row block. These are now moved based on parent information,

which leads to suboptimal moves. Better moves would require more information on

37

larger ancestors and thus require more communication. But possibly an improvement

can be reached.

We can use some other interface than MPI which has the in-built notion of global

shared memory thing. But this thing can only be useful when the other computation is

getting done into that environment. Creating and destroying any global array needs to

be done by all the processors at the same time. So, this leads us the slow computation

of the algorithm. an alternative method should be think to solve this problem.

One more improvement can be done in the node selection strategy. We saw that

somehow by increasing the multiplying factor, we are getting the better result in the

matrix of higher number of non-zero elements. So By some computation, we can predict

that at what number of non-zero elements, which strategy is giving us the better result

and what are the dependencies for that prediction.

38

References

[1] A. George. Nested dissection of a regular finite element mesh. SIAM J. Numer.

Anal., 10(2):p345–363, 1973.

[2] A. George and J. W. H. Liu. An automatic nested dissection algorithm for irregular

finite element problems. SIAM J. Numer. Anal., 15(5):p1053–1069, 1978.

[3] A. George and J. W. H. Liu. The evolution of the minimum degree ordering algo-

rithm. SIAM J. Numer. Anal., 31(1):p1–19, 1989.

[4] A. Gupta. Fast and effective algorithms for graph partitioning and sparse-matrix

ordering. IBM Journal of research and development, 41(1/2), 1997.

[5] Bruce Hendrickson and Ed Rothberg. Effective sparse matrix ordering just around

the bend. Eighth SIAM conf. Parallel processing for Scientific Computing.

[6] C. Ashcraft. Compressed graphs and the minimum degree algorithm. SIAM. J.

Matrix Anal. Appl., 16:pp1404–1411, 1995.

[7] C. Ashcraft and J. W. H. Liu. Robust ordering of sparse matrices using multisection.

SIAM. J. Matrix Anal. Appl., 19:p816–832, 1998.

[8] C. Ashcraft and J. W. H. Liu. A partition improvement algorithm for general-

ized nested dissection. Techn. Rep. BCSTECH-94-020, Boeing Computer Services,

Seattle, 1994.

[9] C. Ashcraft and J. W. H. Liu. Generalized nested dissection: Some recent progress.

Mini Symposium 5th SIAM Conference on Applied Linear Algebra, Snowbird, Utah,,

1994.

[10] C. Ashcraft and J. W. H. Liu. Using domain decomposition to find graph bisectors.

BIT, 37:p506–534, 1997.

39

[11] C. Meszaros. The inexact minimum local fill-in ordering algorithm. Techn. Report

WP 95 7, Computer and Automation Research Institute, Hungarian Academy of

Sciences, Budapest, 1995.

[12] D. J. Rose. A graph-theoretic study of the numerical solution of sparse positive

definite systems of linear equations. in Graph Theory and Computing, R. Read, ed.,

Academic Press, New York, pages pp183–217, 1972.

[13] D. Walker. The design of a standard message-passing interface for distributed mem-

ory concurrent computers. Parallel Computing,, 20(4):pp657–73, 1994.

[14] E. Rothberg. Robust ordering of sparse matrices: a minimum degree, nested dis-

section hybrid. Silicon Graphics manuscript, 1995.

[15] E. Rothberg and S. C. Eisenstat. Node selection strategies for bottom-up sparse

matrix ordering. SIAM J. Matrix Anal. Appl., 19(3):p682–695, 1998.

[16] Esmond G. Ng and Padma Raghavan. Performance of greedy ordering heuristics for

sparse cholesky factorization. Siam J. Matrix Anal. Appl., 20(4):pp.902–914, 1999.

[17] George Karypis and Vipin Kumar. A parallel algorithm for multilevel graph parti-

tioning and sparse matrix ordering. University of Minnesota, Department of Com-

puter Science/ Army HPC Research Center, Minneapolis,MN 55455, Technical Re-

port: 95-036, 1998.

[18] Gregoire Richard. Coupling mumps and ordering software. CERFACS report:

WN/PA/02/24, 2002.

[19] H. M. Markowitz. The elimination form of the inverse and its application to linear

programming. Management Sci., 3:pp255–269, 1957.

[20] Hans L. Bodlaender, John R. Gilbert, Hjalmtyr Hafsteinsson, and Ton kloks. Ap-

proximating treewidth, pathwidth and minimum elimination tree height. Technical

Report RUU-CS-91-1, 1991.

[21] I. A. Cavers. Using deficiency measure for tie-breaking the minimum degree algo-

rithm. Tech. report 89-2, Department of Computer Science, University of British

Columbia, Vancouver, B.C., 1989.

[22] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct methods for sparse matrices.

Oxford University Press, Oxford., 1987.

40

[23] I. S. Duff, R. G. Grimes, and J. G. Lewis. Users guide for the harwell-boeing

sparse matrix collection. Technical Report TR/PA/92/86, Res. and Techn. Division,

Boeing Computer Services, Seattle, 1992.

[24] Jaroslaw Nieplocha, Robert J. Harrison, and Richard J. Littlefield. Global arrays:

A portable shared-memory programming model for distributed memory computers.

Pacific Northwest Laboratory, Richland WA 99352, 1994.

[25] Jarek Nieplocha, Jialin Ju, Manoj Kumar Krishnan, Bruce Palmer, and Vinod

Tipparaju. Global array toolkit. Pacific Northwest National Laboratory Technical

Report No. PNNL-13130, 2002.

[26] Jurgen Schulze. Towards a tighter coupling of bottom-up and top-down sparse

matrix ordering methods. BIT Numerical Mathematics, 41(4):p800–841, 2001.

[27] Jeroen van Grondelle. Symbolic sparse cholesky factorisation using elimination trees.

Master’s Thesis, Department of Mathematics, Utrecht University, 1999.

[28] J. W. H. Liu. Modification of the minimum-degree algorithm by multiple elimina-

tion. ACM Trans. Math. Software, 11(2):p141–153, 1985.

[29] N. I. M. Gould, Y. Hu, and J. A. Scott. A numerical evaluation of sparse di-

rect solvers for the solution of large sparse, symmetric linear systems of equations.

Council for the Central Laboratory of the Research Councils, 2005.

[30] Neil MacDonald, Elspeth Minty, Tim Harding, and Simon Brown. Writing message-

passing parallel programs with mpi. Edinburgh Parallel Computing Centre, The

University of Edinburgh.

[31] Mahesh Joshi, George Karypis, Vipin Kumar, Anshul Gupta, and Fred Gustavson.

Pspases: Bulding a high performance scalable parallel direct solver for sparse linear

systems. 2003.

[32] Manpreet S. Khaira, Gary L. Miller, and Thomas J. Sheffler. Nested dissection: A

survey and comparison of various nested dissection algorithms. CMU-CS-92-106R,

School of Computer Science, Carnegie Mellon University, Pittsburgh,, 1992.

[33] Michael T. Heath and Padma Raghavan. A cartesian parallel nested dissection al-

gorithm. Department of Computer Science and National Center for Supercomputing

Applications, University of Illinois.

41

[34] P. R. Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree

ordering algorithm. SIAM J. Matrix Anal. Appl., 17:p886–905, 1996.

[35] Patrick R. Amestoy, Abdou Guermouche, Jean-Yves L’Excellent, and Stephane

Pralet. Hybrid scheduling for the parallel solution of linear systems. Technical

Report TR/PA/04/140, 2004.

[36] P. R. Amestoy, I. S. Duff, J. Y. LExcellent, and J. Koster. Multifrontal massively

parallel solver, users guide. 2003.

[37] P. Berman and G. Schnitger. On the performance of the minimum degree algorithm

for gaussian elimination. SIAM J. Matrix Anal. Appl., 11:pp83–88, 1990.

[38] Peter S. Pacheco. Department of mathematics, university of san fransisco. A user’s

guide to MPI, 1998.

[39] Tzu-Yi Chen, John R. Gilbert, and Sivan Toledo. Toward an efficient column

minimum degree code for symmetric multiprocessors. 2003.

[40] William Gropp, Ewing Lusk, and Anthony Skjellum. Using mpi: Portable parallel

programming with the message passing. MIT Press, 1994.

[41] W. F. Tinney and J. W. Walker. Direct solutions of sparse network equations by

optimally ordered triangular factorization. Proc. of the IEEE, 55:pp. 1801–1809,

1967.

42

Appendix A

Symbolic Factorization and

Elimination Tree

A.1 Cholesky Factorisation

Cholesky factorisation is a technique for solving linear systems Ax = b where A is a pos-

itive definite symmetric matrix. These computations appear frequently in for instance

the interior point method. This iterative alternative to the simplex method is used for

linear programming, a widely used optimization technique.

Definition 1.1.1 (Cholesky Factorisation) Given a symmetric positive definite

matrix A, the Cholesky factor L is the lower triangular matrix that satisfies

LLT = A (A.1)

After factoring A, we can first solve Ly = b and then LT x = y. Because the upper

and lower triangular systems are easy to solve, Cholesky factorisation is a convenient

way of solving a symmetric linear system Ax = b.

A.2 Numerical Factorisation

We can calculate such a matrix L as follows. If we assume that upper equation of

cholesky factorisation holds, then

43

aij =n−1∑k=0

LikLTkj =

j∑k=0

likljk =j−1∑k=0

likljk + lijljj (A.2)

where 0 ≤ j ≤ i < n. For i = j this leads to

ljj =

ajj −j−1∑k=0

l2jk

12

(A.3)

and for all 0 ≤ j < i < n to

ljj =1

ljj

aij −j−1∑k=0

likljk

. (A.4)

Algorithm 1.1 is a right-looking algorithm, based on equations (A.3) and (A.4) It

is formulated in terms of dense matrices and is called right-looking because it adds a

column to all the columns it should be added to, which are all on its right. A left-

looking algorithm takes a column and adds to it all the columns that should be added

to it. These columns are all on its left.

Algorithm 1.1 A dense numerical Cholesky factorisation algorithm

Input: A = lower(A0)

Output: A, A = L such that LLT = A0

for k := 0 to n− 1 do

akk :=√

akk

for i := k + 1 to n− 1 do

aik := aik/akk

for j := k + 1 to n− 1 do

for i := j to n− 1 do

aij := aij − aikajk

A.3 Symbolic Factorisation

Throughout this thesis, we will be factoring large sparse matrices. Sparse matrices have

many zero coefficients. In general, at most twenty percent of the entries of a sparse

44

matrix will be non-zeros. Taking advantage of the sparsity of matrices allows us to

compute Cholesky factors using far fewer floating point operations (flops) as we would

need factoring dense matrices of the same size. Also we can store sparse matrices using

less memory than their dense counterparts would use.

When factoring these matrices, we will see that Cholesky factors are often much

denser than the original matrices. These new non-zeros or fill-in are for instance gen-

erated by adding two columns with different nonzero positions. Because we use data

structures that only store the non-zeros, it is useful to know the structure of the Cholesky

factor before factoring it. Then we can reserve space in our data structure for the fill-in.

Symbolic factorisation determines the structure of the Cholesky factor. Because

we are not interested in the numerical values of the entries in the factor, this can be done

much faster than a full numerical factorisation.

A.4 Algorithms for symbolic factorisation

In the previous section we mentioned symbolic factorisation. In this section we deduce

a fast symbolic algorithm from the numerical algorithm in the previous section.

A.4.1 A graph representation of symbolic matrices

When dealing with symbolic factorisation, the algorithm can be formulated

conveniently in the language of graph theory.

Definition 1.4.1 An n× n matrix A induces the graph GA = (VA, EA) where

VA = 0, ..., n− 1 and EA = (i, j) |0 ≤ i, j < n ∧ aij 6= 0.

EA is called the set of edges and VA the set of vertices. Because A is assumed

symmetric, the graph need not be directed. Because of the symmetry of A. if

(i, j) ∈ EA, then (j, i) ∈ EA. And because A is positive definite, all the diagonal

elements of A are positive. As each vertex points to itself, these elements are generally

omitted. Therefore we rewrite the definition of GA for symmetric positive definite

matrices A.

Definition 1.4.2 A symmetric positive definite n× n matrix A induces the graph

45

GA = (VA, EA) where VA = 0, ..., n− 1 and EA = (i, j) |0 ≤ i, j < n ∧ aij 6= 0.

Figure A.1: Graph induced by the sparse matrix

A.4.2 A basic algorithm

Now we will transform algorithm 1.1 into a symbolic factorisation algorithm. To do this,

we simply remove all operations from the algorithm that do not introduce new non-zeros

or destroy existing non-zeros.

There are basically three operations in algorithm 1.1: A square root computation,

a division by akk and the actual column addition. The first two operations clearly do not

introduce new non-zeros nor do they destroy them. so that the only operation we have to

implement in symbolic factorisation is the column addition. Algorithm 1.2 implements

this operation in graph notation.

Algorithm 1.2 A basic symbolic Cholesky factorisation algorithm

Input: GA = (VA, EA)

Output : G = (V, E) where G = GL+LT with LLT = A


for all j : k < j < n ∧ (j, k) ∈ E do

for all i : j < i < n ∧ (i, k) ∈ E do

E := E ∪ (i, j)

This algorithm has approximately the same complexity as the numerical factori-

sation, O (nc2)1. We can reduce this runtime significantly as we will show in the next

section.

1c is the average number of non-zeros in each row

46

A.4.3 Fast symbolic Cholesky factorisation

In this section we will introduce a symbolic factorisation algorithm that is a factor c faster

than the basic symbolic Cholesky factorisation algorithm. We will follow the treatment

before but first we need a definition:

Definition 1.4.3 (Parent) Every column is said to have a parent column. The

parent of column k is defined as:

parent(k) = min i : k < i < n ∧ lik 6= 0 = min i : (i, k) ∈ E

Furthermore, parent (k) = ∞ if this minimum does not exist. This emplies that

∀i ∈ 0, ...., n− 1 : i < parent(i)

The proof is trivial, since the parent of column i was defined as the row index of

the first non-zero below the diagonal in column i. Apparently, this index will be greater

than i.Applying this theorem repeatedly at each stage of the calculation reduces the

runtime considerably. Algorithm 1.3 implements this technique.

Algorithm 1.3 The Fast Symbolic Cholesky factorisation algorithm

Input: GA = (VA, EA)

Output : G = (V, E) where G = GL+LT with LLT = A


parent(k) = min i : k < i < n ∧ (i, k) ∈ Efor all i : k < i < n ∧ (i, k) ∈ E do

E := E ∪ (i, parent (i))

In each one of n steps, a column of c elements is added to its parent, therefore

this algorithm has a runtime complexity of O (nc). This is a factor c faster than the

basic algorithm from the previous subsection. From now on, when we refer to symbolic

Cholesky factorisation, we mean the factorisation by algorithm 1.3.

47

A.5 elimination tree

In the previous section, we saw that there exists a parent-child relation between columns.

At the end of the sequential factorisation algorithm, the parent of each column is known.

In this section, we will take a closer look at these relations and try to compute them in

advance.

Definition 1.4.4 (Elimination forest) The elimination forest, associated with the

Cholesky factor L, is the directed graph G = (V, E) with V = 0, ..., n− 1 containing

all column numbers of L and

E = (i, j) ∈ V × V : i = parent (j)

Figure A.2: Note that the forest happens to contain only one tree.

we will assume that the elimination forest contains only one tree and refer to that

tree as the elimination tree. This assumption is reasonable since if necessary, a simple

preprocessing step divides the matrix into sub matrices that have a unique elimination

tree. Apart from the parent relation, the elimination tree contains a more general relation

between columns.

48

Appendix B

Elimination Graph and Quotient

Graph

B.1 Elimination graphs

The nonzero pattern of a symmetric n × n matrix, A, can be represented by a graph

G0 = (V 0, E0) with nodes V 0 = 1, ...., n and edges E0. An edge (i, j) is in E0 if and only

if aij 6= 0 and i 6= j. Since A is symmetric, G0 is undirected.

The elimination graph, Gk =(V k, Ek

)describes the nonzero pattern of the sub-

matrix still to be factorized after the first k pivots have been chosen and eliminated.

It is undirected, since the matrix remains symmetric as it is factorized. At step k, the

graph Gk depends on Gk−1 and the selection of the kth pivot. To find Gk, the kth pivot

node p is selected from V k−1. Edges are added to Ek−1 to make the nodes adjacent to

p in Gk−1 a clique (a fully connected subgraph). This addition of edges(fill-in) means

that we cannot know the storage requirements in advance. The edges added correspond

to fill-in caused by the kth step of factorization. A fill-in is a nonzero entry Lij, where(PAP T

)ij

is zero. The pivot node p and its incident edges are then removed from the

graph Gk−1 to yield the graph Gk. Let AdjGk(i) denote the set of nodes adjacent to i in

the graph Gk. When the kth pivot is eliminated, the graph Gk is given by

V k = V k−1/ p

and

Ek =(Ek−1 ∪ (AdjGk−1(p)× AdjGk−1(p))

)⋂(V k × V k

)49

The minimum degree algorithm selects node p as the kth pivot such that the de-

gree of p, tp ≡ |AdjGk−1(p)|, is minimum (where |...| denotes the size of a set or the

number of nonzeros in a matrix, depending on the context). The minimum degree al-

gorithm is a non-optimal greedy heuristic for reducing the number of new edges(fill-ins)

introduced during the factorization. We have already noted that the optimal solution is

NP-complete. By minimizing the degree, the algorithm minimizes the upper bound on

the fill-in caused by the kth pivot. Selecting p as pivot creates at most(t2p − tp

)/2 new

edges in G.

B.2 Quotient graphs

In contrast to the elimination graph, the quotient graph models the factorization of A

using an amount of storage that never exceeds the storage for the original graph G0.

The quotient graph is also referred to as the generalized element model. An important

component of a quotient graph is a clique. It is a particularly economic structure since

a clique is represented by a list of its members rather than by a list of all the edges in

the clique. Following the generalized element model, we refer to nodes removed from the

elimination graph as elements (George and Liu refer to them as eliminated nodes). We

use the term variable to refer to un-eliminated nodes.

The quotient graph, Gk = (V k, V k, Ek, Ek), implicitly represents the elimination

graph Gk where G0 = G0 , V 0 = V , V 0 = φ, E0 = E and E0 = φ. For clarity, we

drop the superscript k in the following. The nodes in G consist of variables(the set V ),

and elements(the set V ). The edges are divided into two sets: edges between variables

E ⊆ V × V and between variables and elements E ⊆ V × V . Edges between elements

are not required since we could generate the elimination graph from the quotient graph

without them.The sets V 0 and E0 are empty.

We use the following set notation (A, ε and £) to describe the quotient graph

model and our approximate degree bounds. Let Ai be the set of variables adjacent to

variable i in G, and let εi be the set of elements adjacent to variable i in G (we refer to

εi as element list i). That is, if i is a variable in V , then

Ai ≡ j : (i, j) ∈ E ⊆ V,

εi ≡e : (i, e) ∈ E

⊆ V ,

50

and

AdjG(i) ≡ Ai ∪ εi ⊆ V ∪ V

The set Ai refers to a subset of the nonzero entries in row i of the original matrix

A (thus the notation A). That is, A0i ≡ j : aij 6= 0, and Ak

i ⊆ Ak−1i for 1 ≤ k ≤ n.

Let £e denote the set of variables adjacent to element e in G. That is, if e is an element

in V , then we define

£e ≡ AdjG(e) =i : (i, e) ∈ E

⊆ V.

The edges E and E in the quotient graph are represented using the sets Ai and εi

for each variable in G, and the sets £e for each element in G. We will use A, ε, and £

to denote three sets containing all Ai, εi, and £e, respectively, for all variables i and all

elements e. George and Liu show that the quotient graph takes no more storage than

the original graph(∣∣∣Ak

∣∣∣+ ∣∣∣εk∣∣∣+ ∣∣∣£k

∣∣∣ ≤ |A′| for all k).

The quotient graph G and the elimination graph G are closely related. If i is a

variable in G, it is also a variable in G, and

AdjG (i) =

(Ai ∪

⋃e∈εi

£e

)/ i ,

where the / is the standard set subtraction operator. When variable p is selected as

the kth pivot, element p is formed (variable p is removed from V and added to V ). The

set £p = AdjG (p) is found using the above equation. The set £p represents a permuted

nonzero pattern of the kth column of L (thus the notation £). If i ∈ £p, where p is the

kth pivot, and variable i will become the mth pivot (for some m > k), then the entry

Lmk will be nonzero.

The above equation implies that £e/ p ⊆ £p for all elements e adjacent to

variable p. This means that all variables adjacent to an element e ∈ εp are adjacent to

the element p and these elements e ∈ εp are no longer needed. They are absorbed into

the new element p and deleted, and reference to them is replaced by reference to the new

element p. The new element p is added to the element lists, epsiloni, for all variables i

adjacent to element p. Absorbed elements, e ∈ εp, are removed from all element lists.

The sets Ap and εp, and £e for all e in εp, are deleted. Finally, any entry j in Ai,

51

where both i and j are in £p, is redundant and is deleted. The set Ai is thus disjoint

with any set £e for e ∈ εi. In other words, Aki is the pattern of those entries in row i of

A that are not modified by steps 1 through k of the Cholesky factorization of PAP T .

The net result is that the new graph G takes the same, or less, storage than before the

kth pivot was selected.

The following equations summarize how the sets £, ε, and A change when pivot p

is chosen and eliminated. The new element p is added, old elements are absorbed, and

redundant entries are deleted:

£k =

£k−1/⋃

e∈εp

£e

∪£p

εk =

εk−1/⋃

e∈εp

e

∪ pAk =

(Ak−1/ (£p ×£p)

)∪ (Vk × Vk)

Figure B.1: Elimination graph, quotient graph, and matrix for first three steps

52

Appendix C

Message Passing Interface

C.1 Message Passing Interface

The Message-Passing Interface or MPI is a library of functions and macros that can be

used in C, FORTRAN and C++ programs. As its name implies, MPI is intended for use

in programs that exploit the existence of multiple processors by message-passing. Mes-

sage passing is a programming paradigm used widely on parallel computers, especially

Scalable Parallel Computers (SPCs) with distributed memory and on Networks of Work-

stations (NOWs). Although there are many variations, the basic concept of processes

communicating through messages is well understood.

The major goal of MPI, as with most standards, is a degree of portability across

different machines. The expectation is for a degree of portability comparable to that

given by programming languages such as Fortran. This means that the same message-

passing source code can be executed on a variety of machines as long as the MPI library

is available, while some tuning might be needed to take best advantage of the features of

each system. Though message passing is often thought of in the context of distributed-

memory parallel computers, the same code can run well on a shared-memory parallel

computer.Knowing that efficient MPI implementations exist across a wide variety of com-

puters gives a high degree of flexibility in code development, debugging and in choosing

a platform for production runs.

The goal of the Message Passing Interface, simply stated, is to develop a widely used

standard for writing message-passing programs. As such the interface should establish a

practical, portable, efficient and flexible standard for message passing.

53

There are three types of communication routines exists for MPI so far:

C.1.1 Point to Point Communication Routines

MPI point-to-point operations typically involve message passing between two, and only

two, different MPI tasks. One task is performing a send operation and the other task is

performing a matching receive operation. There are different types of send and receive

routines used for different purposes. For example:

• Synchronous send

• Blocking send / blocking receive

• Non-blocking send / non-blocking receive

• Buffered send

• Combined send/receive

• Ready send

Any type of send routine can be paired with any type of receive routine. MPI also

provides several routines associated with send-receive operations, such as those used to

wait for a message’s arrival or probe to find out if a message has arrived.

C.1.2 Collective Communication Routines

Collective communication must involve all processes in the scope of a communicator.

All processes are by default, members in the communicator MPI COMM WORLD. It

is the programmer’s responsibility to insure that all processes within a communicator

participate in any collective operations. There are three types of collective operations

exist in message passing interface:

• Synchronization - processes wait until all members of the group have reached the

synchronization point.

• Data Movement - broadcast, scatter/gather, all to all.

54

• Collective Computation (reductions) - one member of the group collects data from

the other members and performs an operation (min, max, add, multiply, etc.) on

that data.

Collective operations are blocking. Collective communication routines do not take

message tag arguments. Collective operations within subsets of processes are accom-

plished by first partitioning the subsets into new groups and then attaching the new

groups to new communicators.

C.1.3 Group and Communicator Management Routines

A group is an ordered set of processes. Each process in a group is associated with a

unique integer rank. Rank values start at zero and go to N-1, where N is the number

of processes in the group. In MPI, a group is represented within system memory as an

object. It is accessible to the programmer only by a handle. A group is always associated

with a communicator object.

But in a way a communicator encompasses a group of processes that may commu-

nicate with each other. All MPI messages must specify a communicator. In the simplest

sense, the communicator is an extra tag that must be included with MPI calls. Like

groups, communicators are represented within system memory as objects and are acces-

sible to the programmer only by handles. For example, the handle for the communicator

that comprises all tasks is MPI COMM WORLD. From the programmer’s perspective,

a group and a communicator are one. The group routines are primarily used to specify

which processes should be used to construct a communicator.

55

Appendix D

Global Array Toolkit

Portability, efficiency and ease of coding are all important considerations in choosing

the programming model for a scalable parallel application. The message-passing pro-

gramming model is widely used because of its portability, but some applications are

too complex to code in it while also trying to maintain a balanced computation load

and avoid redundant computations. The shared-memory programming model simplifies

coding, but it is not portable and often provides little control over interprocessor data

transfer costs.

But in some sense Global Arrays (GA), that combines the better features of both

other models, leading to both simple coding and efficient execution. The key concept of

GA is that it provides a portable interface through which each process in a MIMD parallel

program can asynchronously access logical blocks of physically distributed matrices with

no need for explicit cooperation by other processes.

The Global Arrays (GA) toolkit provides a shared memory style programming

environment in the context of distributed array data structures (called global arrays).

From the user perspective, a global array can be used as if it was stored in shared memory.

All details of the data distribution, addressing, and data access are encapsulated in

the global array objects. Information about the actual data distribution and locality

can be easily obtained and taken advantage of whenever data locality is important.

The primary target architectures for which GA was developed are massively-parallel

distributed-memory and scalable shared-memory systems.

GA divides logically shared data structures into local and remote portions. It recog-

nizes variable data transfer costs required to access the data depending on the proximity

attributes. A local portion of the shared memory is assumed to be faster to access and

56

Figure D.1: Structure of Global Array Toolkit

the remainder (remote portion) is considered slower to access. These differences do not

hinder the ease-of-use since the library provides uniform access mechanisms for all the

shared data regardless where the referenced data is located. In addition, any processes

can access a local portion of the shared data directly/in-place like any other data in

process local memory. Access to other portions of the shared data must be done through

the GA library calls.

GA was designed to complement rather than substitute the message-passing model,

and it allows the user to combine shared-memory and message-passing styles of pro-

gramming in the same program. GA inherits an execution environment from a message-

passing library (w.r.t. processes, file descriptors etc.) that started the parallel program.

The basic shared memory operations supported include get, put, scatter and gather.

They are complemented by atomic read-and-increment, accumulate (reduction operation

that combines data in local memory with data in the shared memory location), and lock

operations. However, these operations can only be used to access data in global arrays

rather than arbitrary memory locations. At least one global array has to be created before

data transfer operations can be used. These GA operations are truly one-sided/unilateral

and will complete regardless of actions taken by the remote process(es) that own(s) the

referenced data. In particular, GA does not offer or rely on a polling operation or

require inserting any other GA library calls to assure communication progress on the

remote side.

57

symbolic factorisation of sparse matrix using …...symbolic factorisation of sparse matrix using...

Documents