ieee transactions on computers, vol. 64, no. 2,...

13
A Low Complexity Scaling Method for the Lanczos Kernel in Fixed-Point Arithmetic Juan Luis Jerez, Student Member, IEEE, George A. Constantinides, Senior Member, IEEE, and Eric C. Kerrigan, Member, IEEE AbstractWe consider the problem of enabling xed-point implementation of linear algebra kernels on low-cost embedded systems, as well as motivating more efcient computational architectures for scientic applications. Fixed-point arithmetic presents additional design challenges compared to oating-point arithmetic, such as having to bound peak values of variables and control their dynamic ranges. Algorithms for solving linear equations or nding eigenvalues are typically nonlinear and iterative, making solving these design challenges a nontrivial task. For these types of algorithms, the bounding problem cannot be automated by current tools. We focus on the Lanczos iteration, the heart of well-known methods such as conjugate gradient and minimum residual. We show how one can modify the algorithm with a low-complexity scaling procedure to allow us to apply standard linear algebra to derive tight analytical bounds on all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behavior of xed-point implementations of the modied problem can be chosen to be at least as good as a oating-point implementation, if necessary. The approach is evaluated on eld-programmable gate array (FPGA) platforms, highlighting orders of magnitude potential performance and efciency improvements by moving form oating-point to xed-point computation. Index TermsComputer arithmetic, computations on matrices, numerical algorithms, design aids 1 INTRODUCTION T HE Lanczos iteration [1] is the key building block in modern iterative numerical methods for computing eigenvalues or solving systems of linear equations involving symmetric matrices. These methods are typically used in scientic computing applications, for example when solving large sparse linear systems of equations arising from the discretization of partial differential equations (PDEs) [2]. In this context, iterative methods are preferred over direct meth- ods, because they can be easily parallelized and they can better exploit the sparsity in the problem to reduce computa- tion and, perhaps more importantly, memory requirements [3]. However, these methods are also interesting for small- and medium-scale problems arising in real-time embedded applications. In the embedded domain, on top of the advan- tages previously mentioned, iterative methods allow one to trade off computation time for accuracy in the solution and enable the possibility of terminating the method early to meet real-time deadlines. An example application is real-time optimal decision making with applications in, for instance, advanced control systems [4]. In both cases more efcient forms of computation, in the form of new computational architectures and algorithms that allow for more efcient architectures, could enable new applications in many areas of science and engineering. In high-performance computing (HPC), power consumption is quickly becoming the key limiting factor for building the next generation of computing machines [5]. In embedded comput- ing, cost, power consumption, computation time and size constraints often limit the complexity of the algorithms that can be implemented, thereby limiting the capabilities of the embedded solution. Porting oating-point algorithm implementations to xed-point arithmetic is an effective way to address these limitations. Because xed-point numbers do not require man- tissa alignment, the circuitry is signicantly simpler and faster. The smaller delay in arithmetic operations leads to lower latency computation and shorter pipelines. The smaller resource requirements either result in more performance through parallelism for a given silicon budget, or a reduction in silicon area and lower power consumption and cost. This latter observation is especially important, since the cost of chip manufacturing increases at least quadratically with silicon area [6]. It is for this reason that xed-point architectures are ubiquitous in high-volume low-cost embedded platforms, hence any new solution based on increasingly complex so- phisticated algorithms must be able to run on xed-point architectures to achieve high-volume adoption. In the HPC domain, heterogeneous architectures that incorporate xed- point processing could help to lessen the effects of the power wall, which is the major hurdle in the road to exascale computing [7]. However, while xed-point arithmetic is widespread for simple digital signal processing (DSP) operations, it is J.L. Jerez and G.A. Constantinides are with the Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K. E-mail: [email protected]. E.C. Kerrigan is with the Department of Aeronautics and the Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K. Manuscript received 15 Jan. 2013; revised 23 July 2013; accepted 29 July 2013. Date of publication 14 Aug. 2013; date of current version 16 Jan. 2015. Recommended for acceptance by P. Eles. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identier below. Digital Object Identier no. 10.1109/TC.2013.162 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, FEBRUARY 2015 303 0018-9340 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 07-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

A Low Complexity Scaling Method for theLanczos Kernel in Fixed-Point Arithmetic

Juan Luis Jerez, Student Member, IEEE, George A. Constantinides, Senior Member, IEEE,and Eric C. Kerrigan, Member, IEEE

Abstract—We consider the problem of enabling fixed-point implementation of linear algebra kernels on low-cost embedded systems, aswell as motivating more efficient computational architectures for scientific applications. Fixed-point arithmetic presents additionaldesign challenges compared to floating-point arithmetic, such as having to bound peak values of variables and control their dynamicranges. Algorithms for solving linear equations or finding eigenvalues are typically nonlinear and iterative, making solving these designchallenges a nontrivial task. For these types of algorithms, the bounding problem cannot be automated by current tools. We focus onthe Lanczos iteration, the heart of well-known methods such as conjugate gradient and minimum residual. We show how one canmodify the algorithmwith a low-complexity scaling procedure to allow us to apply standard linear algebra to derive tight analytical boundson all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behavior of fixed-pointimplementations of the modified problem can be chosen to be at least as good as a floating-point implementation, if necessary.The approach is evaluated on field-programmable gate array (FPGA) platforms, highlighting orders of magnitude potential performanceand efficiency improvements by moving form floating-point to fixed-point computation.

Index Terms—Computer arithmetic, computations on matrices, numerical algorithms, design aids

1 INTRODUCTION

THE Lanczos iteration [1] is the key building block inmodern iterative numerical methods for computing

eigenvalues or solving systems of linear equations involvingsymmetric matrices. These methods are typically used inscientific computing applications, for example when solvinglarge sparse linear systems of equations arising from thediscretization of partial differential equations (PDEs) [2]. Inthis context, iterativemethods are preferred over direct meth-ods, because they can be easily parallelized and they canbetter exploit the sparsity in the problem to reduce computa-tion and, perhaps more importantly, memory requirements[3]. However, these methods are also interesting for small-and medium-scale problems arising in real-time embeddedapplications. In the embedded domain, on top of the advan-tages previously mentioned, iterative methods allow one totrade off computation time for accuracy in the solution andenable the possibility of terminating themethod early tomeetreal-time deadlines. An example application is real-timeoptimal decision making with applications in, for instance,advanced control systems [4].

In both cases more efficient forms of computation, in theform of new computational architectures and algorithms thatallow for more efficient architectures, could enable newapplications in many areas of science and engineering. Inhigh-performance computing (HPC), power consumption isquickly becoming the key limiting factor for building the nextgeneration of computingmachines [5]. In embedded comput-ing, cost, power consumption, computation time and sizeconstraints often limit the complexity of the algorithms thatcan be implemented, thereby limiting the capabilities of theembedded solution.

Porting floating-point algorithm implementations tofixed-point arithmetic is an effective way to address theselimitations. Because fixed-point numbers do not requireman-tissa alignment, the circuitry is significantly simpler andfaster. The smaller delay in arithmetic operations leads tolower latency computation and shorter pipelines. The smallerresource requirements either result in more performancethrough parallelism for a given silicon budget, or a reductionin silicon area and lower power consumption and cost. Thislatter observation is especially important, since the cost of chipmanufacturing increases at least quadratically with siliconarea [6]. It is for this reason that fixed-point architectures areubiquitous in high-volume low-cost embedded platforms,hence any new solution based on increasingly complex so-phisticated algorithms must be able to run on fixed-pointarchitectures to achieve high-volume adoption. In the HPCdomain, heterogeneous architectures that incorporate fixed-point processing could help to lessen the effects of the powerwall, which is the major hurdle in the road to exascalecomputing [7].

However, while fixed-point arithmetic is widespread forsimple digital signal processing (DSP) operations, it is

• J.L. Jerez and G.A. Constantinides are with the Department of Electrical andElectronic Engineering, Imperial College London, London SW7 2AZ, U.K.E-mail: [email protected].

• E.C. Kerrigan is with the Department of Aeronautics and the Department ofElectrical and Electronic Engineering, Imperial College London, LondonSW7 2AZ, U.K.

Manuscript received 15 Jan. 2013; revised 23 July 2013; accepted 29 July 2013.Date of publication 14 Aug. 2013; date of current version 16 Jan. 2015.Recommended for acceptance by P. Eles.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TC.2013.162

IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, FEBRUARY 2015 303

0018-9340 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

typically assumed that floating-point arithmetic is necessaryfor solving general linear systems or general eigenvalueproblems, due to the potentially large dynamic range in thedata and consequently on the algorithm variables. Further-more, the Lanczos iteration is known to be sensitive tonumerical errors [8], hence moving to fixed-point arithmeticcould potentially worsen the problem and lead to unreliablenumerical behaviour.

To be able to take advantage of the simplicity offixed-pointcircuitry and reduce cost, power and computation time, thecomplexity burden shifts to the algorithm design process [9].In order to have a reliablefixed-point implementation one hasto be able to establish bounds on all variables of the algorithmto avoid online shifting, which would negate any speedadvantages, and avoid overflow errors. In addition, thebounds should be of the same order to minimize loss ofprecisionwhenusing constantwordlengths. There are severaltools in the design automation community for handling thistask [10]. However, because the Lanczos iteration is a nonlin-ear iterative algorithm, all state-of-the-art bounding tools failto provide practical bounds. Unfortunately, most linear alge-bra kernels (except extremely simple operations) are of thistype and they suffer from the same problem.

1.1 Summary of ContributionWe propose a novel scaling procedure first introduced byus in [11] to tackle the fixed-point bounding problem forthe nonlinear and recursive Lanczos kernel. The proceduregives tight bounds for all variables of the Lanczos processregardless of the properties of the original matrix, whileminimizing the computational overhead. The proof, basedon linear algebra, makes use of the fact that the scaledmatrix has all eigenvalues inside the unit circle. This kindof analysis is currently well beyond the capabilities ofstate-of-the-art automatic methods [12]. We then discussthe validity of the bounds under finite precision arithmeticand give simple guidelines to be used with existing erroranalysis [8] to ensure that the absence of overflow ismaintained under inexact computation. We also show howthe same scaling approach can be used for boundingvariables in other nonlinear recursive linear algebra kernelsbased on matrix-vector multiplication, such as the Arnoldiiteration [13]—a generalization of the Lanczos kernel fornon-symmetric matrices.

The potential efficiency improvements of the proposedapproach are evaluated on an FPGA platform. WhileMoore’s law has continued to promote FPGAs to a levelwhere it has become possible to provide substantial accel-eration over microprocessors by directly implementingfloating-point linear algebra kernels [14]-[17], floating-pointoperations remain expensive to implement, mainly becausethere is no hard support in the FPGA fabric to facilitate thenormalisation and denormalisation operations requiredbefore and after every floating-point addition or subtrac-tion. This observation has led to the development of toolsaimed towards fusing entire floating-point datapaths, re-ducing this overhead [18], [19]. However, the hard integersupport built into the FPGA fabric means that the FPGA isa good example platform where the efficiency gap betweenfixed-point and floating-point computation is very large,

which is why they are used for evaluation in this paper,even though the presented results are equally applicablefor implementing these algorithms on embedded micro-controllers and fixed-point DSPs.

To exploit the architecture flexibility in an FPGA wepresent a parameterizable architecture generator where theuser can tune the level of parallelization and the data type ofeach signal. This generator is embedded in a design automa-tion tool that selects the best architecture parameters tominimize latency, while satisfying the accuracy specificationsof the application and the FPGA resources available. Usingthis tool we show that it is possible to get sustained FPGAperformance very close to the peak theoretical general-purpose graphics processing unit GPGPU performance whensolving a single Lanczos problem to equivalent accuracy. Ifthere are multiple independent problems to solve simulta-neously it is possible to exceed the peak floating-pointperformance of a GPGPU. If one considers the power con-sumption of both devices, the fixed-point Lanczos solver onthe FPGA is more than an order of magnitude more efficientthan the peak GPGPU efficiency. The test data are obtainedfrom a benchmark set of equations coming from a Boeing 747optimal controller.

1.2 OutlineThe remainder of the paper is organised as follows: Section 2describes the Lanczos algorithm; Section 3 presents our scal-ing procedure and contains the analysis to guarantee theabsence of overflow in the Lanczos process; Section 4 presentsnumerical results showing that the numerical quality of thelinear equation solution does not suffer by moving to fixed-point arithmetic; in Section 5 we introduce an FPGA designautomation tool that generates minimum latency architec-tures given accuracy specifications and resource constraints;in Section 6 this tool is used to evaluate the potential relativeperformance improvement between fixed-point and floating-point FPGA implementations and perform an absoluteperformance comparison against the peak performanceof a high-end GPGPU; Section 7 discusses the possibility ofextending this methodology to other nonlinear recursivekernels based on matrix-vector multiplication. Section 8concludes the paper.

Algorithm 1. Lanczos Algorithm.

Require: Initial iterate such that ,and .

1: for to do

2:

3:

4:

5:

6:

7: end for

8: return , and

304 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, FEBRUARY 2015

Page 3: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

2 THE LANCZOS KERNEL

The Lanczos algorithm [1] transforms a symmetric matrixinto a tridiagonal matrix (only the diagonal and

off-diagonals are non-zero) with similar spectral properties asusing an orthogonal transformation matrix . The method

is described in Algorithm 1, where is the th column ofmatrix . At every iteration the approximation is refined suchthat

where and . The tridiagonal matrix iseasier to operate on than the original matrix. It can be used toextract the eigenvalues and singular values of [20], or tosolve systems of linear equations of the form using theconjugate gradient CG [21] method when is positive defi-nite or theminimum residualMINRES [22]methodwhen isindefinite. The Arnoldi iteration, a generalisation of Lanczosfor non-symmetric matrices, is used in the generalized mini-mum residual (GMRES) method for general matrices and isdicussed in Section 7. The Lanczos (and Arnoldi) algorithmsaccount for themajority of the computation in thesemethods—they are the key building blocks inmodern iterative algorithmsfor solving formulations of linear systems appearing in allkinds of applications.

Methods involving the Lanczos iteration are typically usedfor large sparse problems arising in scientific computingwhere directmethods, such as LUandCholesky factorization,cannot be used due to prohibitive memory requirements [23].However, iterative methods have additional properties thatalsomake themgood candidates for small problems arising inreal-time applications, since they allow one to trade-off accu-racy for computation time [24].

3 FIXED-POINT ANALYSIS

A floating-point number consists of a sign bit, an exponentand a mantissa. The exponent value moves the binary pointwith respect to themantissa. The dynamic range—the ratio ofthe smallest to largest representable number—grows doublyexponentiallywith the number of exponent bits, therefore it ispossible to represent a wide range of numbers with a rela-tively small number of bits. However, because two operandscan have different exponents it is necessary to performdenormalisation and normalisation operations before andafter every addition or subtraction, leading to greater resourceusage and longer delays. In contrast, fixed-point numbershave a fixed number of bits for the integer and fraction fields,i.e. the exponent does not change with time and it does nothave to be stored. Fixed-point hardware is the same as forinteger arithmetic, hence the circuitry is simple and fast, butthe representable dynamic range is limited. However, if thedynamic range in the data is also limited and fixed, a 32-bitfixed-point processor canprovidemore precision than a 32-bitfloating-point processor because there are no bits wasted forthe exponent.

There are several challenges that need to be addressedbefore implementing an application in fixed-point. Firstly,one should determine the worst-case peak values for everyvariable in order to avoid overflow errors. For linear time-invariant (LTI) algorithms it is possible to use discrete-timesystem theory to put tight analytical bounds on worst-casepeak values [25]. A linear algebra operation that meets suchrequirements is matrix-vectormultiplication, where the inputis a vector within a given range and the matrix does notchange over time. For some nonlinear non-recursive algo-rithms interval arithmetic [26] can be used to propagate dataranges forward through the computation graph [27]. Oftenthis approach can be overly pessimistic for non-trivial graphsbecause it cannot take into account the correlation betweenvariables.

Linear algebra kernels for solving systems of equations,finding eigenvalues or performing singular value decompo-sition are nonlinear and recursive. The Lanczos iterationbelongs to this class. For this type of computation the boundsgiven by interval arithmetic quickly blow up, renderinguseless information. Table 1 highlights the limitations ofstate-of-the-art bounding tool Gappa [12]—a tool based oninterval arithmetic—for handling the bounding problem forone iteration of Algorithm 1. Even when only one iteration isconsidered, the bounds quickly become impractical as theproblem size grows, because the tool cannot use any extrainformation in matrix beyond bounds on the individualcoefficients. Other recent tools [28] that can take into accountthe correlation between input variables can help to tighten thesingle iteration bounds, but there is still a significant amountof conservatism. More complex tools [29] that can take intoaccount additional prior information on the input variablescan further improve the tightness of the bounds. However, asshown in Table 1, the complexity of the procedure limits itsusefulness to very small problems. In addition, the boundsgiven by all these tools will grow further for more than oneiteration. As a consequence, these types of algorithms aretypically implemented using floating-point arithmetic be-cause the absence of overflow errors cannot be guaranteed,in general, with a practical number of fixed-point bits forpractical problems.

Despite the acknowledged difficulties there have been sev-eral fixed-point implementations of nonlinear recursive linearalgebra algorithms. CG-like algorithms were implemented in

TABLE 1Bounds on Computed by State-of-the-Art Bounding Tools [12],

[28] Given and

The tool described in [29] can also use the fact that . Notethat has unit norm, hence , and can be trivially scaled suchthat all coefficients are in the given range. ‘-’ indicates that the tool failed toprove any competitive bound. Our analysis will show that when all theeigenvalues of have magnitude smaller than one, holdsindependent of for all iterations . *No other bound smaller than

could be proved in less than two days.

JEREZ ET AL.: LOW COMPLEXITY SCALING METHOD FOR THE LANCZOS KERNEL IN FIXED-POINT ARITHMETIC 305

Page 4: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

[30], [31], whereas the Lanczos algorithmwas implemented in[32]. Boundsonvariableswere established through simulation-based studies and adding a heuristic safety factor. In thetargeted DSP applications, the types of problems that have tobe processed do not change significantly over time, hence thisapproach might be satisfactory, especially if the application isnot safety critical. In other applications, such as in optimizationsolvers, the range of linear algebra problems that need to besolved on the same hardware is so varied that it is not possibleto assign wordlengths based on simulation in a practicalmanner. Importantly, in safety-critical applications analyticalguarantees are desirable, since overflow errors can lead tocatastrophic behaviour of the system [33].

3.1 A Scaling Procedure for Bounding VariablesWepropose the use of a diagonal scalingmatrix to redefinethe problem in a new co-ordinate system to allowus to controlthe bounds in all variables, such that the same fixed precisionarithmetic can efficiently handle problems with a wide rangeof matrices. For example, if we want to solve the symmetricsystemof linear equations ,where ,weproposeinstead to solve the problem

where

and the elements of the diagonal matrix are chosen as

to ensure the absence of overflow in a fixed-point implemen-tation. The solution to the original problem can be recoveredeasily through the transformation .

An important point is that the scaling procedure and therecovery of the solution still have to be computed usingfloating-point arithmetic, due to the potentially largedynamic range and unboundness in the problem data.However, since the scaling matrix is diagonal, the cost ofthese operations is comparable to the cost of one iterationof the Lanczos algorithm. Since many iterations are typi-cally required, most of the computation is still carried outin fixed-point arithmetic.

In order to illustrate the need for the scaling procedure,Fig. 1 shows the evolution of the range of values of (Line 4 inAlgorithm 1) throughout the solution of one optimizationproblem from the benchmark set described in Section 4.Notice that a different Lanczos problem has to be solved ateach iteration of the optimization solver. Since the range ofLanczos problems that have to be solved on the same hard-ware is so diverse, without using the scalingmatrix (2) it is notpossible to decide on a fixed data format that can represent

numbers efficiently for all problems. Based on the simulationresults, with no scaling one would need to allocate 22 bits forthe integer part to be able to represent the largest value ofoccurring in this benchmark set. Furthermore, using thisnumber of bits would not guarantee that overflow will notoccur on adifferent set of problems. The situation is similar forall other variables in the algorithm.

Instead, when using the scaling matrix (2) we have thefollowing results:

Lemma 1. The scaled matrix with as in (2) has,for any non-singular symmetric matrix , spectral radius

.

Proof. Let be the absolute sum of the off-diagonal elements in row , and let be aGershgorin disc with centre and radius . Consideran alternative non-symmetric preconditioned matrix

. The absolute row sum is equal to 1 for everyrow of , hence the Gershgorin discs associated with thismatrix are given by . It is straightforwardto show that these discs always lie inside the intervalbetween 1 and when , which is the casehere. Hence, according to Geshgorin’s circletheorem [23, Theorem 7.2.1]. Now, for an arbitraryeigenvalue-eigenvector pair ,

where (5) is obtained by substituting for . Thisshows that the eigenvalues of the non-symmetricpreconditioned matrix and the symmetric precon-ditioned matrix are the same. The eigenvectors aredifferent but this does not affect the bounds, which wederive next. ◽

Fig. 1. Evolution of the range of values that takes for different Lanczosproblems arising during the solution of one optimization problem fromthe benchmark set of problems described in Section 4. The red solid andgrey shaded curves represent the scaled and unscaled algorithms,respectively.

306 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, FEBRUARY 2015

Page 5: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

Theorem 1. Given the scaling matrix (2), the symmetric Lanczosalgorithm applied to , for any non-singular symmetric matrix, has intermediate variables with the following bounds for all, and :

,,,

,,

,,

,,

,where denotes the iteration number and and denote the

component of a vector and component of a matrix,respectively.

Corollary 1. For the integer part of a fixed-point 2’s complementrepresentation we require, including the sign bit, two bits for ,, , , , , , and , and three bits for

. Observe that the elements of can be reducedby an arbitrarily small amount to turn the closed intervals ofTheorem 1 into open intervals, saving one bit for all variablesexcept for .

Proof of Theorem 1. The normalisation step in Line 2 ofAlgorithm 1 ensures that the Lanczos vectors have unitnorm for all iterations, hence all the elements of are in

.We follow by bounding the elements of the coefficient

matrix:

where (6) follows from the definition of .Using Lemma 1 we can put bounds on the rest of

the intermediate computations in the Lanczos iteration.We start with , which is used in Lines 3, 4 and 5 inAlgorithm 1:

where (7) follows from the properties of matrix norms andthe fact that . The equality follows from the 2-norm of a real symmetric matrix being equal to its largestabsolute eigenvalue [23, Theorem 2.3.1].

We continue by bounding and , which are used inLines 2, 4, 5 and 6 of Algorithm 1 and represent thecoefficients of the tridiagonal matrix described in (1). Theeigenvalues of the tridiagonal approximationmatrix (1) arecontained within the eigenvalues of , even throughoutthe intermediate iterations [23, §9.1]. Hence, one can usethe following relationship [23, §2.3.2]

to bound the coefficients of in (1), i.e. and, for all iterations.

Interval arithmetic can be used to show that the ele-ments of and are also between 1 and andthe elements of are in .

The following equality

which always holds in the Lanczos process [23, §9], can beused to bound the elements of the auxiliary vector in

via interval arithmetic on the expression . Wealso know that is non-negative so its boundderived from(8) can be refined.

Finally, we can also bound the intermediate computa-tion in Line 6 of Algorithm 1 using

hence lies in [0,1]. ◽

The following points should also be considered for areliable fixed-point implementation of the Lanczos process:

Division and square root operations are implemented asiterativeprocedures indigital architectures. Thedata typesfor the intermediate variables can be designed to preventoverflow errors. In this case, the fact that and

can be used to establish tight bounds on theintermediate results for any implementation. For instance,all the intermediate variables in a CORDIC square rootimplementation [34] can be upper bounded by one if theinput is known to be smaller than one.A possible source of problems both for fixed-point andfloating-point implementations is encountering .However, this would mean that we have already com-puted a perfect tridiagonal approximation to , i.e. theroots of the characteristic polynomial of are the sameas those of the characteristic polynomial of , signallingcompletion of the Lanczos process.If an upper bound estimate for is available, itis possible to bound all variables analytically withoutusing the scaling matrix (2). However, the boundswill lose uniformity, i.e. the elements of would still bein but the elements of would be in

.The scaling operation that has been suggested in this

section is also known as diagonal preconditioning. However,the primary objective of the scaling procedure is not to accel-erate the convergence of the iterative algorithm, the objectiveof standard preconditioning. Sophisticated preconditionersattempt to increase the clustering of eigenvalues. Our scalingprocedure, which has the effect on normalising the 1-norm ofthe rows of the matrix, can be applied after a traditionalaccelerating preconditioner. However, since this will movethe eigenvalues, it cannot be guaranteed that the scalingprocedure will not have a negative effect on the convergencerate. In cases where the convergence rate is not satisfactory, abetter strategy could be to include the objective of normalisingthe 1-norm of the rows of the matrix in the design of theaccelerating preconditioner to trade-off the goals of fast con-vergence and implementation cost. Since the design of sophis-ticated preconditioners is very application-specific it is notpossible to describe a general procedure to achieve this goal.

JEREZ ET AL.: LOW COMPLEXITY SCALING METHOD FOR THE LANCZOS KERNEL IN FIXED-POINT ARITHMETIC 307

Page 6: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

3.2 Validity under Inexact ComputationsWenowusePaige’s error analysis of theLanczosprocess [8] toadapt the previously derived bounds in the presence of finiteprecision computations. We are interested in the worst-caseerror in any component. In the following, wewill denote with

the deviation of variable from its value under exactarithmetic.

Unlike with floating-point arithmetic, fixed-point additionand subtraction operations involve no round-off error, provid-ed there is no overflow and the result has the same number offraction bits as the operands [35],whichwill be assumed in thissection. For multiplication, the exact product of two numberswith fraction bits can be represented using fraction bits,hence a -bit truncation of a 2’s complement number incurs around-off error bounded frombelowby . Recall that in 2’scomplement arithmetic, truncation incurs anegative error bothfor positive and negative numbers.

The maximum absolute component-wise error in the vari-ables involved in Algorithm 1 is summarised in the followingproposition:

Proposition 1. When using fixed-point arithmetic with fractionbits and assuming no overflow errors, themaximumdifference inthe variables involved in the Lanczos process described byAlgorithm 1 with respect to their exact arithmetic values canbe bounded by:

for all iterations , where .

Proof. In the original analysis presented in [8], higher ordererror terms are ignored since every term is assumed to besignificantly smaller than one for the analysis to be valid,hence, higher order terms have a negligible impact on thefinal results.Wedo the samehere as it significantly clarifiesthe presentation.

According to [8], the departure from unit norm in theLanczos vectors can be bounded by

for all iterations . In the worst case, all the error can beattributed to the same element in , hence, neglectinghigher-order terms we have

leading to (10).The error in Line 3 of Algorithm 1 can be written, using

the properties of matrix and vector norms, as

where the last term represents the maximum component-wise error in matrix-vector multiplication. Using (15) onecan infer that the bound on is the same as the boundon given by (10), leading to (11).

Neglecting higher order terms, the error in inLine 4 ofAlgorithm 1 can be obtained by forward error analysis as

where the last term arises from the maximum round-offerror in the dot-product computation. Using the boundsgiven by Theorem 1 one arrives at (12).

Going back to the original analysis in [8] one can use thefact that the 2-norm of the error in the relationship (9), i.e.

can be bounded from below by

for all iterations . One can infer that the bound on isthe same as (16). Using the properties of vector norms leadsto (13).

The error in Line 6 of Algorithm 1 is given by

Using the bounds given by Theorem 1 yields (14). ◽

The error bounds given by Proposition 1 enlarge thebounds given in Theorem 1. In order to prevent overflow inthe presence of round-off errors the integer bitwidth for hasto increase by bits, which will be one bitin all practical cases. For the remaining variables, which havebounds that dependon , one has twopossibilities—eitheruse extra bits to represent the integer part according to thenew larger bounds, or adjust the spectral radius throughthe scaling matrix (2) such that the original bounds still applyunder finite precision effects.

The latter approach is likely to provide effectively tighterbounds. We now outline a procedure, described in the fol-lowing lemma, for controlling and give an exampleshowing how to make use of it.

Lemma 2. If each element of the scaling matrix (2) is multiplied by, where is a small positive number, the scaled matrix

has, for any non-singular symmetric matrix ,spectral radius .

Proof. We now have . The new Gershgorin discsare given by , which can be easilyproved to lie inside the interval between and . ◽

For instance, if onedecides to use fraction bits on thebenchmark problems described in Section 4 with dimension

, the worst error bounds would be given by

The value of in Lemma 2 is chosen such that the followinginequalities are satisfied

which, for the given values, is satisfied by .

308 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, FEBRUARY 2015

Page 7: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

4 NUMERICAL RESULTS

In this sectionwe show that even though these algorithms areknown to be vulnerable to round-off errors [36] they can stillbe executed using fixed-point arithmetic reliably by using ourproposed approach.

In order to evaluate the numerical behaviour of theLanczos method, we examine the convergence of theLanczos-basedMINRES algorithm, described in Algorithm 2,which solves systems of linear equations byminimising the 2-norm of the residual . Notice that in order to boundthe solution vector , one would need an upper bound on thespectral radius of , which depends on the minimumabsolute eigenvalue of . In general it is not possible to obtaina lower bound on this quantity, hence the solution updatecannot be bounded and has to be computed using floating-point arithmetic. Since our primary objective is to evaluate thenumerical behaviour of the computationally-intensive Lanc-zos kernel, the operations outside Lanczos are carried out indouble precision floating point arithmetic.

Algorithm 2. MINRES Algorithm.

Require: Initial values , , , , ,, and . Given , and from

Algorithm 1:

1: for to do

2:

3:

4:

5:

6:

7:

8:

9:

10:

11: end for

12: return

Fig. 2 shows the convergence behaviour of a single preci-sion floating point implementation and several fixed-pointimplementations for a symmetric matrix from the Universityof Florida Sparse Matrix Collection [37]. All implementationsexhibit the same convergence rate. There is a difference in thefinal attainable accuracy due to the accumulation of round-offerrors, which is dependent on the precision used. The figurealso shows that the fixed-point implementations have a stablenumerical behaviour, i.e. the accumulated round-off errorconverges to afinite value. The numerical behaviour is similarfor all linear systems for which the MINRES algorithm con-verges to the solution in double precision floating-pointarithmetic.

In order to investigate the variation in attainable accuracyfor a larger set of problems we created a benchmark set ofapproximately 1000 linear systems, coming from an optimal

controller for a Boeing 747 aircraft model [38], [39] undermany different operating conditions. The linear systems arethe same size ( ) but the condition numbers rangefrom 50 to . The problems were solved using a 32-bitfixed-point implementation, and single and double precisionfloating-point implementations, and the attainable accuracywas recorded in each case. The results are shown in Fig. 3. Asexpected, double precision floating-point with 64 bits achievesbetter accuracy than the fixed-point implementations. How-ever, single precision floating-point with 32-bits consistentlyachieves less accurate solutions. Since single precision only has23mantissa bits, a fixed-point implementationwith 32 bits canprovide better accuracy than a floating-point implementationwith 32 bits if the problems are formulated such that the fulldynamic range offered by a fixed representation can be effi-ciently utilized across different problems. Fig. 3 also highlightsthat these problems are numerically challenging—if the scalingmatrix (2) is not used, even the floating-point implementationsfail to converge. This suggests that the proposed scaling pro-cedure can also improve the numerical behaviour of floating-point iterative algorithms along with achieving its main goalfor bounding variables. An example application supportingthis claim is described in [39].

The final attainable accuracy , denoted by , isdetermined by the machine unit round-off . When usingfloor rounding , where denotes the number of bitsused to represent the fractional part of numbers. It is well-known [40] that when solving systems of linear equationsthese quantities are related by

O

where is the condition number of the coefficient matrix.We have observed that, for , this relationship holdswith approximate equality for all tested problems, includingthe problem described in Fig. 2. For smaller bitwidths,

Fig. 2. Convergence results when solving a linear system using MINRESfor benchmark problem sherman1 from [37] with and conditionnumber . The solid line represents the single precision floating-point implementation (32 bits including 23 mantissa bits), whereas thedotted lines represent, from top to bottom, fixed-point implementationswith , 32, 41 and 50 bits for the fractional part of signals,respectively.

JEREZ ET AL.: LOW COMPLEXITY SCALING METHOD FOR THE LANCZOS KERNEL IN FIXED-POINT ARITHMETIC 309

Page 8: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

excessive round-off error leads to unpredictable behaviour ofthe algorithm for the described application.

The constant of proportionality, which captures thenumerical difficulty of the problem, is different for eachproblem. This constant cannot be computed a priori, butrelationship (17) allows one to calculate it after a singlesimulation run and then determine the attainable accuracywhen using a different number of bits, i.e. we can predict thenumerical behaviour when using bits by shifting the histo-gram for 32 fixed-point fraction bits.

5 PARAMETERIZABLE FPGA ARCHITECTURE

The results derived in Section 3 can be used to implementLanczos-based algorithms reliably in low cost and low powerfixed-point architectures, such as fixed-point DSPs and em-bedded microcontrollers. In this section, we will evaluate thepotential efficiency improvements in FPGAs. In these plat-forms, for addition and subtraction operations, fixed-pointunits consume one order of magnitude fewer resources andincur one order of magnitude less arithmetic delay thanfloating-point units providing the same number of mantissabits [41]. These platforms also provide flexibility for synthe-sizing different computing architectures. We now describeour architecture generating tool, which takes as inputs thedata type, number of fraction bits , level of parallelizationand the latencies of an adder/subtracter ( ), multiplier ( ),square root ( ) and divider ( ) and automatically generatesan architecture described in VHDL.

The proposed compute architecture for implementingAlgorithm 1 is shown in Fig. 4. The most computationallyintensive operation is the matrix-vector multiplicationin Line 3 of Algorithm 1. This is implemented in the blocklabelled , which is a parallel pipelined dot-product unitconsisting of a parallel array of multipliers followed by an

adder reduction tree of depth . The remaining opera-tions of Algorithm 1 are all vector operations andscalar operations that use dedicated components.

The degree of parallelism in the circuit is parameterized byparameter . For instance, if , there will be twoblocks operating in parallel, one operating on the odd rows of, the other on the even. All links carrying vector signals will

branch into two links carrying the even and odd components,respectively, and all arithmetic units operating on vector linkswill be replicated. Note that the square root and divider,which consumemost resources, only operate on scalar values,hence therewill only be one of each of these units regardless ofthe value of . For the memory subsystem, instead of having

independentmemories each storing one column of , therewill be independent memories, where half of the memo-ries will store the even rows and the other half will store theodd rows of .

The latency for one Lanczos iteration in terms of clockcycles is given by

where

is the number of cycles it takes the reduction circuit, illustratedin Fig. 5, to reduce the incoming streams to a single scalarvalue. This operation is necessarywhen computing and

. Note in particular that for a fixed-point implementa-tion with and , the reduction circuit is a singleadder and , as expected. Table 2 shows the latency ofthe arithmetic units under different number representations.

For cases where is large and the resources availablecannot provide parallel multipliers, it is possible to reducethe size of the dot-product unit, which consumes mostresources. For the case where , the dot product unitconsists of parallel multipliers, followed by an adder treeof depth . An extra adder is needed at the output toadd up the intermediate results, hence the overall latency of

Fig. 4. Lanczos compute architecture. Dotted lines denote links carryingvectors, whereas solid lines denote links carrying scalars. The two thickdotted lines going into the block denote parallel vector links. Theinput to the circuit is going into the multiplexer and the matrix beingwritten into on-chip RAM. The output is and .

Fig. 3. Histogram showing the final log relative error at termi-

nation for different linear solver implementations. From top to bottom,

preconditioned 32-bit fixed-point, double precision floating-point and

single precision floating-point implementations, and unpreconditioned

single precision floating-point implementation.

310 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, FEBRUARY 2015

Page 9: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

the dot-product computation does not change. The dottedlinks in Fig. 4 would carry one value every two cycles.

5.1 Design Automation ToolIn order to evaluate the performance of our designs for a giventarget FPGA chip we created a design automation tool thatgenerates optimumdesignswith respect to the following rule:

subject to

P >

where is defined in (18) with the explicit dependenceof latency on the number of fraction bits noted. Prepresents the probability that anyproblem chosen at randomfrom the benchmark set meets the user-specified accuracyconstraint , and is used to model the fact that for any finiteprecision representation—fixed point or double precisionfloating point—there will be problem instances that failto converge to the desired accuracy for numerical reasons.The user can specify —the tolerance on the error, and —theproportion of problems allowed to fail to converge to thedesired accuracy. In the remainder of the paper, we set

, which is reasonable for the application domain forthe data used. is a vector representing the utilizationof the different FPGA resources: flip-flops (FFs), look-uptables (LUTs), embedded multipliers and embedded RAM,for the Lanczos architecture illustrated in Fig. 4 with parallel-ism degree and a -bit fixed point datapath.

This problem can be easily solved. First, determine theminimum number of fraction bits necessary to satisfy theaccuracy requirements (20) by making use of the informationin Fig. 3 and the relationship (17). Once is fixed, find themaximum such that (21) remains satisfied using the infor-mation in Table 3 and a model for the number of LUTs, flip-flops and embedded multipliers necessary for implementingeach arithmetic unit for different number of bits and datarepresentations [41]. Note that the actual resource utilizationof the generated designs can differ slightly from the modelpredictions. However, themodeling error has been verified tobe insignificant compared to the efficiency improvements thatwill be presented in Section 6.

Since the entire matrix has to be stored on-chip, memory istypically the limiting factor for implementations with a smallnumber of bits. For larger numbers of bits embedded multi-pliers limit the degree of parallelization. In the former case,storage of some of the columns of is implemented usingbanks of registers so FFs become the limiting resource. In thelatter case, some multipliers are implemented using LUTs sothese become the limiting resource. Fig. 6 shows the trade-offbetween latency andFFsofferedby thefloating-point Lanczosimplementations and two fixed-point implementations that,when embedded inside a MINRES solver, meet the sameaccuracy requirements as the single and double precisionfloating-point implementations. The trade-off is similar forother resources. We can see that the fixed-point implementa-tions make better utilization of the available resources toreduce latency while providing the same solution quality.

6 PERFORMANCE EVALUATION

In this sectionwewill evaluate the relative performance of thefixed-point and floating-point implementations under theresource constraint framework of Section 5 for a Virtex 7 XT1140 FPGA [42]. Then we will evaluate the absolute perfor-mance and efficiency of the fixed-point implementationsagainst a high-end GPGPUwith a peak floating-point perfor-mance of 1 TFLOP/s.

The trade-off between latency (18) and accuracy require-ments for our FPGA implementations is investigated in Fig. 7.For high accuracy requirements a large number of bits areneeded reducing the extractable parallelism and increasingthe latency. As the accuracy requirements are relaxed itbecomes possible to reduce latency by increasing parallelism.The figure also shows that the fixed-point implementationsprovide a better trade-off even when the accuracy of thecalculation is considered.

The simple control structures in our design and the pipe-lined arithmetic units allow the circuits to be clocked at

TABLE 2Delays for Arithmetic Cores

The delay of the fixed-point divider varies nonlinearly between21 and 36 cycles from to .

TABLE 3Resource Usage

Fig. 5. Reduction circuit. Uses adders and a serial-to-parallelshift register of length .

JEREZ ET AL.: LOW COMPLEXITY SCALING METHOD FOR THE LANCZOS KERNEL IN FIXED-POINT ARITHMETIC 311

Page 10: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

frequencies up to 400MHz.Noting that eachLanczos iterationrequires operations we plot the number of opera-tions per second (OP/s) against accuracy requirements inFig. 8. For extremely high accuracy requirements, not attain-able by double precision floating-point, a fixed-point imple-mentation can still achieve approximately 100 GOP/s. Sincedouble precision floating-point only has 52 mantissa bits, a53-bit fixed-point arithmetic can provide more accuracy if thedynamic range is controlled. For accuracy requirements of

and the fixed-point implementations can achieveapproximately 200 and 300 GOP/s, respectively. Larger pro-blems would benefit more from incremental parallelizationleading to greater performance improvements, especially forlower accuracy requirements.

The GPGPU curves are based on the NVIDIA C2050 [43],which has a peak single precision performance of 1.03TFLOP/s and a peak double precision performance of 515GFLOP/s. It should be emphasized that while the solid linesrepresent the peak GPGPU performance, the actual sustainedperformance can differ significantly [44]. In fact, [45] reportedsustained performance well below 10% of the peak perfor-mance when implementing the Lanczos kernel on thisGPGPU.

The trade-off between performance and accuracy require-ments is important for the range of applications that weconsider. For some HPC applications, high accuracy require-ments, even beyond double precision, can be a high priority.On the other hand, for some embedded applications thatrequire the repeated solution of similar problems, accuracycan be sacrificed for the ability to apply actions fast andrespond quickly to new events. In some of these applications,solution accuracy requirements of can be perfectlyreasonable.

The results presented so far have assumed that we areprocessing a single Lanczos problem at a time. Using thisapproach the arithmetic units in our circuit are always idle forsome fraction of the iteration time. In addition, because theconstant term in (18) is relatively large, the effect of incremen-tal parallelization on latency reduction becomes small veryquickly. In the situation when there are many independentproblems available [46], it is possible to use the idle computa-tional power by time-multiplexingmultiple problems into thesame circuit to hide the pipeline latency and keep arithmeticunits busy [47]. The number of problems needed to fill thepipeline is given by the following expression

If the extra storage needed does not hinder the achievableparallelism, it is possible to achievemuchhigherperformance,exceeding the peak GPGPU performance for most accuracyrequirements even for small problems, as shown in Fig. 8b.Using this approach there is a more direct transfer betweenparallelization and sustained performance. The sharp im-provement in performance for low accuracy requirements isa consequence of a significant reduction in the number ofembedded multiplier blocks necessary for implementingmultiplier units, allowing for a significant increase in theavailable resources for parallelization.

For the Virtex 7 XT 1140 [42] FPGA from the performance-optimized Xilinx device family, Xilinx power estimator [48]was used to estimate the maximum power consumption atapproximately 22 Watts. For the C2050 GPGPU [43], thepower consumption is in the region of 100 Watts, while ahost processor consuming extra power would still be needed

Fig. 7. Latency against accuracy requirements tradeoff on a Virtex 7 XT1140 [42] for . The dotted line, the cross and the circle representfixed-point and double and single precision floating-point implementa-tions, respectively. The jumps in the fixed-point curve mark the points atwhich the level of parallelization changes.

Fig. 6. Latency tradeoff againstFFutilization (frommodel) onaVirtex7XT1140 [42] for . Double precision ( ) and singleprecision ( ) are represented by solid lines with crossesand circles, respectively. Fixed-point implementations with andare represented by the dotted lines with crosses and circles, respectively.These Lanczos implementations, when embedded inside a MINRESsolver, match the accuracy requirements of the floating-pointimplementations.

312 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, FEBRUARY 2015

Page 11: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

for controlling the data transfer to and from the GPGPU.Hence, for problems with modest accuracy requirements,there will be more than one order of magnitude differencein power efficiencywhenmeasured in terms of operations perwatt between the sustained fixed-point FPGA performanceand the peak GPGPU floating-point performance.

7 OTHER LINEAR ALGEBRA KERNELS

It is expected that the same scaling procedure presented inSection 3 will also be useful for bounding variables in otheriterative linear algebra algorithms based on matrix-vectormultiplication.

Algorithm 3. Arnoldi Algorithm.

Require: Initial iterate such that and .

1: for to do

2:

3:

4:

5: for to do

6:

7:

8: end for

9:

10: end for

11: return

The standard Arnoldi iteration [13], described in Algo-rithm 3, transforms a non-symmetric matrix intoan upperHessenbergmatrix (upper triangle andfirst lowerdiagonal are non-zero) with similar spectral properties asusing an orthogonal transformation matrix . At every itera-tion the approximation is refined such that

where and .Since the matrix is not symmetric it is not necessary to

apply a symmetric scaling procedure; hence, instead of solv-ing , we solve

and the computed solution remains the sameas the solution tothe original problem. The following proposition summarisesthe variable bounds for the Arnoldi process:

Proposition 2. Given the scaling matrix (2), the Arnoldi iterationapplied to , for any non-singular matrix , has intermediatevariables with the following bounds for all , and :

where denotes the iteration number and and denote thecomponent of a vector and component of a matrix,

respectively.

Fig. 8. Sustained computing performance for fixed-point implementationson a Virtex 7 XT 1140 [42] for different accuracy requirements. The solidline represents thepeakperformanceof a1TFLOP/sGPGPU. and arethe degree of parallelization and number of fraction bits, respectively.

JEREZ ET AL.: LOW COMPLEXITY SCALING METHOD FOR THE LANCZOS KERNEL IN FIXED-POINT ARITHMETIC 313

Page 12: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

Proof. According to the proof of Lemma1, the spectral radiusof the non-symmetric scaled matrix is still bounded by

. As with the Lanczos iteration, the eigenvalues ofthe approximate matrix are contained within theeigenvalues of even throughout the intermediateiterations. One can use the relationship (8) to show thatthe coefficients of the Hessenberg matrix are bounded by

. The bounds for the remaining expressions in theArnoldi iteration are obtained in the same way as inTheorem 1. ◽

It is expected that the same techniques could be applied toother related kernels such as the unsymmetric Lanczos processor the power iteration for computing maximal eigenvalues.

8 CONCLUSION

Fixed-point computation is more efficient than floating-pointfrom the digital circuit point of view. We have shown thatfixed-point computation can also be suitable for problems thathave traditionally been considered floating-point problems ifenough care is taken to formulate these problems in a numer-ically favourable way. Even for algorithms known to bevulnerable to numerical round-off errors, accuracy does notnecessarily have to be compromised bymoving to fixed-pointarithmetic if the dynamic range can be controlled such that afixed-point representation can represent numbers efficiently.

Implementing an algorithm using fixed-point arithmeticgives more responsibility to the designer, since all variablesneed to be bounded in order to avoid overflow errors that canlead to unpredictable behaviour. We have proposed a scalingprocedure that allows us to bound and control the dynamicrange of all variables in the Lanczos method—the buildingblock in iterative methods for solving the most importantlinear algebra problems, which are ubiquitous in engineeringand science. The proposed methodology is simple to imple-ment but uses linear algebra theorems to establish bounds,which is currentlywell beyond the capabilities of state-of-the-art automatic tools for solving the bounding problem.

The capability for implementing these algorithms usingfixed-point arithmetic could have an impact both in the highperformance and embedded computing domains. In theembedded domain, it has the potential to open up opportu-nities for implementing sophisticated functionality in lowcostsystems with limited computational capabilities. For high-performance scientific applications it could help in the effortto reach exascale levels of performance while keeping thepower consumption costs at an affordable level. For otherapplications there are substantial processing performanceand efficiency gains to be realized.

ACKNOWLEDGMENTS

The authors would like to acknowledge the support of theEPSRC (Grant EP/G031576/1 and Grant EP/I012036/1) andthe EU FP7 Project EMBOCON, as well as industrial supportfromXilinx, theMathworks, and the European SpaceAgency.The authors would also like to acknowledge fruitful discus-sionswithMr. TheoDrane andDr.AmmarHasan, andwouldlike to thank Dr. Ed Hartley for providing the benchmark setof initial conditions for the Boeing 747 optimal controller.

REFERENCES

[1] C. Lanczos, “An Iteration Method for the Solution of the EigenvalueProblem of Linear Differential and Integral Operators,” J. ResearchNat’l Bureau of Standards, vol. 45, no. 4, pp. 255-282, Oct. 1950.

[2] J. Demmel, Applied Numerical Linear Algebra, 1st ed. SIAM, 1997.[3] A. Greenbaum, Iterative Methods for Solving Linear Systems, 1st ed.,

SIAM, 1987.[4] J.M. Maciejowski, Predictive Control with Constraints. Pearson

Education, 2001.[5] S. Hemmert, “Green HPC: From Nice to Necessity,” Computing in

Science and Eng., vol. 12, no. 6, pp. 8-10, Nov. 2010.[6] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quanti-

tative Approach, 5th ed. Morgan Kaufmann Publishers, 2011.[7] D. Jensen and A. Rodrigues, “Embedded Systems and Exascale

Computing,” Computing in Science & Eng., vol. 12, no. 6, pp. 20-29,2010.

[8] C.C. Paige, “Error Analysis of the Lanczos Algorithm for Tridiago-nalizing a Symmetric Matrix,” J. Inst. of Math. Applications,vol. 18, pp. 341-349, 1976.

[9] C. Inacio, “The DSP Decision: Fixed Point or Floating?” IEEESpectrum, vol. 33, no. 9, pp. 72-74, Sept. 1996.

[10] G.A. Constantinides, N. Nicolici, and A.B. Kinsman, “NumericalData Representations for FPGA-Based Scientific Computing,” IEEEDesign & Test of Computers, vol. 28, no. 4, pp. 8-17, Aug./Sept. 2011.

[11] J.L. Jerez, G.A. Constantinides, and E.C. Kerrigan, “Fixed-PointLanczos: Sustaining TFLOP-Equivalent Performance in FPGAs forScientific Computing,” Proc. 20th IEEE Symp. Field-ProgrammableCustom Computing Machines, pp. 53-60, Apr. 2012.

[12] M. Daumas and G. Melquiond, “Certification of Bounds on Expres-sions Involving Rounded Operators,” Trans. Math. Software, vol. 37,no. 1, pp. 2:1-2:20, 2010.

[13] W.E. Arnoldi, “The Principle ofMinimized Iterations in the Solutionof theMatrix Eigenvalue Problem,”Quarterly inAppliedMath., vol. 9,no. 1, pp. 17-29, 1951.

[14] L. Zhuo and V.K. Prasanna, “High-Performance Designs for LinearAlgebra Operations on Reconfigurable Hardware,” IEEE Trans.Computers, vol. 57, no. 8, pp. 1057-1071, Aug. 2008.

[15] K.D. Underwood and K.S. Hemmert, “Closing the Gap: CPU andFPGA Trends in Sustainable Floating-Point BLAS Performance,”Proc. 12th IEEE Symp. Field-Programmable Custom ComputingMachines, pp. 219-228, Apr. 2004.

[16] W. Zhang, V. Betz, and J. Rose, “Portable and Scalable FPGA-BasedAcceleration of a Direct Linear System Solver,” Proc. Int‘l Conf. FieldProgrammable Technology, pp. 17-24, Dec. 2008.

[17] Y. Dou, Y. Lei, G. Wu, S. Guo, J. Zhou, and L. Shen, “FPGAAccelerating Double/Quad-Double High Precision Floating-PointApplications for Exascale Computing,” Proc. 24th ACM Int‘l Conf.Supercomputing, pp. 325-335, June 2010.

[18] F. de Dinechin and B. Pasca, “Designing Custom Arithmetic DataPathswith FloPoCo,” IEEEDesign & Test of Computers, vol. 28, no. 4,pp. 18-27, Aug. 2011.

[19] M. Langhammer and T. VanCourt, “FPGA Floating Point DatapathCompiler,” Proc. 17th IEEE Symp. Field-Programmable Custom Com-puting Machines, pp. 259-262, Apr. 2007.

[20] G. Golub and W. Kahan, “Calculating the Singular Values andPseudo-Inverse of a Matrix,” SIAM J. Numerical Analysis, vol. 2,no. 2, pp. 205-224, 1965.

[21] M.R. Hestenes and E. Stiefel, “Methods of Conjugate Gradients forSolving Linear Systems,” J. Research Nat’l Bureau ofStandards, vol. 49, no. 6, pp. 409-436, Dec. 1952.

[22] C.C. Paige and M.A. Saunders, “Solution of Sparse Indefinite Sys-tems of Linear Equations,” SIAM J. Numerical Analysis, vol. 12, no. 4,pp. 617-629, Sept. 1975.

[23] G.H. Golub and C.F. Van Loan, Matrix Computations, 3rd ed. TheJohns Hopkins Univ. Press, 1996.

[24] G.A. Constantinides, “Tutorial Paper: Parallel Architectures forModel Predictive Control,” Proc. European Control Conf., pp. 138-143,Aug. 2009.

[25] S.K. Mitra, Digital Signal Processing, 3rd ed. McGraw-Hill, 2005.[26] R.E. Moore, Interval Analysis. Prentice-Hall, 1966.[27] A. Benedetti and P. Perona, “Bit-Width Optimization for Configur-

able DSP’s by Multi-Interval Analysis,” Proc. 34th Asilomar Conf.Signals, Systems and Computers, pp. 355-359, Nov. 2000.

[28] D. Boland and G.A. Constantinides, “A Scalable Approach forAutomated Precision Analysis,” Proc. ACM Symp. Field Programma-ble Gate Arrays, pp. 185-194, Mar. 2012.

314 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, FEBRUARY 2015

Page 13: IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 2, …kresttechnology.com/krest-academic-projects/krest... · Index Terms—Computer arithmetic, computations on matrices, numerical algorithms,

[29] L.D. Moura and N.B. Bjørner, “Z3: An Efficient SMT Solver,” Proc.Int‘l Conf. Tools and Algorithms for the Construction and Analysis ofSystems, pp. 337-340, Mar. 2008.

[30] P.S. Chang and A.N. Willson, “Analysis of Conjugate GradientAlgorithms for Adaptive Filtering,” IEEE Trans. Signal Processing,vol. 48, no. 2, pp. 409-418, Feb. 2000.

[31] R.I. Hernández, R. Baghaie, and K. Kettunen, “Implementation ofGram-Schmidt Conjugate Direction and Conjugate Gradient Algo-rithms,” Proc. IEEE Finish Signal Processing Symp., pp. 165-169,May 1999.

[32] P. Jdnis, M. Melvasalo, and V. Koivunen, “Fast Reduced RankEqualizer for HSDPA Systems Based on Lanczos Algorithm,” Proc.IEEE 7th Workshop on Signal Processing Advances in Wireless Comm.,pp. 1-5, July 2006.

[33] J.L. Lions, “ARIANE 5 Flight 501 Failure,” Report by the InquiryBoard, http://www.ima.umn.edu/~arnold/disasters/ariane5rep.html, July 1996.

[34] J.-M. Muller, Elementary Functions: Algorithms and Implementation,2nd ed. Birkhaeuser, 2006.

[35] J.H. Wilkinson, Rounding Errors in Algebraic Processes, 1st ed., HerMajesty’s Stationary Office, 1963.

[36] C.C. Paige, “Accuracy and Effectiveness of the Lanczos Algorithmfor the Symmetric Eigenproblem,” LinearAlgebra and Its Applications,vol. 34, pp. 235-258, Dec. 1980.

[37] T.A. Davis and Y. Hu, “The University of Florida Sparse MatrixCollection,” ACM Trans. Math. Software, vol. 38, no. 1, pp. 1:1-1:25,2011, http://www.cise.ufl.edu/research/sparse/matrices.

[38] C.V.D. Linden,H. Smaili, A.Marcos, G. Balas, D. Breeds, S. Runhan,C. Edwards, H. Alwi, T. Lombaerts, J. Groeneweg, R. Verhoeven,and J. Breeman, GARTEUR RECOVER Benchmark, http://www.faulttolerantcontrol.nl/, 2011.

[39] E.Hartley, J.L. Jerez,A. Suardi, J.M.Maciejowski, E.C.Kerrigan, andG.A. Constantinides, “Predictive Control Using an FPGA withApplication to Aircraft Control,” IEEE Trans. Control Systems Tech-nology, vol. 22, no. 3, pp. 1006-1017, Jul. 2013.

[40] N.J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed.SIAM, 2002.

[41] Xilinx, “LogiCORE IP Floating-Point Operator v5.0. Xilinx,”http://www.xilinx.com/support/documentation/ip_documentation/floating_point_ds335.pdf, 2011.

[42] Xilinx, “Virtex-7 Family Overview,” http://www.xilinx.com/support/documentation/data_sheets/ds180_7Series_Overview.pdf,2011.

[43] NVIDIA, “Tesla C2050 GPU Computing Processor,” http://www.nvidia.com/docs/IO/43395/NV_DS_Tesla_C2050_C2070_jul10_lores.pdf, May 2013.

[44] K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding theEfficiency of GPU Algorithms for Matrix-Matrix Multiplication,”Proc. ACM Conf. Graphics Hardware, pp. 133-137, Aug. 2004.

[45] A. Rafique, N. Kapre, and G.A. Constantinides, “A High Through-put FPGA-Based Implementation of the Lanczos Method for theSymmetric Extremal Eigenvalue Problem,” Proc. Int‘l Symp. AppliedReconfigurable Computing, pp. 239-250, 2012.

[46] J.L. Jerez, K.-V. Ling, G.A.Constantinides, andE.C.Kerrigan, “ModelPredictive Control for Deeply Pipelined Field-Programmable GateArray Implementation:AlgorithmsandCircuitry,” IETControl Theoryand Applications, vol. 6, no. 8, pp. 1029-1041, 2012.

[47] C.E. Leiserson and J.B. Saxe, “Retiming Synchronous Circuitry,”Algorithmica, vol. 6, no. 1-6, pp. 5-35, 1991.

[48] Xilinx, “Xilinx Power Estimator,” http://www.xilinx.com/ise/power_tools/license_7series.htm, Dec. 2012.

Juan L. Jerez (S’11) received the MEng degreein electrical engineering in 2009 and the PhDdegree from the Circuits and Systems ResearchGroup in 2013, both from Imperial College London,London, U.K. He is currently a postdoctoral re-searcher at ETH Zürich, Zürich, Switzerland, andfounder of Embotech GmbH. The focus of hisresearch is on developing tailored linear algebraand optimization algorithms for efficient imple-mentation on customparallel computing platformsand embedded systems. Dr. Jerez has been

awarded the IET Doctoral Dissertation Prize in Control and Automationand a Pioneer Fellowship from ETH Zürich.

George A. Constantinides (S’96-M’01-SM’08)received the PhD degree from Imperial CollegeLondon, U.K., in 2001. Since 2002, he has beenwith the faculty at a Imperial College London,where he is currently professor of digital com-putation and head of the Circuits and SystemsResearch Group. He will be program (general)chair of the ACM International Symposium onField-Programmable Gate Arrays in 2014 (2015).He serves on several programme committees andhas published over 150 research papers in peer

refereed journals and international conferences. He is a Fellow of theBritish Computer Society.

Eric C. Kerrigan (S’94-M’02) received the PhDdegree from the University of Cambridge in 2001and has been with Imperial College London since2006.His research is focusedon the developmentof efficient numerical methods and embeddedcomputing architectures for solving advancedoptimization, control and estimation problems inreal-time. His work is applied to a varietyof problems in aerospace and renewable energyapplications. He is on the IEEE Control SystemsSociety Conference Editorial Board and is an as-

sociate editor of the IEEE Transactions on Control Systems Technology,Control Engineering Practice, and Optimal Control Applications andMethods.

▽ For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

JEREZ ET AL.: LOW COMPLEXITY SCALING METHOD FOR THE LANCZOS KERNEL IN FIXED-POINT ARITHMETIC 315