preface to the third edition - kntu homepage …wp.kntu.ac.ir/peyghami/pdf/book3912.pdf · preface...

754
Preface to the Third Edition A new edition of a text presents not only an opportunity for corrections and minor changes but also for adding new material. Thus we strived to improve the presentation of Hermite interpolation and B-splines in Chap- ter 2, and we added a new Section 2.4.6 on multi-resolution methods and B-splines, using, in particular, low order B-splines for purposes of illustra- tion. The intent is to draw attention to the role of B-splines in this area, and to familiarize the reader with, at least, the principles of multi-resolution methods, which are fundamental to modern applications in signal- and im- age processing. The chapter on differential equations was enlarged, too: A new Section 7.2.18 describes solving differential equations in the presence of disconti- nuities whose locations are not known at the outset. Such discontinuities occur, for instance, in optimal control problems where the character of a differential equation is affected by control function changes in response to switching events. Many applications, such as parameter identification, lead to differential equations which depend on additional parameters. Users then would like to know how sensitively the solution reacts to small changes in these pa- rameters. Techniques for such sensitivity analyses are the subject of the new Section 7.2.19. Multiple shooting methods are among the most powerful for solving boundary value problems for ordinary differential equations. We dedicated, therefore, a new Section 7.3.8 to new advanced techniques in multiple shoot- ing, which especially enhance the efficiency of these methods when applied to solve boundary value problems with discontinuities, which are typical for optimal contol problems. Among the many iterative methods for solving large sparse linear equa- tions, Krylov space methods keep growing in importance. We therefore treated these methods in Section 8.7 more systematically by adding new subsections dealing with the GMRES method (Section 8.7.2), the biorthog- onalization method of Lanczos and the (principles of the) QMR method (Section 8.7.3), and the Bi-CG and Bi-CGSTAB algorithms (Section 8.7.4). Correspondingly, the final Section 8.10 on the comparison of iterative meth- ods was updated in order to incorporate the findings for all Krylov space methods described before. The authors are greatly indebted to the many who have contributed vii

Upload: vanminh

Post on 28-Jul-2018

228 views

Category:

Documents


2 download

TRANSCRIPT

  • Preface to the Third Edition

    A new edition of a text presents not only an opportunity for correctionsand minor changes but also for adding new material. Thus we strived toimprove the presentation of Hermite interpolation and B-splines in Chap-ter 2, and we added a new Section 2.4.6 on multi-resolution methods andB-splines, using, in particular, low order B-splines for purposes of illustra-tion. The intent is to draw attention to the role of B-splines in this area,and to familiarize the reader with, at least, the principles of multi-resolutionmethods, which are fundamental to modern applications in signal- and im-age processing.

    The chapter on differential equations was enlarged, too: A new Section7.2.18 describes solving differential equations in the presence of disconti-nuities whose locations are not known at the outset. Such discontinuitiesoccur, for instance, in optimal control problems where the character of adifferential equation is affected by control function changes in response toswitching events.

    Many applications, such as parameter identification, lead to differentialequations which depend on additional parameters. Users then would liketo know how sensitively the solution reacts to small changes in these pa-rameters. Techniques for such sensitivity analyses are the subject of thenew Section 7.2.19.

    Multiple shooting methods are among the most powerful for solvingboundary value problems for ordinary differential equations. We dedicated,therefore, a new Section 7.3.8 to new advanced techniques in multiple shoot-ing, which especially enhance the efficiency of these methods when appliedto solve boundary value problems with discontinuities, which are typicalfor optimal contol problems.

    Among the many iterative methods for solving large sparse linear equa-tions, Krylov space methods keep growing in importance. We thereforetreated these methods in Section 8.7 more systematically by adding newsubsections dealing with the GMRES method (Section 8.7.2), the biorthog-onalization method of Lanczos and the (principles of the) QMR method(Section 8.7.3), and the Bi-CG and Bi-CGSTAB algorithms (Section 8.7.4).Correspondingly, the final Section 8.10 on the comparison of iterative meth-ods was updated in order to incorporate the findings for all Krylov spacemethods described before.

    The authors are greatly indebted to the many who have contributed

    vii

  • viii Preface to the Third Edition

    to the new edition. We thank R. Grigorieff for many critical remarks onearlier editions, M. v. Golitschek for his recommendations concerning B-splines and their application in multi-resolution methods, and Ch. Pflaumfor his comments on the chapter dealing with the iterative solution of linearequations. T. Kronseder and R. Callies helped substantially to establishthe new sections 7.2.18, 7.2.19, and 7.3.8. Suggestions by Ch. Witzgall,who had helped translate a previous edition, were highly appreciated andwent beyond issues of language. Our co-workers M. Preiss and M. Wenzelhelped us read and correct the original german version. In particular, weappreciate the excellent work done by J. Launer and Mrs. W. Wrschkawho were in charge of transcribing the full text of the new edition in TEX.

    Finally we thank the Springer-Verlag for the smooth cooperation andexpertise that lead to a quick realization of the new edition.

    Wurzburg, Munchen J. StoerJanuary 2002 R. Bulirsch

  • Preface to the Second Edition

    On the occasion of the new edition, the text was enlarged by several newsections. Two sections on B-splines and their computation were added tothe chapter on spline functions: due to their special properties, their flex-ibility, and the availability of well tested programs for their computation,B-splines play an important role in many applications.

    Also, the authors followed suggestions by many readers to supplementthe chapter on elimination methods by a section dealing with the solutionof large sparse systems of linear equations. Even though such systemsare usually solved by iterative methods, the realm of elimination methodshas been widely extended due to powerful techniques for handling sparsematrices. We will explain some of these techniques in connection with theCholesky algorithm for solving positive definite linear systems.

    The chapter on eigenvalue problems was enlarged by a section on theLanczos algorithm; the sections on the LR- and QR algorithm were rewrit-ten and now contain also a description of implicit shift techniques.

    In order to take account of the progress in the area of ordinary dif-ferential equations to some extent, a new section on implicit differentialequations and differential-algebraic systems was added, and the section onstiff differential equations was updated by describing further methods tosolve such equations.

    Also the last chapter on the iterative solution of linear equations wasimproved. The modern view of the conjugate gradient algorithm as aniterative method was stressed by adding an analysis of its convergencerate and a description of some preconditioning techniques. Finally, a newsection on multigrid methods was incorporated: It contains a descriptionof their basic ideas in the context of a simple boundary value problem forordinary differential equations.

    ix

  • x Preface to the Second Edition

    Many of the changes were suggested by several colleagues and readers. Inparticular, we would like to thank R. Seydel, P. Rentrop and A. Neumaierfor detailed proposals, and our translators R. Bartels, W. Gautschi and C.Witzgall for their valuable work and critical commentaries. The originalGerman version was handled by F. Jarre, and I. Brugger was responsiblefor the expert typing of the many versions of the manuscript.

    Finally we thank the Springer-Verlag for the encouragement, patienceand close cooperation leading to this new edition.

    Wurzburg, Munchen J. StoerMay 1991 R. Bulirsch

  • Contents

    Preface to the Third Edition VII

    1 Error Analysis 11.1 Representation of Numbers 21.2 Roundoff Errors and Floating-Point Arithmetic 41.3 Error Propagation 91.4 Examples 211.5 Interval Arithmetic; Statistical Roundoff Estimation 27

    Exercises for Chapter 1 33References for Chapter 1 36

    2 Interpolation 372.1 Interpolation by Polynomials 382.1.1 Theoretical Foundation: The Interpolation Formula of Lagrange 382.1.2 Nevilles Algorithm 402.1.3 Newtons Interpolation Formula: Divided Differences 432.1.4 The Error in Polynomial Interpolation 482.1.5 Hermite Interpolation 512.2 Interpolation by Rational Functions 592.2.1 General Properties of Rational Interpolation 592.2.2 Inverse and Reciprocal Differences. Thieles Continued Fraction 642.2.3 Algorithms of the Neville Type 682.2.4 Comparing Rational and Polynomial Interpolation 732.3 Trigonometric Interpolation 742.3.1 Basic Facts 742.3.2 Fast Fourier Transforms 802.3.3 The Algorithms of Goertzel and Reinsch 882.3.4 The Calculation of Fourier Coefficients. Attenuation Factors 92

    xi

    Preface to the Second Edition IX

  • xii Contents

    2.4 Interpolation by Spline Functions 972.4.1 Theoretical Foundations 972.4.2 Determining Interpolating Cubic Spline Functions 1012.4.3 Convergence Properties of Cubic Spline Functions 1072.4.4 B-Splines 1112.4.5 The Computation of B-Splines 1172.4.6 Multi-Resolution Methods and B-Splines 121

    Exercises for Chpater 2 134References for Chapter2 143

    3 Topics in Integration 145

    3.1 The Integration Formulas of Newton and Cotes 1463.2 Peanos Error Representation 1513.3 The Euler-Maclaurin Summation Formula 1563.4 Integration by Extrapolation 1603.5 About Extrapolation Methods 1653.6 Gaussian Integration Methods 1713.7 Integrals with Singularities 181

    Exercises for Chapter 3 184References for Chapter 3 188

    4 Systems of Linear Equations 190

    4.1 Gaussian Elimination. The Triangular Decomposition of a Matrix 1904.2 The Gauss-Jordan Algorithm 2004.3 The Choleski Decompostion 2044.4 Error Bounds 2074.5 Roundoff-Error Analysis for Gaussian Elimination 2154.6 Roundoff Errors in Solving Triangular Systems 2214.7 Orthogonalization Techniques of Householder and Gram-Schmidt 2234.8 Data Fitting 2314.8.1 Linear Least Squares. The Normal Equations 2324.8.2 The Use of Orthogonalization in Solving Linear Least-Squares

    Problems 2354.8.3 The Condition of the Linear Least-Squares Problem 2364.8.4 Nonlinear Least-Squares Problems 2414.8.5 The Pseudoinverse of a Matrix 2434.9 Modification Techniques for Matrix Decompositions 2474.10 The Simplex Method 2564.11 Phase One of the Simplex Method 2684.12 Appendix: Elimination Methods for Sparse Matrices 272

    Exercises for Chapter 4 280References for Chapter 4 286

  • Contents xiii

    5 Finding Zeros and Minimum Points by IterativeMethods 289

    5.1 The Development of Iterative Methods 2905.2 General Convergence Theorems 2935.3 The Convergence of Newtons Method in Several Variables 2985.4 A Modified Newton Method 3025.4.1 On the Convergence of Minimization Methods 3035.4.2 Application of the Convergence Criteria to the Modified

    Newton Method 3085.4.3 Suggestions for a Practical Implementation of the Modified

    Newton Method. A Rank-One Method Due to Broyden 3135.5 Roots of Polynomials. Application of Newtons Method 3165.6 Sturm Sequences and Bisection Methods 3285.7 Bairstows Method 3335.8 The Sensitivity of Polynomial Roots 3355.9 Interpolation Methods for Determining Roots 3385.10 The 2-Method of Aitken 3445.11 Minimization Problems without Constraints 349

    Exercises for Chapter 5 358References for Chapter 5 361

    6 Eigenvalue Problems 3646.0 Introduction 3646.1 Basic Facts on Eigenvalues 3666.2 The Jordan Normal Form of a Matrix 3696.3 The Frobenius Normal Form of a Matrix 3756.4 The Schur Normal Form of a Matrix; Hermitian and

    Normal Matrices; Singular Values of Matrixes 3796.5 Reduction of Matrices to Simpler Form 3866.5.1 Reduction of a Hermitian Matrix to Tridiagonal Form:

    The Method of Householder 3886.5.2 Reduction of a Hermitian Matrix to Tridiagonal or Diagonal

    Form: The Methods of Givens and Jacobi 3946.5.3 Reduction of a Hermitian Matrix to Tridiagonal Form:

    The Method of Lanczos 3986.5.4 Reduction to Hessenberg Form 4026.6 Methods for Determining the Eigenvalues and Eigenvectors 4056.6.1 Computation of the Eigenvalues of a Hermitian

    Tridiagonal Matrix 4056.6.2 Computation of the Eigenvalues of a Hessenberg Matrix.

    The Method of Hyman 4076.6.3 Simple Vector Iteration and Inverse Iteration of Wielandt 4086.6.4 The LR and QR Methods 4156.6.5 The Practical Implementation of the QR Method 425

  • xiv Contents

    6.7 Computation of the Singular Values of a Matrix 4366.8 Generalized Eigenvalue Problems 4406.9 Estimation of Eigenvalues 441

    Exercises for Chapter 6 455References for Chapter 6 462

    7 Ordinary Differential Equations 4657.0 Introduction 4657.1 Some Theorems from the Theory of Ordinary Differential

    Equations 4677.2 Initial-Value Problems 4717.2.1 One-Step Methods: Basic Concepts 4717.2.2 Convergence of One-Step Methods 4777.2.3 Asymptotic Expansions for the Global Discretization Error

    of One-Step Methods 4807.2.4 The Influence of Rounding Errors in One-Step Methods 4837.2.5 Practical Implementation of One-Step Methods 4857.2.6 Multistep Methods: Examples 4927.2.7 General Multistep Methods 4957.2.8 An Example of Divergence 4987.2.9 Linear Difference Equations 5017.2.10 Convergence of Multistep Methods 5047.2.11 Linear Multistep Methods 5087.2.12 Asymptotic Expansions of the Global Discretization Error for

    Linear Multistep Methods 5137.2.13 Practical Implementation of Multistep Methods 5177.2.14 Extrapolation Methods for the Solution of the Initial-Value

    Problem 5217.2.15 Comparison of Methods for Solving Initial-Value Problems 5247.2.16 Stiff Differential Equations 5257.2.17 Implicit Differential Equations. Differential-Algebraic Equations 5317.2.18 Handling Discontinuities in Differential Equations 5367.2.19 Sensitivity Analysis of Initial-Value Problems 5387.3 Boundary-Value Problems 5397.3.0 Introduction 5397.3.1 The Simple Shooting Method 5427.3.2 The Simple Shooting Method for Linear Boundary-Value

    Problems 5487.3.3 An Existence and Uniqueness Theorem for the Solution of

    Boundary-Value Problems 5507.3.4 Difficulties in the Execution of the Simple Shooting

    Method 5527.3.5 The Multiple Shooting Method 557

  • Contents xv

    7.3.6 Hints for the Practical Implementation of the MultipleShooting Method 561

    7.3.7 An Example: Optimal Control Program for a Lifting ReentrySpace Vehicle 565

    7.3.8 Advanced Techniques in Multiple Shooting 5727.3.9 The Limiting Case m of the Multiple Shooting Method

    (General Newtons Method, Quasilinearization) 5777.4 Difference Methods 5827.5 Variational Methods 5867.6 Comparison of the Methods for Solving Boundary-Value Problems

    for Ordinary Differential Equations 5967.7 Variational Methods for Partial Differential Equations

    The Finite-Element Method 600Exercises for Chapter 7 607References for Chapter 7 613

    8 Iterative Methods for the Solution ofLarge Systems of Linear Equations.Additional Methods 619

    8.0 Introduction 6198.1 General Procedures for the Construction of Iterative Methods 6218.2 Convergence Theorems 6238.3 Relaxation Methods 6298.4 Applications to Difference MethodsAn Example 6398.5 Block Iterative Methods 6458.6 The ADI-Method of Peaceman and Rachford 6478,7 Krylov Space Methods for Solving Linear Equations 6578.7.1 The Conjugate-Gradient Method of Hestenes and Stiefel 6588.7.2 The GMRES Algorithm 6678.7.3 The Biorthogonalization Method of Lanczos and the QMR

    algorithm 6808.7.4 The Bi-CG and BI-CGSTAB Algorithms 6868.8 Bunemans Algorithm and Fourier Methods for Solving the

    Discretized Poisson Equation 6918.9 Multigrid Methods 7028.10 Comparison of Iterative Methods 712

    Exercises for Chapter 8 719References for Chapter 8 727

    General Literature on Numerical Methods 730Index 732

  • 1 Error Analysis

    Assessing the accuracy of the results of calculations is a paramount goalin numerical analysis. One distinguishes several kinds of errors which maylimit this accuracy:

    (1) errors in the input data,(2) roundoff errors,(3) approximation errors.

    Input or data errors are beyond the control of the calculation. Theymay be due, for instance, to the inherent imperfections of physical mea-surements. Roundoff errors arise if one calculates with numbers whose rep-resentation is restricted to a finite number of digits, as is usually the case.

    As for the third kind of error, many methods will not yield the exactsolution of the given problem P , even if the calculations are carried outwithout rounding, but rather the solution of another simpler problem Pwhich approximates P . For instance, the problem P of summing an infiniteseries, e.g.,

    e = 1 +11!

    +12!

    +13!

    + ,

    may be replaced by the simpler problem P of summing only up to a finitenumber of terms of the series. The resulting approximation error is com-monly called a truncation error (however, this term is also used for theroundoff related error committed be deleting any last digit of a numberrepresentation). Many approximating problems P are obtained by dis-cretizing the original problem P : definite integrals are approximated byfinite sums, differential quotients by a difference quotients, etc. In suchcases, the approximation error is often referred to as discretization error.Some authors extend the term truncation error to cover discretizationerrors.

    In this chapter, we will examine the general effect of input and roundofferrors on the result of a calculation. Approximation errors will be discussedin later chapters as we deal with individual methods. For a comprehensivetreatment of roundoff errors in floating-point computation see Sterbenz(1974).

  • 2 1 Error Analysis

    1.1 Representation of Numbers

    Based on their fundamentally different ways of representing numbers, twocategories of computing machinery can be distinguished:

    (1) analog computers,(2) digital computers.

    Examples of analog computers are slide rules and mechanical integrators aswell as electronic analog computers. When using these devices one replacesnumbers by physical quantities, e.g., the length of a bar or the intensity of avoltage, and simulates the mathematical problem by a physical one, whichis solved through measurement, yielding a solution for the original mathe-matical problem as well. The scales of a slide rule, for instance, representnumbers x by line segments of length k ln x. Multiplication is simulated bypositioning line segments contiguously and measuring the combined lengthfor the result.

    It is clear that the accuracy of analog devices is directly limited by thephysical measurements they employ.

    Digital computers express the digits of a number representation by asequence of discrete physical quantities. Typical instances are desk calcu-lators and electronic digital computers.

    Example

    123101

    Each digit is represented by a specific physical quantity. Since only asmall finite number of different digits have to be encoded in the decimalnumber system, for instance, there are only 10 digits the representation ofdigits in digital computers need not be quite as precise as the representationof numbers in analog computers. Thus one might tolerate voltages between,say, 7.8 and 8.2 when aiming at a representation of the digit 8 by 8 volts.

    Consequently, the accuracy of digital computers is not directly limitedby the precision of physical measurements.

    For technical reasons, most modern electronic digital computers rep-resent numbers internally in binary rather than decimal form. Here thecoefficients or bits i of a decomposition by powers of 2 play the role ofdigits in the representation of a number x:

    x = (n2n + n12n1 + + 020 + 121 + 222 + ),

    i = 0 or i = 1.

  • 1.1 Representation of Numbers 3

    In order not to confuse decimal an binary representations of numbers, wedenote the bits of binary number representation by O and L, respectively.

    Example. The number x = 18.5 admits the decomposition

    18.5 = 1 24 + 0 23 + 0 22 + 1 21 + 0 20 + 1 21

    and has therefore the binary representation

    L00L0.L.

    We will use mainly the decimal system, pointing out differences betweenthe two systems whenever it is pertinent to the examination at hand.

    As the example 3.999 . . . = 4 shows, the decimal representation of anumber may not be unique. The same holds for binary representations. Toexclude such ambiguities, we will always refer to the finite representationunless otherwise stated.

    In general, digital computers must make do with a fixed finite numberof places, the word length, when internally representing a number. Thisnumber n is determined by the make of the machine, although some ma-chines have built-in extensions to integer multiples 2n, 3n, . . . (double wordlength, triple word length, . . . ) of n to offer greater precision if needed. Aword length of n places can be used in several different fashions to repre-sent a number. Fixed-point representation specifies a fixed number n1 ofplaces before and a fixed number n2 after the decimal (binary) point, sothat n = n1 + n2 (usually n1 = 0 or n1 = n).

    Example. For n = 10, n1 = 4, n2 = 6

    30.421 0030 421000

    0.0437 0000 n1

    043700 n2

    In this representation, the position of the decimal (binary) point isfixed. A few simple digital devices, mainly for accounting purposes, are stillrestricted to fixed-point representation. Much more important, in particu-lar for scientific calculations, are digital computers featuring floating-pointrepresentation of numbers. Here the decimal (binary) point is not fixed atthe outset; rather its position with respect to the first digit is indicated foreach number separately. This is done by specifying a so-called exponent. Inother words, each real number can be represented in the form

    (1.1.1) x = a 10b (x = a 2b) with |a| < 1, b integer

    (say, 30.421 by 0.30421 102), where the exponent b indicates the positionof the decimal point with respect to the mantissa a. Rutishauser proposedthe following semilogarithmic notation,which displays the basis of the

  • 4 1 Error Analysis

    number system at the subscript level and moves the exponent up to thelevel of the mantissa:

    0.30421102.

    Analogously,0.L00L0L2L0L

    denotes the number 18.5 in the binary system. On any digital computerthere are, of course, only fixed finite numbers t and e, n = t+ e, of placesavailable for the representation of mantissa and exponent, respectively.

    Example. For t = 4, e = 2 one would have the floating-point representation

    0 5420 10 04 or more concisely 5420 04

    for the number 5420 in the decimal system.

    The floating-point representation of a number need not be unique. Since5420 = 0.542104 = 0.0542105, one could also have the floating-point repre-sentation

    0 0542 10 05 or 0542 05

    instead of the one given in the above example.A floating-point representation is normalized if the first digit (bit) of

    the mantissa is different from 0 (0). Then |a| 101(|a| 21) holdsin (1.1.1). The significant digits (bits) of a number are the digits of themantissa not counting leading zeros.

    In what follows, we will only consider normalized floating-point repre-sentations and the corresponding floating-point arithmetic. The numbers tand e determine together with the basis B = 10 or B = 2 of the numberrepresentation the set A IR of real numbers which can be representedexactly within a given machine. The elements of A are called machinenumbers.

    While normalized floating-point arithmetic is prevalent on currentelectronic digital computers, unnormalized arithmetic has been proposedto ensure that only truly significant digits are carried [Ashenhurst andMetropolis, (1959)].

    1.2 Roundoff Errors and Floating-Point Arithmetic

    The set A of numbers which are representable in a given machine is onlyfinite. The question therefore arises of how to approximate a number x Awhich is not a machine number by a number g A which is. This problemis encountered not only when reading data into a computer, but also whenrepresenting intermediate results within the computer during the course

  • 1.2 Roundoff Errors and Floating-Point Arithmetic 5

    of a calculation. Indeed, straightforward examples show that the results ofelementary arithmetic operations x y, x y, x/y need not belong to A,even if both operands x, y A are machine numbers.

    It is natural to postulate that the approximation of any number x Aby a machine number rd(x) A should satisfy

    (1.2.1) |x rd(x)| |x g| for all g A .

    Such a machine-number approximation rd(x) can be obtained in most casesby rounding.

    Example 1 (t = 4).

    rd(0.14285100) = 0.1429100,rd(3.14159100) = 0.3142101,rd(0.142842102) = 0.1428102.

    In general, one can proceed as follows in order to find rd(x) for a t-digitcomputer: x A is first represented in normalized form x = a 10b, sothat |a| 101. Suppose the decimal respresentation of |a| is given by

    |a| = 0.12 . . . ii+1 . . . , 0 i 9, 1 = 0.

    Then one forms

    a :={

    0.12 . . . t if 0 t+1 4,0.12 . . . t + 10t if t+1 5,

    that is, one increases t by 1 if the (t+ 1)st digit at+1 5, and deletes alldigits after the tth one. Finally one puts

    rd(x) := sign(x) a 10b.

    Since |a| 101, the relative error of rd(x) admits the following bound(Scarborough, 1950): rd(x) xx

    5 10(t+1)|a| 5 10t.With the abbreviation eps := 5 10t, this can be written as

    (1.2.2) rd(x) = x(1 + ), where || eps .

    The quantity eps = 5 10t is called the machine precision. In the binarysystem, rd(x) is defined analogously: Starting with a decomposition x =a 2b satisfying 21 |a| < 1 and the binary representation of |a|,

    |a| = 0.1 . . . tt+1 . . . , i = 0 or L, 1 = L,

  • 6 1 Error Analysis

    one forms

    a :={

    0.1 . . . t if t+1 = 0,0.1 . . . t + 2t if t+1 = L,

    rd(x) := sign(x) a 2b.

    Again (1.2.2) holds, provided one defines the machine precision by eps :=2t.

    Whenever rd(x) A is a machine number, then rd has the property(1.2.1) of a correct rounding process, and we may define

    rd(x) := rd(x) for all x with rd(x) A.

    Because only a finite number e of places are available to express the ex-ponent in a floating-point representation, there are unfortunately alwaysnumbers x / A with rd(x) / A.Example 2 (t = 4, e = 2).

    a) rd(0.3179410110) = 0.317910110 A.b) rd(0.999971099) = 0.100010100 A.c) rd(0.01234510 99) = 0.123510100 A.d) rd(0.5432110 110) = 0.543210110 A.

    In cases (a) and (b) the exponent is too greatly positive to fit the allottedspace: These are instances of exponent overflow. Case (b) is particularlypathological: Exponent overflow happens only after rounding. Cases (c) and(d) are instances of exponent underflow, i.e. the exponent of the numberrepresented is too greatly negative. In cases (c) and (d) exponent underflowmay be prevented by defining

    (1.2.3)rd(0.01234510 99) := 0.012310 99 A,rd(0.5432110 110) := 0 A.

    But then rd does not satisfy (1.2.2), that is, the relative error of rd(x)may exceed eps. Digital computers treat occurrences of exponent overflowand underflow as irregularities of the calculation. In the case of exponentunderflow, rd(x) may be formed as indicated in (1.2.3). Exponent overflowmay cause a halt in calculations. In the remaining regular cases (but notfor all makes of computers), rounding is defined by

    rd(x) := rd(x).

    Exponent overflow and underflow can be avoided to some extent by suitablescaling of the input data and by incorporating special checks and rescalingsduring computations. Since each different numerical method will require itsown special protection techniques, and since overflow and underflow do not

  • 1.2 Roundoff Errors and Floating-Point Arithmetic 7

    happen very frequently, we will make the idealized assumption that e = in our subsequent discussions, so that rd := rd does indeed provide a rulefor rounding which ensures

    (1.2.4)rd : IR A,rd(x) = x(1 + ) with || eps for all x IR.

    In further examples we will, correspondingly, give the length t of the man-tissa only. The reader must bear in mind, however, that subsequent state-ments regarding roundoff errors max be invalid if overflows or underflowsare allowed to happen.

    We have seen that the results of arithmetic operations xy, xy, x/yneed not be machine numbers, even if the operands x and y are. Thus onecannot expect to reproduce the arithmetic operations exactly on a digitalcomputer. One will have to be content with substitute operations +, ,, /, so called floating-point operations, which approximate arithmeticoperations as well as possible [v. Neumann and Goldstein (1947)]. Suchoperations my be defined, for instance, with the help of the rounding maprd as follows

    (1.2.5)

    x+ y := rd(x+ y),x y := rd(x y),x y := rd(x y),x / y := rd(x/y),

    for x, y A,

    so that (1.2.4) implies

    (1.2.6)

    x+ y = (x+ y)(1 + 1)x y = (x y)(1 + 2)x y = (x y)(1 + 3)x / y = (x/y)(1 + 4)

    |i| eps .On many modern computer installations, the floating-point operations, . . . are not defined by (1.2.5), but instead in such a way that (1.2.6)holds with only a somewhat weaker bound, say |i| k eps, k 1 be-ing a small integer. Since small deviations from (1.2.6) are not significantfor our examinations, we will assume for simplicity that the floating-pointoperations are in fact defined by (1.2.5) and hence satisfy (1.2.6).

    It should be pointed out that the floating-point operations do not satisfythe well-known laws for arithmetic operations. For instance,

    x+ y = x, if |y| < epsB

    |x|, x, y A,

    where B is the basis of the number system. The machine precision epscould indeed be defined as the smallest positive machine number g forwhich 1 + g > 1:

  • 8 1 Error Analysis

    eps = min{ g A | 1 + g > 1 and g > 0 }.

    Furthermore, floating-point operations need not be associative or distribu-tive.

    Example 3 (t = 8). With

    a := 0.2337125810 4,b := 0.33678429102,c := 0.33677811102,

    one hasa+ (b+ c) = 0.2337125810 4 + 0.6180000010 3

    = 0.6413712610 3,(a+ b) + c = 0.33678452102 0.33677811102

    = 0.6410000010 3.The exact result is

    a+ b+ c = 0.64137125810 3.

    When subtracting two numbers x, y A of the same sign, one has towatch out for cancellation. This occurs if x and y agree in one or moreleading digits with respect to the same exponent, e.g.,

    x = 0.315876101,y = 0.314289101.

    The subtraction causes the common leading digits to disappear. The exactresult xy is consequently a machine number, so that no new roundoff errorarises, x y = x y. In this sense, subtraction in the case of cancellationis a quite harmless operation. We will see in the next section, however,that cancellation is extremely dangerous concerning the propagation of olderrors, which stem from the calculation of x and y prior to carrying outthe subtraction x y.

    For expressing the result of floating-point calculations, a convenientbut slightly imprecise notation has been widely accepted, and we will useit frequently ourselves: If it is clear from the context how to evaluate anarithmetic expression E (if need be this can be specified by inserting suit-able parentheses), then fl(E) denotes the value of the expression as obtainedby floating-point arithmetic.

    Example 4.fl(x y) := x y,

    fl(a+ (b+ c)) := a+ (b+ c),fl((a+ b) + c) := (a+ b) + c.

    We will also use the notation fl(x), fl(cos(x)), etc., whenever the digital

    computer approximates functions

    , cos, etc., by substitutes, cos, etc.

    Thus fl(x) :=

    x

    , and so on.

  • 1.3 Error Propagation 9

    The arithmetic operations +,,, /, together with those basic func-tions like

    , cos, for which floating-point substitutes

    , cos, etc., have

    been specified, will be called elementary operations. .

    1.3 Error Propagation

    We have seen in the previous section (Example 3) that two different butmathematically equivalent methods (a + b) + c, a + (b + c) for evaluatingthe same expression a+ b+ c may lead to different results if floating-pointarithmetic is used. For numerical purposes it is therefore important to dis-tinguish between different evaluation schemes even if they are mathemat-ically equivalent. Thus we call a finite sequence of elementary operations(as given for instance by consecutive computer instructions) which pre-scribes how to calculate the solution of a problem from given input data,an algorithm.

    We will formalize the notion of an algorithm somewhat. Suppose aproblem consists of calculating desired result numbers y1, . . . , ym from in-put numbers x1, . . . , xn. If we introduce the vectors

    x =

    x1...xn

    , y = y1...ym

    ,then solving the above problem means determining the value y = (x) ofa certain multivariate vector function : D IRm, D IRn where isgiven by m real functions i,

    yi = i(x1, . . . xn), i = 1, . . . ,m.

    At each stage of a calculation there is an operand set of numbers, whicheither are original input numbers xi or have resulted from previous op-erations. A single operation calculates a new number from one or moreelements of the operand set. The new number is either an intermediate ora final result. In any case, it is adjoined to the operand set, which thenis purged of all entries that will not be needed as operands during the re-mainder of the calculation. The final operand set will consist of the desiredresults y1, . . . , ym.

    Therefore, an operation corresponds to a transformation of the operandset. Writing consecutive operand sets as vectors,

    x(i) =

    x(i)1...x

    (i)ni

    IRni ,

  • 10 1 Error Analysis

    we can associate with an elementary operation an elementary map

    (i) : Di IRni+1 , Di IRni ,

    so that(i)(x(i)) = x(i+1),

    where (x(i+1)) is a vector of the transformed operand set. The elementarymap (i) is uniquely defined except for inconsequential permutations ofx(i) and x(i+1) which stem from the arbitrariness involved in arranging thecorresponding operand sets in the form of vectors.

    Given an algorithm, then its sequence of elementary operations givesrise to a decomposition of into a sequence of elementary maps

    (1.3.1)(i) : Di Di+1, Dj IRnj ,

    = (r) (r1) (0), D0 = D, Dr+1 IRnr+1 = IRm,

    which characterize the algorithm.

    Example 1. For (a, b, c) = a+ b+ c consider the two algorithms := a+ b, y :=c+ and := b+ c, y := a+ . The decompositions (1.3.1) are

    (0)(a, b, c) :=

    [a+ bc

    ] IR2, (1)(u, v) := u+ v IR

    and

    (0)(a, b, c) :=

    [a

    b+ c

    ] IR2, (1)(u, v) := u+ v IR.

    Example 2. Since a2b2 = (a+b)(ab), one has for the calculation of (a, b) :=a2 b2 the two algorithms

    Algorithm 1: 1 := a a, Algorithm 2: 1 := a+ b,2 := b b, 2 := a b,y := 1 2, y := 1 2.

    Corresponding decompositions (1.3.1) are

    Algorithm 1:

    (0)(a, b) :=

    [a2

    b

    ], (1)(u, ) :=

    [u

    2

    ], (2)(u, ) := u .

    Algorithm 2:

    (0)(a, b) :=

    [ab

    a+ b

    ], (1)(a, b, ) :=

    [

    a b

    ], (2)(u, ) := u .

    Note that the decomposition of (, b) := a2 b2 corresponding to Algorithm1 above can be telescoped into a simpler decomposition:

  • 1.3 Error Propagation 11

    (0)(a, b) =

    [a2

    b2

    ], (1)(u, ) := u .

    Strictly speaking, however, map (0) is not elementary. Moreover the decompo-sition does not determine the algorithm uniquely, since there is still a choice,however numerically insignificant, of what to compute first, a2 or b2.

    Hoping to find criteria for judging the quality of algorithms, we willnow examine the reasons why different algorithms for solving the sameproblem generally yield different results. Error propagation, for one, playsa decisive role, as the example of the sum y = a+ b+ c shows (see Example3 in Section 1.2). Here floating-point arithmetic yields an approximationy = fl((a+ b) + c) to y which, according to (1.2.6), satisfies

    : = fl(a+ b) = (a+ b)(1 + 1),y : = fl( + c) = ( + c)(1 + 2)

    = [(a+ b)(1 + 1) + c](1 + 2)

    = (a+ b+ c)[1 +

    a+ ba+ b+ c

    1(1 + 2) + 2

    ].

    For the relative error y := (y y)/y of y,

    y =a+ b

    a+ b+ c1(1 + 2) + 2

    or disregarding terms of order higher than 1 in s such as 12,

    y.=

    a+ ba+ b+ c

    1 + 1 2.

    The amplification factors (a + b)/(a + b + c) and 1, respectively, measurethe effect of the roundoff errors 1, 2 on the error y of the result. Thefactor (a+ b)/(a+ b+ c) is critical: depending on whether |a+ b| or |b+ c|is the smaller of the two, it is better to proceed via (a+ b) + c rather thana+ (b+ c) for computing a+ b+ c.

    In Example 3 of the previous section,

    a+ ba+ b+ c

    =0.33 . . .10 20.64 . . .103

    12105,

    b+ ca+ b+ c

    =0.618 . . .1030.64 . . .103

    0.97,

    which explains the higher accuracy of fl(a+ (b+ c)).

    The above method of examining the propagation of particular errorswhile disregarding higher-order terms can be extended systematically toprovide a differential error analysis of an algorithm for computing (x) ifthis function is given by a decomposition (1.3.1):

  • 12 1 Error Analysis

    = (r) (r1) (0).

    To this end we must investigate how the input errors x of x as well as theroundoff errors accumulated during the course of the algorithm affect thefinal result y = (x). We start this investigation by considering the inputerrors x alone, and we will apply any insights we gain to the analysis ofthe propagation of roundoff errors. We suppose that the function

    : D IRm, (x) =

    1(x1, . . . , xn)...m(x1, . . . , xn)

    ,is defined on an open subset D of IRn, and that its component functionsi, i = 1, . . . , n, have continuous first derivatives on D. Let x be anapproximate value for x. Then we denote by

    xj := xj xj , x := x x

    the absolute error of xi and x, respectively. The relative error of xi is definedas the quantity

    xi :=xi xixi

    if xi = 0.

    Replacing the input data x by x leads to the result y := (x) instead ofy = (x). Expanding in a Taylor series and disregarding higher-order termsgives

    (1.3.2)

    yi : = yi yi = i(x) i(x).=

    nj=1

    (xj xj)i(x)xj

    =nj=1

    i(x)xj

    xj , i = 1, . . . ,m,

    or in matrix notation,

    (1.3.3) y =

    y1...ym

    .=

    1x1

    1xn...

    ...mx1

    mxn

    x1...xn

    = D(x) xwith the Jacobian matrix D(x).

    The notation .= instead of =, which has been used occasionallybefore, is meant to indicate that the corresponding equations are only afirst order approximation, i.e., they do not take quantities of higher order(in s or s) into account.

    The quantity i(x)/xj in (1.3.3) represents the sensitivity with whichyi reacts to absolute perturbations xj of xj . If yi = 0 for i = 1, . . . ,m

  • 1.3 Error Propagation 13

    and xj = 0 for j = 1, . . . , n, then a similar error propagation formula holdsfor relative errors:

    (1.3.4) yi.=

    nj=1

    xji(x)

    i(x)xj

    xj .

    Again the quantity (xj/i)i/xj indicates how strongly a relative errorxj affects the relative error in yi. The amplification factors (xj/i)i/xjfor the relative error have the advantage of not depending on the scales ofyi and xj . The amplification factors for relative errors are generally calledcondition numbers . If any condition numbers are present which have largeabsolute values, then one speaks of an ill-conditioned problem; otherwise,of a well-conditioned problem. For ill-conditioned problems, small relativeerrors in the input data x can cause large relative errors in the resultsy = (x).

    The above concept of condition numbers suffers from the fact that itis meaningful only for nonzero yi, xj . Moreover, it is impractical for manypurposes, since the condition of is described by mn numbers. For thesereasons, the conditions of special classes of problems are frequently definedin a more convenient fashion. In linear algebra, for example, it is customaryto call numbers c condition numbers if, in conjunction with a suitable norm ,

    (x) (x) (x) c

    x x x

    (see Section 4.4).

    Example 3. For y = (a, b, c) := a+ b+ c, (1.3.4) gives

    y.=

    a

    a+ b+ ca +

    b

    a+ b+ cb +

    c

    a+ b+ cc.

    The problem is well conditioned if every summand a, b, c is small compared toa+ b+ c.

    Example 4: Let y = (p, q) := p+p2 + q. Then

    p= 1 + p

    p2 + q=

    yp2 + q

    ,

    q=

    1

    2p2 + q

    ,

    so that

    y.=

    pp2 + q

    p +q

    2yp2 + q

    q = pp2 + q

    p +p+

    p2 + q

    2p2 + q

    q.

    Since pp2 + q

    1, p+p2 + q

    2p2 + q

    1 for q > 0,

  • 14 1 Error Analysis

    is well conditioned if q > 0, and badly conditioned if q p2.

    For the arithmetic operations (1.3.4) specializes to (x = 0, y = 0)

    (1.3.5)

    (x, y) := x y :(x, y) := x/y :(x, y) := x y :(x) :=

    x :

    xy.= x + y,

    x/y.= x y,

    xy =x

    x y x y

    x y y, if x y = 0,

    x.= 12x.

    It follows that the multiplication, division, and square root are not danger-ous: The relative errors of the operands dont propagate strongly into theresult. This is also the case for the addition, provided the operands x andy have the same sign. Indeed, the condition numbers x/(x+ y), y/(x+ y)then lie between 0 and 1, and they add up to 1, whence

    |x+y| max { |x|, |y| } .

    If one operand is small compared to the other, but carries a large relativeerror, the result x + y will still have a small relative error so long as theother operand has only a small relative error: error damping results. If,however, two operands of different sign are to be added, then at least oneof the factors xx+ y

    , yx+ y

    is bigger than 1, and at least one of the relative errors x, y will be ampli-fied.This amplification is drastic if x y holds and therefore cancellationoccurs.

    We will now employ the formula (1.3.3) to describe the propagationof roundoff errors for a given algorithm. An algorithm for computing thefunction : D IRm, D IRn, : D IRm, D IRn, for a givenx = (x1, . . . , xn)T D corresponds to a decomposition of the map intoelementary maps (i) [see (1.3.1)], and leads from x(0) := x via a chain ofintermediate results

    (1.3.6) x = x(0) (0)(x(0)) = x(1) (r)(x(r)) = x(r+1) = y

    to the result y. Again we assume that every (i) is continuously differen-tiable on Di.

    Now let us denote by (i) the remainder map

    (i) = (r) (r1) . . . (i) : Di IRm, i = 0, 1, 2, . . . , r.

    Then (0) . D(i) and D(i) are the Jacobian matrices of the maps(i) and (i). Since Jacobian matrices are multiplicative with respect tofunction composition,

  • 1.3 Error Propagation 15

    D(f g)(x) = Df(g(x)) Dg(x),

    we note for further reference that for i = 0, 1, . . . , r

    (1.3.7)D(x) = D(r)(x(r)) D(r1)(x(r1)) D(0)(x),

    D(i)(x(i)) = D(r)(x(r)) D(r1)(x(r1)) D(i)(x(i)).

    With floating-point arithmetic, input and roundoff errors will perturbthe intermediate (exact) results x(i) so that approximate values x(i) withx(i+1) = fl

    ((i)(x(i))

    )will be obtained instead. For the absolute errors

    x(i) = x(i) x(i),

    (1.3.8) x(i+1) = [fl((i)(x(i))

    ) (i)(x(i))] + [(i)(x(i)) (i)(x(i))].

    By (1.3.3) (disregarding higher-order error terms),

    (1.3.9) (i)(x(i)) (i)(x(i)) .= D(i)(x(i)) x(i).

    If (i) is an elementary map, or if it involves only independent elementaryoperations, the floating-point evaluation of (i) will yield the rounding ofthe exact value:

    (1.3.10) fl((i)(u)) = rd((i)(u)).

    Note, in this context, that the map (i) : Di Di+1 IRni+1 is actuallya vector of component functions (i)j : Di IR,

    (i)(u) =

    (i)1 (u)...

    (i)ni+1(u)

    .Thus (1.3.10) must be interpreted componentwise:

    (1.3.11)fl((i)j (u)) = rd(

    (i)j (u)) = (1 + j)

    (i)j (u),

    |j | eps, j = 1, 2, . . . , ni+1.

    Here j is the new relative roundoff error generated during the calculationof the jth component of (i) in floating-point arithmetic. Plainly, (1.3.10)can also be written in the form

    fl((i)(u)) = (I + Ei+1) (i)(u)

    with the identity matrix I and the diagonal error matrix

  • 16 1 Error Analysis

    Ei+1 :=

    1 0

    2. . .

    . . .0 ni+1

    , |j | eps .

    This yields the following expression for the first bracket in (1.3.8): ,

    fl((i)(x(i))) (i)(x(i)) = Ei+1 (i)(x(i)).

    Furthermore Ei+1 (i)(x(i)).= Ei+1 (i)(x(i)), since the error terms by

    which (i)(x(i)) and (i)(x(i)) differ are multiplied by the error terms onthe diagonal of Ei+1, giving rise to higher-order error terms. Therefore

    (1.3.12) fl((i)(x(i))) (i)(x(i)) .= Ei+1(i)(x(i)) = Ei+1x(i+1) =: i+1.

    The quantity i+1 can be interpreted as the absolute roundoff error newlycreated when (i) is evaluated in floating-point arithmetic, and the diag-onal elements of Ei+1 can be similarly interpreted as the correspondingrelative roundoff errors. Thus by (1.3.8), (1.3.9), and (1.3.12), x(i+1) canbe expressed in first-order approximation as follows

    x(i+1).= i+1 +D(i)(x(i))x(i) = Ei+1x(i+1) +D(i)(x(i))x(i),

    i 0, x(0) := x.

    Consequently

    x(1).= D(0)x+ 1,

    x(2).= D(1)[D(0) x+ 1] + 2,...

    y = x(r+1) .= D(r) D(0)x+D(r) D(1)1 + + r+1.

    In view of (1.3.7), we finally arrive at the following formulas which describethe effect of the input errors x and the roundoff errors i on the resulty = x(r+1) = (x):(1.3.13)

    y.= D(x)x+D(1)(x(1))1 + +D(r)(x(r))r + r+1= D(x)x+D(1)(x(1))E1x(1) + +D(r)(x(r))Erx(r)+

    + Er+1y.

    It is therefore the size of the Jacobian matrix D(i) of the remainder map(i) which is critical for the effect of the intermediate roundoff errors i orEi on the final result.

  • 1.3 Error Propagation 17

    Example 5. For the two algorithms for computing y = (a, b) = a2 b2 givenin Example 2 we have for Algorithm 1:

    x = x(0) =

    [a

    b

    ], x(1) =

    [a2

    b

    ], x(2) =

    [a2

    b2

    ], x(3) = y = a2 b2,

    (1)(u, v) = u v2, (2)(u, v) = u v,D(x) = (2a,2b), D(1)(x(1)) = (1,2b), D(2)(x(2)) = (1,1).

    Moreover

    1 =

    [1a

    2

    0

    ], E1 =

    [1 00 0

    ], |1| eps,

    because offl((0)(x(0))) (0)(x(0)) =

    [a ab

    ][a2

    b

    ],

    and likewise

    2 =

    [02b2

    ], E2 =

    [0 00 2

    ], 3 = 3(a2 b2), |i| eps for i = 2, 3.

    From (1.3.13) with x =[ab

    ],

    (1.3.14) y .= 2aa 2bb+ a21 b22 + (a2 b2)3.

    Analogously for Algorithm 2:

    x = x(0) =

    [a

    b

    ], x(1) =

    [a+ ba b

    ], x(2) = y = a2 b2,

    (1)(u, v) = u v,

    D(x) = (2a,2b), D(1)(x(1)) = (a b, a+ b),

    1 =[1(a+ b)2(a b)

    ], 2 = 3(a2 b3), E1 =

    [1 00 2

    ], |i| eps,

    and therefore (1.3.13) again yields

    (1.3.15) y .= 2aa 2bb+ (a2 b2)(1 + 2 + 3).

    If one selects a different algorithm for calculating the same result (x)(in other words, a different decomposition of into elementary maps), thenD remains unchanged; the Jacobian matrices D(i), which measure thepropagation of roundoff, will be different, however, and so will be the totaleffect of rounding,

    (1.3.16) D(1)1 + +D(r)r + r+1.

  • 18 1 Error Analysis

    An algorithm is called numerically more trustworthy than another algo-rithm for calculating (x) if, for a given set data x, the total effect ofrounding, (1.3.16), is less for the first algorithm than for the second one.

    Example 6. The total effect of rounding using Algorithm 1 in Example 2 is, by(1.3.14),

    (1.3.17) |a21 b22 + (a2 b2)3| (a2 + b2 + |a2 b2|) eps,

    and that of Algorithm 2, by (1.3.15),

    (1.3.18) |(a2 b2)(1 + 2 + 3)| 3|a2 b2| eps .

    Algorithm 2 is numerically more trustworthy than algorithm 1 whenever 1/3 0, q > 0, p q, determine the root

    y = p+p2 + q

    with smallest absolute value of the quadratic equation

    y2 + 2py q = 0.

    Input data: p, q. Result: y = (p, q) = p+p2 + q.

    The problem was seen to be well conditioned for p > 0, q > 0. It was alsoshown that the relative input errors p, q make the following contribution to therelative error of the result y = (p, q):

    pp2 + q

    p +q

    2yp2 + q

    q =pp2 + q

    p +p+

    p2 + q

    2p2 + q

    q.

    Since pp2 + q 1,

    p+p2 + q

    2p2 + q

    1,the inherent error (0)y satisfies

    eps (0)y :=(0)y

    y 3 eps .

    We will now consider two algorithms for computing y = (p, q).

    s := p2,Algorithm 1:t := s+ q,

    u :=t,

    y := p+ u.

    Obviously, p q causes cancellation when y := p+ u is evaluated, and it musttherefore be expected that the roundoff error

    u := t =

    p2 + q.

    generated during the floating-point calculation of the square root

    fl(t) =

    t (1 + ), || eps,

    will be greatly amplified. Indeed, the above error contributes the following termto the error of y :

    1yu =

    p2 + q

    p+p2 + q

    = 1q

    (pp2 + q + p2 + q) = k .

  • 22 1 Error Analysis

    Since p, q > 0, the amplification factor k admits the following lower bound:

    k >2p2

    q> 0.

    which is large, since p q by hypothesis. Therefore, the proposed algorithm isnot numerically stable, because the influence of rounding

    p2 + q alone exceeds

    that of the inherent error (0)y by an order of magnitude.

    s := p2,Algorithm 2:t := s+ q,

    u :=t,

    v := p+ u,y := q/v.

    This algorithm does not cause cancellation when calculating v := p + u. Theroundoff error u =

    p2 + q, which stems from rounding

    p2 + q, will be

    amplified according to the remainder map (u):

    u p+ u qp+ u

    =: (u).

    Thus it contributes the following term to the relative error of y:

    1y

    uu =

    qy(p+ u)2

    u

    =qp2 + q(

    p+p2 + q

    )(p+

    p2 + q

    )2 =

    p2 + q

    p+p2 + q

    = k .

    The amplification factor k remains small; indeed, |k| < 1, and algorithm 2 istherefore numerically stable.

    The following numerical results illustrate the difference between Algorithms1 and 2. They were obtained using floating-point arithmetic of 40 binary mantissaplaces about 13 decimal places as will be the case in subsequent numericalexamples.

    p = 1000, q = 0.018 000 000 081

    Result y according to Algorithm 1: 0.900 030 136 108105Result y according to nach Algorithm 2: 0.899 999 999 999105Exact value of y: 0.900 000 000 000105

    Example 2. For given fixed x, the value of cos kx may be computed recursivelyusing for m = 1, 2, . . . , k 1 the formula

    cos(m+ 1)x = 2 cosx cosmx cos(m 1)x.

  • 1.4 Examples 23

    In this case, a trigonometric-function evaluation has to be carried out only once,to find c = cosx. Now let |x| = 0 be a small number. The calculation of c causesa small roundoff error:

    c = (1 + ) cosx, || eps .

    How does this roundoff error affect the calculation of cos kx ?cos kx can be expressed in terms of c : cos kx = cos(k arccos c) =: f(c). Since

    df

    dc=k sin kx

    sinx

    the error cosx of c causes, to first approximation, an absolute error

    (1.4.1) cos kx .= cosxsinx

    k sin kx = k cotx sin kx

    in cos kx.On the other hand, the inherent error(0)ck (1.3.19) of the result ck := cos kx

    is(0)ck = [k|x sin kx| + | cos kx|] eps .

    Comparing this with (1.4.1) shows that cos kx may be considerably larger than(0)ck for small |x|; hence the algorithm is not numerically stable.Example 3. For given x and a large positive integer k, the numbers cos kx andsin kx are to be computed recursively using

    cosmx = cosx cos(m 1)x sinx sin(m 1)x,sinmx = sinx cos(m 1)x+ cosx sin(m 1)x, m = 1, 2, . . . , k.

    How do small errors c cosx, s sinx in the calculation of cosx, sinx affect thefinal results cos kx, sin kx ? Abbreviating cm := cosmx, sm := sinmx, c := cosx,s := sinx, and putting

    U :=

    [c ss c

    ],

    we have [cmsm

    ]= U

    [cm1sm1

    ], m = 1, . . . , k.

    Here U is a unitary matrix, which corresponds to a rotation by the angle x.Repeated application of the formula above gives[

    cksk

    ]= Uk

    [c0s0

    ]= Uk

    [10

    ].

    NowU

    c=

    [1 00 1

    ],

    U

    s=

    [0 11 0

    ]=: A,

    and therefore

    cUk = k Uk1,

    sUk = AUk1 + UAUk2 + + Uk1A

    = kAUk1,

  • 24 1 Error Analysis

    because A commutes with U . Since U describes a rotation in IR2 by the angle x,

    cUk = k

    [cos(k 1)x sin(k 1)xsin(k 1)x cos(k 1)x

    ],

    sUk = k

    [ sin(k 1)x cos(k 1)xcos(k 1)x sin(k 1)x

    ].

    The relative errors c, s of c = cosx, s = sinx effect the following absolute errorsof cos kx, sin kx:

    (1.4.2)

    [cksk

    ].=[

    cUk] [1

    0

    ] c cosx+

    [

    sUk] [1

    0

    ] s sinx

    = ck cosx

    [cos(k 1)xsin(k 1)x

    ]+ sk sinx

    [ sin(k 1)x

    cos(k 1)x

    ].

    The inherent errors (0)ck and (0)sk of ck = cos kx and sk = sin kx, respec-tively, are given by

    (1.4.3)(0)ck = [k|x sin kx| + | cos kx|] eps,(0)sk = [k|x cos kx| + | sin kx|] eps .

    Comparison of (1.4.2) and (1.4.3) reveals that for big k and |kx| 1 the influenceof the roundoff error c is considerably bigger than the inherent errors, whilethe roundoff error s is harmless.The algorithm is not numerically stable, albeitnumerically more trustworthy than the algorithm of Example 2 as far as thecomputation of ck alone is concerned.Example 4. For small |x|, the recursive calculation of

    cm = cosmx, sm = sinmx, m = 1, 2, . . . ,

    based oncos(m+ 1)x = cosx cosmx sinx sinmx,sin(m+ 1)x = sinx cosmx+ cosx sinmx,

    as in Example 3, may be further improved numerically. To this end, we expressthe differences dsm+1 and dcm+1 of subsequent sine and cosine values as follows:

    dcm+1 : = cos(m+ 1)x cosmx= 2(cosx 1) cosmx sinx sinmx cosx cosmx+ cosmx

    = 4(

    sin2x

    2

    )cosmx+ [cosmx cos(m 1)x],

    dsm+1 : = sin(m+ 1)x sinmx= 2(cosx 1) sinmx+ sinx cosmx cosx sinmx+ sinmx

    = 4(

    sin2x

    2

    )sinmx+ [sinmx sin(m 1)x].

    This leads to a more elaborate recursive algorithm for computing ck, sk in thecase x > 0:

  • 1.4 Examples 25

    dc1 := 2 sin2x

    2, t := 2 dc1,

    ds1 :=

    dc1(2 + dc1),s0 := 0, c0 := 1,

    and for m := 1, 2, . . . , k:

    cm := cm1 + dcm, dcm+1 := t cm + dcm,sm := sm1 + dsm, dsm+1 := t sm + dsm.

    For the error analysis, note that ck and sk are functions of s = sin(x/2):

    ck = cos(2k arcsin s) =: 1(s),sk = sin(2k arcsin s) =: 2(s).

    An error s = s sin(x/2) in the calculation of s therefore causes to a first-orderapproximation the following errors in ck:

    1s

    s sinx

    2= s

    2k sin kxcos(x/2)

    sinx

    2

    = 2k tan x2

    sin kx s,

    and in sk:2s

    s sinx

    2= 2k tan

    x

    2cos kx s.

    Comparison with the inherent errors (1.4.3) shows these errors to be harmless forsmall |x|. The algorithm is then numerically stable, at least as far as the influenceof the roundoff error s is concerned.

    Again we illustrate our analytical considerations with some numerical results.Let x = 0.001, k = 1000.

    Algorithm Result for cos kx Relative error

    Example 2 0.540 302 121 124 0.34106Example 3 0.540 302 305 776 0.17109Example 4 0.540 302 305 865 0.581011Exact value 0.540 302 305 868 140 . . .

    Example 5. We will derive some results which will be useful for the analysis ofalgorithms for solving linear equations in Section 4.5. Given the quantities c, a1,. . . , an, b1, . . . , bn1 with an = 0, we want to find the solution n of the linearequation

    (1.4.4) c a1b1 an1bn1 ann = 0.

    Floating-point arithmetic yields the approximate solution

    (1.4.5) bn = fl(c a1b1 an1bn1

    an

    )as follows:

    s0 := c;for j := 1, 2, . . . , n 1

  • 26 1 Error Analysis

    (1.4.6) sj := fl(sj1 ajbj) = (sj1 ajbj(1 + j))(1 + j),bn := fl(sn1/an) = (1 + )sn1/an,

    with |j |, |aj |, || eps. If an = 1, as is frequently the case in applications, then = 0, since bn := sn1.

    We will now describe two useful estimates for the residual

    r := c a1b1 . . . anbn

    From (1.4.6) follow the equations

    s0 c = 0,

    sj (sj1 ajbj) = sj (

    sj1 + j

    + ajbjj

    )= sj

    j1 + j

    ajbjj , j = 1, 2, . . . , n 1,

    anbn sn1 = sn1.

    Summing these equations yields

    r = cn

    i=1

    aibi =n1j=1

    (sj

    j1 + j

    + ajbjj

    ) sn1

    and thereby the first one of the promised estimates

    (1.4.7) |r| eps1 eps [

    |sn1| +n1j=1

    (|sj | + |ajbj |)],

    :={

    0 if an = 1,1 otherwise.

    The second estimate is cruder than (1.4.7). (1.4.6) gives

    (1.4.8) bn =

    [c

    n1k=1

    (1 + k) n1j=1

    ajbj(1 + j)n1k=j

    (1 + k)

    ]1 + an

    ,

    which can be solved for c:

    (1.4.9) c =n1j=1

    ajbj(1 + j)j1k=1

    (1 + k)1 + anbn(1 + )1n1k=1

    (1 + k)1.

    A simple induction argument over m shows that

    (1 + ) =m

    k=1

    (1 + k)1, |k| eps, m eps < 1

    implies|| m eps

    1 m eps .

  • 1.5 Interval Arithmetic; Statistical Roundoff Estimation 27

    In view of (1.4.9) this ensures the existence of quantities j with

    (1.4.10) c =n1j=1

    ajbj(1 + j j) + anbn(1 + (n 1 + )n),

    |j | eps

    1 n eps , :=

    {0 if an = 1,1 otherwise.

    For r = c a1b1 a2b2 anbn we have consequently

    (1.4.11) |r| eps1 n eps

    [n1j=1

    j|ajbj | + (n 1 + )|anbn|

    ].

    In particular, (1.4.8) reveals the numerical stability of our algorithm for comput-ing n. The roundoff error m contributes the amount

    c a1b1 a2b2 ambman

    m

    to the absolute error in n. This, however, is at most equal to c c a1b1a1 ambmman

    (|c| +

    mi=1 |aibi|

    )eps

    |an|,

    which represents no more than the influence of the input errors c and ai ofc and ai, i = 1, . . . , m, respectively, provided |c|, |ai | eps. The remainingroundoff errors k and are similarly shown to be harmless.

    The numerical stability of the above algorithm is often shown by interpreting(1.4.10) in the sense of backward analysis: The computed approximate solutionbn is the exact solution of the equation

    c a1b1 . . . anbn = 0,

    whose coeffizients

    aj := aj(1 + j j), 1 j n 1,aj := aj(1 + (n 1 + )n)

    have changed only slightly from their original values aj . This kind of analysis,however, involves the difficulty of having to define how large n can be so thaterrors of the form n, || eps can still be considered as being of the same orderof magnitude as the machine precision eps.

    1.5 Interval Arithmetic; Statistical RoundoffEstimation

    The effect of a few roundoff errors can be quite readily estimated, to afirst-order approximation, by the methods of Section 1.3. For a typical

  • 28 1 Error Analysis

    numerical method, however, the number of the arithmetic operations, andconsequently the number of individual roundoff errors, is very large, andthe corresponding algorithm is too complicated to permit the estimationof the total effect of all roundoff errors in this fashion.

    A technique known as interval arithmeticoffers an approach to deter-mining exact upper bounds for the absolute error of an algorithm, takinginto account all roundoff and data errors. Interval arithmetic is based onthe realization that the exact values for all real numbers a IR which ei-ther enter an algorithm or are computed as intermediate or final results areusually not known. At best one knows small intervals which contain a. Forthis reason, the interval-arithmetic approach is to calculate systematicallyin terms of such intervals

    a = [a, a],

    bounded by machine numbers a, a A, rather than in terms of singlereal numbers a. Each unknown number a is represented by an interval a =[a, a] with a a. The arithmetic operations {,,,} betweenintervals are defined so as to be compatible with the above interpretation.That is, c := a b is defined as an interval (as small as possible) satisfying

    c { a b | a a and b b.

    and having machine number endpoints.In the case of addition, for instance, this holds if is defined as follows:

    [c, c] := [a, a] [b, b]

    wherec := max{ A | a + b }c := min{ A | a + b },

    with A denoting again the set of machine numbers. In the case of multipli-cation , assuming, say, a > 0, b > 0,

    [c, c] := [a, a] [b, b]

    can be defined by letting

    c := max{ A | a b },c := min{ A | a b }.

    Replacing, in these and similar fashions, every quantity by an interval andevery arithmetic operation by its corresponding interval operation this isreadily implemented on computers we obtain interval algorithms whichproduce intervals guaranteed to obtain the desired exact solutions. Thedata for these interval algorithms will be again intervals, chosen to allowfor data errors.

  • 1.5 Interval Arithmetic; Statistical Roundoff Estimation 29

    It has been found, however, that an uncritical utilization of intervalarithmetic techniques leads to error bounds which, while certainly reliable,are in most cases much too pessimistic. It is not enough to simply substituteinterval operations for arithmetic operations without taking into accounthow the particular roundoff or data enter into the respective results. Forexample, it happens quite frequently that a certain roundoff error impairssome intermediate results u1, . . . , un of an algorithm considerably,ui

    1 for i = 1, . . . , n,while the final result y = f(u1, . . . , un) is not strongly affected,y

    1,even though it is calculated from the highly inaccurate intermediate valuesu1, . . . , un: the algorithm shows error damping.

    Example 1. Evaluate y = (x) = x3 3x2 + 3x = ((x 3) x + 3) x usingHorners scheme:

    u := x 3,v := u x,w := v + 3,y := w x.

    The value x is known to lie in the interval

    x x := [0.9, 1.1].

    Starting with this interval and using straight interval arithmetic, we find

    u = x [3, 3] = [2.1,1.9],v = u x = [2.31,1.71],w = v [3, 3] = [0.69, 1.29],y = w x = [0.621, 1.419].

    The interval y is much too large compared to the interval

    {(x)|x x } = [0.999, 1.001],

    which describes the actual effect of an error in x on (x).

    Example 2. Using just ordinary 2-digit arithmetic gives considerably more ac-curate results than the interval arithmetic suggests:

    x = 0.9 x = 1.1u 2.1 1.9v 1.9 2.1w 1.1 0.9y 0.99 0.99

  • 30 1 Error Analysis

    For the successful application of interval arithmetic, therefore, it is notsufficient merely to replace the arithmetic operations of commonly usedalgorithms by interval operations: It is necessary to develop new algorithmsproducing the same final results but having an improved error-dependencepattern for the intermediate results.

    Example 3. In Example 1 a simple transformation of (x) suffices:

    y = (x) = 1 + (x 1)3.

    When applied to the corresponding evaluation algorithm and the same startinginterval x = [0.9, 1.1], interval arithmetic now produces the optimal result:

    u1 := x [1, 1] = [0.1, 0.1],u2 := u1 u1 = [0.01, 0.01],u3 := u2 u1 = [0.001, 0.001],y := u3 [1, 1] = [0.999, 1.001].

    As far as ordinary arithmetic is concerned, there is not much difference betweenthe two evaluation algorithms of Example 1 and Example 3. Using two digitsagain, the results are practically identical to those in Example 2:

    x = 0.9 x = 1.1u1 0.1 0.1u2 0.01 0.01u3 0.001 0.001y 1.0 1.0

    For an in-depth treatment of interval arithmetic the reader should con-sult, for instance, Moore (1966) or Kulisch (1969).In order to obtain statistical roundoff estimates [Rademacher (1948)], weassume that the relative roundoff error [see (1.2.6)] which is caused byan elementary operation is a random variable with values in the interval[ eps, eps]. Furthermore we assume that the roundoff errors attributableto different operations are independent random variables. By we denotethe expected value and by the variance of the above round-off distribu-tion. They satisfy the general relationship

    = E(), 2 = E( E())2 = E(2) (E())2 = 2 2.

    Assuming a uniform distribution in the interval [ eps, eps], we get

    (1.5.1) = E() = 0, 2 = E(2) =

    12 eps

    eps eps

    t2dt =13

    2eps =: 2.

    Closer examinations show the roundoff distribution to be not quit uniform[see Sterbenz (1974)), Exercise 22, p. 122]. It should also be kept in mindthat the ideal roundoff pattern is only an approximation to the roundoff

  • 1.5 Interval Arithmetic; Statistical Roundoff Estimation 31

    patterns observed in actual computing machinery, so that the quantities and 2 may have to be determined empirically.

    The results x of algorithms subjected to random roundoff errors becomerandom variables themselves with expected values x and variances 2xconnected again by the basic relation

    2x = E(x E(x))2 = E(x2) (E(x))2 = x2 2x.

    The propagation of previous roundoff effects through elementary operationsis described by the following formulas for arbitrary independent randomvariables x, y and constants , IR:

    (1.5.2)

    xy = E(x y) = E(x) E(y) = x y,2xy = E((x y)2) (E(x y))2

    = 2E(x E(x))2 + 2E(y E(y))2 = 22x + 22y.

    The first of the above formulas follows by the linearity of the expected-valueoperator. It holds for arbitrary random variables x, y. The second formulais based on the relation E(x y) = E(x)E(y), which holds whenever x andy are independent. Similarly, we obtain for independent x and y

    (1.5.3)

    xy = E(x y) = E(x)E(y) = xy,2xy = E[x y) E(x)E(y)]2 = x2y2 2x2y

    = 2x2y +

    2x

    2y +

    2y

    2x.

    Example. For calculating y = a2 b2 (see example 2 in Section 1.3) we find,under the assumptions (1.5.1), E(a) = a, 2a = 0, E(b) = b, 2b = 0 and using(1.5.2) and (1.5.3), that

    1 = a2(1 + 1), E(1) = a2, 21 = a42,

    2 = b2(1 + 2), E(2) = b2, 22 = b42,

    y = (1 2)(1 + 3), E(y) = E(1 2)E(1 + 3) = a2 b2,

    (1, 2, 3 are assumed to be independent),

    2y = 212

    21+3 +

    212

    21+3 +

    21+3

    212

    = (21 + 22)

    2 + (a2 b2)22 + 1(21 + 22)

    = (a4 + b4)4 + [(a2 b2)2 + a4 + b4]2.

    Neglecting 4 compared to 2 yields

    2y.= ((a2 b2)2 + a4 + b4)2.

    For a := 0.3237, b = 0.3134, eps = 5 104 (see example 5 in Section 1.3), wefind

    y.= 0.144 = 0.000 0415,

  • 32 1 Error Analysis

    which is close in magnitude to the true error y = 0.000 01787 for 4-digit arith-metic. Compare this with the error bound 0.000 10478 furnished by (1.3.17).

    We denote byM(x) the set of all quantities which, directly or indirectly,have entered the calculation of the quantity x. If M(x) M(y) = for thealgorithm in question, then the random variables x and y are in generaldependent.

    The statistical roundoff error analysis of an algorithm becomes ex-tremely complicated if dependent random variables are present. It becomesquite easy, however, under the following simplifying assumptions:

    (1.5.4)

    (a) The operands of each arithmetic operation are independent randomvariables.

    (b) In calculating variances all terms of an order higher than the smallestone are neglected.

    (c) All variances are so small that for elementary operations in first-order approximation, E(x y) .= E(x) E(y) = x y.

    If in addition the expected values x are replaced by the estimated valuesx, and relative variances 2x :=

    2x/

    2x 2x/x2 are introduced, then from

    (1.5.2) and (1.5.3) [compare (1.2.6), (1.3.5)],

    (1.5.5)

    z = fl(x y) : 2z.=(xz

    )22x +

    (yz

    )22y +

    2,

    z = fl(x y) : 2z.= 2x +

    2y +

    2,

    z = fl(x/y) : 2z.= 2x +

    2y +

    2.

    It should be kept in mind, however, that these results are valid only if thehypothesis (1.5.4), in particular (1.5.4a), are met.

    It is possible to evaluate above formulas in the course of a numericalcomputation and thereby to obtain an estimate of the error of the finalresults. As in the case of interval arithmetic, this leads to an arithmetic ofpaired quantities (x, 2x) for which elementary operations are defined withthe help of the above or similar formulas. Error bounds for the final resultsr are then obtained from the relative variance 2r, assuming that the finalerror distribution is normal. This assumption is justified inasmuch as thedistributions of propagated errors alone tend to become normal if subjectedto many elementary operations. At each such operation the nonnormalroundoff error distribution is superimposed on the distribution of previouserrors. However, after many operations, the propagated errors are largecompared to the newly created roundoff errors, so that the latter do notappreciably affect the normality of the total error distribution. Assumingthe final error distribution to be normal, the actual relative error of thefinal result r is bounded with probability 0.9 by 2r.

  • Exercises for Chapter 1 33

    Exercises for Chapter 1

    1. Show that with floating-point arithmetic of t decimal places

    rd(a) =a

    1 + with || 5 10t

    holds in analogy to (1.2.2). [In parallel with (1.2.6), as a consequence, fl(a b) = (a b)/(1 + ) with || 5 10t for all arithmetic operations = +, ,, /.]

    2. Let a, b, c be fixed-point numbers with N decimal places after the decimalpoint, and suppose 0 < a, b, c < 1. Substitute product a b is defined asfollows: Add 10N/2 to the exact product a b, and delete the (N +1)-st andsubsequent digits.(a) Give a bound for |(a b) c abc|.(b) By how many units of the N -th place can (a b) c and a (b c) differ

    ?

    3. Evaluatingn

    i=1 aj in floating-point arithmetic may lead to an arbitrarilylarge relative error. If, however, all summands ai are of the same sign, thenthis relative error is bounded. Derive a crude bound for this error, disregard-ing terms of higher order.

    4. Show how to evaluate the following expressions in a numerically stable fash-ion:

    11 + 2x

    1 x1 + x

    for |x| 1,x+

    1x

    x 1

    xfor x 1,

    1 cosxx

    for x = 0, |x| 1.

    5. Suppose a computer program is available which yields values for arcsin y infloating-point representation with t decimal mantissa places and for |y| 1subject to a relative error with || 5 10t. In view of the relation

    arctanx = arcsinx

    1 + x2,

    this program could also be used to evaluate arctanx. Determine for whichvalues x this procedure is numerically stable by estimating the relative error.

    6. For given z, the function tan z/2 can be computed according to the formula

    tanz

    2=

    (1 cos z1 + cos z

    )1/2.

    Is this method of evaluation numerically stable for z 0, z /2 ? Ifnecessary, give numerically stable alternatives.

    7. The functionf(, kc) :=

    1cos2 + k2c sin2

  • 34 1 Error Analysis

    is to be evaluated for 0 /2, 0 < kc 1.The method

    k2 : = 1 k2c ,

    f(, kc) : =1

    1 k2 sin2 avoids the calculation of cos and is faster. Compare this with the directevaluation of the original expression for f(, kc) with respect to numericalstability.

    8. For the linear function f(x) := a+ b x, where a = 0, b = 0, compute the firstderivative Dhf(0) = f (0) = b by the formula

    Dhf(0) =f(h) f(h)

    2h

    in binary floating-point arithmetic. Suppose that a and b are binary machinenumbers, and h a power of 2. Multiplication by h and division by 2h can betherefore carried out exactly. Give a bound for the relative error of Dhf(0).What is the behavior of this bound as h 0 ?

    9. The square root (u + i) of a complex number x + iy with y = 0 may becalculated from the formulas

    u =

    x+

    x2 + y2

    2

    =y

    2u

    .

    Compare the cases x 0 and x < 0 with respect tom their numerical stabil-ity. Modify the formulas if necessary to ensure numerical stability.

    10. The variance S2, of a set of observations x1, . . . , xn is to determined. Whichof formulas

    S2 =1

    n 1

    (n

    i=1

    x2i nx2),

    S2 =1

    n 1

    ni=1

    (xi x)2 with x :=1n

    ni=1

    xi

    is numerically more trustworthy ?

    11. The coefficients ar, br(r = 0, . . . , n) are, for fixed x, connected recursively:

    bn := an;

    () for r = n 1, n 2, . . . , 0 : br := xbr+1 + ar.

    (a) Show that the polynomials

    A(z) :=n

    r=0

    arzr, B(z) :=

    nr=1

    brzr1

  • Exercises for Chapter 1 35

    satisfyA(z) = (z x) B(z) + b0.

    (b) Suppose A(x) = b0 is to be calculated by the recursion () for fixed x infloating-point arithmetic, the result being b0. Show, using the formulas(compare Exercise 1)

    fl(u+ ) =u+ 1 +

    , || eps,

    fl(u ) = u 1 +

    , || eps,

    the inequality

    |A(x) b0| eps

    1 eps (2e0 |b0|),

    where e0 is defined by the following recursion:

    en := |an|/2;

    for r = n 1, n 2, . . . , 0; er := |x|ar+1 + |br|.

    Hint: From

    bn := an,

    pr := fl(xbr+1) =xbr+1

    1 + r+1

    br := fl(pr + ar) =pr + ar1 + r

    = xbr+1 + ar + r

    r = n 1, . . . , 0,derive

    r = xbr+1r+1

    1 + r+1 rbr (r = n 1, . . . , 0);

    then show b0 =n

    k=0(ak + k)xk, n := 0, and estimate

    n0 |k||x|

    k.

    12. Assuming Earth to be special, two points on its surface can be expressedin Cartesian coordinates

    pi = [xi, yizi] = [r cosi cosi, r sini cosi, r sini], i = 1, 2,

    where r is the earth radius and i, i are the longitudes and latitudesof the two points pi, respectively. If

    cos =pT1 p2r2

    = cos(1 2) cos1 cos2 + sin1 sin2,

    then r is the great-circle distance between the two points.(a) Show that using the arccos function to determine from the above

    expression is not numerically stable.(b) Derive a numerically stable expression for .

  • 36 1 Error Analysis

    References for Chapter 1

    Ashenhurst, R. L., Metropolis, N. (1959): Unnormalized floating-point arithmetic.Assoc. Comput. Mach. 6, 415428.

    Bauer, F. L. (1974): Computational graphs and rounding error. SIAM J. Numer.Anal. 11, 8796.

    , Heinhold, J., Samelson, K., Sauer, R. (1965): Moderne Rechenanlagen.Stuttgart: Teubner.

    Henrici, P. (1963): Error propagation for difference methods. New York: Wiley.Knuth, D. E. (1969): The Art of Computer Programming. Vol. 2. Seminumerical

    Algorithms. Reading, Mass.: Addison-Wesley.Kulisch, U. (1969): Grundzuge der Intervallrechnung. In: D. Laugwitz: Uberblicke

    Mathematik 2, 5198. Mannheim: Bibliographisches Institut.Moore, R. E. (1966): Interval analysis. Englewood Cliffs, N.J.: Prentice-Hall.Neumann, J. von, Goldstein, H. H. (1947): Numerical inverting of matrices. Bull.

    Amer. Math. Soc. 53, 1021-1099.Rademacher, H. A. (1948): On the accumulation of errors in processes of in-

    tegration on high-speed calculating machines. Proceedings of a symposium onlarge-scale digital calculating machinery. Annals Comput. Labor. Harvard Univ.16, 176185.

    Scarborough, J. B. (1950): Numerical Mathematical Analysis, 2nd edition. Balti-more: Johns Hopkins Press.

    Sterbenz, P. H. (1974): Floating Point Computation. Englewood Cliffs, N.J.:Prentice-Hall.

    Wilkinson, J.H. (1960): Error analysis of floating-point computation. Numer.Math. 2, 219340.

    (1963): Rounding Errors in Algebraic Processes. New York: Wiley.(1965): The Algebraic Eigenvalue Problem. Oxford: Clarendon Press.

  • 2 Interpolation

    Consider a family of functions of a single variable x,

    (x; a0, . . . , an),

    having n + 1 parameters a0, . . . , an, whose values characterize the indi-vidual functions in this family. The interpolation problem for consistsof determining these parameters ai so that for n+ 1 given real or complexpairs of numbers (xi, fi), i = 0, . . . , n, with xi = xk for i = k

    (xi; a0, . . . , an) = fi, i = 0, . . . , n,

    holds. We will call the pairs (xi, fi) support points, the locations xi supportabscissas, and the values fi support ordinates.Occasionally, the values ofderivatives of are also prescribed.

    The above is a linear interpolation problem if depends linearly on theparameters ai:

    (x; a0, . . . , an) a00(x) + a11(x) + . . .+ ann(x).

    This class of problems includes the classical one of polynomial interpolation[Section 2.1],

    (x; a0, . . . , an) a0 + a1x+ a2x2 + + anxn,

    as well as trigonometric interpolation [Section 2.3],

    (x; a0, . . . , an) a0 + a1exi + a2e2xi + + anenxi (i2 = 1).

    In the past, polynomial interpolation was frequently used to interpolatefunction values gathered from tables. The availability of modern comput-ing machinery has almost eliminated the need for extensive table lookups.However, polynomial interpolation is also important as the basis of severaltypes of numerical integration formulas in general use. In a more modern de-velopment, polynomial and rational interpolation (see below) are employedin the construction of extrapolation methods for integration, differentialequations, and related problems [see for instance Sections 3.3 and 3.4].

  • 38 2 Interpolation

    Trigonometric interpolation is used extensively for the numerical Fourieranalysis of time series and cyclic phenomena in general. In this context, theso-called fast Fourier transforms are particularly important and success-ful [Section 2.3.2].

    The class of linear interpolation problems also contains spline interpo-lation [Section 2.4]. In the special case of cubic splines, the functions are assumed to be twice continuously differentiable for x [x0, xn] and tocoincide with some cubic polynomial on every subinterval [xi, xi+1] of agiven partition x0 < x1 < < xn.

    Spline interpolation is a fairly new development of growing importance.It provides a valuable tool for representing empirical curves and for approx-imating complicated mathematical functions. It is increasingly used whendealing with ordinary or partial differential equations.

    Two nonlinear interpolation schemes are of importance: rational inter-polation,

    (x; a0, . . . , an, b0, . . . , bm) a0 + a1x+ + anxnb0 + b1x+ + bmxm

    ,

    and exponential interpolation,

    (x; a0, . . . , an, 0, . . . , n) a0e0x + a1e1x + + anenx.

    Rational interpolation (Section 2.2) plays a role in the process of bestapproximating a given function by one which is readily evaluated on adigital computer. Exponential interpolation is used, for instance, in theanalysis of radioactive decay.

    Interpolation is a basic tool for the approximation of given functions.For a comprehensive discussion of these and related topics consult Davis(1965).

    2.1 Interpolation by Polynomials

    2.1.1 Theoretical Foundation: The Interpolation Formula ofLagrange

    In what follows, we denote by n the set of all real or complex polynomialsP whose degrees do not exceed n:

    P (x) = a0 + a1x+ + anxn.

    (2.1.1.1) Theorem For n+ 1 arbitrary support points

    (xi, fi), i = 0, . . . , n, xi = xk for i = k,

    there exists a unique polynomial P n with

  • 2.1 Interpolation by Polynomials 39

    P (xi) = fi, i = 0, 1, . . . , n.

    Proof. Uniqueness: For any two polynomials P1, P2 n with

    P1(xi) = P2(xi) = fi, i = 0, 1, . . . , n,

    the polynomial P := P1 P2 n has degree at most n, and it has atleast n+1 different zeros, namely xi, i = 0, . . . , n. P must therefore vanishidentically, and P1 = P2.

    Existence: We will construct the interpolating polynomial P explicitlywith the help of polnomials Li n, i = 0, . . . , n, for which

    (2.1.1.2) Li(xk) = ik ={

    1 if i = k,0 if i = k.

    The following Lagrange polynomials satisfy the above conditions:

    (2.1.1.3)

    Li(x) ..(x x0) (x xi1)(x xi+1) (x xn)

    (xi x0) (xi xi1)(xi xi+1) (xi xn)

    (x)(x xi)(xi)

    with (x) :=ni=0

    (x xi).

    Note that our proof so far shows that the Lagrange polynomials areuniquely determined by (2.1.1.2).

    The solution P of the interpolation problem can now be expressed di-rectly in terms of the polynomials Li, leading to the Lagrange interpolationformula:

    (2.1.1.4) P (x) ni=0

    fiLi(x) =ni=0

    fi

    nk =ik=0

    x xkxi xk

    .

    The above interpolation formula shows that the coefficients of P de-pend linearly on the support ordinates fi. While theoretically important,Lagranges formula is, in general, not as suitable for actual calculationsas some other methods to be described below, particularly for large num-bers n of support points. Lagranges formula may, however, be useful insome situations in which many interpolation problems are to be solved forthe same support abscissae xi, i = 0, . . . , n, but different sets of supportordinates fi, i = 0, . . . , n.

    Example. Given for n = 2:xi 0 1 3

    fi 1 3 2

    Wanted: P (2), where P 2, P (xi) = fi for i = 0, 1, 2.

  • 40 2 Interpolation

    Solution:

    L0(x) (x 1)(x 3)(0 1)(0 3) , L1(x)

    (x 0)(x 3)(1 0)(1 3) , L2(x)

    (x 0)(x 1)(3 0)(3 1) ,

    P (2) = 1 L0(2) + 3 L1(2) + 2 L2(2) = 1 13

    + 3 1 + 2 13

    =103.

    2.1.2 Nevilles Algorithm

    Instead of solving the interpolation problem all at once, one might considersolving the problem for smaller sets of support points first and then updat-ing these solutions to obtain the solution to the full interpolation problem.This idea will be explored in the following two sections.

    For a given set of support points (xi, fi), i = 0, 1, . . . , n, we denote by

    Pi0i1...ik k

    that polynomial in k for which

    Pi0i1...ik(xij ) = fij , j = 0, 1, . . . , k.

    These polynomials are linked by the following recursion:

    Pi(x) fi,(2.1.2.1a)

    Pi0i1...ik(x) (x xi0)Pi1i2...ik(x) (x xik)Pi0i1...ik1(x)

    xik xi0.(2.1.2.1b)

    Proof: (2.1.2.1a) is trivial. To prove (2.1.2.1b), we denote its right-handside by R(x), and go on to show that R has the characteristic properties ofPi0i1...ik . The degree of R is clearly not greater than k. By the definitionsof Pi1...ik and Pi0...ik1 ,

    R(xi0) = Pi0...ik1(xi0) = fi0 ,R(xik) = Pi1...ik(xik) = fik ,

    and

    R(xij ) =(xij xi0)fij (xij xik)fij

    xik xi0= fij .

    for j = 1, 2, . . . , k 1. Thus R = Pi0i1...ik , in view of the uniqueness ofpolynomial interpolation [Theorem (2.1.1.1)].

    Nevilles algorithm aims at determining the value of the interpolatingpolynomial P for a single value of x. It is less suited for determining theinterpolating polynomial itself. Algorithms that are more efficient for the

  • 2.1 Interpolation by Polynomials 41

    later task, and also more efficient if values of P are sought for severalarguments x simultaneously, will be described in Section 2.1.3.

    Based on the recursion (2.1.2.1), Nevilles algorithm constructs a sym-metric tableau of the values of some of the partially interpolating polyno-mials Pi0i1...ik for fixed x:

    (2.1.2.2)

    k = 0 1 2 3

    x0 f0 = P0(x)P01(x)

    x1 f1 = P1(x) P012(x)P12(x) P0123(x). . .

    x2 f2 = P2(x) P123(x). . .

    P23(x)x3 f3 = P3(x)

    The first column of the tableau contains the prescribed support ordinatesfi. Subsequent columns are filled by calculation each entry recursively fromits two neighbors in the previous column according to (2.1.2.1b). Theentry P123(x), for instance, is given by

    P123(x) =(x x1)P23(x) (x x3)P12(x)

    x3 x1.

    Example. Determine P012(2) for the same support points as in section 2.1.1.

    k = 0 1 2x0 = 0 f0 = P0(2) = 1

    P01(2) = 5x1 = 1 f1 = P1(2) = 3 P012(2) = 103

    P12(2) = 52x2 = 3 f2 = P2(2) = 2

    P01(2) =(2 0) 3 (2 1) 1

    1 0 = 5,

    P12(2) =(2 1) 2 (2 3) 3

    3 1 =52,

    P012(2) =(2 0) 5/2 (2 3) 5

    3 0 =103.

    We will now discuss slight variants of Nevilles algorithm, employing afrequently used abbreviation,

    (2.1.2.3) Ti+k,k := Pi,i+1,...,i+k.

  • 42 2 Interpolation

    The tableau (2.1.2.2) becomes

    (2.1.2.4)

    x0 f0 = T00

    T11

    x1 f1 = T10 T22

    T21 T33

    x2 f2 = T20 T32 T31

    x3 f3 = T30

    The arrows indicate how the additional upward diagonal Ti,0, Ti,1, . . . , Ti,ican be constructed if one more support point xi, fi is added.

    The recursion (2.1.2.1) may be modified for more efficient evaluation:

    Ti,0 :=fi(2.1.2.5a)

    Ti,k :=(x xik)Ti,k1 (x xi)Ti1,k1

    xi xik(2.1.2.5b)

    =Ti,k1 +Ti,k1 Ti1,k1x xikx xi

    1, 1 k i, i 0.

    The following ALGOL algorithm is based on this modified recursion:

    for i := 0 step 1 until n do

    begin t[i] := f [i];

    for j := i 1 step 1 until 0 dot[j] := t[j + 1] + (t[j + 1] t[j]) (x x[i])/(x[i] x[j])

    end ;

    After the inner loop has terminated, t[j] = Ti,ij , 0 j i. Thedesired value Tnn = P01,...,n(x) of the interpolating polynomial can befound in t[0].

    Still another modification of Nevilles algorithm serves to improve some-what the accuracy of the interpolated polynomial value. For i = 0, 1, . . . , n,let the quantities Qik, Dik be defined by

    Qi0 := Di0 := fi,

    Qik := Tik Ti,k1Dik := Tik Ti1,k1

    }1 k i.

  • 2.1 Interpolation by Polynomials 43

    The recursion (2.1.2.5) then translates into(2.1.2.6)

    Qik := (Di,k1 Qi1,k1)xi xxik xi

    Dik := (Di,k1 Qi1,k1)xik xxik xi

    1 k i, i = 0, 1, . . . .Starting with Qi0 := Di0 := fi, one calculates Qik, Dik from the aboverecursion. Finally

    Pnn := fn +nk=1

    Qnk .

    If the values f0, . . . , fn are close to each other, the quantities Qik will besmall compared to fi. This suggests forming the sum of the correctionsQn1, . . . , Qnn first [contrary to (2.1.2.5)] and then adding it to fn, therebyavoiding unnecessary roundoff errors.

    Note finally that for x = 0 the recursion (2.1.2.5) takes a particularlysimple form

    (2.1.2.7a) Ti0 := fi

    (2.1.2.7b) Tik := Ti,k1 +Ti,k1 Ti1,k1

    xikxi

    1, 1 k i ,

    as does its analog (2.1.2.6). These forms are encountered when applyingextrapolation methods.

    For historical reasons mainly, we mention Aitkens algorithm. . It is alsobased on (2.1.2.1), but uses different intermediate polynomials. Its tableauis of the form

    x0 f0 = P0(x)

    x1 f1 = P1(x) P01(x)

    x2 f2 = P2(x) P02(x) P012(x)

    x3 f3 = P3(x) P03(x) P013(x) P0123(x)

    The first column again contains the prescribed values fi. Each subsequententry derives from the previous entry in the same row and the top entry inthe previous column according to (2.1.2.1b).

    2.1.3 Newtons Interpolation Formula: Divided Differences

    Nevilles algorithm is geared towards determining interpolating valuesrather than polynomials. If the interpolating polynomial itself is needed,

  • 44 2 Interpolation

    or if one wants to find interpolating values for several arguments j simul-taneously, then Newtons interpolation formula ff is to be preferred. Herewe write the interpolating polynomial P n, P (xi) = fi, i = 0, 1, . . . n,in the form

    (2.1.3.1)P (x) P01...n(x)

    = a0 + a1(x x0) + a2(x x0)(x x1) + + an(x x0) (x xn1).

    Note that the evaluation of (2.1.3.1) for x = may be done recursivelyas indicated by the following expression:

    P () = ( (an( xn1) + an1)( xn2) + + a1)( x0) + a0.

    This requires fewer operations than evaluating (2.1.3.1) term by term. Itcorresponds to the so-called Horner scheme for evaluating polynomialswhich are given in the usual form, i.e. in terms of powers of x, and itshows that the representation (2.1.3.1) is well suited for evaluation.

    It remains to determine the coefficients ai in (2.1.3.1). In principle, theycan be calculated successively from

    f0 = P (x0) = a0

    f1 = P (x1) = a0 + a1(x1 x0)f2 = P (x2) = a0 + a1(x2 x0) + a2(x2 x0)(x2 x1)

    This can be done with n divisions and n(n 1) multiplications. There is,however, a better way, which requires only n(n+ 1)/2 divisions and whichproduces useful intermediate results.

    Observe that the two polynomials Pi0i1...ik1(x) and Pi0i1...ik(x) differby a polynomial of degree k with k zeros xi0 , xi1 ,. . . , xik1 , since bothpolynomials interpolate the corresponding support points. Therefore thereexists a unique coefficient

    (2.1.3.2) fi0i1...ik

    such that(2.1.3.3)Pi0i1...ik(x) = Pi0i1...ik1(x) + fi0i1...ik(x xi0)(x xi1) (x xik1) .

    From this and from the identity Pi0 fi0 it follows immediately that

    (2.1.3.4)Pi0i1...ik(x) =fi0 + fi0i1(x xi0) +

    + fi0i1...ik(x xi0)(x xi1) (x xik1)

  • 2.1 Interpolation by Polynomials 45

    is a Newton representation of the interpolating polynomial Pi0,...,ik(x). Thecoefficients (2.1.3.2) are called k th divided differences.

    The recursion (2.1.2.1) for the partially interpolating polynomialstranslates into the recursion

    (2.1.3.5) fi0i1...ik =fi1...ik fi0...ik1

    xik xi0for the divided differences, since by (2.1.3.3), fi1...ik and fi0...ik1 are thecoefficients of the highest terms of the polynomials Pi1i2...ik and Pi0i1...ik1 ,respectively. The above recursion starts for k = 0 with the given supportordinates fi, i = 0, . . . , n. It can be used in various ways for calculating di-vided differences fi0 , fi0i1 , . . . , fi0i1...in , which then characterize the desiredinterpolating poynomial P = Pi0i1...in .

    Because the polynomial Pi0i1...ik is uniquely determined by the supportpoints it interpolates [Theorem (2.1.1.1)], the polynomial is invariant to anypermutation of the indices i0, i1, . . . , ik, and so is its coefficient fi0i1...ik ofxk. Thus:(2.1.3.6). The divided differences fi0i1...ik are invariant to permutations ofthe in