introduction to numerical algorithms

Introduction to numerical algorithms Given an algebraic equation or formula, we may want to approximate the value, and while in calculus, we deal with

equations or formulas that are well defined at each point and where the equations or formulas have properties such

as continuity and differentiability, in engineering applications, such information is not always available. Instead, we

may understand that the underlying behavior is continuous and differentiable, but we may only have samples of the

values of the equation or formula. Our goal will be to find algorithms that will give us, under many circumstances,

good approximations of solutions to an equation or the value of a formula.

In this introductory chapter, we will look at:

1. techniques used in numerical algorithms,

2. sources of error, and

3. the representation of floating-point numbers.

We will begin with

Techniques in numerical algorithms Algorithms for numerical approximations to solutions of algebraic equations and formula generally use at least one

of six techniques for finding such approximations, including:

1. iteration,

2. linear algebra,

3. interpolation,

4. Taylor series,

5. bracketing and

6. weighted averages.

We will look at each of these six techniques, and while each is relatively straight-forward on their own, we will see

that together solutions to some of the most complex algebraic equations and formulas can be computed.

Iteration This section will first introduce the concept of iteration, then look at a straight-forward example, the fixed-point

theorem, and it will conclude with a discussion of initial points for such iterations.

Iteration Many numerical algorithms involve taking a poor approximation xk and from this finding a better approximation,

xk+1, and this process can be repeated such that, under certain conditions and usually only in theory, the

approximations will always get closer and closer to the correct answer. Problems with iterative approaches include:

1. the sequence converges, but very slowly,

2. the sequence converges to a solution that is not the one we are looking for,

3. the sequence diverges (approaches plus or minus infinity), or

4. the sequence does not converge.

When we discuss the fixed-point theorem, we will see examples of each of these.

Fixed-point theorem The easiest example of an iterative means of approximating a solution to an equation is finding a solution to

x = f(x)

for any function f. In this case, if we start out with any initial approximation x0 and the let

xk+1 = f(xk),

then under specific circumstances, the fixed-point theorem says that sequence will converge to a solution of the

equation x = f(x).

Example 1 As an example, suppose we want to approximate a solution to the equation

x = cos(x).

The two solutions to this equation are approximately represented by 0.73908513321516064166 .

Now, we know that 0.78539816354

and

2cos 0.7071067810

4 2

, so let’s start with 0

2

2x , but as

computers cannot, we will start out with a 20-decimal-digit approximation, x0 = 0.70710678118654752440, and so

we begin:

1 0

0.70710678118654752440

0.76024459707563015125

cos

cos

x x

If we repeat this, we have the sequence of values presented here:

x0 0.70710678118654752440

x1 0.76024459707563015125

x2 0.72466748088912622790

x3 0.74871988578948429789

x4 0.73256084459224179590

x5 0.74346421131529366888

x6 0.73612825650085194340

x7 0.74107368708371021764

x8 0.73774415899257467163

x9 0.73998776479587092315

x10 0.73847680872455379036

x11 0.73949477113197436584

x12 0.73880913418406974579

x13 0.73927102133010927466

x14 0.73895990397625177601

x15 0.73916948334137422989

x16 0.73902831132627283515

x17 0.73912340792986369997

x18 0.73905935036556661907

x19 0.73910250060713648065

x20 0.73908514737681824432

The easiest way to observe this is to take any older generation calculator, randomly punch in any number, and then

start hitting the cos key.

Example 2 If we try instead to approximate a solution to x = sin(x), we know the only solution to this equation is x = 0. If,

however, we start with x0 = 1.0, the first step looks hopeful, x1 = sin(1.0) = 0.84147098480789650665, but we might

rightfully have reason for concern if it takes nine iterations to get a value less than 0.5:

x0 1.00000000000000000000

x1 0.84147098480789650665

x2 0.74562414166555788889

x3 0.67843047736074022898

x4 0.62757183204915913888

x5 0.58718099657343098933

x6 0.55401639075562963033

x7 0.52610707550284170222

x8 0.50217067626855534868

x9 0.48132935526234634496

In this case, the convergence is very slow; after 10000 iterations, our approximation is still

x10000 = 0.017313621122353677159—far less accurate than twenty iterations with the equation x = cos(x).

Example 3 The equation x = e

x – 1 – cos(x) has a solution at x = 0.92713713388001711273, but if we start with x0 = 0.9271, we

find that we actually converge to the other solution at x = –1.1132554312382240490.

x0 0.9271000000000000000

x1 0.9270135853516848682

x2 0.9267260909475873311

x3 0.9257697888633520838

x4 0.9225906690922362275

x5 0.9120425734814909414

x6 0.8772702797682118268

x7 0.7650749295271742882

x8 0.4278249398787326488

x9 –0.3759527882387797054

x10 –1.2435234699343957626

x11 –1.0330954581334626660

x12 –1.1562590805182166271

x13 –1.0881053100268105210

x14 –1.1273102952777548147

x15 –1.1051875733572243861

x16 –1.1178180735432224195

x17 –1.1106528702068170966

x18 –1.1147327749625783986

x19 –1.1124144944595464182

x20 –1.1137333605438615669

Example 4 If we take the same equation, x = e

x – 1 – cos(x), but instead we start with x0 = 0.9272, we find a different result:

x0 0.9272

x1 0.9273463062488297850

x2 0.9278331540644906092

x3 0.9294536680953979168

x4 0.9348530289767798115

x5 0.9529024459698862581

x6 1.0139056955772054701

x7 1.2277962596009892473

x8 2.0773844215152758665

x9 7.4687566436661236317

x10 1751.0506722832620736

x11 2.9624055016857770800 × 10760

Example 5 If we consider the equation x = 1 + cos(x) – e

x, we note that this equation has only one solution, approximately at

x = 0.41010429603233999790; however, after 10000 iterations, the sequence neither converges to our solution, nor

does it converge to any other solution, nor does it diverge to infinity. Instead, the values always remain bounded.

x0 1.0000000000000000000

x1 –1.1779795225909055180

x2 1.0748919751463956454

x3 –1.4538491475114394034

x4 0.8830116594692874759

x5 –0.7833444047171225754

x6 1.2516820387258261270

x7 –2.1824931015815898224

x8 0.3129825494937142901

x9 0.5839218156734147546

x10 0.0412502764636847386

x11 0.9570364387415620172

x12 –1.0280228200916764844

x13 1.1587993460955800111

x14 –1.7856655640731183991

x15 0.6190949071537133250

x16 –0.0428422865585101293

x17 1.0410199321213100996

x18 –1.3267636954142391473

x19 0.9762831542458109225

x20 –1.0944657282613892553

To demonstrate this more clearly, the following plot shows the first 2000 iterations together with the actual solution

of x = 0.41010429603233999790. The approximations jump both above and below the value, but never converge to

it.

Example 6 Finally, if we consider the equation x = 3.5x(1 – x), we note that this equation has two solutions, at x = 0 and

approximately at x = 0.71428571428571428571; however, after twenty iterations, we note the points bounce

between four values, none of which are either solution.

x0 0.50000000000000000000

x1 0.87500000000000000000

x2 0.38281250000000000000

x3 0.82693481445312500000

x4 0.50089769484475255013

x5 0.87499717950387996645

x6 0.38281990377447189380

x7 0.82694088767001590788

x8 0.50088379589339714035

x9 0.87499726616686585021

x10 0.38281967628581869058

x11 0.82694070106983887107

x12 0.50088422294386791022

x13 0.87499726352424938150

x14 0.38281968322263632519

x15 0.82694070675984845448

x16 0.50088420992179774086

x17 0.87499726360484968051

x18 0.38281968301106208420

x19 0.82694070658630209823

x20 0.50088421031897331932

Convergence criteria If a sequence of points converges to a point x, it is necessary that

lim 0kk

x x

,

but for numerical solutions, we don’t require the exact answer, only an approximation, and thus, we may desire that

abskx x ,

but even then we can’t always guarantee the sequence will converge. Of course, we don’t know when we’re

sufficnently close, because we don’t know the actual value x, thus, we will look at

1 absk kx x .

Unfortunately, even this does not guarantee convergence, as we saw with the example with x = sin(x); however,

10000 9999 0.00 86510000x x ,

but we are still quite far away from the solution:

10000 0 0.01731x .

Where possible, we may require other convergence criteria.

Initial values As two examples demonstrated, using slightly different initial values resulted in drastically different behaviours:

one converged to the other solution, while the other diverged to infinity. In other cases, it can be shown that

iterative methods will converge, but only if the initial approximation is sufficiently close to the actual solution that is

being sought out. In general, given an arbitrary iterative method, there is no conditions that tell you where to start.

However, as an engineering student, when you use such techniques, you should already know approximately what

the solution should be, and from that information, it should give you reasonable initial values and reasonable tests as

to whether or not the approximation is the desired one.

Summary of iteration As you may have noticed, iteration can be very useful in finding approximations to solutions of equations, but their

use allows for many possible failures. Consequently, any

Linear Algebra The next tool for solving algebraic equations is finding approximations to linear equations. Given a system of n

linear equations in n unknowns, the objective is to find a solution that satisfies all n linear equations. In general,

these are the only systems of equations that we can reliably solve, and therefore in many cases we will linearize an

equation from non-linear equation to one that is linear, or from a system of non-linear equations on a system of

linear equations. In solving the linear system, we hope that it will give us information to the solution of the non-

linear equations.

In your course on linear algebra, you have already been exposed to Gaussian elimination. While this technique can

be used to find numeric approximations of solutions to a system of linear equations, it is slow ((n3) for a system of

n linear equations in n unknowns) and it is subject to round-off error, and if certain precautions are not taken, the

approximation can have a significant error associated with it. There are iterative techniques for approximating

solutions to systems of linear equations that are particularly effective for large sparse systems.

Interpolation Given a set of n points (x1, y1), …, (xn, yn), if all the x values are different, there exists a polynomial of degree n – 1

that passes through all n points. This technique will often be used to convert a set of n observations into a

continuous function.

Taylor series A Taylor series describes the behavior of a function by

Taylor series will be used primarily for error analysis, although with techniques such as automatic differentiation

(where the derivative of a Matlab or C function can be deduced algorithmically), it is possible use Taylor series in

numerical computations. As an example of automatic differentiation, from the C function

double f( double x, double y ) { return 1.0 + x + x*(x*x - x*y*sin(x)); }

it could be deduced that the partial derivatives are

double f_x( double x, double y ) { return 1.0 + x*x - x*y*sin(x) + x*(2*x - y*sin(x) - x*y*cos(x)); } double f_y( double x, double y ) { return -x*x*sin(x); }

These could then be compiled and called directly.

Also, given a set of n points (x1, y1), …, (xn, yn), if we allow the x values to converge on a single point (again,

without any repetition except in the limit), the limit of the interpolating polynomials will be the (n – 1)th

-order

Taylor series approximation of the function at that limit point.

Bracketing In some cases, it is simply not possible to use interpolation or Taylor series to find approximations to equations. In

such cases, it may be necessary to revert to the intermediate-value theorem. For example, if we are attempting to

approximate a root of a function f(x) and we know that f(x1) < 0 and f(x2) > 0, if the function is continuous, there

must be a root on the interval [x1, x2]. If we let 1 2

32

x xx

, then the sign of f(x3) will let us know whether or not a

root is in [x1, x3] or [x3, x2].

Weighted averages Finally, another approach to finding numerical approximations is to use weighted averages. A simple average of n

values is the sum of those values divided by n

1 2 nx x x

n

,

but a simple average may not always be the best approximation of a value in question. In some cases, we may have

a number of weights c1, …, cn where

1 2 1nc c c ,

then

1 1 2 2 n nc x c x c x

is a weighted average of the n x values. When 1 2

1nc c c

n , the weighted average is the average.

As an example, suppose we wanted to approximate the average value of the sine function on [1.0, 1.2] with three

function evaluations. One solution may be to calculate

sin 1.0 sin 1.1 sin 1.20.88823914361218606542

3

,

however, the weighted average

sin 1.0 2sin 1.1 sin 1.20.88898119772449838406

4

(here c1 = 0.25, c2 = 0.5 and c3 = 0.25) is closer to the actual average value

1.2

1.0

1sin 0.88972275695733069880

0.2x dx .

You may actually notice the error is almost exactly half (0.00148361… versus 0.00074156…). We will see later

why there are good theoretical reasons for what may appear to be a coincidence.

Summary of numerical techniques In summary, there are six techniques that we will be using find numeric approximations to algebraic equations and

formulas. Every technique will use at least one of these techniques, and often more. Next, we will look at the

source of errors.

Sources of error One source of error in numerical computations is rounding error, and this manifests itself in two ways:

1. Certain numbers, such as , have non-terminating non-repeating decimal representation. Such numbers

cannot be stored exactly.

2. The result of many arithmetic operations, including most divisions, anything but integer multiplications, the

sum of numbers with very different magnitudes, and the subtraction of very similar numbers, will either

result in additional rounding errors, or amplify the effect of previous rounding errors.

As an example, suppose we want to calculate the average of two values that are approximately equal: the most

obvious solution is to calculate

2

a bc

,

but what happens if a + b results in a numeric overflow? If we assume b > a, then while

2

b ac a

is algebraically equivalent to the straight-forward calculation, this formula is not subject to numeric overflow.

There are other sources of error:

1. the values used may themselves be subject to error: a sensor may only be so precise, or the sensor itself

could be subject to a bias (always

2. an incorrect model

Representation of numbers This section will briefly describe the various means of storing numbers, including:

1. representations of integers,

2. floating-point representations of real numbers,

3. fixed-point representations of real numbers, and

4. the representation of complex numbers.

This course will focus on the second: the floating-point representation of real numbers, but we will at least

introduce the other three.

Base 2 in favour of base 10 From early childhood, we have learned to count to 9, and having maxed out the number of digits available, we

proceed to writing 10. This is referred to as base 10, as there are ten digits, 0, 1, 2, …, 8 and 9. It would be possible

to have a computer store a base-10 number using, for example, 10 different voltages, but it is easier to use just two

voltages, thereby allowing only two digits: 0 and 1.

Thus, 0 and 1 represent the first two numbers, but the next must be 10, after which we have 11, and then 100.

Thus, 10 represents “two”, 11 “three”, 100 “four”, and so on. The first seventeen numbers are shown in this table.

Decimal Binary

0 0

1 1

2 10

3 11

4 100

5 101

6 110

7 111

8 1000

9 1001

10 1010

11 1011

12 1100

13 1101

14 1110

15 1111

16 10000

To differeniate between base-10 numbers (“decimal numbers”) and base-2 numbers (“binary numbers”), if the

possibility of ambiguity exists, the base is appended as a subscript, so 1110 = 10112.

You may wonder whether this is efficient, as it takes 5 digits to represent “sixteen”, whereas base 10 only requires

two digits. The additional memory, however, is only a constant multiple: it requires approximately 2log 10 3.3

times as many binary digits (“bits”) as it does require decimal digits to represent the same number. Thus, while one

million may be represented with seven digits, it requires 20 bits (100000010 = 111101000010010000002).

Examples of binary operations are presented here, always remember that binary arithmetic is just like decimal

arithmetic, only 1 + 1 = 10 and 11 + 1 = 100, etc.

1 1 1

1000111001

10010101

1011001110

1 1 0 1

1000111001

10010101

110100100

1000111001

10010101

1000111001

00000000000

100011100100

0000000000000

10001110010000

000000000000000

0000000000000000

10001110010000000

10100101100101101

11.11010001

10010101 100011100 .

10010101

10000111

10010101

1111010

10010101

1011111

10010101

101001

10010101

1111

10010101

101101

1 0 0000

1

0

0 00

0

0

1

00

0

00

To convert a decimal number into a binary number is tedious and the reader is welcome to look this topic up on his

or her own; however, we will make one comment: just because a number has a finite representation in base 10, such

as 0.3, this does not mean that the binary representation will also be finite.

The conversion of a number from binary to decimal is quite straight forward. Recall that the decimal integer

dn···d1 d0

represents the number

0

10n

k

k

k

d

,

as 5402 is 5000 + 400 + 00 + 2. Similarly, each bit corresponds with a power of two, and for integers, if the bits are

numbered as

bn···b1 b0

represents the number

0

2n

k

k

k

b

,

so 11012 is 8 + 4 + 0 + 1 = 13. This also works for real numbers, where 0.12 represents 2–1

or 0.5, and 0.012

represents 0.25, and so on. Thus, 101010.00101112 represents

32 + 8 + 2 + 0.5 + 0.125 + 0.03125 + 0.015625 + 0.0078125 = 42.1796875.

Representations of integers Integers are generally stored in computers as an n-bit unsigned integer capable of storing values from 0 to 2

n – 1 or

as a signed integer using 2’s-complement capable of storing values from –2n – 1

to 2n – 1

– 1. In general, positive

numbers are always stored using a base-2 representation, where the kth

bit represents the coefficient of 2k of the

binary expansion of the number.

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

For example, 0000000000101010 represents 21 + 2

3 + 2

5 = 42. Note, however, if a system is little endian

(discussed in Section Error! Reference source not found.), such a 16-bit binary representation would be stored in

main memory as 0010101000000000.

The 2’s-complement representation storing both positive and negative integers is as follows: Given n bits,

1. if the first bit is 0, the remaining n – 1 bits represent integers from 0 to 2n – 1

– 1 using a base-2

representation, wihle

2. if the first bit is 1, the remaining n – 1 bits represent –(2n – 1

– b) where b is the positive integer of the

remaining n – 1 bits storing a number from 0 to 2n – 1

– 1, so negative numbers range from –2n – 1

to –1.

The easiest way to calculate the representation of a negative integer is to take the bit-wise NOT (complement) of the

positive number from 1 to 2n – 1

(from 000…001 to 100…000), taking the bit-wise complement (from 111…110 to

011…111) and adding 1 to the result (from 111…111 to 100…000). Note this forces the first bit to be 1.

For example, given the 16-bit representations of 4210 = 1010102 , the 16-bit 2’s-complement representation of 42

is

x = 0000000000101010 ~x = 1111111111010101 ~x + 1 = 1111111111010110

All positive integers have a leading 0 and all negative numbers have a leading 1. Incidentally, the largest negative

number is 1000000000000000 while the representation of 0 is 1111111111111111. If you ask most libraries for

the absolute value of the largest negative number, it comes back unchanged—a negative number.

The most significant benefit of the 2’s-complement representation is that addition does not require additional

checks. For example, we can find –42 + 10 by calculating:

1111111111010110 + 1010 1111111111100000

This result is negative (the first bit is 1), and thus we calculate the additive inverse of the result:

y = 1111111111100000

~y = 0000000000011111 ~y + 1 = 0000000000100000

That is, the sum is –32.

While there have previously been other digital formats (for example, binary-coded decimal), these representations

for positive integers and signed integers are almost universal today.

One issue with integer representations is what happens if the result of an operation cannot be within that

representation. For example, suppose we add 1 to the largest signed integer (say, 1 + 0111111111111111).

There are two approaches:

1. The most common is to wrap and signal an overflow, so the result is 1000000000000000 which is now

the largest negative integer. Most high-level programming languages do not allow the programmer to

determine if an overflow has occurred, and therefore it is necessary that checks are made before an

operation is made to determine if an overflow will occur.

2. The second is referred to as saturation arithmetic, where, for example, adding one to the largest integer will

have the largest integer return. This was discussed previously in Section Error! Reference source not

found. with the QADD operation.

One operation that must, however, be avoided at all costs is a division-by-zero or modulo zero operation. Such

operations will throw an interrupt that will halt the currently executing task. The Clementine lunar mission that

failed, in part due to the absence of a watchdog timer, had a second peculiarity: prior to the exception that caused

the processor to hang, there had been previously almost 3000 similar exceptions. See Jack Ganssle’s 2002 article

“Born to Fail” for further details.

In summary, fixed-length base-2 representations of positive integers and 2’s-complement representation of negative

numbers are near universal. Most applications use usual arithmetic while checking for overflow; however,

saturation arithmetic may be more appropriate in critical systems where an accidental overflow may result in a

disaster (as in the Ariane 5 rocket). Allowing exceptions to result from invalid integer operations has also resulted

in numerous issues, too.

Floating-point representations Real numbers are generally approximated using floating- or fixed-point representations. We say approximated

because almost every real number cannot be represented exactly using any finite-length representation.

Floating-point approximations usually use one of two representations specified by IEEE 754: single- and double-

precision floating point numbers, or float and double, respectively. For general applications, double-precision

floating-point numbers, which occupy eight bytes, have sufficient precision to be used for most engineering and

scientific computation, while single-precision floating-point numbers occupy only four bytes, and have significantly

less precision, and therefore should only be used when only course approximations are necessary, such as in the

generation of graphics. In embedded systems, however, if it can be determined that the higher precision of the

double format is not necessary, use of the float format can result in significant savings in memory and run time.

Most larger microcontrollers have floating-point units (FPUs) which perform floating-point operations.

Issues with floating-point operations such as those associated with integer operations are avoided with the

introduction of three special floating-point numbers representing infinity, negative infinity and not-a-number. These

numbers result from operations such as 1.0/0.0, -1e300*1e300 and 0.0/0.0, respectively. Consequently, there

will never be an exception in any floating-point operation. Note that even zero is signed, where +0 and -0

represents all positive and negative real numbers, respectively, too small to be represented by any other floating-

point number. Therefore, 1.0/(-0.0) should result in negative infinity.

For further information on floating-point numbers, see any good text on numerical analysis.

Fixed-point representations Fixed-point representation of real numbers is usually restricted to smaller mircocontrollers that lack an FPU, often

with only 24- or 16-bit registers or smaller. In a fixed-point representation, the first bit is usually the sign bit, and

the radix point is arbitrarily fixed at some location within the number. Thus, if an 16-bit number represented a sign

bit, 7 bits before the integer component, and 8 bits for the fractional component, the value of would be represented

by

0000001100100100

which is the approximation 11.0010012 = 3.14062510 with a 0.0308 % relative error. This can represent real

numbers in the range (–256, 256). Adding two fixed-point representations can, for the most part, be done with

integer addition, but multiplication requires a little more effort, requiring integer multiplication of the 16-bit

numbers as 32-bit numbers, and then truncating the last 8 bits.

11.00100100 × 11.00100100 1001.1101110100010000

Thus, 2 is approximately equal to 1001.110111012 = 9.8632812510, whereas 2

= 9.86960440···.

Whether or not numbers like 0111111111111111 and 1111111111111111 represent plus or minus infinity is a

question that must be addressed during the design phase.

Representation of complex numbers The most usual means of representing a complex number is to store a pair of real numbers representing the real and

imaginary components of the complex number.

Fortunately, Matlab allows you to work seamlessly with complex numbers. By default, the variables i and j both

represent the imaginary unit, 1 , but even if you assign to these variables, you may always enter a complex

number by juxtaposing the imaginary unit with the imaginary component:

>> j = 4; >> 3.25 + 4.79j ans = 3.2500 + 4.7900i

Indeed, Matlab recommends using 1j instead of j for entering the imaginary unit (to avoid the possibility that your

code may at some point later fail if j is accidently assigned a value earlier in your scripts).

While it is possible to store complex number as a pair of real numbers representing its magnitude and argument, this

is seldom used in practice except in special circumstances.

Summary of the representation of numbers In this section, we have reviewed or introduced various binary representation of integers and real numbers. Each

representation must have some limitations and developers of real-time systems must be aware of those limitations.

We will continue with the introduction of definitions related to real-time systems.

Summary of our introduction to numerical algorithms This first chapter discussed techniques that will be used in numerical algorithms, including iteration, linear algebra,

interpolation, Taylor series, bracketing and weighted averages; a brief discussion on the source of error; and the

representation of numbers.

introduction to numerical algorithms

Documents