introduction to numerical algorithms
TRANSCRIPT
Introduction to numerical algorithms Given an algebraic equation or formula, we may want to approximate the value, and while in calculus, we deal with
equations or formulas that are well defined at each point and where the equations or formulas have properties such
as continuity and differentiability, in engineering applications, such information is not always available. Instead, we
may understand that the underlying behavior is continuous and differentiable, but we may only have samples of the
values of the equation or formula. Our goal will be to find algorithms that will give us, under many circumstances,
good approximations of solutions to an equation or the value of a formula.
In this introductory chapter, we will look at:
1. techniques used in numerical algorithms,
2. sources of error, and
3. the representation of floating-point numbers.
We will begin with
Techniques in numerical algorithms Algorithms for numerical approximations to solutions of algebraic equations and formula generally use at least one
of six techniques for finding such approximations, including:
1. iteration,
2. linear algebra,
3. interpolation,
4. Taylor series,
5. bracketing and
6. weighted averages.
We will look at each of these six techniques, and while each is relatively straight-forward on their own, we will see
that together solutions to some of the most complex algebraic equations and formulas can be computed.
Iteration This section will first introduce the concept of iteration, then look at a straight-forward example, the fixed-point
theorem, and it will conclude with a discussion of initial points for such iterations.
Iteration Many numerical algorithms involve taking a poor approximation xk and from this finding a better approximation,
xk+1, and this process can be repeated such that, under certain conditions and usually only in theory, the
approximations will always get closer and closer to the correct answer. Problems with iterative approaches include:
1. the sequence converges, but very slowly,
2. the sequence converges to a solution that is not the one we are looking for,
3. the sequence diverges (approaches plus or minus infinity), or
4. the sequence does not converge.
When we discuss the fixed-point theorem, we will see examples of each of these.
Fixed-point theorem The easiest example of an iterative means of approximating a solution to an equation is finding a solution to
x = f(x)
for any function f. In this case, if we start out with any initial approximation x0 and the let
xk+1 = f(xk),
then under specific circumstances, the fixed-point theorem says that sequence will converge to a solution of the
equation x = f(x).
Example 1 As an example, suppose we want to approximate a solution to the equation
x = cos(x).
The two solutions to this equation are approximately represented by 0.73908513321516064166 .
Now, we know that 0.78539816354
and
2cos 0.7071067810
4 2
, so let’s start with 0
2
2x , but as
computers cannot, we will start out with a 20-decimal-digit approximation, x0 = 0.70710678118654752440, and so
we begin:
1 0
0.70710678118654752440
0.76024459707563015125
cos
cos
x x
If we repeat this, we have the sequence of values presented here:
x0 0.70710678118654752440
x1 0.76024459707563015125
x2 0.72466748088912622790
x3 0.74871988578948429789
x4 0.73256084459224179590
x5 0.74346421131529366888
x6 0.73612825650085194340
x7 0.74107368708371021764
x8 0.73774415899257467163
x9 0.73998776479587092315
x10 0.73847680872455379036
x11 0.73949477113197436584
x12 0.73880913418406974579
x13 0.73927102133010927466
x14 0.73895990397625177601
x15 0.73916948334137422989
x16 0.73902831132627283515
x17 0.73912340792986369997
x18 0.73905935036556661907
x19 0.73910250060713648065
x20 0.73908514737681824432
The easiest way to observe this is to take any older generation calculator, randomly punch in any number, and then
start hitting the cos key.
Example 2 If we try instead to approximate a solution to x = sin(x), we know the only solution to this equation is x = 0. If,
however, we start with x0 = 1.0, the first step looks hopeful, x1 = sin(1.0) = 0.84147098480789650665, but we might
rightfully have reason for concern if it takes nine iterations to get a value less than 0.5:
x0 1.00000000000000000000
x1 0.84147098480789650665
x2 0.74562414166555788889
x3 0.67843047736074022898
x4 0.62757183204915913888
x5 0.58718099657343098933
x6 0.55401639075562963033
x7 0.52610707550284170222
x8 0.50217067626855534868
x9 0.48132935526234634496
In this case, the convergence is very slow; after 10000 iterations, our approximation is still
x10000 = 0.017313621122353677159—far less accurate than twenty iterations with the equation x = cos(x).
Example 3 The equation x = e
x – 1 – cos(x) has a solution at x = 0.92713713388001711273, but if we start with x0 = 0.9271, we
find that we actually converge to the other solution at x = –1.1132554312382240490.
x0 0.9271000000000000000
x1 0.9270135853516848682
x2 0.9267260909475873311
x3 0.9257697888633520838
x4 0.9225906690922362275
x5 0.9120425734814909414
x6 0.8772702797682118268
x7 0.7650749295271742882
x8 0.4278249398787326488
x9 –0.3759527882387797054
x10 –1.2435234699343957626
x11 –1.0330954581334626660
x12 –1.1562590805182166271
x13 –1.0881053100268105210
x14 –1.1273102952777548147
x15 –1.1051875733572243861
x16 –1.1178180735432224195
x17 –1.1106528702068170966
x18 –1.1147327749625783986
x19 –1.1124144944595464182
x20 –1.1137333605438615669
Example 4 If we take the same equation, x = e
x – 1 – cos(x), but instead we start with x0 = 0.9272, we find a different result:
x0 0.9272
x1 0.9273463062488297850
x2 0.9278331540644906092
x3 0.9294536680953979168
x4 0.9348530289767798115
x5 0.9529024459698862581
x6 1.0139056955772054701
x7 1.2277962596009892473
x8 2.0773844215152758665
x9 7.4687566436661236317
x10 1751.0506722832620736
x11 2.9624055016857770800 × 10760
Example 5 If we consider the equation x = 1 + cos(x) – e
x, we note that this equation has only one solution, approximately at
x = 0.41010429603233999790; however, after 10000 iterations, the sequence neither converges to our solution, nor
does it converge to any other solution, nor does it diverge to infinity. Instead, the values always remain bounded.
x0 1.0000000000000000000
x1 –1.1779795225909055180
x2 1.0748919751463956454
x3 –1.4538491475114394034
x4 0.8830116594692874759
x5 –0.7833444047171225754
x6 1.2516820387258261270
x7 –2.1824931015815898224
x8 0.3129825494937142901
x9 0.5839218156734147546
x10 0.0412502764636847386
x11 0.9570364387415620172
x12 –1.0280228200916764844
x13 1.1587993460955800111
x14 –1.7856655640731183991
x15 0.6190949071537133250
x16 –0.0428422865585101293
x17 1.0410199321213100996
x18 –1.3267636954142391473
x19 0.9762831542458109225
x20 –1.0944657282613892553
To demonstrate this more clearly, the following plot shows the first 2000 iterations together with the actual solution
of x = 0.41010429603233999790. The approximations jump both above and below the value, but never converge to
it.
Example 6 Finally, if we consider the equation x = 3.5x(1 – x), we note that this equation has two solutions, at x = 0 and
approximately at x = 0.71428571428571428571; however, after twenty iterations, we note the points bounce
between four values, none of which are either solution.
x0 0.50000000000000000000
x1 0.87500000000000000000
x2 0.38281250000000000000
x3 0.82693481445312500000
x4 0.50089769484475255013
x5 0.87499717950387996645
x6 0.38281990377447189380
x7 0.82694088767001590788
x8 0.50088379589339714035
x9 0.87499726616686585021
x10 0.38281967628581869058
x11 0.82694070106983887107
x12 0.50088422294386791022
x13 0.87499726352424938150
x14 0.38281968322263632519
x15 0.82694070675984845448
x16 0.50088420992179774086
x17 0.87499726360484968051
x18 0.38281968301106208420
x19 0.82694070658630209823
x20 0.50088421031897331932
Convergence criteria If a sequence of points converges to a point x, it is necessary that
lim 0kk
x x
,
but for numerical solutions, we don’t require the exact answer, only an approximation, and thus, we may desire that
abskx x ,
but even then we can’t always guarantee the sequence will converge. Of course, we don’t know when we’re
sufficnently close, because we don’t know the actual value x, thus, we will look at
1 absk kx x .
Unfortunately, even this does not guarantee convergence, as we saw with the example with x = sin(x); however,
10000 9999 0.00 86510000x x ,
but we are still quite far away from the solution:
10000 0 0.01731x .
Where possible, we may require other convergence criteria.
Initial values As two examples demonstrated, using slightly different initial values resulted in drastically different behaviours:
one converged to the other solution, while the other diverged to infinity. In other cases, it can be shown that
iterative methods will converge, but only if the initial approximation is sufficiently close to the actual solution that is
being sought out. In general, given an arbitrary iterative method, there is no conditions that tell you where to start.
However, as an engineering student, when you use such techniques, you should already know approximately what
the solution should be, and from that information, it should give you reasonable initial values and reasonable tests as
to whether or not the approximation is the desired one.
Summary of iteration As you may have noticed, iteration can be very useful in finding approximations to solutions of equations, but their
use allows for many possible failures. Consequently, any
Linear Algebra The next tool for solving algebraic equations is finding approximations to linear equations. Given a system of n
linear equations in n unknowns, the objective is to find a solution that satisfies all n linear equations. In general,
these are the only systems of equations that we can reliably solve, and therefore in many cases we will linearize an
equation from non-linear equation to one that is linear, or from a system of non-linear equations on a system of
linear equations. In solving the linear system, we hope that it will give us information to the solution of the non-
linear equations.
In your course on linear algebra, you have already been exposed to Gaussian elimination. While this technique can
be used to find numeric approximations of solutions to a system of linear equations, it is slow ((n3) for a system of
n linear equations in n unknowns) and it is subject to round-off error, and if certain precautions are not taken, the
approximation can have a significant error associated with it. There are iterative techniques for approximating
solutions to systems of linear equations that are particularly effective for large sparse systems.
Interpolation Given a set of n points (x1, y1), …, (xn, yn), if all the x values are different, there exists a polynomial of degree n – 1
that passes through all n points. This technique will often be used to convert a set of n observations into a
continuous function.
Taylor series A Taylor series describes the behavior of a function by
Taylor series will be used primarily for error analysis, although with techniques such as automatic differentiation
(where the derivative of a Matlab or C function can be deduced algorithmically), it is possible use Taylor series in
numerical computations. As an example of automatic differentiation, from the C function
double f( double x, double y ) { return 1.0 + x + x*(x*x - x*y*sin(x)); }
it could be deduced that the partial derivatives are
double f_x( double x, double y ) { return 1.0 + x*x - x*y*sin(x) + x*(2*x - y*sin(x) - x*y*cos(x)); } double f_y( double x, double y ) { return -x*x*sin(x); }
These could then be compiled and called directly.
Also, given a set of n points (x1, y1), …, (xn, yn), if we allow the x values to converge on a single point (again,
without any repetition except in the limit), the limit of the interpolating polynomials will be the (n – 1)th
-order
Taylor series approximation of the function at that limit point.
Bracketing In some cases, it is simply not possible to use interpolation or Taylor series to find approximations to equations. In
such cases, it may be necessary to revert to the intermediate-value theorem. For example, if we are attempting to
approximate a root of a function f(x) and we know that f(x1) < 0 and f(x2) > 0, if the function is continuous, there
must be a root on the interval [x1, x2]. If we let 1 2
32
x xx
, then the sign of f(x3) will let us know whether or not a
root is in [x1, x3] or [x3, x2].
Weighted averages Finally, another approach to finding numerical approximations is to use weighted averages. A simple average of n
values is the sum of those values divided by n
1 2 nx x x
n
,
but a simple average may not always be the best approximation of a value in question. In some cases, we may have
a number of weights c1, …, cn where
1 2 1nc c c ,
then
1 1 2 2 n nc x c x c x
is a weighted average of the n x values. When 1 2
1nc c c
n , the weighted average is the average.
As an example, suppose we wanted to approximate the average value of the sine function on [1.0, 1.2] with three
function evaluations. One solution may be to calculate
sin 1.0 sin 1.1 sin 1.20.88823914361218606542
3
,
however, the weighted average
sin 1.0 2sin 1.1 sin 1.20.88898119772449838406
4
(here c1 = 0.25, c2 = 0.5 and c3 = 0.25) is closer to the actual average value
1.2
1.0
1sin 0.88972275695733069880
0.2x dx .
You may actually notice the error is almost exactly half (0.00148361… versus 0.00074156…). We will see later
why there are good theoretical reasons for what may appear to be a coincidence.
Summary of numerical techniques In summary, there are six techniques that we will be using find numeric approximations to algebraic equations and
formulas. Every technique will use at least one of these techniques, and often more. Next, we will look at the
source of errors.
Sources of error One source of error in numerical computations is rounding error, and this manifests itself in two ways:
1. Certain numbers, such as , have non-terminating non-repeating decimal representation. Such numbers
cannot be stored exactly.
2. The result of many arithmetic operations, including most divisions, anything but integer multiplications, the
sum of numbers with very different magnitudes, and the subtraction of very similar numbers, will either
result in additional rounding errors, or amplify the effect of previous rounding errors.
As an example, suppose we want to calculate the average of two values that are approximately equal: the most
obvious solution is to calculate
2
a bc
,
but what happens if a + b results in a numeric overflow? If we assume b > a, then while
2
b ac a
is algebraically equivalent to the straight-forward calculation, this formula is not subject to numeric overflow.
There are other sources of error:
1. the values used may themselves be subject to error: a sensor may only be so precise, or the sensor itself
could be subject to a bias (always
2. an incorrect model
Representation of numbers This section will briefly describe the various means of storing numbers, including:
1. representations of integers,
2. floating-point representations of real numbers,
3. fixed-point representations of real numbers, and
4. the representation of complex numbers.
This course will focus on the second: the floating-point representation of real numbers, but we will at least
introduce the other three.
Base 2 in favour of base 10 From early childhood, we have learned to count to 9, and having maxed out the number of digits available, we
proceed to writing 10. This is referred to as base 10, as there are ten digits, 0, 1, 2, …, 8 and 9. It would be possible
to have a computer store a base-10 number using, for example, 10 different voltages, but it is easier to use just two
voltages, thereby allowing only two digits: 0 and 1.
Thus, 0 and 1 represent the first two numbers, but the next must be 10, after which we have 11, and then 100.
Thus, 10 represents “two”, 11 “three”, 100 “four”, and so on. The first seventeen numbers are shown in this table.
Decimal Binary
0 0
1 1
2 10
3 11
4 100
5 101
6 110
7 111
8 1000
9 1001
10 1010
11 1011
12 1100
13 1101
14 1110
15 1111
16 10000
To differeniate between base-10 numbers (“decimal numbers”) and base-2 numbers (“binary numbers”), if the
possibility of ambiguity exists, the base is appended as a subscript, so 1110 = 10112.
You may wonder whether this is efficient, as it takes 5 digits to represent “sixteen”, whereas base 10 only requires
two digits. The additional memory, however, is only a constant multiple: it requires approximately 2log 10 3.3
times as many binary digits (“bits”) as it does require decimal digits to represent the same number. Thus, while one
million may be represented with seven digits, it requires 20 bits (100000010 = 111101000010010000002).
Examples of binary operations are presented here, always remember that binary arithmetic is just like decimal
arithmetic, only 1 + 1 = 10 and 11 + 1 = 100, etc.
1 1 1
1000111001
10010101
1011001110
1 1 0 1
1000111001
10010101
110100100
1000111001
10010101
1000111001
00000000000
100011100100
0000000000000
10001110010000
000000000000000
0000000000000000
10001110010000000
10100101100101101
11.11010001
10010101 100011100 .
10010101
10000111
10010101
1111010
10010101
1011111
10010101
101001
10010101
1111
10010101
101101
1 0 0000
1
0
0 00
0
0
1
00
0
00
To convert a decimal number into a binary number is tedious and the reader is welcome to look this topic up on his
or her own; however, we will make one comment: just because a number has a finite representation in base 10, such
as 0.3, this does not mean that the binary representation will also be finite.
The conversion of a number from binary to decimal is quite straight forward. Recall that the decimal integer
dn···d1 d0
represents the number
0
10n
k
k
k
d
,
as 5402 is 5000 + 400 + 00 + 2. Similarly, each bit corresponds with a power of two, and for integers, if the bits are
numbered as
bn···b1 b0
represents the number
0
2n
k
k
k
b
,
so 11012 is 8 + 4 + 0 + 1 = 13. This also works for real numbers, where 0.12 represents 2–1
or 0.5, and 0.012
represents 0.25, and so on. Thus, 101010.00101112 represents
32 + 8 + 2 + 0.5 + 0.125 + 0.03125 + 0.015625 + 0.0078125 = 42.1796875.
Representations of integers Integers are generally stored in computers as an n-bit unsigned integer capable of storing values from 0 to 2
n – 1 or
as a signed integer using 2’s-complement capable of storing values from –2n – 1
to 2n – 1
– 1. In general, positive
numbers are always stored using a base-2 representation, where the kth
bit represents the coefficient of 2k of the
binary expansion of the number.
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
For example, 0000000000101010 represents 21 + 2
3 + 2
5 = 42. Note, however, if a system is little endian
(discussed in Section Error! Reference source not found.), such a 16-bit binary representation would be stored in
main memory as 0010101000000000.
The 2’s-complement representation storing both positive and negative integers is as follows: Given n bits,
1. if the first bit is 0, the remaining n – 1 bits represent integers from 0 to 2n – 1
– 1 using a base-2
representation, wihle
2. if the first bit is 1, the remaining n – 1 bits represent –(2n – 1
– b) where b is the positive integer of the
remaining n – 1 bits storing a number from 0 to 2n – 1
– 1, so negative numbers range from –2n – 1
to –1.
The easiest way to calculate the representation of a negative integer is to take the bit-wise NOT (complement) of the
positive number from 1 to 2n – 1
(from 000…001 to 100…000), taking the bit-wise complement (from 111…110 to
011…111) and adding 1 to the result (from 111…111 to 100…000). Note this forces the first bit to be 1.
For example, given the 16-bit representations of 4210 = 1010102 , the 16-bit 2’s-complement representation of 42
is
x = 0000000000101010 ~x = 1111111111010101 ~x + 1 = 1111111111010110
All positive integers have a leading 0 and all negative numbers have a leading 1. Incidentally, the largest negative
number is 1000000000000000 while the representation of 0 is 1111111111111111. If you ask most libraries for
the absolute value of the largest negative number, it comes back unchanged—a negative number.
The most significant benefit of the 2’s-complement representation is that addition does not require additional
checks. For example, we can find –42 + 10 by calculating:
1111111111010110 + 1010 1111111111100000
This result is negative (the first bit is 1), and thus we calculate the additive inverse of the result:
y = 1111111111100000
~y = 0000000000011111 ~y + 1 = 0000000000100000
That is, the sum is –32.
While there have previously been other digital formats (for example, binary-coded decimal), these representations
for positive integers and signed integers are almost universal today.
One issue with integer representations is what happens if the result of an operation cannot be within that
representation. For example, suppose we add 1 to the largest signed integer (say, 1 + 0111111111111111).
There are two approaches:
1. The most common is to wrap and signal an overflow, so the result is 1000000000000000 which is now
the largest negative integer. Most high-level programming languages do not allow the programmer to
determine if an overflow has occurred, and therefore it is necessary that checks are made before an
operation is made to determine if an overflow will occur.
2. The second is referred to as saturation arithmetic, where, for example, adding one to the largest integer will
have the largest integer return. This was discussed previously in Section Error! Reference source not
found. with the QADD operation.
One operation that must, however, be avoided at all costs is a division-by-zero or modulo zero operation. Such
operations will throw an interrupt that will halt the currently executing task. The Clementine lunar mission that
failed, in part due to the absence of a watchdog timer, had a second peculiarity: prior to the exception that caused
the processor to hang, there had been previously almost 3000 similar exceptions. See Jack Ganssle’s 2002 article
“Born to Fail” for further details.
In summary, fixed-length base-2 representations of positive integers and 2’s-complement representation of negative
numbers are near universal. Most applications use usual arithmetic while checking for overflow; however,
saturation arithmetic may be more appropriate in critical systems where an accidental overflow may result in a
disaster (as in the Ariane 5 rocket). Allowing exceptions to result from invalid integer operations has also resulted
in numerous issues, too.
Floating-point representations Real numbers are generally approximated using floating- or fixed-point representations. We say approximated
because almost every real number cannot be represented exactly using any finite-length representation.
Floating-point approximations usually use one of two representations specified by IEEE 754: single- and double-
precision floating point numbers, or float and double, respectively. For general applications, double-precision
floating-point numbers, which occupy eight bytes, have sufficient precision to be used for most engineering and
scientific computation, while single-precision floating-point numbers occupy only four bytes, and have significantly
less precision, and therefore should only be used when only course approximations are necessary, such as in the
generation of graphics. In embedded systems, however, if it can be determined that the higher precision of the
double format is not necessary, use of the float format can result in significant savings in memory and run time.
Most larger microcontrollers have floating-point units (FPUs) which perform floating-point operations.
Issues with floating-point operations such as those associated with integer operations are avoided with the
introduction of three special floating-point numbers representing infinity, negative infinity and not-a-number. These
numbers result from operations such as 1.0/0.0, -1e300*1e300 and 0.0/0.0, respectively. Consequently, there
will never be an exception in any floating-point operation. Note that even zero is signed, where +0 and -0
represents all positive and negative real numbers, respectively, too small to be represented by any other floating-
point number. Therefore, 1.0/(-0.0) should result in negative infinity.
For further information on floating-point numbers, see any good text on numerical analysis.
Fixed-point representations Fixed-point representation of real numbers is usually restricted to smaller mircocontrollers that lack an FPU, often
with only 24- or 16-bit registers or smaller. In a fixed-point representation, the first bit is usually the sign bit, and
the radix point is arbitrarily fixed at some location within the number. Thus, if an 16-bit number represented a sign
bit, 7 bits before the integer component, and 8 bits for the fractional component, the value of would be represented
by
0000001100100100
which is the approximation 11.0010012 = 3.14062510 with a 0.0308 % relative error. This can represent real
numbers in the range (–256, 256). Adding two fixed-point representations can, for the most part, be done with
integer addition, but multiplication requires a little more effort, requiring integer multiplication of the 16-bit
numbers as 32-bit numbers, and then truncating the last 8 bits.
11.00100100 × 11.00100100 1001.1101110100010000
Thus, 2 is approximately equal to 1001.110111012 = 9.8632812510, whereas 2
= 9.86960440···.
Whether or not numbers like 0111111111111111 and 1111111111111111 represent plus or minus infinity is a
question that must be addressed during the design phase.
Representation of complex numbers The most usual means of representing a complex number is to store a pair of real numbers representing the real and
imaginary components of the complex number.
Fortunately, Matlab allows you to work seamlessly with complex numbers. By default, the variables i and j both
represent the imaginary unit, 1 , but even if you assign to these variables, you may always enter a complex
number by juxtaposing the imaginary unit with the imaginary component:
>> j = 4; >> 3.25 + 4.79j ans = 3.2500 + 4.7900i
Indeed, Matlab recommends using 1j instead of j for entering the imaginary unit (to avoid the possibility that your
code may at some point later fail if j is accidently assigned a value earlier in your scripts).
While it is possible to store complex number as a pair of real numbers representing its magnitude and argument, this
is seldom used in practice except in special circumstances.
Summary of the representation of numbers In this section, we have reviewed or introduced various binary representation of integers and real numbers. Each
representation must have some limitations and developers of real-time systems must be aware of those limitations.
We will continue with the introduction of definitions related to real-time systems.
Summary of our introduction to numerical algorithms This first chapter discussed techniques that will be used in numerical algorithms, including iteration, linear algebra,
interpolation, Taylor series, bracketing and weighted averages; a brief discussion on the source of error; and the
representation of numbers.