computer arithmetic - montefiore institutegeuzaine/info0939/hpc4.pdf · integers scientiﬁc...

Computer Arithmetic

Numbers in scientific computing

I Integers: . . . ,�2,�1, 0, 1, 2, . . .

I Rational numbers: 1/3, 22/7: not often encountered

I Real numbers 0, 1,�1.5, 2/3,p2, log 10, . . .

I Complex numbers 1 + 2i ,p3�

p5i , . . .

Computers use a finite number of bits to represent numbers,so only a finite number of numbers can be represented, and noirrational numbers (even some rational numbers).

Integers

Scientific computation mostly uses real numbers. Integers aremostly used for array indexing.

16/32/64 bit: short,int,long,long long in C, size notstandardized, use sizeof(long) et cetera. (Also unsigned int

et cetera)

INTEGER*2/4/8 Fortran

Negative integers

Use of sign bit: typically first bit

s i1 . . . in

Simplest solution: n > 0, fl(n) = +1, i1, . . . i31, thenfl(�n) = �1, i1, . . . i31

Problem: +0 and �0; also impractical in other ways.

Sign bit

bitstring 00 · · · 0 . . . 01 · · · 1 10 · · · 0 . . . 11 · · · 1as unsigned int 0 . . . 231 � 1 231 . . . 232 � 1as naive signed 0 . . . 231 � 1 �0 . . . �231 + 1

2’s complement

Better solution: if 0 n 231 � 1, then fl(n) = 0, i1, . . . , i31;1 n 231 then fl(�n) = fl(232 � n).

bitstring 00 · · · 0 . . . 01 · · · 1 10 · · · 0 . . . 11 · · · 1as unsigned int 0 . . . 231 � 1 231 . . . 232 � 1

as 2’s comp. integer 0 · · · 231 � 1 �231 · · · �1

Subtraction in 2’s complement

Subtraction m � n is easy.

I Case: m < n. Observe that �n has the bit pattern of 232 � n.Also, m + (232 � n) = 232 � (n �m) where0 < n �m < 231 � 1, so 232 � (n �m) is the 2’s complementbit pattern of m � n.

I Case: m > n. The bit pattern for �n is 232 � n, so m � n asunsigned is m + 232 � n = 232 + (m � n). Here m � n > 0.The 232 is an overflow bit; ignore.

Floating point numbers

Analogous to scientific notation x = 6.022 · 1023:

x = ±⌃t�1i=0di�

�i �e

I sign bit

I � is the base of the number system

I 0 d

i

� � 1 the digits of the mantissa: with the radix

point mantissa < �

Ie 2 [L,U] exponent

Some examples

� t L U

IEEE single (32 bit) 2 24 -126 127IEEE double (64 bit) 2 53 -1022 1023

Old Cray 64bit 2 48 -16383 16384IBM mainframe 32 bit 16 6 -64 63

packed decimal 10 50 -999 999

BCD is tricky: 3 decimal digits in 10 bits

(we will often use � = 10 in the examples, because it’s easier toread for humans, but all practical computers use � = 2)

Normalized numbers

Require first digit in the mantissa to be nonzero.Equivalent: mantissa part 1 x

m

< �

Unique representation for each number,also: in binary this makes the first digit 1, so we don’t need tostore that.

With normalized numbers, underflow threshold is 1 · �L;‘gradual underflow’ possible, but usually not e�cient.

IEEE 754

sign exponent mantissas e1 · · · e8 s1 . . . s2331 30 · · · 23 22 · · · 0

(e1 · · · e8) numerical value(0 · · · 0) = 0 ±0.s1 · · · s23 ⇥ 2�126

(0 · · · 01) = 1 ±1.s1 · · · s23 ⇥ 2�126

(0 · · · 010) = 2 ±1.s1 · · · s23 ⇥ 2�125

· · ·(01111111) = 127 ±1.s1 · · · s23 ⇥ 20

(10000000) = 128 ±1.s1 · · · s23 ⇥ 21

· · ·(11111110) = 254 ±1.s1 · · · s23 ⇥ 2127

(11111111) = 255 ±1 if s1 · · · s23 = 0, otherwise NaN

Representation error

Error between number x and representation x̃ :absolute x � x̃ or |x � x̃ |relative x�x̃

x

or�� x�x̃

x

��

Equivalent: x̃ = x ± ✏ , |x � x̃ | ✏ , x̃ 2 [x � ✏, x + ✏].

Also: x̃ = x(1 + ✏) often shorthand for�� x̃�x

x

�� ✏

Machine precision

Any real number can be represented to a certain precision:x̃ = x(1 + ✏) wheretruncation: ✏ = ��t+1

rounding: ✏ = 12�

�t+1

This is called machine precision: maximum relative error.

32-bit single precision: mp ⇡ 10�7

64-bit double precision: mp ⇡ 10�16

Maximum attainable accuracy.

Another definition of machine precision: smallest number ✏ suchthat 1 + ✏ > 1.

Addition

1. align exponents

2. add mantissas

3. adjust exponent to normalize

Example: 1.00 + 2.00⇥ 10�2 = 1.00 + .02 = 1.02. This is exact,but what happens with 1.00 + 2.55⇥ 10�2?

Example: 5.00⇥ 101 + 5.04 = (5.00 + 0.504)⇥ 101 ! 5.50⇥ 101

Any error comes from truncating the mantissa: if x is the true sumand x̃ the computed sum, then x̃ = x(1 + ✏) with |✏| < 10�2

The ‘correctly rounded arithmetic’ model

Assumption (enforced by IEEE 754):

The numerical result of an operation is the rounding

of the exactly computed result.

fl(x1 � x2) = (x1 � x2)(1 + ✏)

where � = +,�, ⇤, /

Note: this holds only for a single operation!

Guard digits

Correctly rounding is not trivial, especially for subtraction.

Example: t = 2,� = 10: 1.0� 9.5⇥ 10�1, exact result0.05 = 5.0⇥ 10�2.

I Simple approach:1.0� 9.5⇥ 10�1 = 1.0� 0.9 = 0.1 = 1.0⇥ 10�1

I Using ‘guard digit’:1.0� 9.5⇥ 10�1 = 1.0� 0.95 = 0.05 = 5.0⇥ 10�2, exact.

In general 3 extra bits needed.

Error propagation under addition

Let us consider s = x1 + x2, and assume that xi

is represented byx̃

i

= x

i

(1 + ✏i

)

s̃ = (x̃1 + x̃2)(1 + ✏3)

= x1(1 + ✏1)(1 + ✏3) + x2(1 + ✏2)(1 + ✏3)

⇡ x1(1 + ✏1 + ✏3) + x2(1 + ✏2 + ✏3)

⇡ s(1 + 2✏)

) errors are added

Assumptions: all ✏i

approximately equal size and small;x

i

> 0

Multiplication

1. add exponents

2. multiply mantissas

3. adjust exponent

Example:.123 ⇥ .567⇥ 101 = .069741⇥ 101 ! .69741⇥ 100 ! .697⇥ 100.

What happens with relative errors?

Subtraction

Correct rounding only applies to a single operation.

Example: 1.24� 1.23 = .001 ! 1.⇥ 10�2:result is exact, but only one significant digit.

What if 1.24 = fl(1.244) and 1.23 = fl(1.225)? Correctresult 1.9⇥ 10�2; almost 100% error.

ICancellation leads to loss of precision

I subsequent operations with this result are inaccurate

I this can not be fixed with guard digits and such

I ) avoid subtracting numbers that are likely close.

Example: ax2 + bx + c = 0 ! x = �b±pb

2�4ac2a

suppose b > 0 and b

2 � 4ac then the ‘+’ solution will beinaccurateBetter: compute x� = �b�

pb

2�4ac2a and use x+ · x� = �c/a.

Serious example

Evaluate ⌃10000n=1

1n

2 = 1.644834in 6 digits: machine precision is 10�6 in single precision

First term is 1, so partial sums are � 1, so 1/n2 < 10�6 getsignored, ) last 7000 terms (or more) are ignored, ) sum is1.644725: 4 correct digits

Solution: sum in reverse order; exact result in single precisionWhy? Consider ratio of two terms:

n

2

(n � 1)2=

n

2

n

2 � 2n + 1=

1

1� 2/n + 1/n2⇡ 1 +

2

n

with aligned exponents:n � 1: .00 · · · 0 10 · · · 00

n: .00 · · · 0 10 · · · 01 0 · · · 0k = log(n/2) positions

The last digit in the smaller number is not lost if n < 2/✏

Stability of linear system solving

Problem: solve Ax = b, where b inexact.

A(x +�x) = b +�b.

Since Ax = b, we get A�x = �b. From this,

⇢�x = A

�1�b

Ax = b

�)

⇢kAkkxk � kbkk�xk kA�1kk�bk

) k�xkkxk kAkkA�1kk�bk

kbk‘Condition number’. Attainable accuracy depends on matrixproperties

Consequences of roundo↵

Multiplication and addition are not associative:problems for parallel computations.

Operations with “same” outcomes are not equally stable:matrix inversion is unstable, elimination is stable

Complex numbers

Two real numbers: real and imaginary part.

Storage:

I Store real/imaginary adjacent: easy to pass address of onenumber

I Store array of real, then array of imaginary. Better for stride 1access if only real parts are needed. Other considerations.

Other arithmetic systems

Some compilers support higher precisions.

Arbitrary precision: GMPlib

Interval arithmetic

computer arithmetic - montefiore institutegeuzaine/info0939/hpc4.pdf · integers scientiﬁc...

Documents