lecture topic: what is a number? - ibm · we’ll also consider real numbers and some computer...

186
Lecture Topic: What is a Number?

Upload: others

Post on 21-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Lecture Topic: What is a Number?

Page 2: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Page 3: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Page 4: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Page 5: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Page 6: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Page 7: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Page 8: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Page 9: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

RingsIn mathematics, there are many approaches to numbers. In Algebra, there is awell developed theory of particular sets, called rings. Elements of a ring allow forthe addition and multiplication, while preserving a number of rules:

(a + b) + c = a + (b + c) for all a, b, c ∈ R (“associativity of +”)

a + b = b+a for all a, b ∈ R (“commutativity of +”)

There is 0 ∈ R, such that a + 0 = a for all a ∈ R (“additive identityelement”)

For each a ∈ R, there is a ∈ R, such that a + (a) = 0 (−a is the “additiveinverse” of a)

(a · b) · c = a · (b · c) for all a, b, c ∈ R (“associativity of ·” )

There is 1 ∈ R, such that a · 1 = a and 1 · a = a for all a ∈ R (“multiplicativeidentity element”)

a · (b + c) = (a · b) + (a · c) for all a, b, c ∈ R (“left distributivity”)

(b + c) · a = (b · a) + (c · a) for all a, b, c ∈ R (“right distributivity”).

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 3 / 1

Page 10: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Rings

Common examples of rings are integers, real numbers, and complex numbers.

One could hence see “numbers” as elements of a ring.

Unfortunately, the most common representation of numbers on computers doesnot follow these rules.

1 x = 1.0+(0.001-1.0)y = (1.0+0.001)-1.0print x, y, x == yprint "%.16f %.16f" % (x,y)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 4 / 1

Page 11: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Rings

Common examples of rings are integers, real numbers, and complex numbers.

One could hence see “numbers” as elements of a ring.

Unfortunately, the most common representation of numbers on computers doesnot follow these rules.

1 x = 1.0+(0.001-1.0)y = (1.0+0.001)-1.0print x, y, x == yprint "%.16f %.16f" % (x,y)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 4 / 1

Page 12: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Rings

Common examples of rings are integers, real numbers, and complex numbers.

One could hence see “numbers” as elements of a ring.

Unfortunately, the most common representation of numbers on computers doesnot follow these rules.

1 x = 1.0+(0.001-1.0)y = (1.0+0.001)-1.0print x, y, x == yprint "%.16f %.16f" % (x,y)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 4 / 1

Page 13: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Positional Number SystemsFrom grade school, we encounter positional number systems. For example,

422 = 4× 102 + 2× 101 + 2× 100

The number system is called positional because a digit’s meaning depends on itsposition: e.g., the digit 2 above stands once for for 20 = 2× 101 and once for 2.In general, we will represent a number by a sequence of digits (dndn−1 . . . d1d0),where di ∈ S .S is a finite set of symbols: its size |S | = b is called the base of the system. Thedecimal number system uses 10 symbols: S = {0, 1, . . . , 9}.An integer N is represented as follows:

N = (dndn−1 . . . d0)10

=n∑

i=0

dibi = dn × 10n + dn−1 × 10n−1 + · · ·+ d0 × 100.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 5 / 1

Page 14: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Positional Number SystemsFrom grade school, we encounter positional number systems. For example,

422 = 4× 102 + 2× 101 + 2× 100

The number system is called positional because a digit’s meaning depends on itsposition: e.g., the digit 2 above stands once for for 20 = 2× 101 and once for 2.In general, we will represent a number by a sequence of digits (dndn−1 . . . d1d0),where di ∈ S .S is a finite set of symbols: its size |S | = b is called the base of the system. Thedecimal number system uses 10 symbols: S = {0, 1, . . . , 9}.An integer N is represented as follows:

N = (dndn−1 . . . d0)10

=n∑

i=0

dibi = dn × 10n + dn−1 × 10n−1 + · · ·+ d0 × 100.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 5 / 1

Page 15: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Positional Number SystemsFrom grade school, we encounter positional number systems. For example,

422 = 4× 102 + 2× 101 + 2× 100

The number system is called positional because a digit’s meaning depends on itsposition: e.g., the digit 2 above stands once for for 20 = 2× 101 and once for 2.In general, we will represent a number by a sequence of digits (dndn−1 . . . d1d0),where di ∈ S .S is a finite set of symbols: its size |S | = b is called the base of the system. Thedecimal number system uses 10 symbols: S = {0, 1, . . . , 9}.An integer N is represented as follows:

N = (dndn−1 . . . d0)10

=n∑

i=0

dibi = dn × 10n + dn−1 × 10n−1 + · · ·+ d0 × 100.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 5 / 1

Page 16: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Positional Number SystemsFrom grade school, we encounter positional number systems. For example,

422 = 4× 102 + 2× 101 + 2× 100

The number system is called positional because a digit’s meaning depends on itsposition: e.g., the digit 2 above stands once for for 20 = 2× 101 and once for 2.In general, we will represent a number by a sequence of digits (dndn−1 . . . d1d0),where di ∈ S .S is a finite set of symbols: its size |S | = b is called the base of the system. Thedecimal number system uses 10 symbols: S = {0, 1, . . . , 9}.An integer N is represented as follows:

N = (dndn−1 . . . d0)10

=n∑

i=0

dibi = dn × 10n + dn−1 × 10n−1 + · · ·+ d0 × 100.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 5 / 1

Page 17: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Page 18: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Page 19: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Page 20: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Page 21: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Page 22: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Page 23: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Page 24: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Computer Number SystemsSince each bit can have two states, a string of n bits can have 2 · 2 · · · 2 = 2n

distinct bit patterns. For example,

a byte of 8 bits can contain one of 28 different bit patterns and so canrepresent 28 = 256 different things;

in older computers (IA-32), a word of 4 bytes or 32 bits could represent 232

or 4 billion different things;

in most computers since 2003 (x86-64) and many mobile phones since 2013(ARMv8-A), a word of 8 bytes or 64 bits could represent 264 or18, 446, 744, 073 billion different things;

in some computers, a word of 16 bytes or 128 bits could represent 2128

different things.

There are many ways to interpret a string of bits computers do not use a plainpositional system, especially beyond integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 7 / 1

Page 25: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Computer Number SystemsSince each bit can have two states, a string of n bits can have 2 · 2 · · · 2 = 2n

distinct bit patterns. For example,

a byte of 8 bits can contain one of 28 different bit patterns and so canrepresent 28 = 256 different things;

in older computers (IA-32), a word of 4 bytes or 32 bits could represent 232

or 4 billion different things;

in most computers since 2003 (x86-64) and many mobile phones since 2013(ARMv8-A), a word of 8 bytes or 64 bits could represent 264 or18, 446, 744, 073 billion different things;

in some computers, a word of 16 bytes or 128 bits could represent 2128

different things.

There are many ways to interpret a string of bits computers do not use a plainpositional system, especially beyond integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 7 / 1

Page 26: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Computer Number SystemsSince each bit can have two states, a string of n bits can have 2 · 2 · · · 2 = 2n

distinct bit patterns. For example,

a byte of 8 bits can contain one of 28 different bit patterns and so canrepresent 28 = 256 different things;

in older computers (IA-32), a word of 4 bytes or 32 bits could represent 232

or 4 billion different things;

in most computers since 2003 (x86-64) and many mobile phones since 2013(ARMv8-A), a word of 8 bytes or 64 bits could represent 264 or18, 446, 744, 073 billion different things;

in some computers, a word of 16 bytes or 128 bits could represent 2128

different things.

There are many ways to interpret a string of bits computers do not use a plainpositional system, especially beyond integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 7 / 1

Page 27: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Computer Number SystemsSince each bit can have two states, a string of n bits can have 2 · 2 · · · 2 = 2n

distinct bit patterns. For example,

a byte of 8 bits can contain one of 28 different bit patterns and so canrepresent 28 = 256 different things;

in older computers (IA-32), a word of 4 bytes or 32 bits could represent 232

or 4 billion different things;

in most computers since 2003 (x86-64) and many mobile phones since 2013(ARMv8-A), a word of 8 bytes or 64 bits could represent 264 or18, 446, 744, 073 billion different things;

in some computers, a word of 16 bytes or 128 bits could represent 2128

different things.

There are many ways to interpret a string of bits computers do not use a plainpositional system, especially beyond integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 7 / 1

Page 28: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

Page 29: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

Page 30: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

Page 31: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

Page 32: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

Page 33: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

Page 34: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

Page 35: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Page 36: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Page 37: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Page 38: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Page 39: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Page 40: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Page 41: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Page 42: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Page 43: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Page 44: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Page 45: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Page 46: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Page 47: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Page 48: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Page 49: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Page 50: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings: 1’s complementIn ones complement or representation modulo 2w − 1,given an integer x such that

−2w−1 + 1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

x if x > 0

x + 2w − 1 if x < 0

0 or 2w − 1 if x = 0

.

The existence of two zeros (incl. minus 0 encoded by 2w − 1) is questionable.Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros.

This is called the ones complement representation of negative integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 11 / 1

Page 51: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings: 1’s complementIn ones complement or representation modulo 2w − 1,given an integer x such that

−2w−1 + 1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

x if x > 0

x + 2w − 1 if x < 0

0 or 2w − 1 if x = 0

.

The existence of two zeros (incl. minus 0 encoded by 2w − 1) is questionable.Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros.

This is called the ones complement representation of negative integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 11 / 1

Page 52: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings: 1’s complementIn ones complement or representation modulo 2w − 1,given an integer x such that

−2w−1 + 1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

x if x > 0

x + 2w − 1 if x < 0

0 or 2w − 1 if x = 0

.

The existence of two zeros (incl. minus 0 encoded by 2w − 1) is questionable.Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros.

This is called the ones complement representation of negative integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 11 / 1

Page 53: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings: 1’s complementIn ones complement or representation modulo 2w − 1,given an integer x such that

−2w−1 + 1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

x if x > 0

x + 2w − 1 if x < 0

0 or 2w − 1 if x = 0

.

The existence of two zeros (incl. minus 0 encoded by 2w − 1) is questionable.Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros.

This is called the ones complement representation of negative integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 11 / 1

Page 54: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings: 1’s complementIn ones complement or representation modulo 2w − 1,given an integer x such that

−2w−1 + 1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

x if x > 0

x + 2w − 1 if x < 0

0 or 2w − 1 if x = 0

.

The existence of two zeros (incl. minus 0 encoded by 2w − 1) is questionable.Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros.

This is called the ones complement representation of negative integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 11 / 1

Page 55: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn twos complement or representation modulo 2w ,given an integer x such that

−2w−1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

{x if x ≥ 0

x + 2w if x < 0.

Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros(i.e., apply the logic operation not to each bit),then add 1.

That is, do a ones complement and add 1, carrying to the left if necessary.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 12 / 1

Page 56: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn twos complement or representation modulo 2w ,given an integer x such that

−2w−1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

{x if x ≥ 0

x + 2w if x < 0.

Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros(i.e., apply the logic operation not to each bit),then add 1.

That is, do a ones complement and add 1, carrying to the left if necessary.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 12 / 1

Page 57: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn twos complement or representation modulo 2w ,given an integer x such that

−2w−1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

{x if x ≥ 0

x + 2w if x < 0.

Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros(i.e., apply the logic operation not to each bit),then add 1.

That is, do a ones complement and add 1, carrying to the left if necessary.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 12 / 1

Page 58: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn twos complement or representation modulo 2w ,given an integer x such that

−2w−1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

{x if x ≥ 0

x + 2w if x < 0.

Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros(i.e., apply the logic operation not to each bit),then add 1.

That is, do a ones complement and add 1, carrying to the left if necessary.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 12 / 1

Page 59: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodingsIn twos complement or representation modulo 2w ,given an integer x such that

−2w−1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

{x if x ≥ 0

x + 2w if x < 0.

Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros(i.e., apply the logic operation not to each bit),then add 1.

That is, do a ones complement and add 1, carrying to the left if necessary.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 12 / 1

Page 60: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings

The name twos complement for this representation of negative integers is notperfect:only the rightmost bits up to and including the least significant one arecomplemented with respect to 2 (i.e., left unchanged) andall the other bits are complemented with respect to 1.Note that there is one extra negative number, −4 in this case(it would be −128 if w = 8, or −231 if w = 32),for which there is no corresponding positive number.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 13 / 1

Page 61: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings

The name twos complement for this representation of negative integers is notperfect:only the rightmost bits up to and including the least significant one arecomplemented with respect to 2 (i.e., left unchanged) andall the other bits are complemented with respect to 1.Note that there is one extra negative number, −4 in this case(it would be −128 if w = 8, or −231 if w = 32),for which there is no corresponding positive number.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 13 / 1

Page 62: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Example encodings

The name twos complement for this representation of negative integers is notperfect:only the rightmost bits up to and including the least significant one arecomplemented with respect to 2 (i.e., left unchanged) andall the other bits are complemented with respect to 1.Note that there is one extra negative number, −4 in this case(it would be −128 if w = 8, or −231 if w = 32),for which there is no corresponding positive number.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 13 / 1

Page 63: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Twos complement encoding: 32 bit example

Here are the numbers −4,−3, . . . , 1, 2, 3with their representations on a 32-bit PC encoded as (signed) int).

This is a 32-bit twos complement encoding.

-4 : 11111111111111111111111111111100-3 : 11111111111111111111111111111101-2 : 11111111111111111111111111111110-1 : 111111111111111111111111111111110 : 000000000000000000000000000000001 : 000000000000000000000000000000012 : 000000000000000000000000000000103 : 00000000000000000000000000000011

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 14 / 1

Page 64: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Twos complement encoding: 32 bit example

Here are the numbers −4,−3, . . . , 1, 2, 3with their representations on a 32-bit PC encoded as (signed) int).

This is a 32-bit twos complement encoding.

-4 : 11111111111111111111111111111100-3 : 11111111111111111111111111111101-2 : 11111111111111111111111111111110-1 : 111111111111111111111111111111110 : 000000000000000000000000000000001 : 000000000000000000000000000000012 : 000000000000000000000000000000103 : 00000000000000000000000000000011

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 14 / 1

Page 65: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Twos complement encoding: representing −12

For example, to get the representation of −12, start with that of 12:

000 · · ·0001100 (Start with 12)−→ 111 · · ·1110011 (Bitwise not)−→ 111 · · ·1110100 (Add 1, carrying bit to left, giving −12)

It should be clear from this example of 5 encodings that we may store any type ofnumber in a w -bit word. All we need is the appropriate encoding.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 15 / 1

Page 66: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Twos complement encoding: representing −12

For example, to get the representation of −12, start with that of 12:

000 · · ·0001100 (Start with 12)−→ 111 · · ·1110011 (Bitwise not)−→ 111 · · ·1110100 (Add 1, carrying bit to left, giving −12)

It should be clear from this example of 5 encodings that we may store any type ofnumber in a w -bit word. All we need is the appropriate encoding.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 15 / 1

Page 67: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Twos complement encoding: representing −12

For example, to get the representation of −12, start with that of 12:

000 · · ·0001100 (Start with 12)−→ 111 · · ·1110011 (Bitwise not)−→ 111 · · ·1110100 (Add 1, carrying bit to left, giving −12)

It should be clear from this example of 5 encodings that we may store any type ofnumber in a w -bit word. All we need is the appropriate encoding.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 15 / 1

Page 68: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Twos complement encoding: representing −12

For example, to get the representation of −12, start with that of 12:

000 · · ·0001100 (Start with 12)−→ 111 · · ·1110011 (Bitwise not)−→ 111 · · ·1110100 (Add 1, carrying bit to left, giving −12)

It should be clear from this example of 5 encodings that we may store any type ofnumber in a w -bit word. All we need is the appropriate encoding.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 15 / 1

Page 69: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Range and Distribution of Integers

Most computer use encoding E5 above (twos complement) with the word-lengthw = 32, by default.

The 232 integers are in the range

−231 ≤ i ≤ +231 − 1 or, approximately, − 109 ≤ i ≤ +109.

These integers are distributed uniformly across this range.

If the result of an arithmetic operation is an integer outside this range then aninteger overflow occurs.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 16 / 1

Page 70: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Range and Distribution of Integers

Most computer use encoding E5 above (twos complement) with the word-lengthw = 32, by default.

The 232 integers are in the range

−231 ≤ i ≤ +231 − 1 or, approximately, − 109 ≤ i ≤ +109.

These integers are distributed uniformly across this range.

If the result of an arithmetic operation is an integer outside this range then aninteger overflow occurs.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 16 / 1

Page 71: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Range and Distribution of Integers

Most computer use encoding E5 above (twos complement) with the word-lengthw = 32, by default.

The 232 integers are in the range

−231 ≤ i ≤ +231 − 1 or, approximately, − 109 ≤ i ≤ +109.

These integers are distributed uniformly across this range.

If the result of an arithmetic operation is an integer outside this range then aninteger overflow occurs.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 16 / 1

Page 72: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Range and Distribution of Integers

Most computer use encoding E5 above (twos complement) with the word-lengthw = 32, by default.

The 232 integers are in the range

−231 ≤ i ≤ +231 − 1 or, approximately, − 109 ≤ i ≤ +109.

These integers are distributed uniformly across this range.

If the result of an arithmetic operation is an integer outside this range then aninteger overflow occurs.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 16 / 1

Page 73: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Real Numbers

Beyond integers, one often needs to approximate numbers of potentially infinitebinary and decimal representation (e.g., 1

3 = 0.333 · · · ,√

2) on a machine withfinite amount of memory.

This often requires very non-trivial algorithms or some loss of precision.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 17 / 1

Page 74: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Real Numbers

Beyond integers, one often needs to approximate numbers of potentially infinitebinary and decimal representation (e.g., 1

3 = 0.333 · · · ,√

2) on a machine withfinite amount of memory.

This often requires very non-trivial algorithms or some loss of precision.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 17 / 1

Page 75: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Rational Numbers

For rational numbers, which can be expressed as a fraction of two integers, suchas 1

3 , it is easy to store the two integers.

Quite surprisingly, few computer programs actually do this, due to the fact thearithmetic operations over such a representation does not come with hardwaresupport.

Let us illustrate this in Python:

1 from fractions import Fraction3*Fraction(1, 3)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 18 / 1

Page 76: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Rational Numbers

For rational numbers, which can be expressed as a fraction of two integers, suchas 1

3 , it is easy to store the two integers.

Quite surprisingly, few computer programs actually do this, due to the fact thearithmetic operations over such a representation does not come with hardwaresupport.

Let us illustrate this in Python:

from fractions import Fraction3*Fraction(1, 3)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 18 / 1

Page 77: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Rational Numbers

For rational numbers, which can be expressed as a fraction of two integers, suchas 1

3 , it is easy to store the two integers.

Quite surprisingly, few computer programs actually do this, due to the fact thearithmetic operations over such a representation does not come with hardwaresupport.

Let us illustrate this in Python:

from fractions import Fraction3*Fraction(1, 3)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 18 / 1

Page 78: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Fractions in Positional Number Systems

Alternatively, one can represent some rational numbers by extending positionalnumber systems.

There, we represent numbers with fractional parts by allowing the index i of thedigits di to be negative.

Thus N with a fractional part is represented as follows:

N =(dndn−1 . . . d0d−1 . . . d−m)10 =n∑

i=−m

dibi

=dn × 10n + · · ·+ d0 × 100 + d−1 × 10−1 + . . . + d−m × 10−m.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 19 / 1

Page 79: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Fractions in Positional Number Systems

Alternatively, one can represent some rational numbers by extending positionalnumber systems.

There, we represent numbers with fractional parts by allowing the index i of thedigits di to be negative.

Thus N with a fractional part is represented as follows:

N =(dndn−1 . . . d0d−1 . . . d−m)10 =n∑

i=−m

dibi

=dn × 10n + · · ·+ d0 × 100 + d−1 × 10−1 + . . . + d−m × 10−m.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 19 / 1

Page 80: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Fractions in Positional Number Systems

Alternatively, one can represent some rational numbers by extending positionalnumber systems.

There, we represent numbers with fractional parts by allowing the index i of thedigits di to be negative.

Thus N with a fractional part is represented as follows:

N =(dndn−1 . . . d0d−1 . . . d−m)10 =n∑

i=−m

dibi

=dn × 10n + · · ·+ d0 × 100 + d−1 × 10−1 + . . . + d−m × 10−m.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 19 / 1

Page 81: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Fractions in Positional Number Systems

Alternatively, one can represent some rational numbers by extending positionalnumber systems.

There, we represent numbers with fractional parts by allowing the index i of thedigits di to be negative.

Thus N with a fractional part is represented as follows:

N =(dndn−1 . . . d0d−1 . . . d−m)10 =n∑

i=−m

dibi

=dn × 10n + · · ·+ d0 × 100 + d−1 × 10−1 + . . . + d−m × 10−m.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 19 / 1

Page 82: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Binary Numbers

Example (Binary Numbers II)

The fractional binary number N = 1001.1012 in expanded form is

N = 1× 23 + 0× 22 + 0× 21 + 1× 20 + 1× 2−1 + 0× 2−2 + 1× 2−3

= 8 + 0 + 0 + 1 + 0.5 + 0 + 0.125

= 9.62510.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 20 / 1

Page 83: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Binary Numbers

Example (Binary Numbers II)

The fractional binary number N = 1001.1012 in expanded form is

N = 1× 23 + 0× 22 + 0× 21 + 1× 20 + 1× 2−1 + 0× 2−2 + 1× 2−3

= 8 + 0 + 0 + 1 + 0.5 + 0 + 0.125

= 9.62510.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 20 / 1

Page 84: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Binary Numbers

Example (Binary Numbers III)

The fractional decimal number N = 1.110 in its binary expansion:

N = 1× 20 + 0 · 2−1 + 0 · 2−2 + 0 · 2−3 + 2−4 + 2−5 + 0 · 2−6 + 2−7

≈ 1.00011001100110011001100110011001100110011001100110011001100110

≈ 1.0999999999999999999132638262011596452794037759304046630859375.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 21 / 1

Page 85: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Two Questions

print 1.1+2.2, (1.1+2.2) == 3.3print "%.16f %.16f %.16f" % (1.1, 2.2, 1.1+2.2)

Exercise

What positional number system would it take to encode 13 precisely?

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 22 / 1

Page 86: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Irrational Numbers

For irrational numbers, such as√

2, the situation is very different.

No matter how we pick the positional number system or alphabet, the digits are afinite alphabet.

If you allow for infinitely long strings over a finite alphabet, you still end up withonly countably many numbers.

There are, however, uncountably many irrational numbers.

Hence, there is no hope of encoding all real numbers exactly, at the same time, ona digital computer.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 23 / 1

Page 87: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Irrational Numbers

For irrational numbers, such as√

2, the situation is very different.

No matter how we pick the positional number system or alphabet, the digits are afinite alphabet.

If you allow for infinitely long strings over a finite alphabet, you still end up withonly countably many numbers.

There are, however, uncountably many irrational numbers.

Hence, there is no hope of encoding all real numbers exactly, at the same time, ona digital computer.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 23 / 1

Page 88: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Irrational Numbers

For irrational numbers, such as√

2, the situation is very different.

No matter how we pick the positional number system or alphabet, the digits are afinite alphabet.

If you allow for infinitely long strings over a finite alphabet, you still end up withonly countably many numbers.

There are, however, uncountably many irrational numbers.

Hence, there is no hope of encoding all real numbers exactly, at the same time, ona digital computer.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 23 / 1

Page 89: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Irrational Numbers

For irrational numbers, such as√

2, the situation is very different.

No matter how we pick the positional number system or alphabet, the digits are afinite alphabet.

If you allow for infinitely long strings over a finite alphabet, you still end up withonly countably many numbers.

There are, however, uncountably many irrational numbers.

Hence, there is no hope of encoding all real numbers exactly, at the same time, ona digital computer.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 23 / 1

Page 90: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Irrational Numbers

For irrational numbers, such as√

2, the situation is very different.

No matter how we pick the positional number system or alphabet, the digits are afinite alphabet.

If you allow for infinitely long strings over a finite alphabet, you still end up withonly countably many numbers.

There are, however, uncountably many irrational numbers.

Hence, there is no hope of encoding all real numbers exactly, at the same time, ona digital computer.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 23 / 1

Page 91: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Irrational Numbers

For irrational numbers, such as√

2, the situation is very different.

No matter how we pick the positional number system or alphabet, the digits are afinite alphabet.

If you allow for infinitely long strings over a finite alphabet, you still end up withonly countably many numbers.

There are, however, uncountably many irrational numbers.

Hence, there is no hope of encoding all real numbers exactly, at the same time, ona digital computer.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 23 / 1

Page 92: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Irrational Numbers

One option is to use symbolic representations thereof,

but these may slow the computation down considerably:

from sympy import *sympy.sqrt(8)

3 x = symbols(’x’)solve(x**2 - 2, x)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 24 / 1

Page 93: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Irrational Numbers

One option is to use symbolic representations thereof,

but these may slow the computation down considerably:

1 from sympy import *sympy.sqrt(8)x = symbols(’x’)solve(x**2 - 2, x)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 24 / 1

Page 94: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Approximations

Alternatively, one has to be content with an approximation.

In theory, one can use an approximation with arbitrary precision. This can beimplemented in Python using module decimal.

In practice, there is a widespread encoding, the floating-point number system,with well-defined error characteristics.

This encoding has been standardised by the Institute of Electrical and ElectronicEngineers (IEEE) and in-built into most modern computers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 25 / 1

Page 95: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Approximations

Alternatively, one has to be content with an approximation.

In theory, one can use an approximation with arbitrary precision. This can beimplemented in Python using module decimal.

In practice, there is a widespread encoding, the floating-point number system,with well-defined error characteristics.

This encoding has been standardised by the Institute of Electrical and ElectronicEngineers (IEEE) and in-built into most modern computers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 25 / 1

Page 96: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Approximations

Alternatively, one has to be content with an approximation.

In theory, one can use an approximation with arbitrary precision. This can beimplemented in Python using module decimal.

In practice, there is a widespread encoding, the floating-point number system,with well-defined error characteristics.

This encoding has been standardised by the Institute of Electrical and ElectronicEngineers (IEEE) and in-built into most modern computers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 25 / 1

Page 97: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Approximations

Alternatively, one has to be content with an approximation.

In theory, one can use an approximation with arbitrary precision. This can beimplemented in Python using module decimal.

In practice, there is a widespread encoding, the floating-point number system,with well-defined error characteristics.

This encoding has been standardised by the Institute of Electrical and ElectronicEngineers (IEEE) and in-built into most modern computers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 25 / 1

Page 98: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Page 99: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Page 100: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Page 101: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Page 102: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Page 103: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Page 104: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Page 105: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Page 106: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Page 107: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Page 108: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Page 109: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Page 110: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Page 111: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Page 112: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Storage of Floating Point Numbers

A floating point number has a sign,a mantissa,and an exponent (whose sign is not explicitly written).

Thus, the number x = (±,m, e) has three partsand these are stored in a w -bit word. The sign occupies 1 bit, the mantissa t bits,and the exponent w − t − 1 bits.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 28 / 1

Page 113: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Storage of Floating Point Numbers

A floating point number has a sign,a mantissa,and an exponent (whose sign is not explicitly written).

Thus, the number x = (±,m, e) has three partsand these are stored in a w -bit word. The sign occupies 1 bit, the mantissa t bits,and the exponent w − t − 1 bits.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 28 / 1

Page 114: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Storage of Floating Point Numbers

A floating point number has a sign,a mantissa,and an exponent (whose sign is not explicitly written).

Thus, the number x = (±,m, e) has three partsand these are stored in a w -bit word. The sign occupies 1 bit, the mantissa t bits,and the exponent w − t − 1 bits.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 28 / 1

Page 115: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Biasing the exponent

So that the exponent field can represent both positive and negative exponents,we add a bias to the actual exponent,to ensure that the stored exponent is always non-negative.

The actual (unbiased) exponent is needed for input and output only.

We allow for half the possible exponents to be positive, half to be non-positive.

Thus for an exponent of w − t − 1 bits,the bias is the largest integer representable with w − t − 2 bits,that is, 2w−t−2 − 1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 29 / 1

Page 116: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Biasing the exponent

So that the exponent field can represent both positive and negative exponents,we add a bias to the actual exponent,to ensure that the stored exponent is always non-negative.

The actual (unbiased) exponent is needed for input and output only.

We allow for half the possible exponents to be positive, half to be non-positive.

Thus for an exponent of w − t − 1 bits,the bias is the largest integer representable with w − t − 2 bits,that is, 2w−t−2 − 1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 29 / 1

Page 117: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Biasing the exponent

So that the exponent field can represent both positive and negative exponents,we add a bias to the actual exponent,to ensure that the stored exponent is always non-negative.

The actual (unbiased) exponent is needed for input and output only.

We allow for half the possible exponents to be positive, half to be non-positive.

Thus for an exponent of w − t − 1 bits,the bias is the largest integer representable with w − t − 2 bits,that is, 2w−t−2 − 1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 29 / 1

Page 118: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Biasing the exponent

So that the exponent field can represent both positive and negative exponents,we add a bias to the actual exponent,to ensure that the stored exponent is always non-negative.

The actual (unbiased) exponent is needed for input and output only.

We allow for half the possible exponents to be positive, half to be non-positive.

Thus for an exponent of w − t − 1 bits,the bias is the largest integer representable with w − t − 2 bits,that is, 2w−t−2 − 1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 29 / 1

Page 119: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Biasing the exponent

So that the exponent field can represent both positive and negative exponents,we add a bias to the actual exponent,to ensure that the stored exponent is always non-negative.

The actual (unbiased) exponent is needed for input and output only.

We allow for half the possible exponents to be positive, half to be non-positive.

Thus for an exponent of w − t − 1 bits,the bias is the largest integer representable with w − t − 2 bits,that is, 2w−t−2 − 1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 29 / 1

Page 120: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Biasing the exponent

So that the exponent field can represent both positive and negative exponents,we add a bias to the actual exponent,to ensure that the stored exponent is always non-negative.

The actual (unbiased) exponent is needed for input and output only.

We allow for half the possible exponents to be positive, half to be non-positive.

Thus for an exponent of w − t − 1 bits,the bias is the largest integer representable with w − t − 2 bits,that is, 2w−t−2 − 1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 29 / 1

Page 121: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Representable Numbers

A number that can be represented exactly in a floating point number system iscalled representable.

Because there is a finite number of these representable numbers,there are infinitely many numbers that cannot be represented exactly.

For example, the number x = 221.4922 = 0.2214922× 103 cannot be representedin the base-10 system with 4-digit mantissa and emin = −8, emax = 8 as x has a7-digit mantissa.To store this number the mantissa must be reduced to 4 digits.

The simplest way to do this is to chop off the last 3 digits and store the numberx = 0.2214× 103, which is representable.

The (more commonly used) alternative to chopping is to round the number up ordown to the nearest representable number, thus storing x as x = 0.2215× 103.

Page 122: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Representable Numbers

A number that can be represented exactly in a floating point number system iscalled representable.

Because there is a finite number of these representable numbers,there are infinitely many numbers that cannot be represented exactly.

For example, the number x = 221.4922 = 0.2214922× 103 cannot be representedin the base-10 system with 4-digit mantissa and emin = −8, emax = 8 as x has a7-digit mantissa.To store this number the mantissa must be reduced to 4 digits.

The simplest way to do this is to chop off the last 3 digits and store the numberx = 0.2214× 103, which is representable.

The (more commonly used) alternative to chopping is to round the number up ordown to the nearest representable number, thus storing x as x = 0.2215× 103.

Page 123: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Representable Numbers

A number that can be represented exactly in a floating point number system iscalled representable.

Because there is a finite number of these representable numbers,there are infinitely many numbers that cannot be represented exactly.

For example, the number x = 221.4922 = 0.2214922× 103 cannot be representedin the base-10 system with 4-digit mantissa and emin = −8, emax = 8 as x has a7-digit mantissa.To store this number the mantissa must be reduced to 4 digits.

The simplest way to do this is to chop off the last 3 digits and store the numberx = 0.2214× 103, which is representable.

The (more commonly used) alternative to chopping is to round the number up ordown to the nearest representable number, thus storing x as x = 0.2215× 103.

Page 124: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Representable Numbers

A number that can be represented exactly in a floating point number system iscalled representable.

Because there is a finite number of these representable numbers,there are infinitely many numbers that cannot be represented exactly.

For example, the number x = 221.4922 = 0.2214922× 103 cannot be representedin the base-10 system with 4-digit mantissa and emin = −8, emax = 8 as x has a7-digit mantissa.To store this number the mantissa must be reduced to 4 digits.

The simplest way to do this is to chop off the last 3 digits and store the numberx = 0.2214× 103, which is representable.

The (more commonly used) alternative to chopping is to round the number up ordown to the nearest representable number, thus storing x as x = 0.2215× 103.

Page 125: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Representable Numbers

A number that can be represented exactly in a floating point number system iscalled representable.

Because there is a finite number of these representable numbers,there are infinitely many numbers that cannot be represented exactly.

For example, the number x = 221.4922 = 0.2214922× 103 cannot be representedin the base-10 system with 4-digit mantissa and emin = −8, emax = 8 as x has a7-digit mantissa.To store this number the mantissa must be reduced to 4 digits.

The simplest way to do this is to chop off the last 3 digits and store the numberx = 0.2214× 103, which is representable.

The (more commonly used) alternative to chopping is to round the number up ordown to the nearest representable number, thus storing x as x = 0.2215× 103.

Page 126: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Representable Numbers

A number that can be represented exactly in a floating point number system iscalled representable.

Because there is a finite number of these representable numbers,there are infinitely many numbers that cannot be represented exactly.

For example, the number x = 221.4922 = 0.2214922× 103 cannot be representedin the base-10 system with 4-digit mantissa and emin = −8, emax = 8 as x has a7-digit mantissa.To store this number the mantissa must be reduced to 4 digits.

The simplest way to do this is to chop off the last 3 digits and store the numberx = 0.2214× 103, which is representable.

The (more commonly used) alternative to chopping is to round the number up ordown to the nearest representable number, thus storing x as x = 0.2215× 103.

Page 127: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Range of FP NumbersThe magnitudes of the floating point numbers have the range

.min(m)× bemin < |x | < .max(m)× bemax .

Now, if x is normalised, then

.min(m) = .(10 . . . 0)b = b−1 and

.max(m) = .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t .

Hence the range of the floating point numbers is

bemin−1 ≤ |x | ≤ (1− b−t)bemax ,

or approximately (since 1− b−t ≈ 1)

bemin−1 < |x | < bemax .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 31 / 1

Page 128: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Range of FP NumbersThe magnitudes of the floating point numbers have the range

.min(m)× bemin < |x | < .max(m)× bemax .

Now, if x is normalised, then

.min(m) = .(10 . . . 0)b = b−1 and

.max(m) = .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t .

Hence the range of the floating point numbers is

bemin−1 ≤ |x | ≤ (1− b−t)bemax ,

or approximately (since 1− b−t ≈ 1)

bemin−1 < |x | < bemax .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 31 / 1

Page 129: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Range of FP NumbersThe magnitudes of the floating point numbers have the range

.min(m)× bemin < |x | < .max(m)× bemax .

Now, if x is normalised, then

.min(m) = .(10 . . . 0)b = b−1 and

.max(m) = .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t .

Hence the range of the floating point numbers is

bemin−1 ≤ |x | ≤ (1− b−t)bemax ,

or approximately (since 1− b−t ≈ 1)

bemin−1 < |x | < bemax .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 31 / 1

Page 130: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Range of FP NumbersThe magnitudes of the floating point numbers have the range

.min(m)× bemin < |x | < .max(m)× bemax .

Now, if x is normalised, then

.min(m) = .(10 . . . 0)b = b−1 and

.max(m) = .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t .

Hence the range of the floating point numbers is

bemin−1 ≤ |x | ≤ (1− b−t)bemax ,

or approximately (since 1− b−t ≈ 1)

bemin−1 < |x | < bemax .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 31 / 1

Page 131: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Range of FP NumbersThe magnitudes of the floating point numbers have the range

.min(m)× bemin < |x | < .max(m)× bemax .

Now, if x is normalised, then

.min(m) = .(10 . . . 0)b = b−1 and

.max(m) = .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t .

Hence the range of the floating point numbers is

bemin−1 ≤ |x | ≤ (1− b−t)bemax ,

or approximately (since 1− b−t ≈ 1)

bemin−1 < |x | < bemax .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 31 / 1

Page 132: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Range of FP Numbers: Underflow and Overflow

Of course, it is possible for the true result of an operation to be so large or sosmall that it falls outside the available floating-point range.

underflow: number too small

overflow: number too large

Underflow/overflow occurs, roughly speaking, whenthe result of an arithmetic operation is so small/largethat it cannot be stored in its intended destination formatwithout suffering a rounding error that is larger than usual.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 32 / 1

Page 133: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Range of FP Numbers: Underflow and Overflow

Of course, it is possible for the true result of an operation to be so large or sosmall that it falls outside the available floating-point range.

underflow: number too small

overflow: number too large

Underflow/overflow occurs, roughly speaking, whenthe result of an arithmetic operation is so small/largethat it cannot be stored in its intended destination formatwithout suffering a rounding error that is larger than usual.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 32 / 1

Page 134: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Range of FP Numbers: Underflow and Overflow

Of course, it is possible for the true result of an operation to be so large or sosmall that it falls outside the available floating-point range.

underflow: number too small

overflow: number too large

Underflow/overflow occurs, roughly speaking, whenthe result of an arithmetic operation is so small/largethat it cannot be stored in its intended destination formatwithout suffering a rounding error that is larger than usual.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 32 / 1

Page 135: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Range of FP Numbers: Underflow and Overflow

Of course, it is possible for the true result of an operation to be so large or sosmall that it falls outside the available floating-point range.

underflow: number too small

overflow: number too large

Underflow/overflow occurs, roughly speaking, whenthe result of an arithmetic operation is so small/largethat it cannot be stored in its intended destination formatwithout suffering a rounding error that is larger than usual.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 32 / 1

Page 136: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Range of FP Numbers: Underflow and Overflow

Of course, it is possible for the true result of an operation to be so large or sosmall that it falls outside the available floating-point range.

underflow: number too small

overflow: number too large

Underflow/overflow occurs, roughly speaking, whenthe result of an arithmetic operation is so small/largethat it cannot be stored in its intended destination formatwithout suffering a rounding error that is larger than usual.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 32 / 1

Page 137: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Range of FP Numbers: Underflow and Overflow

Of course, it is possible for the true result of an operation to be so large or sosmall that it falls outside the available floating-point range.

underflow: number too small

overflow: number too large

Underflow/overflow occurs, roughly speaking, whenthe result of an arithmetic operation is so small/largethat it cannot be stored in its intended destination formatwithout suffering a rounding error that is larger than usual.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 32 / 1

Page 138: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

Page 139: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

Page 140: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

Page 141: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

Page 142: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

Page 143: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

Page 144: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

Page 145: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

Page 146: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

Page 147: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

Page 148: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

FP Numbers are not uniformly spaced

We can see that, in general,when the exponent is small the spacing between floating point numbers is small;andwhen it is large the spacing is large.

0.1 1 10 100

-R

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 34 / 1

Page 149: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

FP Numbers are not uniformly spaced

We can see that, in general,when the exponent is small the spacing between floating point numbers is small;andwhen it is large the spacing is large.

0.1 1 10 100

-R

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 34 / 1

Page 150: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

FP Numbers are not uniformly spaced

We can see that, in general,when the exponent is small the spacing between floating point numbers is small;andwhen it is large the spacing is large.

0.1 1 10 100

-R

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 34 / 1

Page 151: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Machine Epsilon

The spacing between any pair of adjacent normalised numbers can be defined interms of the spacing about 1.0.

This spacing is called machine epsilon, εm defined as

1 εm = distance between 1.0 and the next higher floating point number, or

2 εm = the smallest number such that 1.0 + εm when chopped is distinct from1.0.

Using the first definition, we have

εm = .(00 . . . 1)b × b1 = b−t × b1 = b1−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 35 / 1

Page 152: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Machine Epsilon

The spacing between any pair of adjacent normalised numbers can be defined interms of the spacing about 1.0.

This spacing is called machine epsilon, εm defined as

1 εm = distance between 1.0 and the next higher floating point number, or

2 εm = the smallest number such that 1.0 + εm when chopped is distinct from1.0.

Using the first definition, we have

εm = .(00 . . . 1)b × b1 = b−t × b1 = b1−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 35 / 1

Page 153: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Machine Epsilon

The spacing between any pair of adjacent normalised numbers can be defined interms of the spacing about 1.0.

This spacing is called machine epsilon, εm defined as

1 εm = distance between 1.0 and the next higher floating point number, or

2 εm = the smallest number such that 1.0 + εm when chopped is distinct from1.0.

Using the first definition, we have

εm = .(00 . . . 1)b × b1 = b−t × b1 = b1−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 35 / 1

Page 154: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Machine Epsilon

The spacing between any pair of adjacent normalised numbers can be defined interms of the spacing about 1.0.

This spacing is called machine epsilon, εm defined as

1 εm = distance between 1.0 and the next higher floating point number, or

2 εm = the smallest number such that 1.0 + εm when chopped is distinct from1.0.

Using the first definition, we have

εm = .(00 . . . 1)b × b1 = b−t × b1 = b1−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 35 / 1

Page 155: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Machine Epsilon

The spacing between any pair of adjacent normalised numbers can be defined interms of the spacing about 1.0.

This spacing is called machine epsilon, εm defined as

1 εm = distance between 1.0 and the next higher floating point number, or

2 εm = the smallest number such that 1.0 + εm when chopped is distinct from1.0.

Using the first definition, we have

εm = .(00 . . . 1)b × b1 = b−t × b1 = b1−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 35 / 1

Page 156: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Spacing about a floating point number

and this can be used in the following lemma of [Higham, 2002, p. 41]:

Lemma (Floating Point Spacing)

The spacing between a normalised floating point number x and an adjacentnormalised number is at least b−1εm|x | and at most εm|x |, unless x or itsneighbour is 0.

This lemma clearly shows that the spacing between adjacent floating pointnumbers varies with the magnitude of x .

Thus the minimum spacing is bemin−t and the maximum spacing is bemax−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 36 / 1

Page 157: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Spacing about a floating point number

and this can be used in the following lemma of [Higham, 2002, p. 41]:

Lemma (Floating Point Spacing)

The spacing between a normalised floating point number x and an adjacentnormalised number is at least b−1εm|x | and at most εm|x |, unless x or itsneighbour is 0.

This lemma clearly shows that the spacing between adjacent floating pointnumbers varies with the magnitude of x .

Thus the minimum spacing is bemin−t and the maximum spacing is bemax−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 36 / 1

Page 158: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Spacing about a floating point number

and this can be used in the following lemma of [Higham, 2002, p. 41]:

Lemma (Floating Point Spacing)

The spacing between a normalised floating point number x and an adjacentnormalised number is at least b−1εm|x | and at most εm|x |, unless x or itsneighbour is 0.

This lemma clearly shows that the spacing between adjacent floating pointnumbers varies with the magnitude of x .

Thus the minimum spacing is bemin−t and the maximum spacing is bemax−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 36 / 1

Page 159: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Spacing about a floating point number

and this can be used in the following lemma of [Higham, 2002, p. 41]:

Lemma (Floating Point Spacing)

The spacing between a normalised floating point number x and an adjacentnormalised number is at least b−1εm|x | and at most εm|x |, unless x or itsneighbour is 0.

This lemma clearly shows that the spacing between adjacent floating pointnumbers varies with the magnitude of x .

Thus the minimum spacing is bemin−t and the maximum spacing is bemax−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 36 / 1

Page 160: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IEEE Arithmetics

The Institute of Electrical and Electronic Engineers (IEEE) standard 745 waspublished in 1985.

It specifies how hardware and software combine to perform well-defined operationson standard floating point numbers.

Many processors have built-in floating point operations that include+,−,×, /,

√,

along with comparison and conversion operations, on one, two, or three “formats”of numbers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 37 / 1

Page 161: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IEEE Arithmetics

The Institute of Electrical and Electronic Engineers (IEEE) standard 745 waspublished in 1985.

It specifies how hardware and software combine to perform well-defined operationson standard floating point numbers.

Many processors have built-in floating point operations that include+,−,×, /,

√,

along with comparison and conversion operations, on one, two, or three “formats”of numbers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 37 / 1

Page 162: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IEEE Arithmetics

The Institute of Electrical and Electronic Engineers (IEEE) standard 745 waspublished in 1985.

It specifies how hardware and software combine to perform well-defined operationson standard floating point numbers.

Many processors have built-in floating point operations that include+,−,×, /,

√,

along with comparison and conversion operations, on one, two, or three “formats”of numbers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 37 / 1

Page 163: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IEEE Arithmetics

The Institute of Electrical and Electronic Engineers (IEEE) standard 745 waspublished in 1985.

It specifies how hardware and software combine to perform well-defined operationson standard floating point numbers.

Many processors have built-in floating point operations that include+,−,×, /,

√,

along with comparison and conversion operations, on one, two, or three “formats”of numbers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 37 / 1

Page 164: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IEEE Formats

Double 8-byteI 53 = 52 + 1 significant bits mantissa, 11-bit exponentI Default in Python, C/C++/Java double, Fortran REAL*8

Single 4-byteI 24 = 23 + 1 significant bits mantissa, 8-bit exponentI e.g., Python numpy.float32, C/C++/Java float, Fortran REAL*4,

Quad 16-byteI 113 significant bits mantissa, 15-bit exponentI The implementation varies. Python may have numpy.float128, C/C++

may have long double or not, depending on the platform.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 38 / 1

Page 165: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IEEE Formats

Double 8-byteI 53 = 52 + 1 significant bits mantissa, 11-bit exponentI Default in Python, C/C++/Java double, Fortran REAL*8

Single 4-byteI 24 = 23 + 1 significant bits mantissa, 8-bit exponentI e.g., Python numpy.float32, C/C++/Java float, Fortran REAL*4,

Quad 16-byteI 113 significant bits mantissa, 15-bit exponentI The implementation varies. Python may have numpy.float128, C/C++

may have long double or not, depending on the platform.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 38 / 1

Page 166: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

IEEE Formats

Double 8-byteI 53 = 52 + 1 significant bits mantissa, 11-bit exponentI Default in Python, C/C++/Java double, Fortran REAL*8

Single 4-byteI 24 = 23 + 1 significant bits mantissa, 8-bit exponentI e.g., Python numpy.float32, C/C++/Java float, Fortran REAL*4,

Quad 16-byteI 113 significant bits mantissa, 15-bit exponentI The implementation varies. Python may have numpy.float128, C/C++

may have long double or not, depending on the platform.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 38 / 1

Page 167: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Let us illustrate this in Python:

1 x = 1.0y = numpy.float32(1.0)z = numpy.float128(1.0)print x, y, z

The focus will be on 8-byte floating-point numbers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 39 / 1

Page 168: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Precisions and ranges for IEEE formats

Type Unit roundoff Smallest LargestSingle ur = 2−24 ≈ 5.96× 10−8 2−126 ≈ 1.18× 10−38 (2− 2−23)× 2127 ≈ 3.40× 1038

Double ur = 2−53 ≈ 1.11× 10−16 2−1022 ≈ 2.82× 10−308 (2− 2−52)× 21023 ≈ 1.41× 10308

Quad ur = 2−113 ≈ 9.63× 10−35 2−16382 ≈ 3.36× 10−4932 216384 − 216272 ≈ 1.19× 104932

Since the sign of floating point numbers is given by the leading bit,the range for negative numbers is just the negation of the above values.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 40 / 1

Page 169: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Page 170: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Page 171: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Page 172: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Page 173: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Page 174: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Page 175: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Page 176: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Biases in IEEE formats

Let us explain the reasoning. For IEEE single precision with 8-bit exponent, thebias is 127 = 28−1 − 1.

For IEEE double precision with 11-bit exponent the bias is 1023 = 211−1 − 1.

For example, in single precision,a stored value of 100 would mean a true exponent of 100− 127 = −27;a stored value of 150 means an exponent of 150− 127 = 23.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 42 / 1

Page 177: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Biases in IEEE formats

Let us explain the reasoning. For IEEE single precision with 8-bit exponent, thebias is 127 = 28−1 − 1.

For IEEE double precision with 11-bit exponent the bias is 1023 = 211−1 − 1.

For example, in single precision,a stored value of 100 would mean a true exponent of 100− 127 = −27;a stored value of 150 means an exponent of 150− 127 = 23.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 42 / 1

Page 178: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Biases in IEEE formats

Let us explain the reasoning. For IEEE single precision with 8-bit exponent, thebias is 127 = 28−1 − 1.

For IEEE double precision with 11-bit exponent the bias is 1023 = 211−1 − 1.

For example, in single precision,a stored value of 100 would mean a true exponent of 100− 127 = −27;a stored value of 150 means an exponent of 150− 127 = 23.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 42 / 1

Page 179: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Biases in IEEE formats

Let us explain the reasoning. For IEEE single precision with 8-bit exponent, thebias is 127 = 28−1 − 1.

For IEEE double precision with 11-bit exponent the bias is 1023 = 211−1 − 1.

For example, in single precision,a stored value of 100 would mean a true exponent of 100− 127 = −27;a stored value of 150 means an exponent of 150− 127 = 23.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 42 / 1

Page 180: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Hidden bit normalisation

The mantissa is composed of an implicit leading bit and the fraction bits.

As the only non-zero number in binary is 1,it follows that a normalised binary number must look like 1.d1d2 . . . dn.

Thus we do not need to store the leading 1;the 23 bits (or 52 in double precision) of the mantissa can all be used to store thepart of the number after the “decimal” point.

We call this hidden bit normalisation:thus single and double precision actually have 24 and 53 bits for the mantissa,respectively.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 43 / 1

Page 181: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Hidden bit normalisation

The mantissa is composed of an implicit leading bit and the fraction bits.

As the only non-zero number in binary is 1,it follows that a normalised binary number must look like 1.d1d2 . . . dn.

Thus we do not need to store the leading 1;the 23 bits (or 52 in double precision) of the mantissa can all be used to store thepart of the number after the “decimal” point.

We call this hidden bit normalisation:thus single and double precision actually have 24 and 53 bits for the mantissa,respectively.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 43 / 1

Page 182: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Hidden bit normalisation

The mantissa is composed of an implicit leading bit and the fraction bits.

As the only non-zero number in binary is 1,it follows that a normalised binary number must look like 1.d1d2 . . . dn.

Thus we do not need to store the leading 1;the 23 bits (or 52 in double precision) of the mantissa can all be used to store thepart of the number after the “decimal” point.

We call this hidden bit normalisation:thus single and double precision actually have 24 and 53 bits for the mantissa,respectively.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 43 / 1

Page 183: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Hidden bit normalisation

The mantissa is composed of an implicit leading bit and the fraction bits.

As the only non-zero number in binary is 1,it follows that a normalised binary number must look like 1.d1d2 . . . dn.

Thus we do not need to store the leading 1;the 23 bits (or 52 in double precision) of the mantissa can all be used to store thepart of the number after the “decimal” point.

We call this hidden bit normalisation:thus single and double precision actually have 24 and 53 bits for the mantissa,respectively.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 43 / 1

Page 184: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Hidden bit normalisation

The mantissa is composed of an implicit leading bit and the fraction bits.

As the only non-zero number in binary is 1,it follows that a normalised binary number must look like 1.d1d2 . . . dn.

Thus we do not need to store the leading 1;the 23 bits (or 52 in double precision) of the mantissa can all be used to store thepart of the number after the “decimal” point.

We call this hidden bit normalisation:thus single and double precision actually have 24 and 53 bits for the mantissa,respectively.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 43 / 1

Page 185: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Exercise

Try to prove that√

2 is an irrational number, i.e., there exists no integers p, qsuch that the fraction p

q =√

2. How many different proofs can you find?

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 44 / 1

Page 186: Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer approximations thereof. Finally, we’ll examine how operations are performed on these

Jean-Michael Muller Example

Exercise

Implement the sequence x0 = 4, x1 = 4.25, xn = 108− (815− 1500/xn−2)/xn−1,which has been suggested by Muller et al. [2010], first using the defaultfloating-point numbers, and second using Decimal with increasing precision. Whatis the correct value?

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 45 / 1