data representation - mcgill...

Data Representation

● Real-world information

● Analog vs. Digital (binary) representation

1 1 1 1 1 1 1 10 0 0 0 0 0 0 0

Digital Signal

Analog Signal

Digital Signal Degradation

Analog Signal Degradation

Analog vs. Digital data/signals

Analog vs. Digital data/signals

● Storage and processing units for digital (binary) data are

● Reliable● Cheap● Simple

● Beware: analog data has infinite precision and wide range; Digital data is a finite approximation

with a limited range (Real number vs. real/float/double/...)

How to digitally (binary) represent/encode data?

● Numbers (12, -134, 12.34, ...)

● Characters/Strings ('a', '©', ' ش',' ('א

● Images

● Sound

...

Binary representation

● in computing and telecommunications● a bit is a basic unit of information storage● “binary digit”● the maximum amount of information that can be stored in only two distinct states

0 or 1

Bit sequences (bit) “string” (vs. char string)

0 1 bit bit 0111 4 bits nibble 01100001 8 bits byte0101010110110111 16 bits half-word ... 32 bits word ... 64 bits word

A word is a natural unit of data used by a particular processor design(8, 16, 24, 32, or 64 bits)

Binary representation ++

With 1 bit, can represent 2 distinct entities

With 2 bits, can represent 4 distinct entities...

With N bits, can represent 2N

distinct entities

Example: {Red, Green, Blue} encoded as {00, 01, 10}

Binary representation of numbers● Binary Coded Decimal (BCD) representation of Unsigned Integer/Real numbers

● Binary representation of Unsigned Integers

● Binary representation of Signed Integer Numbers

● Signed magnitude● One's complement● Two's complement

● Fixed-Point binary approximation/representation of Real numbers

● Floating Point binary approximation/representation of Real numbers

n-bit string “xn-1

xn-1

...x1x

0 “ has/encodes value x

00

11

2n2n

1n1n 2x2x2x2xx

Range: 0 to +2n – 1 Example

0000 0000 0000 0000 0000 0000 0000 10112

= 0 + … + 1×23 + 0×22 +1×21 +1×20

= 0 + … + 8 + 0 + 2 + 1 = 1110

Binary representation/encoding of Unsigned Integers

x0

xn-1

Least Significant Bit (LSB)

Most Significant Bit (MSB)

Powers of 2 0 1 1 2 2 4 3 8 4 16 5 32 6 64 7 128 8 256 9 51210 1,02411 2,04812 4,09613 8,19214 16,38415 32,76816 65,536

...

32 4,294,967,296

Binary Coded Decimal (BCD)representation of unsigned int/realDecimal: 0 1 2 3 4 5 6 7 8 9BCD: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001

Every Decimal digit gets represented by 4 bits (Binary digits)

Decimal: 127 : 1 2 7BCD: 000100100111 : 0001 0010 0111

Used mostly in

● Mainframes (financial applications)● Embedded microcontrollers/small processors (computation <<<)● Only digital logic (no processor), display (7-segment)

Word size often 10 or 12 decimal digits (e.g., in calculators)

Binary Coded Decimal (BCD)

Advantages:● Easy to print, display, ... thanks to per-digit conversion● Numbers such as decimal 0.2 have an infinite place-value representation in binary (0.001100110011...) but have a finite place-value in binary-coded decimal (0000.0010)● Scaling by factors of 10 by shifting (clever compiler ...)● 10-based rounding is easy

Disadvantages:● More complex to implement +,-,*,/ (+: 15-20% more circuitry)● Slower ● More storage space required (15: 1111 vs. 00010101)

n-bit string “xn-1

xn-1

...x1x

0 “ has/encodes value x

00

11

2n2n

1n1n 2x2x2x2xx

Range: 0 to +2n – 1 Example

0000 0000 0000 0000 0000 0000 0000 10112

= 0 + … + 1×23 + 0×22 +1×21 +1×20

= 0 + … + 8 + 0 + 2 + 1 = 1110

Binary representation of Unsigned Integers

x0

xn-1

Least Significant Bit (LSB)

Most Significant Bit (MSB)

(intermezzo) Binary addition

The base, or radix of a number system defines the range of possible values that a digit may have: 0 – 9 for decimal; 0 –1 for binary. 0 – (N-1) for base N.The general form for determining the value of a number is given by:

Example:

541.2510 = 5 102 + 4 101 + 1 100 + 2 10-1 + 5 10-2

= (500)10 + (40)10 + (1)10 + (2/10)10 + (5/100)10

= (541.25)10

Weighted Position Code

More later: “fixed point” encoding of Rational number as Approximation of a Real number

Base 2,8,10,16 Number Systems

Simple base N operations ...

Multiply by powers of N by ...

Divide by powers of N by ...

Base conversion (binary “encoding”):remainder method

Example: Convert 23.37510 to base 2.

Start by converting the integer portion:

23 = 2*11 + 1

= 2*(2*5 + 1) + 1

= 2*(2*(2*2 + 1) + 1) + 1

= 2*(2*(2*(2*1 + 0) + 1) + 1) + 1

= 2*(2*(2*(2*(2*0 + 1) + 0) + 1) + 1) + 1

Base conversion: multiplication method

• Now, convert the fraction:

• Putting it all together, 23.37510 = 10111.0112

0.375 = 1/2*(0 + 0.75)

0.375 = 1/2*(0 + 1/2*(1 + 0.5))

0.375 = 1/2*(0 + 1/2*(1 + 1/2*(1 + 0)))

Non-terminating Base 2 Fraction

We can’t always convert a terminating base 10 fraction into an equivalent terminating base 2 fraction:

0.001100110011...

● Signed magnitude

● One's complement

● Two's complement

● Biased (Excess)

Binary representation ofSigned Integer numbers

Signed Magnitude

• Also know as “sign and magnitude,”

the leftmost bit is the sign (0 = positive, 1 = negative) and

the remaining bits are the magnitude.

• Example:

+2510 = 000110012

-2510 = 100110012

• Two representations for zero:

+0 = 000000002, -0 = 100000002.

• Largest number is +127, smallest number is -12710, using

an 8-bit representation.

One's complement• The leftmost bit is the sign (0 = positive, 1 = negative).

Negative of a number is obtained by complementing each bit from 0 to 1 or from 1 to 0. This goes both ways: converting positive numbers to negative numbers, and converting negative numbers to positive numbers.

• Example: +2510 = 000110012

-2510 = 111001102

• Two representations for zero: +0 = 000000002, -0 = 111111112.

• Largest number is +12710, smallest number is -12710, using an 8-bit representation.

n-bit string 'xn-1

xn-1

...x1x

0 ' has value x

00

11

2n2n

1n1n 2x2x2x2xx

2's complement

000 0 0001 1 1010 2 2011 3 3100 4 -4101 5 -3110 6 -2111 7 -1

100 -4101 -3110 -2111 -1000 0001 1010 2011 3

2's complement arithmetic

100 -4101 -3110 -2111 -1000 0001 1010 2011 3

101: -4+0+1 = -3

000

001

010

011100

101

110

111

-4

-3

-2

-1 0

1

2

3

00

11

2n2n

1n1n 2x2x2x2xx

+1

● Given an n-bit number0

01

12n

2n1n

1n 2x2x2x2xx

Range: –2n – 1 to +2n – 1 – 1 Example

1111 1111 1111 1111 1111 1111 1111 11002

= –1×231 + 1×230 + … + 1×22 +0×21 +0×20

= –2,147,483,648 + 2,147,483,644 = –410

Using 32 bits –2,147,483,648 to +2,147,483,647

2's complement

● Bit 31 is sign bit (easy to test for sign)● 1 for negative numbers● 0 for non-negative numbers

● 2n – 1 can not be represented (-2n – 1 can)● Non-negative numbers have the same unsigned and

2s-complement representation● One representation for zero:

+0 = 000000002, -0 = 000000002

● Some specific numbers● 0: 0000 0000 … 0000● –1: 1111 1111 … 1111● Most-negative: 1000 0000 … 0000● Most-positive: 0111 1111 … 1111● 8-bit: largest number is +12710, smallest number is -12810

2's complement

● Complement and add 1● Complement means 1 → 0, 0 → 1

x1x

11111...111xx 2

Example: negate +2 +2 = 0000 0000 … 00102

–2 = 1111 1111 … 11012 + 1 = 1111 1111 … 11102

Negation for 2's complement

● Fast comparison through non-negative number comparison (of exponents of numbers in Floating Point representation):

Negative number looks larger than positive● add bias B

to make >= 0

Biased (Excess B) representation

2's compl B=4 binary-4 100 -4+4 0 000-3 101 -3+4 1 001-2 110 -2+4 2 010-1 111 -1+4 3 011 0 000 0+4 4 100 1 001 1+4 5 101 2 010 2+4 6 110 3 011 3+4 7 111

Excess B representation: represents -B as “0N“ and -B + 2N-1 as “1N“ maps 0N to -B, and 1N to -B + 2N-1To cover full range, -B + 2N-1 should be = B-1

⇒ B should be 2N-1

(e.g, N=3 B=4; N=8 B=128)

Can we use unsigned arithmetic (e.g., -2+1)?

● add bias B

to make > 0● 3 bit: bias 3

8 bit: bias 12711 bit: bias 1023

Biased (Excess B) representationIn normalized Floating Point exponent “0N“ and “1N“ are reserved

B=3 binary-3 -3+3 0 000-2 -2+3 1 001-1 -1+3 2 010 0 0+3 3 011 1 1+3 4 100 2 2+3 5 101 3 3+3 6 110 4 4+3 7 111

FP exponent bias: B is 2N-1-1 (e.g, N=3 B=3)

3-bit representations

Fixed-Point Representation of Real Numbersfixed-point real number is an integer that is scaled by a certain factor

E.g., 1.23 = 123/100, scaling factor 100 The scaling factor is the same for all numbers of a certain fixed-point type.Floating-point types on the other hand store the scaling factor as part of individual numbers.

Upper bound of a fixed-point type = upper bound of underlying integer type / scaling factor

Lower bound of a fixed-point type = lower bound of underlying integer type / scaling factor

E.g., binary fixed-point type in two's complement format, with f fractional bits and a total of b bits iii.fffff

Smallest representable number: − (2b − 1) / 2f

Largest representable number: (2b − 1 − 1) / 2f

Calculating with Real Numbers in Fixed-Point representation

To add or subtract two fixed-point numbers (of the same fixed-point type):● add or subtract the underlying integers

To multiply or divide two fixed-point numbers:● multiply or divide the underlying integers● need to re-scale the result

for multiplication: result needs to be divided by the scaling factor

for division: result needs to be multiplied by the scaling factor

Example: multiply two fixed-point numbers afp and bfp, stored as fixed-point numbers ai and bi with scaling factor S afp · bfp = ai/S · bi/S = (ai · bi) / S2

If we construct ai · bi, its fixed-point value is (ai · bi) / Sso we need to divide this by S to get the correct value.

·

Real Numbers in Fixed Point representation

Advantages:

● Needed if no Floating Point Unit (FPU) available● Less hardware needed (embedded)● Less power consumed (embedded)● More control over error/rounding (cfr. BCD) than floating point● All representable numbers “equidistant” (what's the distance?)● Smallest/largest number that can be represented?

• Floating point numbers allow very large and very small numbers to be represented using only a few digits, at the expense of precision. The precision is primarily determined by the number of digits in the fraction (or significand, which has integer and fractional parts), and the range is primarily determined by the number of digits in the exponent.

• Example (+6.023 x 1023):

Base 10 Floating Point Numbers

Normalization• The base 10 number 254 can be represented in floating point form as

254 x 100, or equivalently as:

25.4 x 101, or 2.54 x 102, or

.254 x 103, or .0254 x 104, or ...

infinitely many other ways, which creates problems when making comparisons, with many representations of the same number.

Hence, choose a canonical representation, the unique representative of the set of mathematically equivalent numbers.

• Floating point numbers are usually normalized, in which the radix point is located in only one possible position for a given number.

• Typically, the normalized representation places the radix point● (A) immediately to the left of the leftmost, nonzero digit in the

fraction, as in .254 x 103 , or● (B) immediately to the right of the leftmost, nonzero digit in the

fraction, as in 2.54 x 103

Floating Point Example• Represent .254 x103 in

16 bit Floating Point, in a normalized, base 8 floating point format , with a sign bit, followed by a 3-bit 2's complement exponent, followed by four base 8 digits (possible values: 0-7). The full Floating Point number must be represented as a 16 bit binary string.

• Step #1: Convert to the target base.

.254 x 103 = 25410.

Using the remainder method, we find that 25410 = 3768

254/8 = 31 R 6

31/8 = 3 R 7

3/8 = 0 R 3

• Step #2: Normalize (choice (A)): 3768 = .376 x 83

• Step #3: Fill in the bit fields, with a positive sign (sign bit = 0),

an exponent of 3 (in 2's complement), and 4-digit fraction = .37608:

0 011 . 011 111 110 000

Representations

• binary representation• binary representation

0011011111110000

• octal representation

000 011 011 111 110 000

0 3 3 7 6 0

• hexadecimal representation

0011 0111 1111 0000

3 7 F 0

Range, Precision/Error (~ gap size)• In the previous example, we have the base b = 8, the number of

significant digits (not bits!) in the fraction s = 4, the largest exponent value (not bit pattern) M = 3, and the smallest exponent value m = -4.

• In the previous example, there is no explicit representation of 0, but there needs to be a special bit pattern reserved for 0 otherwise there would be no way to represent 0 without violating the normalization rule (choices (A) and (B)). We will assume a bit pattern of 0 000 000 000 000 000 represents 0.Note that 0 110 000 000 000 000 etc. are not normalized numbers!

Hence, not all bit combinations are used !!!!!!

• Using b, s, M, and m, we would like to characterize this floating point representation in terms of the largest positive representable number, the range of representable numbers, the smallest (nonzero) positive representable number, the smallest gap between two successive numbers, the largest gap between two successive numbers, and the total number of numbers that can be represented (not all bit combinations are used!).

Range, Precision/Error (~gap size)using choice (A)

• Largest representable number: bM x (1 - b-s) = 83 x (1 – 8-4)

= 511.875

• Range = [-bM x (1 – b-s), bM x (1 – b-s)]

• Smallest representable number: bm x b-1 = 8-4 - 1 = 8-5

= 0.000030517578125

• Largest gap: bM x b-s = 83 - 4 = 8-1

= 0.125

• Smallest gap: bm x b-s = 8-4 - 4= 8-8

= 0.0000000596046447754

• Number of representable numbers: There are 5 components: (A) sign bit; for each number except 0 for this case, there is both a positive and negative version; (B) M - (m - 1) exponents; (C) b - 1 values for the first digit (0 is disallowed for the first normalized digit); (D) bs-1 values for the s-1 remaining digits, plus (E) a special representation for 0.

For this example: 2 x ((3 - (-4 - 1) x (8 - 1) x 84-1 + 1 = 57345 numbers that can be represented. Notice how this number is smaller than the number of possible bit patterns that are possible in 16 bits, which is 216

(=65536).

Not all bit combinations are used

Example Floating Point Format

• Number of bits = 1 + 2 + 3 = 6. Number of possible bit patterns = 64

• Largest non-zero positive number = bM x (1 – b-s) = 7/4

• Range = [- bM x (1 – b-s), bM x (1 – b-s)] = [-7/4, 7/4]

• Smallest non-zero positive number = bm x b-1 = 1/8

• Largest gap = bM x b-s = 1/4

• Smallest gap = bm x b-s = 1/32

• Number of representable numbers = 2x((M-m)+1)x(b-1)xbs-1+1 = 33● Note for later: fill the gap around 0: de-normalized

Gap Size Follows Exponent Size• The relative error ( remember: error ~ gap size)

is approximately the same for all numbers.

• If we take the ratio of a large gap to a large number, and compare that to the ratio of a small gap to a small number, then the ratios are the same:

Conversion Example• Example: Convert (9.375 x 10-2)10 to 8 bit, base 2, 4 bit excess 7 exponent

normalized scientific notation

(using choice (B): 1 non-0 digit before the radix point)

• Start by expanding the -2 exponent in base 10: .09375.

• Next, convert from base 10 fixed point to base 2:

.09375 x 2 = 0.1875

.1875 x 2 = 0.375

.375 x 2 = 0.75

.75 x 2 = 1.5

.5 x 2 = 1.0

• Thus, (.09375)10 = (.00011)2.

• Finally, convert to normalized base 2 Floating Point scientific notation:

.00011 = .00011 x 20 = 1.1 x 2-4

0 0011 100

(“engineering” notation: exp. multiple of 3)

IEEE-754 Floating Point Formats

Notes: ● Single precision exponent: bias 127● Double precision exponent: bias 1023

http://dx.doi.org/10.1109/IEEESTD.2008.4610935

IEEE-754 Conversion Example

• Represent -12.62510 in single precision IEEE-754 format.

• Step #1: Convert to target base. -12.62510 = -1100.1012

• Step #2: Normalize. -1100.1012 = -1.1001012 x 23

• Step #3: Fill in bit fields.

Sign is negative, so sign bit is 1. Exponent is in excess 127 (not excess 128!), so exponent is represented as the unsigned integer 3 + 127 = 130. Leading 1 of significand is hidden, so final bit pattern is:

1 1000 0010 . 1001 0100 0000 0000 0000 000

IEEE-754 Floating Point Formats

(“18”)

(“18”)

(“111”)

(“111”)

Denormalized number (exponent 0) – aka “subnormal” numbers (with less precision, i.e., fewer digits): Single precision: fraction x 2-126 Double precision: fraction x 2-1022

IEEE-754 Examples

Notes: ● (g): de-normalized● (g) vs. (i)

Beware of Limited Precision!

of elements of R (math) as single/double precision Floating Point (computer)

>>> A = 500.0>>> B = A + 1e-15>>> C = B - 500.0

>>> print(C)0.0

>>> A = 500.0>>> B = A – 500.0>>> C = B + 1e-15

>>> print(C)1e-15

Limited Precision (ctd.)

1/49 * 49 = 1/51 * 51 = 1 (in R)

>>> v1 = 1/49.0 * 49>>> v2 = 1/51.0 * 51>>> print('%.16f %.16f' % (v1, v2))

0.9999999999999999 1.0000000000000000

Base 10: 0.2Base 2: 0.001100110011...

Remember also that some numbers have an exact decimal representation, but no exact binary representation (i.e., need an infinite number of bits)

Comparing two FP numbersif number1 == number2: …

versus

if abs(number1 – number2) < epsilon: …

>>> from sys import float_info>>> float_infosys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

>>> epsilon = float_info[8]>>> epsilon2.220446049250313e-16

http://docs.python.org/3.3/library/decimal.html

Beyond BCD:Decimal Fixed Point and Floating Point arithmetic,including Arbitrary Precision

1994 Pentium FDIV bug in the Intel P5 Pentium floating point unit (FPU)

Effect of Loss of Precision

According to the General Accounting Office of the U.S. Government, a loss of precision in converting 24-bit integers into 24-bit floating point numbers was responsible for the failure of a Patriot anti-missile battery.

Representation of textAlphanumeric Data

3 standards for the representation of letters (alpha) and digits (numerical)

➢ ASCII –

American Standard Code for Information Interchange➢ EBCDIC –

Extended Binary-Coded Decimal Interchange Code – IBM mainframes

➢ Unicode

http://dx.doi.org/10.1109/IEEESTD.2008.4610935

ASCII = American Standard Code for Information Interchange.

• ASCII-code originally used only 7 bits to represent a character.

• 27 = 128 unique characters.

• 8th bit unused (parity bit for error checking)

• ASCII evolved to use all 8 bits.

Able to represent 256 unique characters.

Representation of textAlphanumeric Data

000 001 010 011 100 101 110 1110000 NULL DLE 0 @ P ` p0001 SOH DC1 ! 1 A Q a q0010 STX DC2 " 2 B R b r0011 ETX DC3 # 3 C S c s0100 EDT DC4 $ 4 D T d t0101 ENQ NAK % 5 E U e u0110 ACK SYN & 6 F V f v0111 BEL ETB ' 7 G W g w1000 BS CAN ( 8 H X h x1001 HT EM ) 9 I Y i y1010 LF SUB * : J Z j z1011 VT ESC + ; K [ k {1100 FF FS , < L \ l |1101 CR GS - = M ] m }1110 SO RS . > N ^ n ~1111 SI US / ? O _ o DEL

Most significant bit

Least significant bit

ASCII Character Code

ASCII Character Code• ASCII is a 7-bit code,

commonly stored in 8-bit bytes.

• 'A' is at 4116. To convert upper case letters to lower case letters, add 2016. Thus “a” is at 4116 + 2016 = 6116.

• The character '5' at position 3516 is different from the number 5! To convert character-numbers into number-numbers, subtract 3016: 3516 - 3016 = 5.

‘a’ = 11000012 = 1418= 6116 = 9710



33 Control codes



CR & LF

95 display-able codes




Alphabetic codes


============

Binary010010000110010101101100011011000110111100101100001000000111011101101111011100100110110001100100

Hexadecimal48656C6C6F2C207767726C64

Decimal72

101108108111

4432

119103114108100

============

============

Note: 12 characters – requires 12 bytesEach character requires 1 byte (8bits)

Hell

o,

worl

d

“Hello, world”

“Hello, world”

Intermezzo: representing character strings

● C approach

● Java approach

Numeric codes


====

Binary00110100001010110011000100110101

Hexadecimal342B3135

Decimal52434953

4+15

====

====

“4+15”char

ASCII 00110100 00101011 00110001 001101012

34 2B 31 3516

Punctuation, etc.


• EBCDIC is an 8-bit code.

def lower2Upper(character): distanceLower2Upper = ord("A") - ord("a") if ord("a") <= ord(character) <= ord("z"): upperOrd = ord(character) + distanceLower2Upper return(chr(upperOrd)) else: # does not catch the “hole” in EBCDIC encoding print("lower2Upper only accepts lower case alphabet letters") sys.exit(1)

print(ord(“a”)) # gives different results for ASCII and EBCDICprint(chr(100)) # gives different results for ASCII and EBCDIC

print(lower2Upper("c"))print(lower2Upper("C"))

Output:

97dClower2Upper only accepts lower case alphabet letters

# for all lower case characters c: invariant upper2Lower(lower2Upper(c)) == c

Extended (8bit) ASCII

Unicode Character CodeA character is the smallest possible component of a text (e.g., ‘A’, ‘B’, ‘È’ and ‘Í’)that has semantic value.

Even the extended (8 bit) version of ASCII is not enough for international use.

The Unicode standard (http://www.unicode.org/) describes how characters are represented by unique code points. A code point is an integer value, usually denoted in base 16. Values range from 0 through 0x10FFFF (1,114,111 decimal).

The notation U+12CA is used to denote the character with value 0x12ca (4,810 decimal).

The Unicode standard contains tables listing characters and their corresponding code points:

0061 'a'; LATIN SMALL LETTER A0062 'b'; LATIN SMALL LETTER B0063 'c'; LATIN SMALL LETTER C...007B '{'; LEFT CURLY BRACKET

Unicode was designed to be an ASCII-super set: the first 256 characters in the Unicode character set are identical to those in the extended ASCII code.

http://www.unicode.org/charts/

U+211D

in Python(3):

>>> print("\N{DOUBLE-STRUCK CAPITAL R}")ℝ>>> print("\u211D")ℝ

>>> ord("\u211D")8477>>> chr(8477) 'ℝ'

ℝ

>>> ord('€')8364

>>> hex(ord('€'))'0x20ac'

>>> chr(8364)'€'

>>> import unicodedata>>> unicodedata.name('€')'EURO SIGN'

>>> unicodedata.lookup('EURO SIGN')'€'

>>> unicodedata.category('€') # http://www.fileformat.info/info/unicode/category/index.htm'Sc' # [S]ymbol [c]urrency

Unicode code points

A Unicode code point represents a character

Characters are defined by their meaning in a language, Glyphs are defined by their appearance.

A text-to-speech reader should pronounce “a 339 Ω resistor” “a three hundred and thirty nine Ohm resistor” and not “a three hundred and thirty nine uppercase omega resistor”

The glyph Ω is represented by unicode character U+03A9 when it represents the Greek letter omega U+2126 when it represents Ohms, the unit of electrical resistance.

The glyph M is represented by unicode character U+004D when it represents a Latin letter U+216F when it represents the Roman numeral for 1,000.

Glyphs are handled by font renderers

http://designrfix.com/fonts/arial-helvetica

anatomy of a glyph

typeface vs. font Back in the good old days of analog printing, every page was laboriously set out in frames with metal letters. That was rolled in ink, and then it was pressed down onto a clean piece of paper. That was a page layout. Printers needed thousands of physical metal blocks, each with the character it was meant to represent set out in relief (the type face). If you wanted to print Garamond, for example, you needed different blocks for every different size (10 point, 12 point, 14 point, and so on) and weight (bold, light, medium).

A typeface (also known as font family) is a set of one or more fonts each composed of glyphs that share common design features. Each font of a typeface has a specific weight, style, condensation, width, slant, italicization, ornamentation, and designer or foundry (and formerly size, in metal fonts).

A font described a subset of blocks in a typeface—but each font embodied a particular size and weight. For example, bolded Garamond in 12 point was considered a different font than normal Garamond in 8 point, and italicized Times New Roman at 24 point would be considered a different font than italicized Times New Roman at 28 point.

http://whsdesignandphoto.weebly.com/typefaces.html

A scalable font is a font that is created in the required point size when needed for display or printing. The dot patterns (bitmaps) are generated from a set of outline fonts, or base fonts, which contain a mathematical representation of the typeface. The two major scalable fonts are Adobe's Type 1 PostScript and Apple/Microsoft's TrueType.

A bitmapped font designed from scratch for a particular font size always looks the best.Scalable fonts however eliminate storing hundreds of different sizes of fonts on disk. In most cases, only the trained eye can tell the difference. Scaling does not always retain all properties.

http://www.pcmag.com/encyclopedia/term/50836/scalable-font

scalable vs. bitmap font

character vs. glyph ligatures

A ligature glyph is the joining together of one or more glyphs into one continuous glyph.The ligature for aesthetically combining fi is one glyph, but two characters.

A ligature character (unicode standard):"The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets.Their use is discouraged."

alif lām

Unicode string encodings

A Unicode string is a sequence of code points (each representing a character).

This sequence needs to be represented as a set of bytes (unsigned integer values from 0 through 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.

Encodings don’t have to handle every possible Unicode character, and most encodings don’t.

ASCII encoding:

If a code point is < 128, each byte is the same as the value of the code point. If a code point is >= 128, the Unicode string can not be represented in this encoding.

Latin-1, also known as ISO-8859-1 encoding:

Unicode code points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1.

>>> ord('a'.encode('ASCII'))97

>>> '€'.encode('ASCII')Traceback (most recent call last): File "<stdin>", line 1, in <module>UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in position 0: ordinal not in range(128)

>>> ord('a'.encode('Latin-1'))97

>>> '€'.encode('Latin-1')Traceback (most recent call last): File "<stdin>", line 1, in <module>UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)

http://www.unicode.org/

http://www.unicode.org/charts/

UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that (one to four) 8-bit numbers are used in the encoding (i.e., a “variable length encoding”).

UTF-8 has several convenient properties:

● It can handle any Unicode code point.

● A Unicode string is turned into a string of bytes containing no embedded zero bytes. Hence,UTF-8 strings can be processed by C functions such as strcpy() and sent through (e.g., network) protocols that can’t handle zero bytes.

● A string of ASCII text is also valid UTF-8 text.

● UTF-8 is fairly compact: most commonly used characters can be represented with one or two bytes.

● If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.

Unicode string encodings

>>> ord('a'.encode('UTF-8'))97

>>> '€'.encode('UTF-8')b'\xe2\x82\xac'

>>> '€'.encode('UTF-16')b'\xff\xfe\xac '

>>> '€'.encode('UTF-32')b'\xff\xfe\x00\x00\xac \x00\x00'

>>> b'\xE2\x82\xAC'.decode('UTF-8')'€'

>>> b'\xff\xfe\xac '.decode('UTF-16')'€'

>>> b'\xff\xfe\x00\x00\xac \x00\x00'.decode('UTF-32')'€'

Unicode string (en/de)coding

ℝ

In-browser UTF-8 test: http://www.fileformat.info/info/unicode/utf8test.htmUTF-8 format description: http://www.fileformat.info/info/unicode/utf8.htm

The big picture

data representation - mcgill...

Documents