information theory and communication wayne lawton department of mathematics national university of...

INFORMATION THEORY

and Communication

Wayne Lawton

Department of Mathematics

National University of Singapore

S14-04-04, [email protected]

http://math.nus.edu.sg/~matwml

mailto:[email protected]

CHOICE CONCEPT

We are confronted by choices every day, for instance we may choose to purchase an apple, a banana, or a coconut;or to withdraw k < N dollars from our bank account.

Choice has two aspect: we may need to make the decision: “what fruit to purchase” or “how many dollars to withdraw”. We will ignore this aspect which is the concern of Decision Theory.

We may also need to communicate this choice to our food seller or our bank. This is the concern of Information Theory.

INCREASING CHOICE?The number of choices increases if more elements are added to the set of existing choices. For example, if a shopper is to choose one fruit from a store that carries Apples, Bananas, and Coconuts, then discovers that the store added Durians and Edelberries. The number of choices increased from 3 to 5 by the addition of 2 extra choices.

The number of choices increases if two or more sets of choices are combined. A shopper may have the choiceof choosing one fruit from {A,B,C} on Monday and choosing one fruit from {D,E} on Thursday. Compare with case above.

},,,,{},{},,{,, EDCBAEDCBACBA

)},(),,(),,(),,(),,(),,{(

},{},,{,,

ECDCEBDBEADA

EDCBACBA

MEASURING CHOICE?For any set X let #X denote the number of elements in X. Then

We require that the information, required to specify the choice of a element from a set, be an additive function, therefore

YXYX ###

YHXHYXH

Theorem 1. The logarithm functions measure information

XXH a#loglog to the base a, where a > 0; called bits if a = 2, nats if a = e

FACTORIAL FUNCTION

For any positive integer n define n factorial

Often the choices within different sets are mutually constrained. If a shopper is to purchase a different fruit from the set {A,B,C,D,E} on each of 5 days, then there are 5 choices on the first days but only 4 choices on the second day, etc. so the total number of choices equals

12)2()1(! nnnn

120!512345

1!0

STIRLING’S APPROXIMATION

Theorem 2.

nnnn ee log!log 21

constant

Proof [K] pages 111-115

COMBINATORICSTheorem 2 (Binomial)

where

knknk

k

n tsk

nts

0

)(

)()()( tststs

k

ncalled n choose k, is the number of ways of choosing k elements from a set with n elements.

Proof. Consider that is the product of n-factors and it equals the sum of

n2terms, each term is obtained by specifying a choice of s or t from each factor, the number of terms with

)!(!

!

knk

n

Furthermore, n choose k equals

knkts

is exactly the number of ways of choosing k factors to have s out of n factors.

MULTINOMIAL COEFFICIENTS

Theorem 3 The number of sequences with

symbols of a given type equals

Mini ,...,1,0

!!!

!

21 Mnnn

N

where

MnnN 1

SHANNON’S FORMULA

Theorem 4 For large N, the average information per symbol of a string of length N containing M symbols with probabilities

isMpp ,,1

bitsppppH i

Mi

iiM 2

11 log,,

Proof. The law of large numbers says that the i-th symbol will occur approximately

MiNpn ii ,...,1,

times, so the result follows from Stirling’s Approximation.

ENTROPY OF A RANDOM VARIABLE

with probabilities

Let X denote a random variable with values in a set

m1 a,...,aA m1 p,...,p

We define the entropy H(X) by

bits

Hence the entropy of a random variable equals the average information required to describe the values that it takes, it takes 1000 bits to describe 1000 flips of a fair coin but we can describe the loaded coin sequence HHHHHHTHHHHHHHHHTT by its run lengths 6H1T9H2T

Recall that for a large integer N, NH(X) equals the log of the number of strings of length N from A whose relative frequencies of letters are these probabilities

i2i1n1 plogpp,...,pH(X)

m

iH

JOINT DISTRIBUTION

with probabilities

Let X denote a random variable with values in a set

m1 a,...,aA m1 p,...,p

and Y is a random variable with values in a set

Then

YX

n1 b,...,bB with probabilities n1 q,...,q

is a random variable with values in

n1,...,jm;1,...,i|)b,(aBA ji whose probabilities

satisfy the marginal equations (m+n-1 independent)

n1,...,jm;1,...,i|rij

m

1i jij qr

n

1j iij pr

MUTUAL INFORMATIONThe joint entropy of X and Y

satisfies

Equality holds if and only if X and Y are

m

1i

n

1j ij2ij rlogr-YXH

independent, this means that

The mutual information of X and Y, defined by

jiij qpr

H(Y)H(X)YXH

YXH-H(Y)H(X)Y)I(X, satisfies H(Y),H(X)minY)I(X,0

H(X)X)I(X,0 and

CHANNELS AND THEIR CAPACITYA channel is a relationship between the transmitted message X and received message Y

Typically, this relationship does not determine Y as a function of X but only determines the statistics of Y given the value of X, this determines the joint distribution of X and YThe channel capacity is defined as Y)}{I(X,maxC Example: a binary channel with a 10% bit error rate

has joint probabilities

)p-.9(1)p-.1(11

.1p.9p0

10

11

11

1p0}{Xproband

5.p1

Max I(X,Y) = .531 bits

for

SHANNON’S THEOREMIf a channel has capacity C then it is possible to send information over that channel with a rate arbitrarily close to,but never more than, C witha probability of error arbitrarily small.

Shannon showed that this was possible to do byProving that there existed a sequence of codesWhose rates approached C and whose probabilities of error approaced zero

His masterpiece, called the Channel Coding Theorem, never actually constructed any specific codes, and thus provided jobs for thousands offor engineers, mathematicians and scientists

LANGUAGE AS A CODEDuring my first visit to Indonesia I ate a curious small fruit. Back in Singapore I went to a store and asked for a small fruit with the skin of a dark brown snake and more bitter than any gourd. Now I ask for Salak – a far more efficient, if less descriptive, code to specify my choice of fruit.

When I specify the number of dollars that I want to withdraw from my bank account I use positional notation (in base 10), a code to specify nonnegative integers that was invented in Babylonia (now Iraq) about 4000 years ago (in base 60).

Digital Computers, in contrast to analog computers, represent numbers using positional notation in base 2 (shouldn’t they be called binary computers?). Is thisbecause they can’t count futher than 1? These lecturewill explore this and other intriguing mysteries.

WHAT IS THE BEST BASE?A base B-code of length L (or uses an ordered sequence on symbols from a set of B symbols to represent B x B x … x B (read ‘B to the power L’) choices

Physically, this is represented using L devices each of which can exist in one of B states.

The cost is L times the cost of each device and the cost of each device is proportional to B since physical material is required to represent each of the B-1 ‘inactive states’ for each of the L devices that correspond to each position.

The efficiency of base B is therefore the ratio of information in a base B sequence of length L divided by BL, therefore

BB

BLB eL

eEff loglog

IS THE SKY BLUE?If I use base 2 positional notation to specify that I want to withdraw d < 8 dollars from my bank account then

Positional notation is great for computing, but if I decide to withdraw 2 rather than 1 (or 4 rather than 3) dollars I must change my code by 2 (or 3) bits. Consider the gray code:

0000 d 1001 d 2010 d

3011 d 4100 d 5101 d

3110 d 7111 d

0000 d 1001 d 2011 d

3010 d 4110 d 5111 d

6101 d 7100 d What’s different?

GRAY CODE GEOMETRYHow many binary gray codes of length 3 are there? And how can we construct them. Cube geometry gives the answers.

000 001

011010

110 111

101100

0000 d 1001 d 2011 d

3010 d 4110 d 5111 d

6101 d 7100 d

Bits in a code are the Cartesian Coordinates of the vertices. The d-th and (d+1)-th vertex share a common edge.

Answer the questions.

PROBLEMS?1. Derive Theorem 1. Hint: review properties of logarithms.

2. Write and run a simple computer program to demonstrate Stirling’s Approximation.

3. Derive the formula for n choose k by induction and then try to find another derivation. Then use the other derivation to derive the multinomial formula.

4. Complete the details for the second half of the derivation of Shannon’s Formula for Information.

5. How many binary Gray codes are there of length 3?

ERROR CORRECTING CODESHow many bits of information can be sent reliably by sending 3 bits if one of those bits may be corrupted? Consider the 3-dimensional binary hypercube.

000 001

011010

110 111

101100

H = {binary seq. of length 3}

A Code C is a subset of H

The Hamming Distance d(x,y) between x and y in H is the number of bits that they differ by. Hence d(010,111) = 2.The minimal distance d(C) of a code C is min {d(x,y) | x, y in C}

A code C can correct 1 error bit if and only if d(C) 3So we can send 1 bit reliably with the code C = {(000),(111)}

H has 8 sequences

PARITY CODESIf we wanted to send 4 bits reliably (to correct up to 1 bit error) then we could send each of these bits three times – this code consist of a set C of 16 sequences having length 12 – the code rate is 50% since 12 bits are used to communicate 4 bitsHowever, it is possible to send 4 bits reliably using only 8 bitsArranging the four bits in a 2 x 2 square and assigning 4 parity bits- one for each row and each column

To send a sequence abcd

subscript means mod 2

22

2

2

)()(

)(

)(

dbca

dcdc

baba

dc

ba

01

110

011

10

11

Note: a single bit error in a,b,c,d results in odd parity in its row and column Ref. See rectangular and triangle codes in [H]

HAMMING CODESThe following [7,4,3] Hamming Code can send 4 bits reliably using only 7 bits, it has d(C) = 3.

1111111

1000101

1100010

0110001

1011000

0101100

0010110

0001011

0000000

1101001

0111010

1110100

1001110

1010011

0100111

0011101

OTHER CODESHamming Codes are examples of cyclic group codes – why?

BCH (Bose-Chaudhuri-Hocquenghem) codes are another class of cyclic group codes and generated by the coefficient sequences of certain irreducible polynomials over a finite field

Reed-Solomon Codes were the first class of BCH codes to be discovered. They were first used by NASA for space communications and are now used as error corrections in CD’s

Other codes include: Convolutional, Goethals, Golay, Goppa, Hadamard, Julin, Justesen, Nordstrom-Robinson, Pless double circulant, Preparata, Quadratic Residue, Rao-Reddy, Reed-Muller, t-designs and Steiner systems, Sugiyama, Trellis, Turbo, and Weldon codes. There are many waitingto be discovered and the number of open problems is huge.

COUNTING STRINGS

Let

n21 TTT

1m

be positive integers Let

Let A be an alphabet with

1T

2m

symbols of time duration

2T

n1 m,,m and let be positive real numbers

symbols of time duration

nm symbols of time duration nT

andn

0t,T,,T,T,m,,m,mt,CS n21n21 be the number of strings, made from the letters of A,whose time duration is

t

MORSE CODE MESSAGES

1T

1mm 21 A = {dot, dash} =

time duration of

Examples: If

,,

2T,1T 21

2n

12,1,1,11,CS

# messages whose duration t

2T time duration of

,

32,1,1,12,CS

62,1,1,13,CS ,,,,,

21 T,T,1,1t,CS

112,1,1,14,CS ,,,,,,,,,,

PROTEINSn1,...,i1,mi

A = {amino acids} = {Glycine, Alanine, Valine, Phenylalanine, Proline, Methionine, Isoleucine, Leucine, Aspartic Acid, Glutamic Acid, Lysine, Arginine, Serine, Threonine, Tyrosine, Histidine, Cystein, Aspargine, Glutatimine, Tryptophan}

H

weight of corresponding Peptide Unit of i-th amino acid arranged from lightest (i = 1) to heaviest (i = 20)

iT

20n

H

C CN

R

OUnit with Amino Acid Residue R

UnitUnit

Unit

Single Chain Protein with Three Units

HOH

...,...t,CS # Single Chain Proteins

with weight O)weight(Ht 2

RECURSIVE ALGORITHM

n21n21 T,,T,T,m,,m,mt,CS

21111 TT,...,T-tCSmm t

1T,0 t

1kk

k

1iiii TT,...,T-tCSmm

t

t

n

n

1iiii T,...,T-tCSmm

MATLAB PROGRAMfunction N = cs(t,m,T)

% function N = cs(t,m,T)

% Wayne Lawton 19 / 1 / 2003

% Inputs: t = time,

% m = array of n positive integers

% T = array of n increasing positive numbers

% Outputs: N = number of strings composed out of these

% m(i) symbols of duration T(i), i = 1,...,n and having duration <= t

%

k = sum(T <= t);

N = 0;

if k > 0

for j = 1:k

N = N + m(j)*(1+cs(t-T(j),m,T));

end

end

ASYMPTOTIC GROWTH

tn21n21 cXT,,T,T,m,,m,mt,CS

t

1Xmn

1i

Ti

i

Theorem 5 For large

where X is the unique real root of the equation

Example tt

251 )618.1(2,1,1,1t,CS cc

and c is some constant

Proof For integer T’s a proof based on linear algebra works and X is the largest eigenvalue of a matrix that represents the recursion or difference equation for CS Othewise the Laplace Transform is required. We discovered a new proof based on Information Theory

INFORMATION THEORY PROOFn1,...,i,/mp ii

n

n

n

n

mp

mp

mp

mp

mp

mp ,,,,,,,,,HH

2

2

2

2

1

1

1

1

to maximize THfor the symbols having time duration

We choose probabilities

is the Shannon information (or entropy) per symbol and

iTwhere

i2i

n

1ii2i mlogpplogp

n

1iiiTpT is the average duration per symbol

Clearly is the average information per time

INFORMATION THEORY PROOF

for some Lagrange multiplier

Since there is the constraint

hence

n1,...,j,p

pp

n

1ii

jj

1pn

1ii

where the denominator, called the partition function, is the sum of the numerators (why).

Substituting these probabilities into THgives

n1,...,j),Z(emp jλTjj

1)( Z hence λeX satisfies the rootcondition in Theorem 5. The proof is complete since the probabilities that maximal information are the ones that occur in the set of all sequences

MORE PROBLEMS?6. Compute H(1/2,1/3,1/6)

7. Show that H(X,Y) is maximal when X and Y are independent

9. Compute all Morse Code Sequences of length <= 5If dots have duration 1 and dashes have duration 2

10. Compute the smallest molecular weight W so that only at least 100 single strand proteins have weight <= W

8. Read [H] and explain what a triangular parity code is.

REFERENCES[BT] Carl Brandon and John Tooze, Introduction to Protein Structure, Garland Publishing, Inc., New York, 1991.

[H] Sharon Heumann, Coding theory and its application to the study of sphere-packing, Course Notes, October 1998 http://www.mdstud.chalmers.se/~md7sharo/coding/main/main.html

[SW] Claude E. Shannon and Warren Weaver, The Mathematical Theory of Communication, Univ. of Illinois Press, Urbana, 1949.

[K] Donald E. Knuth, The Art of Computer Programming, Volume 1 Fundamental Algorithms, Addison-Wesley, Reading, 1997.

[Ham] R. W. Hamming, Coding and Information Theory, Prentice-Hall, New Jersey, 1980.

[CS] J. H. Conway and N. J. A. Sloan, Sphere Packings, Lattices and Groups, Springer, New York, 1993.

[BC] James Ward Brown and Ruel V. Churchill, Complex Variables and Applications, McGraw-Hill, New York, 1996.

MATHEMATICAL APPENDIX

n21n21 T,,T,T,m,,m,mt,CSf(t)

0

stdtef(t)F(s)

If

1T1Ms

satisfies so its Laplace Transform

exists for complex s if

1TtMf(t)

ni

1i imM and t largest integer t then

The recursion for f implies that P(s)G(s)F(s) isTn

1i iem1P(s) )(f(t)dteG(s) 1

T

0

stn

sG

in

inTT

0

stsTn

1i isT

1 f(t)dteemMe(s)G s

MATHEMATICAL APPENDIX

0t,dseF(s)P.V.f(t)iγ

i-γ

sti2π

1

This allows F to be defined as a meromorphic function swith singularities to the left of a line

and

iRγ

Therefore, f is given by a Bromwich integral that can be computed by a contour integral using the method of residues, see page 235 [BC],

iRγ

iRγ

γst

1 sseF(s)Resf(t)

j

j1s

2s3s

1j,s j

4s

0are the singularities of F

The unique real singularity Xlogs e1 and for large t tcXf(t) thus proving Theorem 5

information theory and communication wayne lawton department of mathematics national university of...

Documents