chapter 2: entropy and mutual informationdevroye/courses/ece534/lectures/ch2.pdf · university of...

University of Illinois at Chicago ECE 534, Fall 2009, Natasha DevroyeUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Chapter 2: Entropy and Mutual Information

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Chapter 2 outline

• Definitions

• Entropy

• Joint entropy, conditional entropy

• Relative entropy, mutual information

• Chain rules

• Jensen’s inequality

• Log-sum inequality

• Data processing inequality

• Fano’s inequality


Definitions

A discrete random variable X takes on values x from the discrete alphabet X .

The probability mass function (pmf) is described by

pX(x) = p(x) = Pr{X = x}, for x ! X .


Definitions

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981

You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2

Probability, Entropy, and Inference

This chapter, and its sibling, Chapter 8, devote some time to notation. Justas the White Knight distinguished between the song, the name of the song,and what the name of the song was called (Carroll, 1998), we will sometimesneed to be careful to distinguish between a random variable, the value of therandom variable, and the proposition that asserts that the random variablehas a particular value. In any particular chapter, however, I will use the mostsimple and friendly notation possible, at the risk of upsetting pure-mindedreaders. For example, if something is ‘true with probability 1’, I will usuallysimply say that it is ‘true’.

2.1 Probabilities and ensembles

An ensemble X is a triple (x,AX ,PX), where the outcome x is the valueof a random variable, which takes on one of a set of possible values,AX = {a1, a2, . . . , ai, . . . , aI}, having probabilitiesPX = {p1, p2, . . . , pI},with P (x=ai) = pi, pi ! 0 and

!ai!AX

P (x=ai) = 1.

The name A is mnemonic for ‘alphabet’. One example of an ensemble is aletter that is randomly selected from an English document. This ensemble isshown in figure 2.1. There are twenty-seven possible letters: a–z, and a spacecharacter ‘-’.

i ai pi

1 a 0.05752 b 0.01283 c 0.02634 d 0.02855 e 0.09136 f 0.01737 g 0.01338 h 0.03139 i 0.059910 j 0.000611 k 0.008412 l 0.033513 m 0.023514 n 0.059615 o 0.068916 p 0.019217 q 0.000818 r 0.050819 s 0.056720 t 0.070621 u 0.033422 v 0.006923 w 0.011924 x 0.007325 y 0.016426 z 0.000727 – 0.1928

abcdefg

hijklmnopqrstuvwxyz–

Figure 2.1. Probabilitydistribution over the 27 outcomesfor a randomly selected letter inan English language document(estimated from The FrequentlyAsked Questions Manual forLinux ). The picture shows theprobabilities by the areas of whitesquares.

Abbreviations. Briefer notation will sometimes be used. For example,P (x=ai) may be written as P (ai) or P (x).

Probability of a subset. If T is a subset of AX then:

P (T ) = P (x"T ) ="

ai!T

P (x=ai). (2.1)

For example, if we define V to be vowels from figure 2.1, V ={a, e, i, o, u}, then

P (V ) = 0.06 + 0.09 + 0.06 + 0.07 + 0.03 = 0.31. (2.2)

A joint ensemble XY is an ensemble in which each outcome is an orderedpair x, y with x " AX = {a1, . . . , aI} and y " AY = {b1, . . . , bJ}.We call P (x, y) the joint probability of x and y.

Commas are optional when writing ordered pairs, so xy # x, y.

N.B. In a joint ensemble XY the two variables are not necessarily inde-pendent.

22



2.1: Probabilities and ensembles 23

a b c d e f g h i j k l m n o p q r s t u v w x y z – y

abcdefghijklmnopqrstuvwxyz–

x Figure 2.2. The probabilitydistribution over the 27!27possible bigrams xy in an Englishlanguage document, TheFrequently Asked QuestionsManual for Linux.

Marginal probability. We can obtain the marginal probability P (x) fromthe joint probability P (x, y) by summation:

P (x=ai) !!

y!AY

P (x=ai, y). (2.3)

Similarly, using briefer notation, the marginal probability of y is:

P (y) !!

x!AX

P (x, y). (2.4)

Conditional probability

P (x=ai | y = bj) !P (x=ai, y = bj)

P (y = bj)if P (y = bj) "= 0. (2.5)

[If P (y = bj) = 0 then P (x=ai | y = bj) is undefined.]

We pronounce P (x=ai | y = bj) ‘the probability that x equals ai, giveny equals bj ’.

Example 2.1. An example of a joint ensemble is the ordered pair XY consistingof two successive letters in an English document. The possible outcomesare ordered pairs such as aa, ab, ac, and zz; of these, we might expectab and ac to be more probable than aa and zz. An estimate of thejoint probability distribution for two neighbouring characters is showngraphically in figure 2.2.

This joint ensemble has the special property that its two marginal dis-tributions, P (x) and P (y), are identical. They are both equal to themonogram distribution shown in figure 2.1.

From this joint ensemble P (x, y) we can obtain conditional distributions,P (y |x) and P (x | y), by normalizing the rows and columns, respectively(figure 2.3). The probability P (y |x=q) is the probability distributionof the second letter given that the first letter is a q. As you can see infigure 2.3a, the two most probable values for the second letter y given


Definitions

The events X = x and Y = y are statistically independent if p(x, y) = p(x)p(y).

The variables X1, X2, · · ·XN are called independent if for all (x1, x2, · · · , xN ) !X1 " X2 " · · · XN we have

p(x1, x2, · · ·xN ) =N!

i=1

pXi(xi).

They are furthermore called identically distributed if all variables Xi have the samedistribution pX(x).


Entropy

• Intuitive notions?

• 2 ways of defining entropy of a random variable:

• axiomatic definition (want a measure with certain properties...)

• just define and then justify definition by showing it arises as answer to a number of natural questions

Definition: The entropy H(X) of a discrete random variable X with pmf pX(x) isgiven by

H(X) = !!

x

pX(x) log pX(x) = !EpX(x)[log pX(X)]


Order these in terms of entropy


Order these in terms of entropy


Entropy examples 1

• What’s the entropy of a uniform discrete random variable taking on K values?

• What’s the entropy of a random variable with

• What’s the entropy of a deterministic random variable?

X = [!,",#,$], pX = [1/2; 1/4; 1/8; 1/8]

!


Entropy: example 2Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981


32 2 — Probability, Entropy, and Inference

What do you notice about your solutions? Does each answer depend on thedetailed contents of each urn?

The details of the other possible outcomes and their probabilities are ir-relevant. All that matters is the probability of the outcome that actuallyhappened (here, that the ball drawn was black) given the di!erent hypothe-ses. We need only to know the likelihood, i.e., how the probability of the datathat happened varies with the hypothesis. This simple rule about inference isknown as the likelihood principle.

The likelihood principle: given a generative model for data d givenparameters !, P (d |!), and having observed a particular outcomed1, all inferences and predictions should depend only on the functionP (d1 |!).

In spite of the simplicity of this principle, many classical statistical methodsviolate it.

2.4 Definition of entropy and related functions

The Shannon information content of an outcome x is defined to be

h(x) = log21

P (x). (2.34)

It is measured in bits. [The word ‘bit’ is also used to denote a variablewhose value is 0 or 1; I hope context will always make clear which of thetwo meanings is intended.]

In the next few chapters, we will establish that the Shannon informationcontent h(ai) is indeed a natural measure of the information contentof the event x = ai. At that point, we will shorten the name of thisquantity to ‘the information content’.

i ai pi h(pi)

1 a .0575 4.12 b .0128 6.33 c .0263 5.24 d .0285 5.15 e .0913 3.56 f .0173 5.97 g .0133 6.28 h .0313 5.09 i .0599 4.110 j .0006 10.711 k .0084 6.912 l .0335 4.913 m .0235 5.414 n .0596 4.115 o .0689 3.916 p .0192 5.717 q .0008 10.318 r .0508 4.319 s .0567 4.120 t .0706 3.821 u .0334 4.922 v .0069 7.223 w .0119 6.424 x .0073 7.125 y .0164 5.926 z .0007 10.427 - .1928 2.4

!

i

pi log21pi

4.1

Table 2.9. Shannon informationcontents of the outcomes a–z.

The fourth column in table 2.9 shows the Shannon information contentof the 27 possible outcomes when a random character is picked froman English document. The outcome x = z has a Shannon informationcontent of 10.4 bits, and x = e has an information content of 3.5 bits.

The entropy of an ensemble X is defined to be the average Shannon in-formation content of an outcome:

H(X) !!

x!AX

P (x) log1

P (x), (2.35)

with the convention for P (x) = 0 that 0 " log 1/0 ! 0, sincelim!"0+ ! log 1/! = 0.

Like the information content, entropy is measured in bits.

When it is convenient, we may also write H(X) as H(p), where p isthe vector (p1, p2, . . . , pI). Another name for the entropy of X is theuncertainty of X.

Example 2.12. The entropy of a randomly selected letter in an English docu-ment is about 4.11 bits, assuming its probability is as given in table 2.9.We obtain this number by averaging log 1/pi (shown in the fourth col-umn) under the probability distribution pi (shown in the third column).



32 2 — Probability, Entropy, and Inference

What do you notice about your solutions? Does each answer depend on thedetailed contents of each urn?

The details of the other possible outcomes and their probabilities are ir-relevant. All that matters is the probability of the outcome that actuallyhappened (here, that the ball drawn was black) given the di!erent hypothe-ses. We need only to know the likelihood, i.e., how the probability of the datathat happened varies with the hypothesis. This simple rule about inference isknown as the likelihood principle.

The likelihood principle: given a generative model for data d givenparameters !, P (d |!), and having observed a particular outcomed1, all inferences and predictions should depend only on the functionP (d1 |!).

In spite of the simplicity of this principle, many classical statistical methodsviolate it.

2.4 Definition of entropy and related functions

The Shannon information content of an outcome x is defined to be

h(x) = log21

P (x). (2.34)

It is measured in bits. [The word ‘bit’ is also used to denote a variablewhose value is 0 or 1; I hope context will always make clear which of thetwo meanings is intended.]

In the next few chapters, we will establish that the Shannon informationcontent h(ai) is indeed a natural measure of the information contentof the event x = ai. At that point, we will shorten the name of thisquantity to ‘the information content’.

i ai pi h(pi)

1 a .0575 4.12 b .0128 6.33 c .0263 5.24 d .0285 5.15 e .0913 3.56 f .0173 5.97 g .0133 6.28 h .0313 5.09 i .0599 4.110 j .0006 10.711 k .0084 6.912 l .0335 4.913 m .0235 5.414 n .0596 4.115 o .0689 3.916 p .0192 5.717 q .0008 10.318 r .0508 4.319 s .0567 4.120 t .0706 3.821 u .0334 4.922 v .0069 7.223 w .0119 6.424 x .0073 7.125 y .0164 5.926 z .0007 10.427 - .1928 2.4

!

i

pi log21pi

4.1

Table 2.9. Shannon informationcontents of the outcomes a–z.

The fourth column in table 2.9 shows the Shannon information contentof the 27 possible outcomes when a random character is picked froman English document. The outcome x = z has a Shannon informationcontent of 10.4 bits, and x = e has an information content of 3.5 bits.

The entropy of an ensemble X is defined to be the average Shannon in-formation content of an outcome:

H(X) !!

x!AX

P (x) log1

P (x), (2.35)

with the convention for P (x) = 0 that 0 " log 1/0 ! 0, sincelim!"0+ ! log 1/! = 0.

Like the information content, entropy is measured in bits.

When it is convenient, we may also write H(X) as H(p), where p isthe vector (p1, p2, . . . , pI). Another name for the entropy of X is theuncertainty of X.

Example 2.12. The entropy of a randomly selected letter in an English docu-ment is about 4.11 bits, assuming its probability is as given in table 2.9.We obtain this number by averaging log 1/pi (shown in the fourth col-umn) under the probability distribution pi (shown in the third column).


Entropy: example 3

• Bernoulli random variable takes on heads (0) with probability p and tails with probability 1-p. Its entropy is defined as

H(p) := !p log2(p)! (1! p) log2(1! p)16 ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1p

H(p

)

FIGURE 2.1. H(p) vs. p.

Suppose that we wish to determine the value of X with the minimumnumber of binary questions. An efficient first question is “Is X = a?”This splits the probability in half. If the answer to the first question isno, the second question can be “Is X = b?” The third question can be“Is X = c?” The resulting expected number of binary questions requiredis 1.75. This turns out to be the minimum expected number of binaryquestions required to determine the value of X. In Chapter 5 we show thatthe minimum expected number of binary questions required to determineX lies between H(X) and H(X) + 1.

2.2 JOINT ENTROPY AND CONDITIONAL ENTROPY

We defined the entropy of a single random variable in Section 2.1. Wenow extend the definition to a pair of random variables. There is nothingreally new in this definition because (X, Y ) can be considered to be asingle vector-valued random variable.

Definition The joint entropy H(X, Y ) of a pair of discrete randomvariables (X, Y ) with a joint distribution p(x, y) is defined as

H(X, Y ) = !!

x"X

!

y"Yp(x, y) log p(x, y), (2.8)


Entropy

The entropy H(X) = !!

x p(x) log p(x) has the following properties:

• H(X) " 0, entropy is always non-negative. H(X) = 0 i! X is deterministic(0 log(0) = 0).

• H(X) # log(|X |). H(X) = log(|X |) i! X has uniform distribution over X.

• Since Hb(X) = logb(a)Ha(X), we don’t need to specify the base of the loga-rithm (bits vs. nat).

Moving on to multiple RVs


Joint entropy and conditional entropy

Definition: Joint entropy of a pair of two discrete random variables X and Y is:

H(X, Y ) := !Ep(x,y)[log p(X, Y )]

= !!

x!X

!

y!Yp(x, y) log p(x, y)

Note: H(X|Y ) != H(Y |X). """


Joint entropy and conditional entropy

• Natural definitions, since....

Theorem: Chain ruleH(X, Y ) = H(X) + H(Y |X)

Corollary:H(X, Y |Z) = H(X|Z) + H(Y |X, Z)



140 8 — Dependent Random Variables

H(X,Y )

H(X)

H(Y )

I(X;Y )H(X |Y ) H(Y |X)

Figure 8.1. The relationshipbetween joint information,marginal entropy, conditionalentropy and mutual entropy.

8.2 Exercises

! Exercise 8.1.[1 ] Consider three independent random variables u, v,w with en-tropies Hu,Hv,Hw. Let X ! (U,V ) and Y ! (V,W ). What is H(X,Y )?What is H(X |Y )? What is I(X;Y )?

! Exercise 8.2.[3, p.142] Referring to the definitions of conditional entropy (8.3–8.4), confirm (with an example) that it is possible for H(X | y = bk) toexceed H(X), but that the average, H(X |Y ), is less than H(X). Sodata are helpful – they do not increase uncertainty, on average.

! Exercise 8.3.[2, p.143] Prove the chain rule for entropy, equation (8.7).[H(X,Y ) = H(X) + H(Y |X)].

Exercise 8.4.[2, p.143] Prove that the mutual information I(X;Y ) ! H(X) "H(X |Y ) satisfies I(X;Y ) = I(Y ;X) and I(X;Y ) # 0.

[Hint: see exercise 2.26 (p.37) and note that

I(X;Y ) = DKL(P (x, y)||P (x)P (y)).] (8.11)

Exercise 8.5.[4 ] The ‘entropy distance’ between two random variables can bedefined to be the di!erence between their joint entropy and their mutualinformation:

DH(X,Y ) ! H(X,Y ) " I(X;Y ). (8.12)

Prove that the entropy distance satisfies the axioms for a distance –DH(X,Y ) # 0, DH(X,X)= 0, DH(X,Y )=DH(Y,X), and DH(X,Z) $DH(X,Y ) + DH(Y,Z). [Incidentally, we are unlikely to see DH(X,Y )again but it is a good function on which to practise inequality-proving.]

Exercise 8.6.[2 ] A joint ensemble XY has the following joint distribution.

P (x, y) x1 2 3 4

1 1/8 1/16 1/32 1/32

y 2 1/16 1/8 1/32 1/32

3 1/16 1/16 1/16 1/16

4 1/4 0 0 04

3

2

1

1 2 3 4

What is the joint entropy H(X,Y )? What are the marginal entropiesH(X) and H(Y )? For each value of y, what is the conditional entropyH(X | y)? What is the conditional entropy H(X |Y )? What is theconditional entropy of Y given X? What is the mutual informationbetween X and Y ?



8.3: Further exercises 141

Exercise 8.7.[2, p.143] Consider the ensemble XY Z in which AX = AY =AZ = {0, 1}, x and y are independent with PX = {p, 1 ! p} andPY = {q, 1!q} and

z = (x + y)mod 2. (8.13)

(a) If q = 1/2, what is PZ? What is I(Z;X)?

(b) For general p and q, what is PZ? What is I(Z;X)? Notice thatthis ensemble is related to the binary symmetric channel, with x =input, y = noise, and z = output.

H(X|Y) H(Y|X)I(X;Y)

H(X)

H(Y)

H(X,Y)

Figure 8.2. A misleadingrepresentation of entropies(contrast with figure 8.1).

Three term entropies

Exercise 8.8.[3, p.143] Many texts draw figure 8.1 in the form of a Venn diagram(figure 8.2). Discuss why this diagram is a misleading representationof entropies. Hint: consider the three-variable ensemble XY Z in whichx " {0, 1} and y " {0, 1} are independent binary variables and z " {0, 1}is defined to be z = x + y mod 2.

8.3 Further exercises

The data-processing theorem

The data processing theorem states that data processing can only destroyinformation.

Exercise 8.9.[3, p.144] Prove this theorem by considering an ensemble WDRin which w is the state of the world, d is data gathered, and r is theprocessed data, so that these three variables form a Markov chain

w # d # r, (8.14)

that is, the probability P (w, d, r) can be written as

P (w, d, r) = P (w)P (d |w)P (r | d). (8.15)

Show that the average information that R conveys about W, I(W ;R), isless than or equal to the average information that D conveys about W ,I(W ;D).

This theorem is as much a caution about our definition of ‘information’ as itis a caution about data processing!

!


Joint/conditional entropy examples

p(x, y) y = 0 y = 1x = 0 1/2 1/4x = 1 0 1/4

H(X,Y)=

H(X|Y)=

H(Y|X)=

H(X)=

H(Y)=

!


H(X) = !!

x p(x) log2(p(x))

(A) entropy is the measure of average uncertainty in the random variable

(B) entropy is the average number of bits needed to describe the random variable

(C) entropy is a lower bound on the average length of the shortest description

of the random variable

(D) entropy is measured in bits?

(E)

(F) entropy of a deterministic value is 0

Entropy is central because...


Mutual information

• Entropy H(X) is the uncertainty (``self-information'') of a single random variable

• Conditional entropy H(X|Y) is the entropy of one random variable conditional

upon knowledge of another.

• The average amount of decrease of the randomness of X by observing Y is the average information that Y gives us about X.

Definition: The mutual information I(X;Y ) between the random variables X andY is given by

I(X;Y ) = H(X)!H(X|Y )

=!

x!X

!

y!Yp(x, y) log2

p(x, y)p(x)p(y)

= Ep(x,y)

"log2

p(X, Y )p(X)p(Y )

#


At the heart of information theory because...

• Information channel capacity:

• Channel coding theorem says: information capacity = operational capacity

Pe =n!

i=m+1

"n

i

#f i (1! f)n!i

C = maxp(x)

I(X; Y )

C =1

2log2(1 + |h|2P/PN)

C =

$%

&

12 log2(1 + |h|2P/PN)

Eh

'12 log2(1 + |h|2P/PN)

(

C =

$%

&

maxQ:Tr(Q)=P12 log2

))IMR + HQH†))

maxQ:Tr(Q)=P EH

'12 log2

))IMR + HQH†))(

Y = HX + N

X = H!1U + N

"Y = H(H!1U) + N

= U + N

C =1

2log2(1 + P/N)

R2 # I(Y2; X2|X1)

Let Z = (Y1, Y2,X1,X2,V1,V2, W ) be distributed as:

P (w)$ P (m1!|w)P (m1"|w)P (x1|m1!, m1", w)

$ P (m"1!|m1!, w)P (m"

1"|m1", w)P (m2!|v1, w)P (m2"|v1, w)

$ P (x2|m2!, m2",m", w)P (y1|x1,x2)P (y2|x1,x2)

1

• Operational channel capacity:

Highest rate (bits/channel use) that can

communicate at reliably

X YChannel: p(y|x)


Mutual information example

p(x, y) y = 0 y = 1x = 0 1/2 1/4x = 1 0 1/4

X or Y p(x) p(y)0 3/4 1/21 1/4 1/2

!


Divergence (relative entropy, K-L distance)

Definition: Relative entropy, divergence or Kullback-Leibler distance between twodistributions, P and Q, on the same alphabet, is

D(p ! q) := Ep

!log

p(x)q(x)

"=

#

x!Xp(x) log

p(x)q(x)

(Note: we use the convention 0 log 00 = 0 and 0 log 0

q = p log p0 =!.)

• D(p ! q) is in a sense a measure of the “distance” between the two distribu-tions.

• If P = Q then D(p ! q) = 0.

• Note D(p ! q) is not a true distance.

D( , ) = 0.2075 D( , ) = 0.1887


K-L divergence example

• X = {1, 2, 3, 4, 5, 6}

• P = [1/6 1/6 1/6 1/6 1/6 1/6]

• Q = [1/10 1/10 1/10 1/10 1/10 1/2]

• D(p ! q) =? and D(q ! p) =?

x

p(x)

x

q(x)

!


Mutual information as divergence!

Definition: The mutual information I(X;Y ) between the random variables X andY is given by

I(X;Y ) = H(X)!H(X|Y )

=!

x!X

!

y!Yp(x, y) log2

p(x, y)p(x)p(y)

= Ep(x,y)

"log2

p(X, Y )p(X)p(Y )

#

I(X;Y ) = D(p(x, y) ! p(x)p(y))

• Can we express mutual information in terms of the K-L divergence?


Mutual information and entropy

Theorem: Relationship between mutual information and entropy.

I(X;Y ) = H(X)!H(X|Y )I(X;Y ) = H(Y )!H(Y |X)I(X;Y ) = H(X) + H(Y )!H(X, Y )I(X;Y ) = I(Y ;X) (symmetry)I(X;X) = H(X) (“self-information”)

``Two’s company, three’s a crowd’’

H(X) H(Y) H(Y)H(X|Y)

H(X) H(Y)

I(X;Y) I(X;Y) I(X;Y)


Chain rule for entropy

Theorem: (Chain rule for entropy): (X1, X2, ..., Xn) ! p(x1, x2, ..., xn)

H(X1, X2, ..., Xn) =n!

i=1

H(Xi|Xi!1, ..., X1) !H(X1)

H(X3)

H(X2)

= + +H(X1,X2,X3)

H(X1) H(X2|X1)

H(X3|X1,X2)


Conditional mutual information

H(X)

H(Z)

H(Y)

H(X)

H(Z)

H(Y)H(X)

H(Z)

H(Y)H(X)

H(Z)

H(Y)I(XlY|Z)

H(X|Z) H(X|Y,Z)

= -


Chain rule for mutual information

Theorem: (Chain rule for mutual information)

I(X1, X2, ..., Xn;Y ) =n!

i=1

I(Xi;Y |Xi!1, Xi!2, ..., X1)

!

H(X)

H(Z)

H(Y)

=

H(X)

H(Z)

H(Y)H(X)

H(Z)

H(Y)

I(X,Y;Z)

H(X)

H(Z)

H(Y)

I(X,;Z) +

H(X)

H(Z)

H(Y)

I(Y,;Z|X)

Chain rule for relative entropy in book pg. 24


What is the grey region?

H(X)

H(Z)

H(Y)H(X)

H(Z)

H(Y)H(X)

H(Z)

H(Y)H(X)

H(Z)

H(Y)H(X)

H(Z)

H(Y)


Another disclaimer.... Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981


144 8 — Dependent Random Variables

H(Y|X,Z)

H(X)

H(Z)

I(X;Y)

H(Z|X) H(Z|X,Y)

I(X;Y|Z)A

H(Z|Y)

H(X|Y,Z)

H(Y)

H(X,Y|Z)

Figure 8.3. A misleadingrepresentation of entropies,continued.

that the random outcome (x, y) might correspond to a point in the diagram,and thus confuse entropies with probabilities.

Secondly, the depiction in terms of Venn diagrams encourages one to be-lieve that all the areas correspond to positive quantities. In the special case oftwo random variables it is indeed true that H(X |Y ), I(X;Y ) and H(Y |X)are positive quantities. But as soon as we progress to three-variable ensembles,we obtain a diagram with positive-looking areas that may actually correspondto negative quantities. Figure 8.3 correctly shows relationships such as

H(X) + H(Z |X) + H(Y |X,Z) = H(X,Y,Z). (8.31)

But it gives the misleading impression that the conditional mutual informationI(X;Y |Z) is less than the mutual information I(X;Y ). In fact the arealabelled A can correspond to a negative quantity. Consider the joint ensemble(X,Y,Z) in which x ! {0, 1} and y ! {0, 1} are independent binary variablesand z ! {0, 1} is defined to be z = x + y mod2. Then clearly H(X) =H(Y ) = 1 bit. Also H(Z) = 1 bit. And H(Y |X) = H(Y ) = 1 since the twovariables are independent. So the mutual information between X and Y iszero. I(X;Y ) = 0. However, if z is observed, X and Y become dependent —knowing x, given z, tells you what y is: y = z " xmod 2. So I(X;Y |Z) = 1bit. Thus the area labelled A must correspond to "1 bits for the figure to givethe correct answers.

The above example is not at all a capricious or exceptional illustration. Thebinary symmetric channel with input X, noise Y , and output Z is a situationin which I(X;Y ) = 0 (input and noise are independent) but I(X;Y |Z) > 0(once you see the output, the unknown input and the unknown noise areintimately related!).

The Venn diagram representation is therefore valid only if one is awarethat positive areas may represent negative quantities. With this proviso keptin mind, the interpretation of entropies in terms of sets can be helpful (Yeung,1991).

Solution to exercise 8.9 (p.141). For any joint ensemble XY Z, the followingchain rule for mutual information holds.

I(X;Y,Z) = I(X;Y ) + I(X;Z |Y ). (8.32)

Now, in the case w # d # r, w and r are independent given d, soI(W ;R |D) = 0. Using the chain rule twice, we have:

I(W ;D,R) = I(W ;D) (8.33)

andI(W ;D,R) = I(W ;R) + I(W ;D |R), (8.34)

soI(W ;R) " I(W ;D) $ 0. (8.35)

[Mackay’s textbook]

"""


Convex and concave functions

-10

-7.5 -5

-2.5 0

2.5 5

7.5 10

12.5

-2.5

0

2.5

5

7.5

10


Convex and concave functions

-10

-7.5 -5

-2.5 0

2.5 5

7.5 10

12.5

-2.5

0

2.5

5

7.5

10


Jensen’s inequality

Theorem: (Jensen’s inequality) If f is convex, then

E[f(X)] ! f(E[X]).

If f is strictly convex, the equality implies X = E[X] with probability 1.

!


Jensen’s inequality consequences

• Theorem: (Information inequality) D(p ! q) " 0, with equality i! p = q.

• Corollary: (Nonnegativity of mutual information) I(X;Y ) " 0 with equalityi! X and Y are independent.

• Theorem: (Conditioning reduces entropy) H(X|Y ) # H(X) with equality i!X and Y are independent.

• Theorem: H(X) # log |X | with equality i! X has a uniform distribution overX .

• Theorem: (Independence bound on entropy) H(X1, X2, ..., Xn) #!n

i=1 H(Xi)withequality i! Xi are independent.

!#


Log-sum inequality

!

Theorem: (Log sum inequality) For nonnegative a1, a2, ..., an and b1, b2, ..., bn,

n!

i=1

ai logai

bi!

"n!

i=1

ai

#log

$ni=1 ai$ni=1 bi

with equality i! ai/bi = const.

Convention: 0 log 0 = 0, a log a0 =" if a > 0 and 0 log 0

0 = 0.


Log-sum inequality consequences

• Theorem: (Convexity of relative entropy) D(p ! q) is convex in the pair (p, q),so that for pmf’s (p1, q1) and (p2, q2), we have for all 0 " ! " 1:

D(!p1 + (1# !)p2 ! !q1 + (1# !)q2)" !D(p1 ! q1) + (1# !)D(p2 ! q2)

• Theorem: Concavity of entropy For X $ p(x), we have that

H(p) := Hp(X)is a concave function of p(x).

• Theorem: (Concavity of the mutual information in p(x)) Let (X, Y ) $ p(x, y) =p(x)p(y|x). Then, I(X;Y ) is a concave function of p(x) for fixed p(y|x).

• Theorem: (Convexity of the mutual information in p(y|x)) Let (X, Y ) $p(x, y) = p(x)p(y|x). Then, I(X;Y ) is a convex function of p(y|x) for fixedp(x).

!#


Markov chains

Definition: X,Y, Z form a Markov chain in that order (X ! Y ! Z) i!

p(x, y, z) = p(x)p(y|x)p(z|y) " p(z|y, x) = p(z|y)

X !

N1

!

N2

YZ

• X ! Y ! Z i! X and Z are conditionally independent given Y

• X ! Y ! Z " Z ! Y ! X. Thus, we can write X # Y # Z.!


Data-processing inequality

X !

N1

f()Y

ZX !

N1

!

N2

YZ

!


Markov chain questions

If X ! Y ! Z, then I(X;Y ) " I(X;Y |Z).

What if X,Y, Z do not form a Markov chain, can I(X;Y |Z) " I(X;Y )?

If X1 ! X2 ! X3 ! X4 ! X5 ! X6, then Mutual Information increases as youget closer together:

I(X1;X2) " I(X1;X4) " I(X1;X5) " I(X1;X6).


Consequences on sufficient statistics

• Consider a family of probability distributions {f!(x)} indexed by !.If X ! f(x | !) for fixed ! and T (X) is any statistic (i.e., function of thesample X), then we have

! " X " T (X).

• The data processing inequality in turn implies

I(!;X) ! I(!;T (X))

for any distribution on !.

• Is it possible to choose a statistic that preserves all of the information in Xabout !?


Consequences on sufficient statistics• Consider a family of probability distributions {f!(x)} indexed by !.

If X ! f(x | !) for fixed ! and T (X) is any statistic (i.e., function of thesample X), then we have

! " X " T (X).

• The data processing inequality in turn implies

I(!;X) ! I(!;T (X))

for any distribution on !.

• Is it possible to choose a statistic that preserves all of the information in Xabout !?

Definition: Su!cient Statistic A function T (X) is said to be a su!cient statisticrelative to the family {f!(x)} if the conditional distribution of X, given T (X) = t,is independent of ! for any distribution on ! (Fisher-Neyman):

f!(x) = f(x | t)f!(t) ! ! " T (X)" X ! I(!;T (X)) # I(!;X)

"""


Example of a sufficient statistic


Fano’s inequality

!


Fano’s inequality consequences

chapter 2: entropy and mutual informationdevroye/courses/ece534/lectures/ch2.pdf · university of...

Documents