random process.pdf

1. Introduction to random processes

A random process is a collection of random variables (r.v.’s for short) that arise in thesame probability experiment (the last clause can be replaced by the exact statement “thatare defined on a common probability space;” the term probability space will be defined inthe next section). Thus a random process is mathematically represented by the collection

Xt, t ∈ I ,

where Xt denotes the tth random variable in the process, and the index t runs over anindex set I which is arbitrary.

A random process is a mathematical idealization of a set of random measurementsobtained in a physical experiment. This randomness can be quantified by a probabilisticor statistical description of the process, and the complexity of this description dependslargely on the size of the index set I. In briefly discussing this issue of complexity, weconsider index set sizes, which cover most cases of interest.

(a) I consists of one index only. In this case we are measuring a single randomquantity, represented by the r.v. X. From elementary probability, we know that a simpleway of describing X statistically is through its cumulative distribution function (or cdf )FX , which is defined by the relationship

FX(x) = PrX ≤ x

(Notation. Throughout this course, random variables will be denoted by upper caseletters, and fixed (non-random) numbers by lower case letters.)

FX is always nondecreasing, continuous from the right, and such that FX(−∞) = 0,FX(+∞) = 1. Thus typically it looks like this:

We also know that in most cases of interest, we can alternatively specify the statisticsof X by a probability density function (or pdf ) fX , which is a nonnegative function thatintegrates to unity over the entire real line and is related to FX by

FX(x) =∫ x

−∞fX(u)du .

This also covers r.v.’s with discrete components, in which case fX contains δ−functions.

1

(b) I consists of n indices, e.g. I = 1, . . . , n. In this case the process variables forma random vector in Rn, denoted by

X = (X1, . . . , Xn) .

The statistical description of the process can be accomplished by specifying the cdf FX

(or FX1,...,Xn) of the random vector X; this is a real-valued function on Rn defined by the

relationshipFX(x1, . . . , xn) = PrX1 ≤ x1, . . . , Xn ≤ xn .

The dimension of their argument notwithstanding, the functions FX and FX have quitesimilar behavior. FX, too, is nondecreasing: if yk≥xk for all values of k, then

FX(y1, . . . , yn) ≥ FX(y1, . . . , yn) .

Furthermore, in most cases of interest, we can write

FX(x1, . . . , xn) =∫ xn

−∞. . .

∫ x1

−∞fX(u1, . . . , un)du1 · · · dun ,

for a suitable pdf fX, also defined on Rn.Recall that the cdf of any sub-vector of Xn can be easily determined from the cdf FX

by setting the redundant arguments of FX equal to +∞. Thus for example, the cdf of ther.v. X1 is computed via

FX1(x1) = FX(x1,∞, . . . ,∞) .

This procedure is not reversible: knowledge of the marginal distributions of X does not ingeneral suffice to determine FX. One important exception is the case where the componentsof the random vector are independent ; then the cdf of X is given by the product of thecdf’s of the individual components, i.e.,

FX(x1, . . . , xn) = FX1(x1) · · ·FXn(xn) ,

and the same relationship is true if we replace cdf’s by pdf’s (F by f).

In the last two cases, we consider infinite index sets I.

(c) I is countably infinite, say I = N (the set of positive integers or natural numbers).Here the process is equivalent to a sequence of random variables

X1, X2, . . . .

The problem of describing the statistics of infinitely many random variables is most eco-nomically solved by specifying the so-called finite-dimensional distributions, namely thedistributions of all finite-dimensional vectors that can be formed with these variables. Inthis case, the stated procedure amounts to specifying the cdf

FXt1,...,Xtn

2

for every choice of n and (distinct) integers t1, . . . , tn.

Although the above specification of finite-dimensional distributions suffices to describestatistically what happens in in the random process over any finite index- (or time-) win-dow, it is not clear whether it also determines properties of the process that effectivelyinvolve the entire index set (or discrete-time axis). Consider for example the randomvariable

X∞ = limn→∞

X1 + . . . Xn

n,

which gives the asymptotic value of the time average of the random observations. X∞ isclearly a property of the process, yet its value is not determined by any finite number ofvariables of the process. Thus

X∞ 6= g(Xt1, . . . , Xtn)

for any choice of g and arguments t1, . . . , tn, and we cannot use a single finite-dimensionaldistribution of the process to determine the cdf of X∞.

As it turns out (this is a rather profound fact in probability theory), most infinitaryproperties of the process are determined by the set of all finite-dimensional distributions.Such properties include random quantities such as limits of time averages, and thus thestatistics of X∞ are in principle deducible from the finite-dimensional distributions (inpractice, the task is usually formidable!). Put differently, if two distinct random processesXk, k ∈ N and Yk, k ∈ N have identical finite-dimensional distributions, then thevariables X∞ and Y∞ will also have identical statistics.

In summary, augmentation of a finite index set to a countably infinite one necessi-tates the specification of an infinite set of finite-dimensional distributions. This entails aconsiderable jump in complexity, but ensures that all important properties of the process(including asymptotic ones) are statistically specified.

(d) In this last case we consider an uncountably infinite index set, namely I = R.If we think of the process as evolving in time, then we are effectively dealing with acontinuous-time process observed at all times.

In continuing our previous discussion on finite-dimensional distributions, we note an-other rather profound fact: finite-dimensional distributions no longer suffice to determineall salient characteristics of the process. As an example, suppose one wishes to model thenumber of calls handled by a telephone exhange up to time t, where t ∈ R. A typicalgraph of this random time-varying quantity would be

3

It is now possible to construct two models Xt, t ∈ R and Yt, t ∈ R that have iden-tical finite dimensional distributions, yet differ in the following important aspect: Xt, t ∈R (almost) always gives observations of the above typical form, whereas Yk, k ∈ Z isnot known to do the same. More precisely,

PrXt is integer-valued and nondecreasing for all tequals unity, whereas the quantity

PrYt is integer-valued and nondecreasing for all tcannot be defined, and hence does not exist.* Of the two processes, only Xt, t ∈ R is(possibly) suitable for modeling the random physical system in hand.

The reason for the above discrepancy is that the two random processes Xt, t ∈ Rand Yt, t ∈ R are constructed in entirely different ways. This illustrates the generalprinciple that random processes are not fully characterized by distributions alone; theirconstruction amounts to the specification of a family of random variables on the sameprobability space. Precise understanding of the concepts probability space and randomvariable is therefore essential.

2. A simple stochastic process

Gray & Davisson, p.103.Billingsley, Sec. 1, The unit interval.

Consider the probability experiment in which we choose a point ω at random fromthe unit interval (0, 1].(Notation. A parenthesis implies that the endpoint lies outside the interval; a squarebracket that it lies inside.)We assume that the selection of ω is uniform, in that

Pr

ω ∈ (a, b]

= b− a .

As expected, Pr

ω ∈ (0, 1]

= 1.

We now consider the binary expansion of the random point ω. We can write

ω = .X1(ω)X2(ω) . . . =∞∑

k=1

Xk(ω)2k

,

where Xk(ω) stands for the kth digit in the binary expansion. An iterative algorithm forderiving these digits is as follows. We divide the unit interval into two equal subintervals,

* The pivotal difference between the statements “X∞ ≤ 0” in (c) and “Xt is integer-valued and nondecreasing for all t” in (d) is that the former involves a countable infinityof time indices, whereas the latter involves an uncountable one. Agreement of two pro-cesses over finite-dimensional distributions implies agreement over “countably expressible”properties, but does not guarantee the same for “uncountably expressible” ones.

4

and set the first digit equal to 0 if ω falls in the left-hand subinterval, 1 otherwise. On thesubinterval containing ω we perform a similar division to obtain the second digit; and soforth. This is illustrated in the figure below.

(Notation. Endpoints marked “” lie outside, those marked “•” inside, the set or curvedepicted)

The variation of the kth digit Xk(ω) with ω has the following distinctive feature.Starting from the left, Xk alternates in value between 0 and 1 on adjacent intervals oflength 2−k: Thus the graphs of X1 and X2 look like this:

From the above observation we deduce that the vector X = (X1, . . . , Xn) has the followingbehavior as ω varies:

ω ∈ (0, 2−n]ω ∈ (2−n, 2 · 2−n]ω ∈ (2 · 2−n, 3 · 2−n]

...

ω ∈ ((2n − 1)2−n, 1

]

:::...:

X(ω) = 000 . . . 000X(ω) = 000 . . . 001X(ω) = 000 . . . 010

...X(ω) = 111 . . . 111

Thus each of the 2n binary words of length n is obtained over an interval of length2−n. In terms of probabilities (here length=probability), all binary words of length n are

5

equally likely candidates for the truncated expansion of a point drawn uniformly at randomfrom the unit interval. Noting also that any fixed digit Xk is 0 or 1 with equal probability,we conclude that for a binary word (a1, . . . , an),

PrX1 = a1, . . . , Xn = an = 2−n = PrX1 = a1 · · ·PrXn = an .

Now compare the above with the situation in which Y1, Y2, . . . are the outcomes of asequence of independent tosses of a fair coin labeled 0 and 1. By independence, we have

PrY1 = a1, . . . , Yn = an = 2−n = PrY1 = a1 · · ·PrYn = an .

Thus we have two sequences of random quantities with identical probabilistic descrip-tions (it is easy to verify that the processes Xk, k ∈ N and Yk, k ∈ N have the samefinite-dimensional distributions). Since drawing a point from an interval is in a sensesimpler than tossing a coin infinitely many times, we can use the process Xk, k ∈ Ninstead of Yk, k ∈ N to model the outcomes of independent coin tosses. This choicehas the interesting implication that one can construct infinitely many random quantitieswithout explicit reference to their (joint) statistical description by defining these quantitiesas functions of the outcome of a single probability experiment.

The above leads to the following interpretation, which will prevail in this course: arandom variable is a real-valued function of the outcome of a probability experiment. Arandom process is a collection of such functions, all of which are defined in terms of thesame probability experiment.

3. The notion of a probability space

For references, see subsequent sections.

A probability space is a mathematical model for a random experiment (or proba-bility experiment). It consists of three entities.

(i) An abstract set of points, called sample space, and usually denoted by Ω. Thepoints, or elements, of Ω are usually denoted by ω.

Interpretation: Ω is the set of all possible outcomes of the random experiment.

(ii) A collection of subsets of Ω, called event space, and usually denoted by an uppercase script character such as F . The sets that constitute the event space are called events.

Interpretation: The event space essentially represents all possible modes of observ-ing the experiment. A subset A of Ω is an event if we can set up an observation mechanismto detect whether the outcome ω of the experiment lies in A or not, i.e., whether A occursor not.

(iii) A function P , called probability measure, which is defined on the event spaceand takes values in the interval [0, 1].

Interpretation: For every event A, P (A) provides a numerical assessment of thelikelihood that A occurs; the quantity P (A) is the probability of A.

6

The standard representation of a probability space is a triple with the above threeentities in their respective order, i.e.,

(Ω,F , P ) .

The pair (Ω,F) is referred to as a measurable space. It describes the outcomes andmodes of observation of the experiment without reference to the likelihood of the observ-ables. In general, the same measurable space can give rise to many different probabilityspaces.

4. Event spaces and fields

Gray & Davisson, pp. 27–36.Billingsley, Sec. 2, Spaces and Classes of Sets.

From a mathematical viewpoint, the sample space Ω is entirely unconstrained; it isan arbitrary set of points. Constraints on Ω are imposed only by modeling considerations:Ω should be “rich” enough to represent all outcomes of the physical experiment that wewish to model. This does not mean that a point ω should be of the same form as theoutcome of the physical experiment; it merely suggests that one should be able to set upa correspondence between the actual outcomes and the points in Ω. Thus in the Exampleof Sec. 2 above, the sample space Ω = (0, 1] adequately represented the outcomes of asequence of coin tosses, in spite of the fact that the points in Ω were not themselves binarysequences. This was because it was possible to identify every ω with a distinct binarysequence by taking its binary expansion (conversely, every binary sequence that does notconverge to 0 can be identified with a distinct point in (0, 1]).

In contrast to the above, the mathematical constraints on the event space F are rigid;they stem from the earlier interpretation of events as sets of outcomes that are observableby available mechanisms. Three such constraints are given below.

1. ∅ ∈ F , Ω ∈ F .

This is reasonable in view of the fact that no observation is needed to determine whetherthe outcome lies in ∅ (impossible) of Ω (certain).

2. A ∈ F ⇒ Ac ∈ F (closure under complementation)

Obvious, since the same observation mechanism is used for both A and Ac.

3. A ∈ F , B ∈ F ⇒ A ∪B ∈ F (closure under union)

By combining the two observation mechanisms (A versus Ac and B versus Bc, one obtainsa single observation mechanism for A ∪B versus (A ∪B)c.

Definition. An algebra or field is a collection of subsets of Ω satisfying conditions(1)–(3) above.

Examples of fields.

(i) Ω arbitrary. F = ∅,Ω.

7

F is easily seen to satisfy conditions (1)–(3). It is the smallest field that can be builtfrom a sample space Ω, and is often referred to as the trivial field. Clearly, no usefulobservations can be made in the experiment represented here.

(ii) Ω arbitrary. F = power set of Ω = the collection of all subsets of Ω.Again F is easily seen to satisfy conditions (1)–(3): by convention, the empty set is a

subset of every set, and set operations on subsets of Ω always yield subsets of Ω. In theexperiment modeled here, every subset of Ω can be tested for occurrence; we thus have theexact opposite of example (i).(Notation. The power set of Ω is denoted by 2Ω.)

(iii) Here Ω is again arbitrary, and we consider sets C1, . . . , CM that form a finitepartition or decomposition of Ω; that is,

(∀i, j s.t. i 6= j) Ci ∩ Cj = ∅ andM⋃

i=1

Ci = Ω .

The sets Ci are referred to as cells or atoms of the partition. The definition of F is asfollows:

F =

A : A =⋃

i∈I

Ci, I ⊂ 1, . . . ,M

.

Thus F consists of all unions of sets Ci; by convention, we let⋃

i∈∅Ci = ∅ .

To see whether F is a field, we check conditions (1)–(3).

(1) ∅ =⋃

i∈∅Ci ∈ F , Ω =

⋃

i∈1,...,MCi ∈ F ;

(2) A =⋃

i∈I

Ci ⇒ Ac =⋃

i∈1,...,M−I

Ci ∈ F ;

(3) A =⋃

i∈J

Ci , B =⋃

i∈K

Ci ⇒ A ∪B =⋃

i∈J∪K

Ci ∈ F ,

and thus F is a field.

(iv) Here we take Ω = (0, 1], and we define F as the collection consisting of the emptyset and all finite unions of semi-open subintervals of (0,1], i.e.,

F = ∅ ∪

A : A =M⋃

i=1

(ai, bi], M < ∞, (ai, bi] ⊂ (0, 1]

.

8

Here condition (1) is easily seen to be satisfied: ∅ is explicitly included in F , and thechoice M = 1, a1 = 0, b1 = 1 yields Ω ∈ F . The same is true of condition (3), since theunion of two finite unions of intervals is itself a finite union of intervals.

To check condition (2), we first note that if two intervals (a1, b1] and (a2, b2] overlap,their union is a single semi-open interval (c, d]. Based on this observation, we can usea simple inductive argument to show that a finite union of semi-open intervals can beexpressed as a finite union of non-overlapping semi-open intervals. In other words,

M⋃

i=1

(ai, bi] =N⋃

i=1

(ci, di] ,

where 0 ≤ c1 < d1 < c2 < d2 < . . . < cN < dN ≤ 1 and N ≤ M . Now

( N⋃

i=1

(ci, di])c

= (0, c1] ∪ (d1, c2] ∪ . . . (dN−1, cN ] ∪ (dN , 1] ,

where both (0, 0] and (1, 1] are taken to be the empty set. This equality (illustrated in thefigure below) verifies condition (2), thereby proving that F is a field.

Two further properties of fields

(4) Closure under intersection: A ∈ F , B ∈ F ⇒ A ∩B ∈ F .To see this, recall de Morgan’s law:

(A ∩B)c = Ac ∪Bc .

Suppose now that A and B lie in F . By axioms (2) and (3), the same is true of the setsAc, Bc, Ac ∪Bc and (Ac ∪Bc)c. The last set is precisely A ∩B.

(5) Closure under finite unions: A1, . . . , An ∈ F ⇒ A1 ∪ · · · ∪An ∈ F .We prove this by an easy induction: suppose the statement is true for any n sets

A1, . . . , An∈ F , and that An+1 is also a set in F . Then by axiom (3), we have that

A1 ∪ · · · ∪An+1 = (A1 ∪ · · · ∪An) ∪An+1

also lies in F , which proves that the statement is true for any n + 1 sets in F . As thestatement is obviously true in the case n = 1, the induction is complete.

9

Remark. From (4) and (5) it easily follows that every field is closed under finiteintersections.

5. Event spaces and sigma-fields

Gray & Davisson, pp. 36–40.Billingsley, Sec. 1., Classes of Sets.

Unions and intersections over arbitrary index sets

Suppose Ai, i ∈ I is a collection of subsets of Ω; here the index set I is entirelyarbitrary.

The union of the sets Ai over I is defined as the set of points ω that lie in at leastone of the sets in the collection; i.e.,

⋃

i∈I

Ai = ω : (∃i ∈ I) ω ∈ Ai .

The intersection of the sets Ai over I is defined as the set of points ω that lie inevery one of the sets in the collection, i.e.,

⋂

i∈I

Ai = ω : (∀i ∈ I) ω ∈ Ai .

(Notation. The symbol ∀ reads “for all,” and ∃ reads “there exists one.”)

Fields and countable unions

We saw that fields are closed under the operation of taking unions of finitely manyconstituent sets. However, closure does not always hold if we take unions of infinitely manysuch sets. Thus if we have a sequence A1, A2, . . . of sets in a field F , the union

∞⋃

i=1

Aidef=

⋃

i∈N

Ai

will not always lie in F .

To see an instance where such a countable union lies outside the field, consider Ex-ample (iv) introduced earlier. If we take

Ai =(0,

12− 1

3i

],

then the countable union ∞⋃

i=1

Ai

10

will be a subset of (0, 1/2), since each of the Ai’s is a subset of that open interval. Weclaim that this union is actually equal to (0, 1/2). Indeed, if ω is any point in (0, 1/2),then for a sufficiently large value of i we will have

ω ≤ 12− 1

3i,

and thus ω will lie in Ai for that value of i. This is illustated in the figure below.

We have therefore shown that the union of all Ai’s is given by the open interval(0, 1/2), which cannot be expressed as a finite union of semi-open intervals and hence liesoutside F .

Remark. An often asked question is: what happens for i = ∞? The answer is, inever takes infinity as a value, and the inclusion of∞ in the symbol for the above countableunion is purely a matter of convention (just as in the case of an infinite series). Thus thedefinition of the sequence A1, A2, . . . does not encompass a set such as A∞, which couldbe naıvely taken as (0, 1/2− 1/∞] = (0, 1/2] .

The definition of a sigma-field

A σ-field or σ-algebra is a field that is closed under countable unions. Thus a collectionF of subsets of Ω is a σ-field if it satisfies the following axioms.

1. ∅ ∈ F , Ω ∈ F .2. A ∈ F ⇒ Ac ∈ F (closure under complementation).

3′. (∀i ∈ N) Ai ∈ F ⇒∞⋃

i=1

Ai ∈ F (closure under countable unions).

Remark. Countable means either finite or countably infinite; a set is countably in-finite if its elements can be arranged in the form of an infinite sequence, or equivalently,put in a one-to-one correspondence with the natural numbers. Thus strictly speaking, (3′)should be labeled closure under countably infinite unions. Yet the distinction is unimpor-tant, since a finite union is a countably infinite union where all but finitely many sets areempty. In particular, (3′) readily implies axiom (3) in the definition of a field (closureunder union), as well as property (5) of the previous section (closure under finite unions).

The following statement is a direct consequence of the above considerations.

Corollary. If a field consists of finitely many sets, it is also a σ-field.

Examples of sigma fields

Let us again consider the examples of fields given in Section 4.

11

(i) F = ∅,Ω is a σ-field by the above Corollary.

(ii) F = 2Ω is a σ-field since any set operation (including taking countable unions)yields a subset of Ω.

(iii) In this case F consists of all unions of cells in a finite partition of Ω. Clearly F isfinite (if there are M cells, then there are 2M sets in F) and thus by the earlier corollary,F is a σ-field.

If we take a countably infinite partition of Ω into sets C1, C2, . . ., then the collection

F ′ =

A : A =⋃

i∈I

Ci, I ⊂ N

,

will also be a σ-field. It is easy to check the first two axioms; for closure under countableunions, we note that given any sequence of sets A1, A2, . . . in F ′ such that

Ak =⋃

i∈Ik

Ci ,

we can write ∞⋃

k=1

Ak =⋃

i∈I

Ci ,

where I = I1 ∪ I2 ∪ · · ·.Remark. A measurable space (Ω,F) in which F is a σ-field consisting of all unions

of atoms in a countable partition of Ω is called discrete.

(iv). In this example, F consisted of the empty set and all finite unions of semi-opensubintervals of (0, 1]. As we saw earlier in this section, there exists sequence of semi-openintervals in F , the countable union of which is an open interval lying outside F . Thus Fis not closed under countable unions, and hence it is not a σ-field.

6. Generated sigma-fields and the Borel field

Gray & Davisson, pp. 40–47.Billingsley, Sec. 1, Classes of Sets.

The sigma-field generated by a collection

As we saw in the previous section, a field of subsets of Ω is not always a σ-field. Aquestion that arises naturally is in what ways such a field (or more generally, an arbitrarycollection of subsets of Ω) can be augmented so as to form a σ-field.

It is easy to see that this is always possible, since the power set 2Ω is a σ-field whichcontains (as a subcollection) every collection G of subsets of Ω. A far more interesting factis that there also exists a minimal such augmentation: that is, given any G, there existsa unique σ-field of subsets of Ω that both contains G and is contained in every σ-fieldcontaining G.

12

(Notation. The term “contains” can mean either “contains as a subset” or “contains asan element;” which meaning is pertinent depends on the context.)

Before proving the existence of a minimal such σ-field, it is worth giving a simpleexample. Suppose Ω = (0, 1], and define the collection G by

G =

(0, 1/3], (2/3, 1]

.

As pointed out above, the power set

F1 = 2Ω

is (trivially) a σ-field that contains G. To find the smallest σ-field with this property, wereason as follows.

The sets ∅ and (0, 1] clearly lie in every σ-field containing G, as do (0, 1/3] and (2/3, 1].Hence the union

(0, 1/3] ∪ (2/3, 1]

also lies in every such σ-field, and so does its complement (1/3, 2/3]. By closure underunion, the same is true of the sets (0, 2/3] and (1/3, 1]. Thus every σ-field containing Gmust also contain the collection

F2 =∅, Ω, (0, 1/3], (1/3, 2/3], (2/3, 1], (0, 2/3], (1/3, 1], (0, 1/3] ∪ (2/3, 1]

.

Since F2 is itself a σ-field, we conclude that F2 is the smallest σ-field containing G.

In the case of G consisting of infinitely many sets, the construction of a minimalσ-field containing G is often impossible. In contrast, the proof of its existence is quitestraightforward, and relies on the simple fact that the intersection of an arbitrary class ofσ-fields is itself a σ-field.(Remark. In taking the intersection of two collections of subsets of Ω, we identify thosesubsets of Ω that are common to both collections; we do not take intersections of subsetsof Ω. For example, if Ω = (0, 1] and G, F1 and F2 are defined as above, then

G ∩ F2 = G ∩ F1 = G , F2 ∩ F1 = F2 .

If also F3 = (0, 1/2], then

F3 ∩ G = F3 ∩ F2 = ∅ , F3 ∩ F1 = F3 .

Analogous statements can be made for every set operation and relation applied to collec-tions. Thus for example,

G ⊂ F2 ⊂ F1 ,

while F3 is a subset of neither G nor F2.)

13

To show that an arbitrary intersection of σ-fields is itself a σ-field , consider

F∩ =⋂

k∈K

Fk ,

where each Fk is a σ-field of subsets of Ω and the index set K is arbitrary. We check eachof the three axioms in turn.

(1) Both ∅ and Ω lie in every Fk, thus also in F∩.

(2) If A lies in Fk for some k, then Ac also lies in that Fk. If A lies in every Fk, thenso does Ac, i.e., Ac ∈ F∩.

(3′) The argument here is essentially the same as above: if A1, A2, . . . lie in every Fk,then by closure of each Fk under countable unions, the same will be true of

∞⋃

i=1

Ai .

With the above fact in mind, we can easily show that the minimal σ-field containinga collection G is the intersection of all σ-fields containing G. Indeed, the said intersection

(i) is itself a σ-field (by the above fact);

(ii) is contained in every σ-field containing G; and

(iii) contains G .

We summarize the above information in the following definition.

Given a collection G of subsets of Ω, the minimal σ-field containing G, or equivalentlythe σ-field generated by G, is defined as the intersection of all σ-field s containing G, andis denoted by σ(G).

One last remark before proceeding to the next topic is the following: it is possible fortwo or more distinct collections to generate the same σ-field . A simple illustration of thisfact can be given in terms of our earlier example, where

G =

(0, 1/3], (2/3, 1]

.

If we now take

G′ =

(0, 1/3], (1/3, 2/3], (2/3, 1]

and G′′ =

(0, 1/3], (0, 2/3]

,

then one can easily verify that

σ(G) = σ(G′) = σ(G′′) .

14

The Borel field

As noted above, the σ-fields generated by finite collections of subsets of Ω are rathereasy to construct; such σ-fields are simply described in terms of finite partitions of Ω (cf.Section 4, Example (iii)). For a countably infinite G, one might be tempted to extrapolatethat σ(G) will consist of unions of cells in a countable partition of Ω. This will always betrue if the sample space Ω is countably infinite, but not so if Ω is uncountable.

To elaborate on the last statement, we return to the uncountable space

Ω = (0, 1] .

As we saw in Section 4, the field

F = ∅ ∪

A : A =M⋃

i=1

(ai, bi], M < ∞, (ai, bi] ⊂ (0, 1]

is not a σ-field. By the foregoing discussion, F can be augmented to a minimal σ-fieldσ(F). This σ-field is called the Borel field of the unit interval, and is denoted by B((0, 1]).Thus

B((0, 1]) def= σ(F) .

The Borel field is of crucial importance in probability theory, being the basis for thedefinition of a random variable and its distribution. It contains, among others, all sets thatarise from intervals by countably many set operations. It does not, however, contain everysubset of the unit interval; it is possible to give (admittedly contrived) counterexamplesto that effect.

What other collections of subsets of (0, 1] generate B((0, 1])? The answer is many,including some that are easier to describe than F . In what follows we give an example ofsuch an alternative collection, principally in order to illustrate a general method of provingthat two given collections generate the same σ-field.

We claim that the σ-field generated by the collection

G =

A : A = (0, a), a < 1

,

is B((0, 1]), i.e., that σ(G) = σ(F).

To prove the above equality, we must prove each of the inclusions σ(G) ⊂ σ(F) andσ(F) ⊂ σ(G). For the former inclusion, it is sufficient to show that G is contained in σ(F).This is because any σ-field that contains G will, by definition of σ(·), also contain σ(G).The same argument can be made with F and G interchanged, and thus we conclude that

G ⊂ σ(F) and F ⊂ σ(G) ⇒ σ(G) = σ(F) .

15

To prove G ⊂ σ(F): By the method given under Fields and countable unions (Section5), we can write an arbitrary interval (0, a) in G as

(0, a) =∞⋃

i=1

(0, (1− (i + 1)−1)a

].

Each of the sets in the above union lies in F . Thus (0, a) is expressible as a countableunion of sets in F , and hence lies in σ(F).

To prove F ⊂ σ(G): The empty set trivially lies in σ(G). It thus remains to prove thatevery finite union of semi-open intervals (ai, bi] lies in σ(G); this is equivalent to provingthat every single semi-open interval (a, b] lies in σ(G).

We express (a, b] as

(a, b] = (a, 1]− (b, 1] = (a, 1] ∩ (b, 1]c ,

so that it suffices to show that every (a, 1] lies in σ(G). We now write (a, 1] as

(a, 1] =∞⋃

i=1

[a + (1− a)i−1, 1

].

Each of the sets in the above union is a complement of a set in G, and hence lies in σ(G).Thus (a, 1] also lies in σ(G), and the proof is complete.

We emphasize again that the choice of the alternative generating collection G is notunique; one can easily show that substitution of the generic set (0, a) by any of the intervals(0, a], [a, b], etc., still yields the Borel field. More importantly, G can be replaced by a(sub-)collection of intervals (0, a) such that a is a rational number (expressible as a ratioof integers). Since any real number can be written as an increasing or decreasing sequenceof rationals, we can easily adapt the above proof to suit the modified G by using rationalendpoints in the appropriate unions. And since the set of rationals is countable, thisimplies that the Borel field can be generated by a countable collection of intervals.

We can now justify our earlier statement that σ-fields generated by countable col-lections on uncountable sample spaces are not always described in terms of countablepartitions. We do so by noting that the Borel field contains (among others) all sets thatconsist of single points on the unit interval; these sets alone form an uncountable partitionof that interval.

The Borel field of the entire real line can be defined in a similar fashion:

B(R) def= σ

((−∞, a] : a ∈ R

).

Here again the choice of generating intervals is not unique, and rational endpoints are fullyacceptable.

16

We can also define the Borel field of an arbitrary subset Ω of the real line by

B(Ω) def= A : A = C ∩ Ω, C ∈ B(R) .

An interesting exercise is to prove that in the case of the unit interval, the above definitionof B((0, 1]) is consistent with the one given originally.

7. Definition of the event space

Billingsley, Sec. 4, Limit sets.

As we argued in Section 4, it is desirable that every event space contain ∅ and Ω,and be closed under complementation and finite unions. Thus every event space shouldat least be a field. That it should also be a σ-field is not so obvious. The axiom ofclosure under countable unions implies the following: if we have a sequence of observationmechanisms M1,M2, . . . (where Mi observes the occurence of event Ai), then we caneffectively combine these mechanisms into one that will decide whether the union

∞⋃

i=1

Ai

occurred or not. This implication is not always true in practice, as the example given inthis section illustrates. Despite this shortcoming, the structure of the σ-field is chosen forthe event space because it allows us to use the powerful mathematical machinery associatedwith the probability measure (which will be formally defined in the following section).

For the remainder of this course, the event space F will always be a σ-field.

Consider now the following example which involves an infinite sequence of eventsA1, A2, . . . in a measurable space (Ω,F). We are interested in descriptions of the set A∗ ofsample points ω that lie in infinitely many (but not necessarily all) of the sets Ai. Thus

A∗ = ω : ω ∈ Ai for infinitely many i .

In deriving an alternative description of A∗, we argue as follows. If a point ω lies in finitelymany sets Ai, then there exists an index k such that ω does not lie in any of the setsAk, Ak+1, . . .. Conversely, if a point ω lies in infinitely many Ai’s, then for every k thatpoint will lie in at least one of the sets Ak, Ak+1, . . .. Thus

A∗ = ω : (∀k) ω ∈ at least one of Ak, Ak+1, . . .

=

ω : (∀k) ω ∈⋃

i≥k

Ai

.

Writing Bk for⋃

i≥k Ai, we have

A∗ = ω : (∀k) ω ∈ Bk =⋂

k≥1

Bk =⋂

k≥1

⋃

i≥k

Ai .

17

To show that A∗ is an event, i.e., A∗ ∈ F , we argue as follows. Every Bk is anevent, since it can be written as a countable union of events Ai; and thus A∗, which is theintersection of the Bk’s, is also an event.(Remark. De Morgan’s law is true for arbitrary collections of events, and thus closureunder countable unions is equivalent to closure under countable intersections.)

The event A∗ is called the limit superior (lim sup) of the sequence A1, A2, . . ., andis often described in words as “Ai occurs infinitely often (i.o.).” Thus

lim supi

Aidef= Ai i.o. def=

⋂

k≥1

⋃

i≥k

Ai .

We can think of the above situation in terms of a random experiment whose actualoutcome ω is unknown to us. Our information about ω is limited to a sequence of partialobservations of the experiment: for every i we know whether ω ∈ Ai or ω ∈ Ac

i , i.e.,whether the event Ai has occurred or not. Since the set A∗ of outcomes is expressiblein terms of the sequence A1, A2, . . ., it is reasonable to assume that we can process ourobservations so as to determine whether or not ω ∈ A∗ i.e., whether or not infinitely manyof the events Ai have occurred.

Unfortunately, this is easier said than done. Consider for instance the case in whichthe observations are made sequentially in discrete time. If we assume that the Ai’s aresuch that every intersection of the form

⋂

i≥1

Ci , (Ci is either Ai or Aci ) ,

is nonempty, then we have no means of determining in finite time whether infinitely manyof the Ai’s have occurred. Thus the set of outcomes A∗ = lim supi Ai does not correspondto any “real” observation of the experiment; it is an event only because F is a σ-field.

Remark. In a similar manner we can define the limit inferior (lim inf) of thesequence A1, A2, . . . as the event that “Ai occurs eventually,” or equivalently, “Ai occursfor all but finitely many values of i.” It is easy to check that this definition is consistentwith the representation

lim infi

Ai =⋃

k≥1

⋂

i≥k

Ai ,

and that (lim infi Ai)c = lim supi Aci .

8. Probability measures

Gray & Davisson, pp. 31, 47–50.Billingsley, Sec. 2, Probability Measures.

Definition

A probability measure on the measurable space (Ω,F) is a real-valued function Pdefined on F that satisfies the following axioms:

18

(P1) Nonnegativity: (∀A ∈ F) P (A) ≥ 0.

(P2) Normalization: P (Ω) = 1.

(P3) Countable Additivity: if A1, A2, . . . are pairwise disjoint events (i.e., Ai∩Aj =∅for i 6= j), then

P

( ∞⋃

i=1

Ai

)=

∞∑

i=1

P (Ai) .

Note that in (P3) above, the union of the Ai’s lies in F by the assumption that theevent space is a σ-field. (P1–3) are also known as the Kolmogorov axioms.

Simple properties

From (P1) and (P2) we can deduce that P is finitely additive and that the probabilityof the empty set is 0. Indeed, if A1, . . . , An are pairwise disjoint events, we can write

n⋃

i=1

Ai =∞⋃

i=1

Ai ,

where An+1 = An+2 = . . . = ∅. The sequence A1, A2, . . . still consists of disjoint events, sowe can apply (P3) to obtain

P

( n⋃

i=1

Ai

)=

n∑

i=1

P (Ai) +∞∑

i=n+1

P (Ai) .

The infinite sum on the right-hand side consists of terms equal to P (∅), and hence it willbe equal to 0 if P (∅) = 0, and +∞ if P (∅) > 0. As the probability measure cannot takeinfinity as a value (it is assumed to be a real-valued function), it must be that

∞∑

i=n+1

P (Ai) = P (∅) = 0 .

Therefore

P

( n⋃

i=1

Ai

)=

n∑

i=1

P (Ai) .

We thus obtain(P4) P (∅) = 0;

and(P5) Finite additivity: if A1, . . . , An are pairwise disjoint events,

P

( n⋃

i=1

Ai

)=

n∑

i=1

P (Ai) .

19

(Remark. We can obtain (P4) without assuming that P (A) < ∞ by applying (P3) to asequence where A1 = Ω and the remaining Ai’s equal to ∅ ((P1) and (P2) are also neededhere). Having obtained (P4), we can apply (P3) and (P1) as before to obtain (P5).)

(P6) For all A ∈ F , P (A) + P (Ac) = 1 .

This follows from (P2) and (P5) since A ∩Ac = ∅, A ∪Ac = Ω.

(P7) Monotonicity under inclusion: B ⊃ A ⇒ P (B) ≥ P (A).This is because we can write B as A ∪ (B ∩ Ac), which is a disjoint union. Hence by

(P5) and (P1),P (B) = P (A) + P (B ∩Ac) ≥ P (A) .

In particular, Ω ⊃ A for every event A, so(P8) For all A ∈ F , P (A) ≤ 1.

Notation. The set difference A−B or A \B is defined by

A−Bdef= A \B

def= A ∩Bc .

The symmetric set difference A4B is defined by

A4B = B 4A = (A ∩Bc) ∪ (B ∩Ac) .

These operations are illustrated in the figure below.

We say that a sequence of events (An)n∈N is increasing if

A1 ⊂ A2 ⊂ . . . ;

it is decreasing ifA1 ⊃ A2 ⊃ . . . .

For such sequences (which are also called monotone) we can define a limiting event limn An

as follows: if (An)n∈N is increasing,

limn

Andef=

∞⋃n=1

An ;

20

whereas if (An)n∈N is decreasing,

limn

Andef=

∞⋂n=1

An .

For brevity we write An ↑ A and An ↓ A (respectively), where A = limn An.

Probability measures are continuous on monotone sequences of events; in other words,

P (limn

An) = limn

P (An) .

To prove this for an increasing sequence A1, A2, . . ., we generate a sequence of pairwisedisjoint events B1, B2, . . . as follows:

B1 =A1

B2 =A2 −A1

...Bn =An −An−1

A1 =B1

A2 =B1 ∪B2

...An =B1 ∪ · · · ∪Bn

From the above construction, it is easy to see that

∞⋃

i=1

Ai =∞⋃

i=1

Bi .

This is because any ω that lies in one of the Ai’s will also lie in one of the Bi’s, and viceversa. Hence

P

( ∞⋃

i=1

Ai

)= P

( ∞⋃

i=1

Bi

),

and since the Bi’s are disjoint, countable additivity gives

P

( ∞⋃

i=1

Ai

)=

∞∑

i=1

P (Bi) = limn

n∑

i=1

P (Bi) .

Now since An is the union of the first n Bi’s, we also have

P (An) =n∑

i=1

P (Bi) ,

21

and hence

P

( ∞⋃

i=1

Ai

)= lim

nP (An) .

The analogous result for decreasing sequences of sets follows easily. If A1, A2, . . . isdecreasing, then Ac

1, Ac2, . . . is increasing; by De Morgan’s law and the above result we

then have

P

( ∞⋂

i=1

Ai

)= P

( ∞⋃

i=1

Aci

)c

= 1− limn

P (Acn)

= 1− [1− limn

P (An)] = limn

P (An) .

We have thus obtained(P9) Monotone continuity from below : If An ↑ A, then limn P (An) = P (A).(P10) Monotone continuity from above: If An ↓ A, then limn P (An) = P (A).

Remark. As we have seen, (P9) and (P10) follow directly from the Kolmogorovaxioms (P1–3). It is not difficult to show that under the assumption of nonnegativity(P1), the countable additivity axiom (P3) is actually equivalent to the two axioms of finiteadditivity (P5) and monotone continuity from below (P9) combined. Thus an alternativeto the Kolmogorov axioms is the set of axioms consisting of (P 1,2,5) and either (P9) or(P10):

NonnegativityNormalizationCountable additivity

⇐⇒

NonnegativityNormalizationFinite additivityContinuity from above or below

Convex mixtures of probability measures

Let P1, P2, . . . be probability measures on the same measurable space (Ω,F). We saythat the set function P is a convex mixture (or convex combination) of these measures ifit can be expressed as a weighted sum of the Pi’s with nonnegative weights that add tounity. In other words, for every A ∈ F , P (A) is defined as

P (A) =∞∑

i=1

λiPi(A) ,

where the real coefficients λi satisfy

(∀i) λi ≥ 0,

∞∑

i=1

λi = 1 .

Claim. P is a probability measure on (Ω,F).

22

Proof. Nonnegativity of P follows directly from that of the λi’s and Pi’s. Normal-ization is also easily established via

P (Ω) =∞∑

i=1

λiPi(Ω) =∞∑

i=1

λi = 1 .

To prove countable additivity, consider a sequence of disjoint events A1, A2, . . .. We have

P

( ∞⋃

j=1

Aj

)=

∞∑

i=1

λiPi

( ∞⋃

j=1

Aj

)=

∞∑

i=1

λi

∞∑

j=1

Pi(Aj) ,

where the last equality follows by countable additivity of the measures Pi. The iteratedsum has positive summands, so we can change the order of summation to obtain

P

( ∞⋃

j=1

Aj

)=

∞∑

j=1

∞∑

i=1

λiPi(Aj) =∞∑

j=1

P (Aj) .

Thus P is a probability measure.

9. Specification of probability measures

Gray & Davisson, pp. 50–52.Billingsley, Sec. 3, Lebesgue measure on the unit interval; Sec. 4.

Discrete spaces

As we saw in Section 5, a measurable space (Ω,F) is discrete if the σ-field F isgenerated by a countable partition of Ω into atoms

C1, C2, . . . .

Then for every event A there exists an index set I ⊂ N such that

A =⋃

i∈I

Ci ,

and if P is a probability measure on (Ω,F), we have

P (A) =∑

i∈I

P (Ci) .

The above demonstrates that in order to define a probability measure P on (Ω,F), itsuffices to specify the quantity

pi = P (Ci)

23

for every atomic event Ci. Clearly, the pi’s satisfy

(∀i) pi ≥ 0,

∞∑

i=1

pi = 1 .

That any sequence (pi)i∈N satisfying this nonnegativity/normalization condition generatesa probability measure is not difficult to see: if we let

P (A) =∑

i∈I

pi

with A and I defined as before, then the set function P is nonnegative and such thatP (Ω) = 1. To establish countable additivity, we simply note that disjoint events can beexpressed as unions of cells over likewise disjoint index sets.

Definition. A probability mass function (pmf) is a sequence of nonnegativenumbers whose sum equals unity. In the context of a given discrete probability space, thepmf is the function that assigns probability to each atomic event.

Example. Ω = 0, 1, . . ., F = 2Ω.It is easy to see in this case that F is generated by the one point sets (or singletons)

k. To define the measure P , we use the Poisson pmf:

pk = e−λ λk

k!,

where k = 0, 1, . . .. We have

∞∑

k=0

pk = e−λ∞∑

k=0

λk

k!= e−λeλ = 1

as required. If we let Pk = pk, then the probability that the outcome of the experimentis odd is

P1, 3, 5, . . . = e−λ

(λ +

λ3

3!+

λ5

5!+ . . .

)= e−λ

(eλ − e−λ

2

)=

1− e−2λ

2.

Non-discrete spaces

As we saw above, we can concisely specify a probability measure on a discrete spaceby quoting the probabilities of the atomic events. This is not possible in the case of anon-discrete space, since the event space is no longer generated by a countable partition ofΩ and thus there is no countable family of “minimal” events. Yet a concise specificationof measures on non-discrete spaces is absolutely necessary if such spaces are to be usedin probability models. Without such specification, the task of explicitly defining a set

24

function on all (uncountably many) events and subsequently testing it for the Kolmogorovaxioms becomes impracticable.

A method often used for defining a probability measure on a non-discrete σ-field Finvolves the construction of a preliminary set function Q on a field F0 that generatesF . If the function Q, as constructed on F0, satisfies certain conditions similar (but notidentical) to the Kolmogorov axioms, then it is possible to extend Q to a unique probabilitymeasure P on F = σ(F0). Thus under these conditions, specification of Q on F0 sufficesto determine the probability measure P on F without ambiguity. (One word of caution:that Q uniquely determines P does not imply that for an arbitrary event A /∈ F0 one caneasily compute P (A) based on the values taken by Q; in many cases, this computation willbe highly complex or even infeasible.)

The above method of defining measures is based on the following theorem, whoseproof can be found in Billingsley, Section 3.

Theorem. Let F0 be a field of subsets of Ω, and let Q be a nonnegative countablyadditive set function on F0 such that Q(Ω) = 1. Then there exists a unique probabilitymeasure on σ(F0) such that P ≡ Q on F0.

Remark. As Fo will not in general be closed under countable unions, the statement“Q is countably additive on F0” is understood as “if A1, A2, . . . lie in F0 and

⋃i Ai also

lies in F0, then

Q

(⋃

i

Ai

)=

∑

i

P (Ai) .”

The Lebesgue measure on the unit interval

To illustrate the extension technique outlined in the previous subsection, we brieflyconsider the problem of defining a probability measure P on the Borel field of the unitinterval such that

P (a, b] = b− a

for every (a, b] ⊂ (0, 1].We first need to identify a field F0 that generates the Borel field, i.e., such that

σ(F0) = B((0, 1]). As we saw in Section 6, one such choice of F0 consists of the empty setand all finite disjoint unions of semi-open intervals

N⋃

i=1

(ci, di] ,

where 0 ≤ c1 < d1 < c2 < d2 < . . . < cN < dN ≤ 1. Since we would like the probability ofany semi-open interval to be equal to its length, we must define the set function Q on F0

by

Q(∅) = 0,

( N⋃

i=1

(ci, di])

=N∑

i=1

(di − ci) .

25

The set function Q is clearly nonnegative and satisfies Q(0, 1] = 1. It is not difficult to showthat Q is finitely additive: if two sets in F0 are disjoint, then their constituent intervals arenon-overlapping. Countable additivity of Q on F0 can be also established, but the proofis slightly more involved (see Billingsley, Sec. 2, Lebesgue measure on the unit interval).

By the extension theorem of the previous subsection, there exists a unique probabilitymeasure P on σ(F0) = B((0, 1]) such that P = Q on F0. P is called the Lebesguemeasure on the unit interval. It is the only probability measure on B((0, 1]) that assignsto every semi-open interval a probability equal to its length.

What is the probability of a singleton x under the Lebesgue measure? Intuitively,it should be zero (a point has no length). This is easily verified by writing x as the limitof a decresing sequence of semi-open intervals,

x =∞⋂

i=1

((1− (i + 1)−1)x, x

],

and invoking monotone continuity from above:

Px = limn

P((1− (n + 1)−1)x, x

]= lim

n(n + 1)−1x = 0 .

Thus the Lebesgue measure of any interval (whether open, semi-open or closed) equals theinterval length.

Question. Do countable subsets of the unit interval (e.g., the set of rationals in thatinterval) lie in its Borel field? What is the Lebesgue measure of such sets?

10. Definition of random variable

Gray & Davisson, pp. 64–74.Billingsley, Sec. 5, Definition; Sec. 13, Measurable Mappings, Mappings into Rk.

Preliminaries

In this section we consider real–valued functions f defined on a sample space Ω, i.e.,

f : Ω 7→ R ,

We recall the definition of the image of a set A ⊂ Ω under f as

f(A) def= x ∈ R : (∃ω ∈ A) f(ω) = x ;

the inverse image of a set H ⊂ R under f is defined by

f−1(H) def= ω ∈ Ω : f(ω) ∈ H .

26

In developing the concept of the random variable, we will employ images and inverseimages (in particular) extensively. The following simple property will be quite useful: ifHi, i ∈ I is an arbitrary collection of subsets of R, then

f−1

(⋃

i∈I

Hi

)=

⋃

i∈I

f−1(Hi) .

To see this, let ω lie in the inverse image of the union. Then f(ω) ∈ Hi for some i = i(ω),and thus ω f−1(Hi). Conversely, if ω′ lies in the union of the inverse images, then f(ω′) ∈Hj for some j = j(ω′), and thus f(ω′) also lies in the union of all H’s.

By similar reasoning, we can show that

f−1

(⋂

i∈I

Hi

)=

⋂

i∈I

f−1(Hi) ,

and for (forward) images,

f

(⋃

i∈I

Ai

)=

⋃

i∈I

f(Ai) , f

(⋂

i∈I

Ai

)⊂

⋂

i∈I

f(Ai) .

Definition

A random variable (r.v.) on a measurable space (Ω,F) is a real-valued fuctionX = X(·) on Ω such that for all a ∈ R, the set

X−1(−∞, a] = ω : X(ω) ≤ a

lies in F (i.e., is an event).

We can think of X(ω) as the result of a measurement taken in the course of a randomexperiment: if the outcome of the experiment is ω, we obtain a reading X(ω) which providespartial information about ω. A more precise interpretation will be given in Section 11.For the moment, we consider some simple examples of random variables.

Examples of random variables

(i) Let (Ω,F) be arbitrary, and A ∈ F . The indicator function IA(.) of the eventA is defined by

IA(ω) =

1, if ω ∈ A;0, if ω ∈ Ac.

We illustrate this definition in the figure below, where for the sake of simplicity, the samplespace Ω is represented by an interval on the real line.

27

To see that X = IA defines a random variable on (Ω,F), note that as a ranges overthe real line, the set X−1(−∞, a] = ω : X(ω) ≤ a is given by

X−1 =

∅, if a < 0;Ac if 0 ≤ a < 1;Ω if a ≥ 1.

Thus X−1(−∞, a] is always an event, and hence X = IA is a r.v. by the above definition.If B /∈ F , then also Bc /∈ F , and thus IB is not a r.v.

(ii) (Ω,F) is arbitrary, E1, E2, . . . is a countable partition of Ω, and (cj)j∈N is asequence of distinct real numbers. We define X(·) as the function that takes the constantvalue cj on every event Ej . In terms of indicator functions,

X(ω) =∞∑

j=1

cjIEj (ω) .

We then have for a ∈ R

X−1(−∞, a] =⋃

j∈Ja

Ej ,

where Ja = j : a ≥ cj. As the above inverse image is a countable union of events, it isitself an event and thus X is a r.v. on (Ω,F).

28

Remark. X as defined above takes the general form of a discrete random variableon the measurable space (Ω,F). Thus a discrete r.v. is one whose range is countable. Asimple random variable is one whose range is finite.

(iii) Let (Ω,F) = (R,B(R)) and X(·) be a continuous increasing real-valued functionsuch that

limω↓−∞

X(ω) = −∞ . limω↑+∞

X(ω) = +∞ .

Then X−1(−∞, a] = (−∞, X−1(a)]. Since every interval lies in B(R), X is a r.v.

Remark. There are many possible variations on this example. If we merely assumethat X is nondecreasing (as opposed to strictly increasing) the inverse images are againof the form (−∞, b] for suitable values of b. If we remove the constraint that the limits asω → ±∞ be infinite, we admit ∅ and R as possible inverse images. Finally, if we drop thecontinuity assumption, then the inverse images will take the form (−∞, b] or (−∞, b). Inall these cases, X remains a r.v. on (Ω,F).

11. The sigma field generated by a random variable

Billingsley, Sec. 5, Definition; Sec. 13, Measurable Mappings, Mappings into Rk; Sec. 20,Subfields.

As noted in the previous section, we can think of the random variable X as a mea-surement taken in the course of the random experiment (Ω,F , P ). Thus if the outcome ofthe experiment is ω, we can observe X(ω) with the aid of a measuring instrument MX .

In introducing the important concept of the σ-field generated by X, we will assumethat the instrument MX can be set in an infinity of observation modes indexed by thereal numbers a and denoted by MX,a. In mode MX,a, the instrument determines whetherX(ω) ≤ a or X(ω) > a; that is, it decides which of the complementary events

X−1(−∞, a] = ω : X(ω) ≤ a and X−1(a,∞) = ω : X(ω) > a

occurs. (Note that the above subsets of Ω are events by definition of the random variable;this is also consistent with our earlier interpretation of an event as a set of outcomes whoseoccurrence can be determined.)

29

The class of events observable by MX is not limited to inverse images of intervals.As in our earlier discussion of the event space, we may assume that this class containsthe empty set and the sample space, and is closed under complementation and countableunions, i.e., it is a σ-field. Since the “nominal” events observable by MX are the inverseimages of the intervals (−∞, a], it is reasonable to postulate that the σ-field associatedwith the instrument MX is the smallest σ-field that contains these events.

Definition.

Let X be a random variable on (Ω,F , P ). We denote by σ(X) the σ-field generatedby the events ω : X(ω) ≤ a as a varies over the real line. Thus

σ(X) def= σ

(X−1(−∞, a] : a ∈ R

).

σ(X) is referred to as the σ-field generated by X.

Corollary. σ(X) ⊂ F .

To see this, note that the generating collection

G =

X−1(−∞, a] : a ∈ R

is contained in F . Thus σ(X) = σ(G) ⊂ F .

Examples

Consider the three examples of the previous section.

(i) (Ω,F) arbitrary, A ∈ F , X = IA. We have seen that the inverse image X−1(−∞, a]is one of the three sets

∅ , Ac , Ω .

Thusσ(X) = σ(∅, Ac, Ω) = ∅,Ω, A,Ac ,

and IA allows us to determine the occurrence of a single nontrivial event, namely A.

(ii) (Ω,F) is again arbitrary, and

X =∞∑

j=1

cjIEj ,

where the cj ’s are distinct and the events Ej form a countable partition of Ω. In this casethe inverse images satisfy

X−1(−∞, a] =⋃

j∈Ja

Ej

30

for Ja = j : a ≥ cj. Thus every inverse image lies in the σ-field generated by thepartition E1, E2, . . ., and hence

σ(X) ⊂ σ(E1, E2, . . .) .

We claim that the reverse inclusion is also true. To show this, it suffices to provethat every atom Ej of the partition lies in σ(X). Noting that (by distinctness of the cj ’s)Ej = X−1cj, and that

cj = (−∞, cj ] −∞⋃

n=1

(−∞, cj − n−1] ,

we obtain

Ej = X−1(−∞, cj ] −∞⋃

n=1

X−1(−∞, cj − n−1] .

Since the above expression involves countably many operations on inverse images of inter-vals (−∞, a], we have that Ej ∈ σ(X). Hence we conclude that

σ(X) = σ(E1, E2, . . .) .

Thus the σ-field of a discrete r.v. is the σ-field generated by the corresponding countablepartition.

(iii) (Ω,F) = (R,B(R)), and X(·) is continuous, strictly increasing and unboundedboth from above and below. In this case an inverse function X−1 exists, and its rangeequals R. Then

X−1(−∞, a] = (−∞, X−1(a)]

and since X−1(a) takes all possible real values, we have

X−1(−∞, a] : a ∈ R

=

(−∞, a] : a ∈ R

.

As the collection on the r.h.s. generates the Borel field of the real line, we have

σ(X) = B(R) = F .

Thus the r.v. X is completely informative about the underlying experiment: the occurrenceor not of any event can be determined using X.

Remark. As it turns out, the above result is true even without assuming that X(·)is continuous and unbounded. However, it is essential that X(·) be strictly increasing. Ifit were not, e.g., if X(·) were constant over an interval (c, d), then the instrument MX

would not be able to distinguish between outcomes lying in that interval. Thus no propersubset of (c, d) could lie in σ(X), and σ(X) 6= B(R).

31

An alternative, somewhat more direct, representation of σ(X) exists by virtue of thefollowing theorem.

Theorem. The σ-field generated by a random variable is the collection of all inverseimages of Borel sets on the real line. Thus if X is a r.v. on (Ω,F , P ),

σ(X) = X−1(H) : H ∈ B(R) .

To prove this theorem, we use the following simple lemma.

Lemma. Let Ω be arbitrary, and f a real-valued function on Ω. Then(i) if B is a σ-field of subsets of R, the collection f−1(H) : H ∈ B is a σ-field of

subsets of Ω;(ii) if A is a σ-field of subsets of Ω, the collection H ⊂ R : f−1(H) ∈ A is a σ-field

of subsets of R.

The proof of the lemma is straightforward and is left as an exercise. The identityf−1(

⋃i Hi) =

⋃i f−1(Hi) is useful in establishing closure under countable unions.

Proof of Theorem. For convenience let L = X−1(H) : H ∈ B(R).L ⊃ σ(X): Since B(R) is a σ-field of subsets of R, the collection L is a σ-field of

subsets of Ω by statement (i) of the above lemma. Since all intervals lie in B(R), we havethat

L ⊃

X−1(−∞, a] : a ∈ R

,

and hence L also contains the σ-field generated by the collection on the r.h.s. ThusL ⊃ σ(X).

L ⊂ σ(X): LetH = H ⊂ R : X−1(H) ∈ σ(X).

This collection is a σ-field by statement (ii) of the above lemma. Furthermore, it containsevery interval (−∞, a] since X−1(−∞, a] ∈ σ(X). Hence H also contains the σ-fieldgenerated by such intervals, namely the Borel field B(R) of the real line. Thus X−1(H) ∈σ(X) for every Borel set H, or equivalently, L ⊂ σ(X).

The above characterization of σ(X) conforms with our intuition about event spacesand random variables. Indeed, the class of subsets H of the real line for which X ∈ H canbe tested should be a σ-field. One would also expect this σ-field to be generated by thesets that are directly observable using the instrument MX , namely the intervals (−∞, a].Thus we obtain the Borel field, and infer that the class of events in F that can be observedusing X or MX is

X−1(H) : H ∈ B(R) .

This is precisely the statement of the above theorem.

Remark. Note that in Example (ii), the fact that Ej ∈ σ(X) follows directly fromthe above theorem by noting that cj ∈ B(R).

32

Equivalent definition of random variable. From the above theorem it followsthat X : Ω 7→ R is a r.v. on (Ω,F) if and only if X−1(H) ∈ F for every H ∈ B(R).

12. Operations on random variables

Billingsley, Sec. 13, Measurable Mappings, Mappings onto Rk.

Convention. We shall reserve the term random variable for functions defined ona measurable space (Ω,F) which is part of a probability space (Ω,F , P ). For functionswhich satisfy the definition of the random variable as given in Section 10, but which arenot defined on a measurable space associated with a probability experiment, we will usethe term measurable function.

Operations on finite sets of random variables

1. Every piecewise continuous real-valued function on R is a measurable function on(R,B(R)) (proof of this fact can be found in Billingsley). Thus most real-valued functionson R encountered in practice are measurable functions.

2. If X is a r.v. on (Ω,F) and g is a measurable function on (R,B(R)), then thecomposition Y = g X defined by

Y (ω) = (g X)(ω) = g(X(ω))

is also a random variable on (Ω,F).

To prove this, we show that Y −1(H) ∈ F for every Borel set H. Indeed,

Y −1(H) = ω : g(X(ω)) ∈ H = ω : X(ω) ∈ g−1(H)= X−1(g−1(H)) .

Since g is a measurable function on (R,B(R)), the set g−1(H) is a Borel set; since X is ar.v. on (Ω,F), the set X−1(g−1(H)) is an event.

Thus for example, if X is a r.v. on (Ω,F), so are the functions αX, eX and I[a,b](X).

3. (without proof) If X and Y are r.v.’s on the same measurable space, then so areX + Y , XY and max(X,Y ).

Thus if X1, . . . , Xn are random variables on the same probability space, and

g : Rn 7→ R

is a function composed by a finite number of additions, multiplications and maximizations,then

Y (ω) = g(X1(ω), . . . , Xn(ω))

defines yet another random variable on (Ω,F). As we shall see later, this property extendsto most scalar-valued functions on Rn encountered in practice.

33

Operations on sequences of random variables

Suppose X1, X2, . . . is a sequence of random variables on (Ω,F). If

g : RN 7→ R ,

then we can define a function Y (·) on Ω by

Y (ω) = g(X1(ω), X2(ω), . . .) .

Operations such as the above on sequences of r.v.’s often yield infinite values. Forexample, if we take

Y (ω) =∞∑

i=1

|Xi(ω)| ,

then it is conceivable that Y (ω) will be infinite for some ω ∈ Ω. Thus it seems desirable toextend the definition of a random variable to functions that take ±∞ as values, providedthese values are “observable” according to the interpretation of the previous section. Wedo so as follows.

DefinitionAn extended random variable on (Ω,F) is a function Y on Ω into R∪−∞,+∞ =

[−∞, +∞] such that the set

Y −1[−∞, a] = ω : Y (ω) ≤ a

lies in F for every a ∈ [−∞,∞].

An equivalent definition is the following: Y is an extended r.v. on (Ω,F) if there existsa r.v. X on (Ω,F) and events A+ and A− in F such that

X(ω) =

+∞ if ω ∈ A+ ;−∞ if ω ∈ A− ;X(ω) if ω ∈ (A+ ∪A−)c .

For most transformations g of interest, the function Y = g(X1, X2, . . .) will be arandom variable. We show this for the important case in which Y is the supremum (orinfimum) of the random variables X1, X2, . . ..

Digression on suprema and infima

We say that a nonempty set S of real numbers is bounded from above if thereexists a number b ∈ R such that every element in S is less than or equal to b. Such b iscalled an upper bound of the set S.

Example. S1 = (−∞, 1) is bounded from above by 2, which is an upper bound ofS1. The set S2 = N is not bounded from above.

34

Axiom. Every nonempty set S ⊂ R that is bounded from above has a least upperbound (l.u.b.).

Based on this axiom, we define the supremum of a nonempty set S as follows:

sup S =

l.u.b. of S, if S is bounded from above;+∞, otherwise.

Listed below are a few elementary properties of the supremum:1. If sup S ∈ S, then supS = max S.2. If S is finite, then sup S ∈ S. If S is infinite, then sup S may or may not lie in S.

For example,sup(0, 1] = 1 ∈ S , sup(0, 1) = 1 /∈ S .

3. If S is bounded from above, then sup S is the unique real number t with theproperty that for every a < t < b,

(a, t] ∩ S 6= ∅ , (t, b) ∩ S = ∅ .

4. For all a ∈ R, the following equivalence is true:

sup S ≤ a ⇔ (∀x ∈ S) x ≤ a .

If S comprises the terms of an infinite sequence x1, x2, . . ., it is customary to writesupn xn for sup S.

The notion of the infimum can be developed along parallel lines. The conceptsbounded from below and greatest lower bound should be transparent; we let inf S equalthe greatest lower bound if it exists, −∞ otherwise. Counterparts of properties (1–4) abovecan be obtained by noting that

inf S = − sup(−S) .

End of Digression.

We now claim that the function Y (·) defined by

Y (ω) = supn

Xn(ω)

is an extended random variable on the space (Ω,F). To see this, consider the inverseimages Y −1[−∞, a] for all finite and infinite values of a. Clearly

Y −1−∞ = ∅ , Y −1[−∞,∞] = Ω .

35

For a finite, we use the equivalence stated in (4) above. Thus

Y −1[−∞, a] = ω : supn

Xn(ω) ≤ a = ω : (∀n) Xn(ω) ≤ a

=⋂

n≥1

ω : Xn(ω) ≤ a =⋂

n≥1

X−1n (−∞, a] .

Since each Xn is a random variable, the final expression is a countable intersection ofevents, hence also an event. This concludes our proof.

A similar argument establishes that infn Xn is also an extended random variable.One can proceed from this point to show that the set C of sample points ω over whichlimn Xn(ω) exists is an event, and the function thus defined on C is an extended r.v. onthe restriction of (Ω,F) on C.

The outline of the proof is as follows: The set C consists of those ω for which

supn

infk≥n

X(ω) = infn

supk≥n

X(ω) ,

the common value being equal to limn X(ω). By our earlier result on suprema and infima,both sides of the above equation define extended r.v.’s on (Ω,F), and thus the set C is anevent. Also, for all a ∈ [−∞,+∞], we have the equality

ω : limn

Xn(ω) ≤ a = C ∩ supn

infk≥n

X(ω) ≤ a ,

which implies that the above set is an event. In this sense, the mapping ω 7→ limn Xn(ω)defines an extended r.v. on the restriction of (Ω,F) on C.

13. The distribution of a random variable

Gray & Davisson, pp. 75–84.Billingsley, Sec. 12, Specifying Measures on the Line; Sec. 14, Distribution Functions; Sec.20, Distributions.

In continuing our discussion of random variables, we bring probability measures intoplay. Indeed, the most marked characteristic of a random variable X on (Ω,F , P ) is thatit induces a probability measure on (R,B(R)) in the following fashion.

By the definition of random variable, the inverse image of any Borel set H under Xis an event in F . Thus if we let

PX(H) def= P (X−1(H)) = Pω : X(ω) ∈ H ,

we obtain a well-defined set function PX on B(R).We claim that PX is a probability measure on (R,B(R)). Nonnegativity is self-evident,

and normalization follows from

PX(R) = Pω : X(ω) ∈ R = P (Ω) = 1 .

36

For countable additivity, consider a sequence H1,H2, . . . of pairwise disjoint Borel sets. Itis easy to see that the inverse images under X will also be pairwise disjoint. Recalling that

X−1

(⋃

i

Hi

)=

⋃

i

X−1(Hi) ,

and that P is countably additive, we obtain

PX

(⋃

i

Hi

)= P

[X−1

(⋃

i

Hi

)]= P

[⋃

i

X−1(Hi)]

=∑

i

P(X−1(Hi)

)=

∑

i

PX(Hi) .

Definition

The distribution of a random variable X on (Ω,F , P ) is the probability measure PX

on (R,B(R)) defined by

(∀H ∈ B(R)) . PX(H) = P(X−1(H)

)

The interpretation of PX is clear if we associate the random variable X with a sub-experiment of (Ω,F , P ). The sample space for this sub-experiment is the real line, theevents are the Borel sets, and the probabilities of these events are given by the distributionof PX . Thus in describing the random measurement X, we can restrict ourselves to theprobability space (R,B(R), PX).

Specification of distributions

The cumulative distribution function (cdf) FX of the random variable X on(Ω,F , P ) is defined for all x ∈ R by

FX(x) def= PX(−∞, x] = Pω : X(ω) ≤ x .

FX has three essential properties:

(D1) FX(x) is nondecreasing in x;

(D2) limx↑∞ FX(x) = 1, limx↓−∞ FX(x) = 0;

(D3) FX is everywhere right continuous: (∀x) limε↓0 FX(x + ε) = FX(x).

To show (D1), we note that

x < x′ ⇒ (−∞, x] ⊂ (−∞, x′] ⇒ PX(−∞, x] ≤ PX(−∞, x′] .

37

Monotone continuity of PX yields (D2) and (D3) as follows:

limx→∞

FX(x) = limn→∞

PX(−∞, n] = PX

( ∞⋃n=1

(−∞, n])

= PX(R) = 1 ,

limx→−∞

FX(x) = limn→∞

PX(−∞,−n] = PX

( ∞⋂n=1

(−∞,−n])

= PX(∅) = 0 ,

limε↓0

FX(x + ε) = limn→∞

PX(−∞, x + n−1] = PX

( ∞⋂n=1

(−∞, x + n−1])

= PX(−∞, x] .

It is worth noting that X is not always left continuous: if we evaluate the quantitylimε↓0 FX(x− ε) in the above fashion, we obtain

limε↓0

FX(x− ε) = limn→∞

PX(−∞, x− n−1] = PX

( ∞⋃n=1

(−∞, x− n−1])

= PX(−∞, x) .

Thuslimε↓0

FX(x− ε) = FX(x)− PXx ,

and FX is continuous at x (both from the left and the right) if and only if PXx = 0.

The general form of FX is illustrated below. The magnitude of each jump is given bythe probability of the corresponding abscissa.

Conditions (D1–3) are clearly necessary for a function to be a cumulative distributionfunction. As it turns out, they are also sufficient, in that any function possessing theseproperties is the cdf of some random variable. This is by virtue of the following theorem.

Theorem. Let F be a function on the real line satisfying conditions (D1–3). Thenthere exists a measure P on (R,B(R)) such that

P (−∞, a] = F (a)

for every a ∈ R. Furthermore, there exists on some probability space (Ω,F , P ) a randomvariable X such that PX ≡ P , FX ≡ F .

38

To prove this theorem, we follow a procedure similar to the construction of theLebesgue measure in Section 9. Briefly, we consider the field F0 consisting of finite disjointunions of bounded semi-open intervals on the real line. We define a set function Q on F0

by lettingQ(∅) = 0 , Q(c, d] = F (d)− F (c) ,

and extending Q to all members of F0 in the obvious manner. It is possible to show that Qis countably additive on F0 and thus possesses a unique extension to a probability measureP on σ(F0) = B(R). Then for all a ∈ R,

P (−∞, a] = limn→∞

P (−n, a] = Q(a)− limn→∞

Q(−n) = Q(a) .

The second statement of the theorem follows immediately from the above constructionif we take (R,B(R)) as the probability space, with X being the identity transformationX(ω) = ω. Indeed for every Borel set H,

PX(H) = Pω : X(ω) ∈ H = Pω : ω ∈ H = P (H) ,

and thus also FX ≡ F .

Thus there is a one-to-one correspondence between probability measures on the realline and functions satisfying (D1–3). Any such measure is completely specified by its cdf,and conversely, any function possessing these three properties defines a probability measureon the real line.

The Lebesgue measure on the unit interval can be obtained from the cdf

F (x) =

1, if x > 1;x, if 0 < x ≤ 1;0, if x ≤ 0.

Indeed, the corresponding measure P is such that

P (0, 1] = F (1)− F (0) = 1 ,

and for every (a, b] ⊂ (0, 1],

P (a, b] = F (b)− F (a) = b− a .

Thus the restriction of P on the Borel subsets of the unit interval is identical to theLebesgue measure on that interval.

Notation. We will also use the symbols µ and ν to denote measures on the real lineor subsets thereof.

Decomposition of Distributions

Any cdf F can be decomposed into cdf’s of three distinct types: discrete, absolutelycontinuous and singular. By this we mean that we can write

F ≡ α1F1 + α2F2 + α3F3 ,

39

where

α1, α2, α3 ≥ 0 , α1 + α2 + α3 = 1 ,

and the functions F1, F2 and F3 have the following properties.

F1 is a discrete cdf. A discrete cdf corresponds to a measure which assigns probabilityone to a countable set c1, c2, . . . of real numbers. It is easily seen that all discrete randomvariables have discrete cdf’s. Assuming that the probabilities of the ci’s are given by thepdf (pi)i∈N, we can write

F1(x) =∑

i:ci≤x

pi =∑

i∈N

piI[ci,∞)(x) ,

and thus F1 is the countable sum of weighted step functions.

F2 is an absolutely continuous cdf. An absolutely continuous cdf has no jumps. Itsdefining property is

(∀x) F2(x) =∫ x

−∞f(t)dt

for a nonnegative function f called the probability density function, or pdf, or simplydensity of F2.

Remark. The above integral can be interpreted in the usual (Riemann) sense; thisis certainly adequate from the computational viewpoint, as most densities encountered incalculations are piecewise continuous, hence Riemann integrable. For such densities, it isalso true that

F ′2(x) = f(x)

at all points of continuity of f ; furthermore, the values taken by f at its points of discon-tinuity do not affect the resulting cdf F2. Thus in most cases of interest, the density of anabsolutely continuous cdf is obtained by simple differentiation.

F3 is a singular cdf. Singular (or more precisely, continuous and singular) cdf’s areseldom encountered in probability models. Such functions combine the seemingly incom-patible features of continuity, monotone increase from 0 to 1, and almost everywhereexistence of a derivative that is equal to zero. The last property is interpreted by sayingthat any interval of unit length has a Borel subset A of full (i.e., unit) measure such thatF ′3(x) = 0 on A. Examples of such unusual functions can be found in Billingsley (underSingular Functions, Sec. 31) and various textbooks on real analysis (under Cantor ternaryfunction). We shall not concern ourselves with explicit forms of singular cdf’s.

The general forms of F1, F2 and F3 is illustrated below; unfortunately, the microscopicirregularities of F3 are not discernible on graphs of finite precision and resolution.

40

Before looking at a numerical example, we note that the above decomposition of cdf’s alsoholds for probability measures on the real line. Thus if µ is the measure on (R,B(R))corresponding to the cdf F , we can write

µ ≡ α1µ1 + α2µ2 + α3µ3 ,

where for every i, µi corresponds to Fi.

Example. Let (Ω,F , P ) = (R,B(R)), where the measure µ has the Laplace density

f(x) =12e−|x| (x ∈ R) .

Define the random variable X on (Ω,F , P ) by X(ω) = ω2I[0,∞)(ω).

Since X is nonnegative, we have for x < 0

FX(x) = 0 .

For x ≥ 0 we have X(ω) ≤ x ⇔ ω ≤ √x, and thus

FX(x) = µ(−∞,√

x]

=∫ √

x

−∞f(t)dt = = 1− 1

2

∫ ∞

√x

e−tdt = 1− 12e−√

x .

41

Thus

FX(x) =(

1− 12e−√

x

)I[0,∞)(ω) =

12F1(x) +

12F2(x) ,

whereF1(x) = I[0,∞)(x) , F2(x) = (1− e−

√x)I[0,∞)(x) .

F2 is differentiable everywhere except at the origin:

F ′2(x) = 0 (x < 0) , F ′2(x) =1

2√

xe−√

x (x > 0) .

Thus we may take the density of F2 as

f2(x) =e−√

x

2√

xI(0,∞)(x) .

14. Random vectors

Gray & Davisson, pp. 84-95Billingsley, Sec. 12, Specifying Measures in Rk; Sec. 13, Mappings into Rk; Sec. 20, Ran-dom Variables and Vectors, Distributions.

A k-dimensional random vector on (Ω,F , P ) is an ordered k-tuple of random vari-ables on the same probability space. Thus any k random variables X1, . . . , Xk define arandom vector

X = (X1, . . . , Xk) ,

and X defines a mapping on Ω into Rk through

X(ω) = (X1(ω), . . . , Xk(ω)) .

The figure below illustrates a three-dimensional random vector X = (X1, X2, X3). Atthe particular sample point ω shown, X(ω) = (a1, a2, a3).

42

To interpret the concept of a k-dimensional random vector X, we follow our earlierdevelopment on random variables. We postulate the existence of an instrument MX thatcan be set in an infinity of modes MX,a indexed by

a = (a1, . . . , ak) ∈ Rk .

For any such a, it is possible to determine whether or not

X1(ω) ≤ a1 , . . . , Xk(ω) ≤ ak

simultaneously, or equivalently, whether or not X ∈ C, where

C = (−∞, a1]× · · · × (−∞, ak] .

The corresponding event in F is

X−1(C) = ω : X1(ω) ≤ a1, . . . , Xk(ω) ≤ ak .

A set C ⊂ Rk of the above form will be called a k-dimensional lower rectangle.In many respects, lower rectangles are to random vectors what intervals (−∞, a] are torandom variables (in the one-dimensional case, all lower rectangles are intervals of thatform). This will become transparent in what follows.

The sigma-field generated by a random vector

For a k-dimensional random vector X on (Ω,F , P ), we define σ(X) as the sigma-fieldgenerated by the directly observable events, i.e., the inverse images of lower rectangles.This is in direct analogy to our definition of σ(X) in the one-dimensional case. Thus

σ(X) def= σ(X−1(Ca) : a ∈ Rk

),

where Ca denotes the lower rectangle with vertex at the point a = (a1, . . . , an):

Cadef= (−∞, a1]× · · · × (−∞, ak] .

43

Recall that in the one-dimensional case, a direct representation for σ(X) was obtainedin terms of the inverse images of all Borel sets in R, or sets “generated” by intervals(−∞, a]. The same argument can be applied in the k-dimensional case to show thatσ(X) consists of the inverse images of all Borel sets in Rk, i.e., sets “generated” by lowerrectangles Ca. We therefore have

σ(X) = X−1(H) : H ∈ B(Rk) ,

where B(Rk) is the Borel field of the k-dimensional Euclidean space:

B(Rk) def= σCa : a ∈ Rk .

The collection B(Rk) contains most sets of interest in Rk. In particular, smoothshapes and surfaces associated with continuous mappings on Rk−j into Rj are Borel sets.In the two-dimensional case, for example, most “nice” sets are Borel because they areexpressible in terms of countably many operations on rectangles (a1, b1] × (a2, b2]; theseare Borel since

(a1, b1]× (a2, b2] =(

Cb1b2 − Cb1a2

)∩

(Cb1b2 − Ca1b2

).

This is illustrated in the figure below.

The distribution of a random vector

The distribution of the random vector X = (X1, . . . , Xk) on (Ω,F , P ) is the proba-bility measure PX on (Rk,B(Rk)) defined by the following relationship:

PX(H) def= P (X−1(H)) = Pω : X(ω) ∈ H .

That PX is indeed a probability measure is easily established as in the one-dimensionalcase. In describing the sub-experiment that entails observation of the random vector X,it suffices to consider the probability space (Rk,B(Rk), PX).

The distribution PX yields a cumulative distribution function FX, which gives,for every point x = (x1, . . . , xk), the probability of the lower rectangle with vertex at thatpoint:

FX(x) def= PX(Cx) = Pω : X1(ω) ≤ x1, . . . , Xk(ω) ≤ xk .

44

We also call FX the joint cdf of X1, . . . , Xk, and use the alternative forms

F(X1,...,Xk) ≡ FX1...Xk≡ FX .

A j-dimensional marginal cdf of X is the cdf of any j-dimensional subvector of X,i.e., any random vector obtained from X by eliminating k − j of its constituent randomvariables. Any marginal cdf can be easily computed from FX using the formula

FX1...Xj(x1, . . . , xj) = lim

xj+1↑∞,...,xk↑∞FX1...Xk

(x1, . . . , xk) .

This is so because given any k− j sequences (a(j+1)n)n∈N,. . . , (a(k)

n)n∈N that increase toinfinity with n, we can write

ω : X1(ω) ≤ x1, . . . , Xj(ω) ≤ xj =

= limnω : X1(ω) ≤ x1, . . . , Xj(ω) ≤ xj , Xj+1 ≤ a(j+1)

n , . . . , Xk ≤ a(k)n

The sought result then follows by monotone continuity of P .

Counterparts of properties (D1–3) are easily obtained by re-interpreting inequalitiesand limits componentwise. Thus

(1) if xi < yi for all i, then FX(x) ≤ FX(y).(2) FX approaches unity when all arguments jointly increase to +∞; it approaches

zero when one or more arguments decrease to −∞;(3) if yi ↓ xi for all i, then FX(y) ↓ FX(x).

It is worth mentioning that unlike the one-dimensional case, the above conditions arenot sufficient for a k-dimensional cdf if k > 1. One needs to strengthen (1) so as to ensurethat every rectangle

(a1, b1]× · · · × (ak, bn]

is assigned nonnegative probability; in its present form, (1) only guarantees this for properdifferences of lower rectangles. Details on the needed modifications and proofs are given inBillingsley, Sec. 18. For our purposes, it suffices to note that any k-dimensional distributionis completely specified by its cdf.

We say that FX is discrete if the measure PX on (Rk,B(Rk)) assigns probability1 to a countable set in Rk; this is certainly true if the vector X is itself discrete. Thegeneral form of FX can be obtained as in the one-dimensional case, and can be written asa weighted sum of countably many indicator functions of upper rectangles

[a1,∞)× · · · × [ak,∞) .

FX is absolutely continuous if it possesses a nonnegative pdf fX = fX1...Xksuch

that for all x ∈ Rk,

FX1...Xk(x1, . . . , xk) =

∫ xk

−∞. . .

∫ x1

−∞fX1...Xk

(t1, . . . , tk) dt1 · · · dtk .

45

By our earlier result, the marginal cdf for the first j components is given by

FX1...Xj(x1, . . . , xj) =

tk=∞∫

tk=−∞. . .

tk=∞∫

tj+1=−∞

tj=xj∫

tj=−∞. . .

t1=x1∫

t1=−∞fX1...Xk

(t1, . . . , tk) dt1 · · · dtk .

It follows that FX1...Xjis itself absolutely continuous with density

fX1...Xj=

∫ tk=∞

tk=−∞. . .

∫ tj+1=∞

tj+1=−∞fX1...Xk

(t1, . . . , tk) dtj+1 · · · dtk .

In most practical cases, a pdf f of an absolutely continuous F can be obtained bytaking a mixed partial derivative of F where such derivative exists, and setting f equal to0 elsewhere (usually on a simple boundary A ⊂ Rk):

f(x1, . . . , xk) =∂kF (x1, . . . , xk)

∂x1 · · · ∂xkIAc(x1, . . . , xk) .

As in the one-dimensional case, it is possible to decompose a k-dimensional distribu-tion F into components F1 (discrete), F2 (absolutely continuous) and F3 (singular). Oneimportant difference here is that for k > 1, the singular component F3 need not be acontinuous function if k > 1.

To see an instance of this, consider the space((0, 1],B((0, 1]), µ

)with µ = Lebesgue

measure, and let X1(ω) = ω, X2(ω) = 0. It is easily shown that the cdf of the randomvector (X1, X2) is given by

FX1X2(x1, x2) =(

x1I(0,1](x1) + I(1,∞)(x1))

I[0,∞)(x2) .

This function is illustrated below, where the discontinuity along the half-line x1 ≥ 0,x2 = 0, becomes evident.

46

The mixed partial derivative ∂2FX1X2(x1, x2)/∂x1∂x2 is zero everywhere except on ahandful of half-lines. Now lines on a plane are as paltry as points on a line: if we definea Lebesgue measure on the unit square by assigning to each rectangle a measure equal toits area, then any line segment contained in that square will have probability zero. Thus∂2F (x1, x2)/∂x1∂x2 is “almost everywhere” zero, and by extrapolating from our definitionof singularity in the one dimensional-case, we can say that FX1X2 is singular.

It is worth noting here that the discontinuity of FX1X2 in the above example is some-what coincidental: the corresponding measure PX1X2 is uniformly distributed over aninterval of unit length that happens to be aligned with one of the axes. A simple rotationof that interval can remove this discontinuity. Thus if on the above probability space wedefine

X3(ω) = X4(ω) = ω/√

2 ,

then PX3X4 is uniform over an interval of unit length, yet

FX3X4(x3, x4) =(√

2x3 ∧√

2x4 ∧ 1)∨ 0 .

(Here “∧” denotes minimum and “∨” maximum.) This function is clearly continuouseverywhere.

In conclusion, we should note that discrete and singular measures on Rk share oneessential feature: they assign probability one to Borel sets whose intersection with anyk-dimensional unit cube has zero Lebesgue measure, or “volume.” The key difference isthat discrete measures are concentrated on (countably many) points, whereas singular onesare not. This raises a question as to why singletons alone should receive special treatmentin higher-dimensional spaces (in the case k = 3, for example, one could also distinguishbetween singular measures concentrated on curves and ones concentrated on surfaces). It isdifficult to give a satisfactory answer; in fact, in the mathematical literature, all measuresassigning unit probability to sets of Lebesgue measure zero are termed singular, and thusdiscrete measures are subsumed under singular ones.

15. Independence

Gray & Davisson, pp. 95–99Billingsley, Sec. 4, Independent Events; Sec. 20, Independence.

Independent events

The concept of independence stems from elementary considerations of conditionalprobability. Recall that if A and B are two events in a probability space (Ω,F , P ) suchthat P (A) > 0, then the conditional probability of B given A is defined by

P (B|A) =P (B ∩A)

P (A).

Independence of A and B is defined by

P (A ∩B) = P (A)P (B) .

47

This is easily seen to imply P (B|A) = P (B) if P (A) > 0, and P (A|B) = P (A) if P (B) > 0.Thus intuitively, two events are independent if knowledge of occurrence of either event doesnot affect the assessment of likelihood of the other.

A finite collection A1, . . . , An of events in (Ω,F , P ) is independent if for every setof distinct indices i1, . . . , ir⊂ 1, . . . , n, the following is true:

P (Ai1 ∩ · · · ∩Air) = P (Ai1) · · ·P (Ain

) .

It is worth noting that independence is not affected if we replace one or more events inthe collection by their complement(s). To see this, assume independence of the abovecollection and write

P (Ai1 ∩ · · · ∩Air ) = P (Ai1) · · ·P (Ain) ,

P (Ai2 ∩ · · · ∩Air ) = P (Ai2) · · ·P (Ain) .

Upon subtraction, we obtain

P (Aci1 ∩Ai2 ∩ . . . ∩Air ) = P (Ac

i1)P (Ai2) · · ·P (Ain) ,

and thus independence will still hold if we replace Ai1 by Aci1

.

The above observation leads to the following alternative (albeit hardly more econom-ical) definition of independence: A1, . . . , An are independent if and only if

P (B1 ∩ · · · ∩Bn) = P (B1) · · ·P (Bn)

for every choice of events Bi such that Bi = Ai or Bi = Aci . The “only if” part was

established above. To see that this condition implies independence (the “if” part), we canshow that the above product relationship is true for all choices of n − 1 events Bi, andthen proceed to n− 2, n− 3, etc. (or use induction for brevity).

Remark. Independence of a collection of events implies pairwise independence ofevents in the collection:

A1, . . . , An independent ⇐⇒ (∀i 6= j) Ai, Aj independent .

It is not difficult to show by counterexample that the reverse implication is not true. Thusit should be borne in mind that independence of a collection is equivalent to independenceof all its subcollections; there is nothing special about subcollections of size two.

Independent random variables and vectors

A finite collection of random variables X1, . . . , Xn on the same probability space(Ω,F , P ) is independent if for every choice of linear Borel sets Hi, . . . , Hn,

PX1...Xn(H1 × · · · ×Hn) = PX1(H1) · · ·PXn(Hn) ,

48

or equivalently,

P(X1

−1(H1) ∩ · · · ∩Xn−1(Hn)

)= P (X1

−1(H1)) · · ·P (Xn−1(Hn)) .

Put differently, X1, . . . , Xn is independent if for every choice of A1 ∈ σ(X1), . . ., An ∈σ(Xn), the collection A1, . . . , An is itself independent (convince yourselves!).

Independence of random variables implies that their joint cdf can be written in aproduct form:

FX1...Xn(x1, . . . , xn) = FX1(x1) · · ·FXn

(xn) . (1)

This is seen by substituting Hi = (−∞, xi] in the above definition.

If each cdf FXiis absolutely continuous with pdf fXi

, the above product relationship(1) can be written in the equivalent form

fX1...Xn(x1, . . . , xn) = fX1(x1) · · · fXn(xn) . (2)

To see this, consider the integral of the n-variate function on right-hand side over the lowerrectangle with vertex at (x1, . . . , xn):

∫ xn

−∞. . .

∫ x1

−∞fX1(t1) · · · fXn(tn) dt1 · · · dtn =

(∫ x1

−∞fX1(t1) dt1

)· · ·

(∫ xn

−∞fX1(tn) dtn

)

= FX1(x1) · · ·FXn(xn) .

Thus if (1) is true, the joint pdf can be taken to be the product of the marginal pdf’s.Conversely, if (2) is true, the joint cdf is the product of the marginal cdf’s.

For discrete r.v.’s, we can write a similar product relationship for the pmf’s by takingthe sets Hi to be singletons. Thus if x1, . . . , xn are points in the range of X1, . . . , Xn, then

PX1...Xn

(x1, . . . , xn)

= PX1x1 · · ·PXnxn . (3)

The equivalence of (1) and (3) can be established as above; integrals are simply replacedby sums.

In the foregoing discussion we saw that the product relationship (1) for cdf’s is neces-sary for independence. As it turns out, it is also sufficient; thus to establish independenceof X1, . . . , Xn, it suffices to show that

FX1...Xn(x1, . . . , xn) = FX1(x1) · · ·FXn(xn) .

This equivalence is proved with the aid of the following important fact

Theorem. Let B be a fixed event and A a collection of events that is closed underintersection and such that

P (A ∩B) = P (A)P (B)

49

for every A in A. Then the same product relationship also holds for every A in σ(A).

For a proof of the above statement, see Billingsley, Sec. 4. We apply it by consideringthe collection

A = X−11 (−∞, x1] : x1 ∈ R ,

which is both closed under intersection and such that σ(A) = σ(X1). We also let

B = X−12 (−∞, x2] ∩ · · · ∩X−1

n (−∞, xn]

for fixed x2, . . . , xn. Assuming that (1) is valid, we can write

FX1...Xn(x1, . . . , xn) = FX1(x1)FX2...Xn(x2, . . . , xn) ,

or equivalently,P (A ∩B) = P (A)P (B) .

By the above theorem, this will also be true for all A ∈ σ(A) = σ(X1), and thus for everyBorel set H1,

PX1...Xn(H1 × Cx2...xn) = PX1(H1)PX2...Xn

(Cx2...xn

).

We continue in the same fashion to replace (−∞, x2] by H2, and so on.

Exercise. Using a similar technique, prove the following intuitively plausible fact: ifthe collection X, Y, Z is independent and g, h are Borel measurable functions on R andR2, respectively, then the collection g(X, Y ), h(Z) is also independent.

The concept of independence can be extended to random vectors by replacing uni-variate distributions by multivariate ones. Thus two random vectors X ∈ Rk and Y ∈ Rl

are independent (or, more precisely, form an independent pair) if

PX,Y(G×H) = PX(G)PY(H)

for all G ∈ B(Rk) and H ∈ B(Rl); equivalently,

FX,Y(x,y) = FX(x)FY(y)

for all x ∈ Rk and y ∈ Rl. Note that this does not imply that FX and FY can each bewritten in product form, i.e., that the individual components of X and Y are independent.On the other hand, by taking suitable limits on both sides of the last relationship, it iseasily seen that any subvector of X and any subvector of Y will form an independent pair.For example, if X = (X1, X2, X3) and Y = (Y1, Y2) are independent, then so are (X1, X2)and Y1; yet neither of the collections X1, X2, X3 and Y1, Y2 need be independent. Theabove definition can be extended to a finite collection of random vectors X(1), . . . ,X(n)in an obvious way, and product relationships for pdf’s and pmf’s can also be established.

50

It is worth noting here that independence of random variables can be used to defineindependence of events:

A1, . . . , An independent ⇐⇒ IA1, . . . , IAn independent .

To see this, choose events B1, . . . , Bn such that Bi is either Ai or Aci ; correspondingly, let

bi = 1 if Bi = Ai, and bi = 0 if Bi = Aci . Then

(∀i) P (Bi) = Pω : IAi(ω) = bi ,

P (B1 ∩ · · · ∩Bn) = P

ω : IA1(ω) = b1, . . . , IAn(ω) = bn

.

Thus A1, . . . , An is an independent collection if and only if the indicator functionsIA1, . . . , IAn satisfy the product relationship (3) above, i.e., they are independent.

An infinite collection of events, random variables, or random vectors is termed inde-pendent if every finite subcollection is independent. Thus a sequence of random variablesX1, X2, . . . is independent if and only if for every value of n, the collection X1, . . . , Xnis independent.

Definition. A collection of random variables Xi (or random vectors X(i) of the samedimension) is identically distributed if every Xi (or every X(i)) has the same distribu-tion. If, in addition, the collection is independent, the collection is termed independentand identically distributed (i.i.d.).

Examples

1. Let X1, . . . , Xn be independent random variables and define Y by

Y (ω) = maxX1(ω), . . . , Xn(ω) .

As we saw in Sec. 12, Y is also a random variable, and

ω : Y (ω) ≤ y = ω : X1(ω) ≤ y, . . . , Xn(ω) ≤ y .

ThusFY (y) = FX1...Xn(y, . . . , y) ,

and by invoking independence, we obtain

FY (y) = FX1(y) · · ·FXn(y) .

(If the variables are also identically distributed with common cdf FX , then FY = (FX)n.

2. Convolution. Let X and Y be independent with absolutely continuous distribu-tions. We are interested in computing the distribution of

Z = X + Y.

51

We have

FZ(z) = PXY (x, y) : x + y ≤ z =∫ ∫

x+y≤z

fXY (x, y) dxdy .

The region of integration is shaded in the figure below.

Using independence, we obtain

FZ(z) =∫ ∞

−∞fX(x)

∫ z−x

−∞fY (y) dy

dx

(y = t− x) =∫ ∞

−∞fX(x)

∫ z

−∞fY (t− x) dt

dx

=∫ z

−∞

∫ ∞

−∞fX(x)fY (t− x) dx

dt

.

Thus FZ is absolutely continuous with density fZ given by the convolution of fX and fY :

fZ(z) =∫ ∞

−∞fX(x)fY (z − x) dx ,

or fZ = fX ∗ fY .

16. Expectation

Gray & Davisson, pp. 130–138.Billingsley, Sections 15, 16, 17; Sec. 21, Expected Value as Integral, Expected Values andDistributions, Independence and Expected Value.Wong, pp. 12–14.

Expectation as integral over (Ω,F , P )

The expectation of a discrete nonnegative random variable X is the weighted averageof the countably many values taken by X, computed using the pmf of X as weighting

52

function. Thus if X takes nonnegative values c1, c2, . . . with probabilities pi = PXci, theexpectation of X is given by

EXdef=

∞∑

i=1

pici =∑

x:PXx>0

xPXx .

If X is defined on some probability space (Ω,F , P ), it is possible to express X in terms ofthe countable partition A1, A2, . . . that it induces:

X(ω) =∞∑

i=1

ciIAi(ω) .

The expectation of X can then be rewritten as

EX =∞∑

i=1

ciP (Ai).

We have often depicted the abstract sample space Ω by an interval of the real line.If we go one step further and replace measure by length, then we can interpret the lastexpression for EX as the “area” under the graph of X(ω); this is illustrated in the figureon the left below.

Note also that “area” and “average height” are identical in the above context, since thebase is of unit “length.”

In order to extend the notion of expectation, or “area,” to an arbitrary nonnegativerandom variable X, we use a discrete approximation: the “area” under X(ω) is the leastupper bound to the “areas” of all discrete random variables that we can fit under X(ω);i.e.,

EX = supEY : 0 ≤ Y ≤ X, Y discrete .

If the EY ’s are unbounded, then EX takes the value +∞. This definition is illustrated inthe previous figure, on the right. We also say that EX equals the Lebesgue integral ofX with respect to P , and we write

EX =∫

X(ω) dP (ω) .

53

The above equation emphasizes the fact that EX is defined through discrete approxima-tions of X on the probability space (Ω,F , P ): expectation is an integral, or “area,” underthe graph of a measurable function, evaluated using a probability measure in lieu of lengthon the horizontal axis. As we shall soon see, it is possible to express EX in terms ofthe distribution PX ; for the moment, however, we can only do this for discrete randomvariables.

It turns out that for every nonnegative random variable X there exists a systematicapproximation of X by a nondecreasing sequence of random variables

X1 ≤ X2 ≤ . . . ≤ X

such thatEX = sup

nEXn = lim

nEXn.

This approximation is obtained by successive truncations and quantizations of X as follows.For every n, we restrict the range space from R to [0, n], and we partition [0, n] into 2n

adjoining intervals, each of length 2−n. If X(ω) falls in one of these intervals, we set Xn(ω)equal to the left endpoint of that interval; otherwise (if X(ω) ≥ n) we set Xn(ω) equal ton. This is illustrated in the figure below.

In short, we write

Xn(ω) =M(n)∑

k=0

c(n)k I

A(n)k

(ω) ,

where M(n) = 2n; c(n)k = k2−n if k ≤ M(n), c

(n)M(n)+1 = +∞; and

A(n)k = X−1

[c(n)k , c

(n)k+1

).

It can be shown (see Billingsley for details) that EX = limn EXn, and thus

EX =∫

X(ω) dP (ω) = limn

M(n)∑

k=0

c(n)k P

(A

(n)k

).

54

Expectation as integral over (R,B(R), PX)

The above approximation is important in that it allows us to express the expectationof X as the Lebesgue integral of a measurable function h ≥ 0 on the space (R,B(R), PX).The function h is the ramp function

h(t) = tI(0,∞)(t) ,

and can be approximated by piecewise constant functions in the same fashion as X.

More precisely,

hn(t) =M(n)∑

k=0

c(n)k I

H(n)k

(t) ,

where M(n) and c(n)k are defined as before, and

H(n)k = h−1

[c(n)k , c

(n)k+1

)=

[c(n)k , c

(n)k+1

).

To see why EX is the integral of h with respect to PX , we write∫

h(t) dPX(t) = limn

∫hn(t) dPX(t)

= limn

M(n)∑

k=0

c(n)k PX

[c(n)k , c

(n)k+1

)

= limn

M(n)∑

k=0

c(n)k P

(A

(n)k

)=

∫X(ω) dP (ω) = EX .

We conclude that the expected value of a random variable X on (Ω,F , P ) can beexpressed solely in terms of the distribution of X:

EX =∫

tI(0,∞)(t) dPX(t) =∫

(0,∞)

t dPX(t) .

55

(Notation. A subscript A in the Lebesgue integral sign denotes multiplication of theintegrand by the indicator function of the measurable set A.)

In most cases of interest, evaluation of the above integral is quite straightforward. Fordiscrete distributions PX , we have already seen that

EX =∫

(0,∞)

t dPX(t) =∑

x:PXx>0

xPXx .

For absolutely continuous PX with piecewise continuous density fX , it can be shown usingthe approximation (hn)n∈N that the Lebesgue integral with respect to PX is given byRiemann integral as follows:

EX =∫

(0,∞)

t dPX(t) =∫ ∞

0

tfX(t) dt .

(Since X is nonnegative, we may replace the lower limit in the Riemann integral by −∞.For the present, we must keep the subscript (0,∞) in the Lebesgue integral, as we havenot yet defined this integral for functions that take both positive and negative values.)

For distributions that are mixtures of the above two types, we use the following simplerelationship: if PX = λP1 + (1− λ)P2 for 0 ≤ λ ≤ 1, and g is any nonnegative measurablefunction, then

∫g(t) dPX(t) = λ

∫g(t) dP1(t) + (1− λ)

∫g(t) dP2(t) .

Expectations of functions of random variables

We now turn to expectations of nonnegative functions of random variables. Let X bean arbitrary random variable (not necessarily nonnegative) on (Ω,F , P ), and let g : R 7→ Rbe a nonnegative measurable function on (R,B(R)). As we saw in section 13, the function

(g X)(ω) = g(X(ω))

is a nonnegative random variable on (Ω,F , P ). By the result of the previous subsection,we can write

E[g X] = E[g(X)] =∫

g(X(ω)) dP (ω) =∫

(0,∞)

t dPgX(t) . (1)

Consider now the function g as a random variable on the probability space (R,B(R), PX).The probability of a Borel set H under the distribution of g is given by

PX(g−1(H)) = P(X−1(g−1(H))

)= P

((g X)−1(H)

)= PgX(H) .

56

Thus the r.v. g on (R,B(R), PX) has the same distribution as the r.v. g X on (Ω,F , P ).By the result of the previous subsection, we have

∫g(t) dPX(t) =

∫

(0,∞)

t dPgX(t) . (2)

Combining (1) and (2), we obtain

E[g(X)] = E[g X] =∫

(0,∞)

t dPX(t) =∫

g(t) dPX(t) .

The above result is extremely useful in that it allows us to compute the expectationof any nonnegative function g of a random variable X in terms of the distribution of X;thus the distribution of g X is not needed in order to evaluate E[g(X)].

For PX discrete, it is easy to show that

E[g(X)] =∑

x:PXx>0

g(x)Px .

If PX is absolutely continuous with piecewise continuous density f , and g itself is piecewisecontinuous, we can again use a Riemann integral to evaluate E[g(X)]:

E[g(X)] =∫

g(t) dPX(t) =∫ ∞

0

g(t)fX(t) dt .

Example If Y = X2, then

EY =∫

(0,∞)

t dPY (t) =∫

t2 dPX(t) .

If PX is absolutely continuous with piecewise continuous density, then the same is true ofPY , and

EY =∫ ∞

0

tfY (t) d(t) =∫ ∞

0

t2fX(t) dt .

It is both interesting and useful to note that the proof of the above result is alsovalid for a random vector X replacing the random variable X. Thus if g : Rk 7→ R is anonnegative measurable function on (Rk,B(Rk)), we can write

E[g(X)] =∫

g(t) dPX(t) .

In the discrete case, evaluation of the above is again straightforward. In the absolutelycontinuous case, if we assume piecewise continuity of fX and g on Rk, we have

E[g(X)] =∫ ∞

−∞. . .

∫ ∞

−∞g(t1, . . . , tk)fX(t1, . . . , tk) dt1 · · · dtk .

57

Expectation of arbitrary random variables

If X is an arbitrary (i.e., not necessarily nonnegative) random variable on (Ω,F , P ) ,the expectation of X is defined through a decomposition of X into positive part X+ anda negative part X−:

X+(ω) def= X(ω)Iω:X(ω)>0(ω) , X−(ω) def= −X(ω)Iω:X(ω)<0(ω) .

Thus X+ and X− are nonnegative random variables such that

X = X+ −X− , |X| = X+ + X− .

Furthermore, EX+ and EX− are well-defined by the foregoing discussion.

Definition. The expectation of X is defined by

EXdef= EX+ − EX−

if both EX+ and EX− are finite, or if at most one of EX+ and EX− is infinite. If bothEX+ and EX− are infinite, then EX is not defined (i.e., it does not exist).

If EX exists, we can also express it as a Lebesgue integral:

EX =∫

X(ω) dP (ω) =∫

X+(ω) dP (ω)−∫

X−(ω) dP (ω) .

If both EX+ and EX− are finite, we say that X is integrable; this is also equivalent tosaying that E|X| < ∞, or EX 6= ±∞. If E|X| is infinite (in which case EX is itselfinfinite or does not exist), we say that X is non-integrable.

Example. Consider the probability space consisting of the interval [−π/2, π/2], itsBorel field, and the measure P that has uniform density over [−π/2, π/2] (P is a scaledversion of the Lebesgue measure). Define the random variables X and Y by

X(ω) = sin ω , Y (ω) = tan ω .

We have

EX+ =∫

(0,π/2)

sin ω dP (ω) =1π

∫ π/2

0

sin ω dω =1π

,

EY + =∫

(0,π/2)

tan ω dP (ω) =1π

∫ π/2

0

tan ω dω = +∞ .

By symmetry, EX− = EX+ and EY − = EY +. Thus EX = 0, while EY does not exist.

Further properties of expectation

(a) Functions of random variables. If g : R 7→ R is Borel measurable (but notnecessarily nonnegative) and E[g(X)] exists, then we can express E[g(X)] in terms of PX

as before. To see this, observe that

(g X)+(ω) = g+(X(ω)) , = (g X)−(ω) = g−(X(ω)) ,

58

and hence by our earlier result,

E[(g X)+] = E[g+(X)] =∫

g+(t) dPX(t) ,

E[(g X)−] = E[g−(X)] =∫

g−(t) dPX(t) .

Our latest extension of the definition of expectation yields

E[g(X)] = E[g X] =∫

g(t) dPX(t) ,

whenever either side of the equation exists. By taking g(t) = t, we obtain the importantrelationship

EX =∫

g(t) dPX(t) .

Once again, the Lebesgue integrals in the above relationships are usually evaluated bysums (when PX is discrete) or Riemann integrals (when PX is absolutely continuous withpiecewise continuous density, and g is itself piecewise continuous). One must, however,exercise care in using such sums and integrals; as a rule, one should compute E[g+(X)]and E[g−(X)] separately before giving the final answer. For instance, if g(t) = t and PX

is absolutely continuous, one would need to evaluate both

EX+ =∫ ∞

0

tfX(t) dt and EX− = −∫ 0

−∞tfX(t) dt .

Example The random variable Y of the previous example has the Cauchy density fY (t) =π−1(1 + t2)−1. It is tempting to write

∫ ∞

−∞tfY (t) dt = 0

by virtue of the fact that the integrand is an odd function of t. Yet as we have seen,EY does not exist, and restriction of the above integral to (0, ∞) and (−∞, 0) yields thecorrect results EY + = ∞, and EY − = ∞, respectively. The same error could have beenmade in the evaluation of EY in the previous example, as Y (ω) is an odd function of ω.

As in the previous section, we can extend the above results to measurable functionsof random vectors. Again one should exercise care in handling multiple integrals such as

∫ ∞

−∞. . .

∫ ∞

−∞g(t1, . . . , tk)fX(t1, . . . , tk) dt1 · · · dtk .

As a rule, one should first decompose g into positive and negative parts, and then evaluatethe two resulting integrals in the usual iterated fashion. One important exception is thecase where integrability of g can be independently established (e.g. by evaluating the above

59

integral with |g| replacing g); then decomposition into positive and negative parts is notnecessary.

(b) Linearity. If X and Y are integrable random variables and α, β are real constants,then

E[αX + βY ] = αEX + βEY .

This is easily shown using linearity of sums and integrals for cases in which the pair (X, Y )is discrete or absolutely continuous with “nice” density. In the latter case, for instance, wehave

E[αX + βY ] =∫ ∞

−∞

∫ ∞

−∞(αx + βy)fXY (x, y) dxdy

= α

∫ ∞

−∞x dx

∫ ∞

−∞fXY (x, y) dy + β

∫ ∞

−∞y dy

∫ ∞

−∞fXY (x, y) dx

= α

∫ ∞

−∞xfX(x) dx + β

∫ ∞

−∞yfY (y) dy = αEX + βEY .

(Note that the assumption of integrability of X and Y enabled us to decompose theintegrals into components without having to worry about indeterminate forms such as+∞−∞.)

The proof of linearity in the general case appears in Billingsley.

(c) Independence and Expectation. If X1, . . . , Xn are independent and integrablerandom variables, then the expectation of their product is given by the product of theirexpectations:

E[X1 · · ·Xn] = E[X1] · · ·E[Xn] .

The proof of this fact for arbitrary random variables is given in Billingsley. For the caseof absolutely continuous distributions with piecewise continuous density, we first showthe above result for nonnegative independent (but not necessarily integrable) variablesY1, . . . , Yn:

E[Y1 · · ·Yn] =∫ ∞

0

. . .

∫ ∞

0

t1 · · · tnfY1(t1) · · · fYn(tn) dt1 · · · dtn

=

(∫ ∞

0

t1fY1(t1) dt1

)· · ·

(∫ ∞

0

tnfYn(tn) dtn

)

= E[Y1] · · ·E[Yn] .

Thus if X1, . . . , Xn are integrable, we can utilize the above identity with Yi = |Xi| toconclude that the product X1 · · ·Xn is also integrable. By repeating the steps with Xi

replacing Yi for every i and −∞ replacing 0 in all lower limits, we obtain

E[X1 · · ·Xn] = E[X1] · · ·E[Xn] .

60

Example. The assumption of integrability was essential in establishing the aboveproduct relationship. Consider a counterexample in which X and Y ≥ 0 are independentwith

fX(t) =12I[−1,1](t) , fY (t) =

2π(1 + t2)

I[0,∞)(t) .

In this case E[X+] = E[X−] = 1/2, while EY = +∞. We have

E[(XY )+] = E[(XY )−] = (1/2)(+∞) = +∞ ,

and thus E[XY ] does not exist. On the other hand, E[X]E[Y ] equals 0(+∞) which isby convention equal to zero. The same error can be made by setting E[XY ] equal to theiterated integral ∫ ∞

−∞yfY (y)

∫ ∞

−∞xfX(x) dx

dy ,

which is equal to zero.

17. Applications of expectation

Gray & Davisson, pp. 139–145, 147–148.Billingsley, Sec. 21, Applications of expectation.

Moments and Correlation

For positive integer k, the kth moment of a random variable X is the expectation ofXk (provided it exists).

In what follows we will assume random variables X and Y are square-integrable,i.e., they have finite second moments. This also implies that X and Y also have finiteexpectations (or first moments).

The correlation of X and Y is the expectation of the product XY . This exists andis finite by virtue of the fact that |XY | ≤ X2 + Y 2 and the assumption that X and Y aresquare-integrable.

The covariance of X and Y is the correlation of X − EX and Y − EY , i.e., thecorrelation of the original random variables centered at their expectations:

Cov(X, Y ) def= E[(X − EX)(Y − EY )] .

We also have

Cov(X, Y ) = E[XY −X(EY )− Y (EX) + (EX)(EY )] = E[XY ]− E[X]E[Y ] .

If we let X = Y , then we obtain the variance of X:

VarX def= Cov(X, X) = E(X − EX)2 = EX2 − (EX)2 .

61

Since the variance equals the expectation of the nonnegative random variable (X −EX)2,it is always nonnegative; from the above we also deduce that the second moment of X isalways greater than or equal to the square of its first moment.

We say that X and Y are uncorrelated if Cov(X, Y ) = 0, or equivalently, if E[XY ] =E[X]E[Y ]. Thus two independent random variables are always uncorrelated; the converseis not true, as the following example shows.

Example. Let X be such that EX = 0 and EX3 6= 0. Then for α, β ∈ R, therandom variable

Y = αX2 + βX

is clearly not independent of X. Yet

Cov(X, Y ) = E[XY ] = αEX3 + βEX2 ,

which can be zero for suitable α and β.

Thus uncorrelatedness is a much weaker condition than independence. It should beemphasized that the product relationship by which uncorrelatedness is defined always in-volves two random variables. Thus when speaking of a collection of uncorrelated variables,we understand that every two variables in the collection satisfy the said relationship; we donot imply the validity of product relationships involving three or more variables at once.This is in clear contrast to the case of independence. Another salient difference betweenuncorrelatedness and independence is that the former is not preserved under measurabletransformations of the variables involved, whereas the latter is.

Using the definition of covariance and the linearity of expectation, it is easy to showthat for square integrable variables X1, . . . , XM and Y1, . . . , YN ,

Cov( M∑

i=1

aiXi + bi,

N∑

j=1

cjYj + dj

)=

M∑

i=1

N∑

j=1

aicjCov(Xi, Xj) .

In particular, covariance is not affected by additive constants.

The Cauchy-Schwarz inequality

The Cauchy-Schwarz inequality states that for square-integrable random variables Xand Y ,

(E[XY ])2 ≤ E[X2]E[Y 2] ,

with equality if and only if there exists a constant λ such that X +λY = 0 with probabilityone.

To prove this, consider the nonnegative random variable Zλ defined by

Zλ = (X + λY )2 .

62

Here λ is a (nonrandom) constant. We have

EZλ = E[X2 + 2λXY + λ2Y 2]

= E[X2] + 2λE[XY ] + λ2E[Y 2] ≥ 0 ,

where the last inequality follows from the fact that Zλ ≥ 0. Since the inequality holds forall λ, we have (by the elementary theory of quadratics) that

(2E[XY ])2 ≤ 4E[X2]E[Y 2] , or (E[XY ])2 ≤ E[X2]E[Y 2] .

This establishes the sought inequality. Equality will hold if and only if the quadratic hasexactly one real root (of multiplicity two), i.e., there exists a unique λ such that

EZλ = 0 .

Since Zλ ≥ 0, the above statement is equivalent to

Zλ = 0 , or X + λY = 0

with probability one.

Remark. It is clear that if Zλ = 0 with probability one, then EZλ = 0. To see whyEZλ > 0 if the inequality Zλ > 0 is true with positive probability, note that the eventover which Zλ > 0 holds can be decomposed into events over which r−1 ≤ Z < (r − 1)−1

for r ∈ N. If Zλ > 0 with positive probability, one of the above disjoint events haspositive probability, and thus Zλ is lower bounded by a simple random variable of positiveexpectation.

A corollary of the Cauchy-Schwarz inequality is obtained by centering the randomvariables at their expectations:

(Cov(X,Y )

)2

≤ VarX ·VarY ,

with equality if and only if there exist constants λ and c such that X + λY = c withprobability one.

The Markov and Chebyshev inequalities

Notation. Where there is no ambiguity about the choice of probability space, we willuse “Pr” as an abbreviation for “probability that.” For example, on a probability space(Ω,F , P ),

Prinfinitely many Ai’s occur = P (lim supi

Ai) ,

PrX ∈ H = Pω : X(ω) ∈ H = PX(ω) .

63

Consider a nonnegative random variable U . For a positive constant α, the Markovinequality provides an upper bound to PrU ≥ α in terms of the expectation EU . Toderive this inequality, we consider the functions

g(t) = αI[α,∞)(t) and h(t) = tI[0,∞)(t)

defined on the real line.

Sinceg(t) ≤ h(t) ,

it is also true that ∫g(t) dPU (t) ≤

∫h(t) dPU (t) .

or equivalently,αPU [α,∞) ≤ EU .

Thus we have obtained the Markov inequality:

PrU ≥ α ≤ EU

α.

The Markov inequality can be applied to an arbitrary r.v. X in order to obtain anupper bound to Pr|X| ≥ α in terms of the rth moment of |X|. Indeed, if we set U = |X|,then

Pr|X| ≥ α = Pr|X|r ≥ αr ≤ E|X|rαr

.

For r = 2, U = |X| and |X−EX|, we obtain two versions of the Chebyshev inequality:

Pr|X| ≥ α ≤ EX2

α2;

Pr|X − EX| ≥ α ≤ VarXα2

.

Remark. Except for trivial cases, the Markov and Chebyshev inequalities are alsovalid with strict inequality signs on both sides of the appropriate expressions. This can beseen by taking g = I(α,∞) in the above.

64

As an application of the Chebyshev inequality, consider a sequence X1, X2, . . . ofsquare-integrable uncorrelated random variables such that EXi = µi, VarXi = σ2

i .We are interested in the behavior of the sample (or time) average

X1 + · · ·Xn

n

of the first n variables. Clearly

E

[1n

n∑

i=1

Xi

]=

1n

n∑

i=1

EXi =1n

n∑

i=1

µi

and

Var

(1n

n∑

i=1

Xi

)= Cov

1

n

n∑

i=1

Xi,1n

n∑

j=1

Xj

=

n∑

i=1

n∑

j=1

1n2

Cov(Xi, Xj)

(by uncorrelatedness) =1n2

n∑

i=1

Cov(Xi, Xi) =1n2

n∑

i=1

σ2i .

Thus by the Chebyshev inequality,

Pr

∣∣∣∣1n

n∑

i=1

Xi − 1n

n∑

i=1

µi

∣∣∣∣ ≥ α

≤ 1

n2α2

n∑

i=1

σ2i .

In particular, if µi = µ, σ2i = σ2, then

Pr

∣∣∣∣1n

n∑

i=1

Xi − µ

∣∣∣∣ ≥ α

≤ 1

n2α2(nσ2) =

σ2

nα2.

Thus regardless of the choice of α ≥ 0, we have

limn

Pr

∣∣∣∣1n

n∑

i=1

Xi − µ

∣∣∣∣ ≥ α

= 0 .

As we shall see in the following section, the above statement can be read as “the sampleaverage converges to the expectation (mean) µ in probability.”

18. Convergence of sequences of random variables

Gray & Davisson, pp. 145–152.Wong, pp. 49–55.Billingsley, Sec. 20, Convergence in Probability; Sec. 25, Convergence in Distribution,Convergence in Probability.Wong & Hajek, pp. 18–24.

65

The notion of convergence of a sequence of random variables was introduced in Section12. As we saw in that section, if X1, X2, . . . are random variables on the same probabilityspace (Ω,F , P ), then(i) the set C of ω’s for which X(ω) converges to a finite or infinite limit is an event (i.e.,

C ∈ F);(ii) the mapping that carries every ω in C to the corresponding limit defines an extended

random variable on the restriction of (Ω,F) on C.In this section we will discuss different modes in which a sequence of random variables

X1, X2, . . . may converge to a limiting random variable X. Some modes of convergence aredefined in the framework of the previous paragraph, i.e. in regard to the limiting behaviorof Xn(ω) as ω varies over a suitable event C. Other modes are defined in an aggregatesense, and thus do not entail the convergence of individual sequences X1(ω), X2(ω), ....

Remark. It is important to appreciate the difference between two expressions suchas

X1(ω), X2(ω), . . .

andX1, X2, . . . .

For a given ω, the former is a sequence of numbers; the latter is a sequence of functions onΩ. Thus there is nothing equivocal about the statement “Xn(ω) converges;” on the otherhand, as we shall soon see, “Xn” converges” could mean one of several things.

In what follows, all random variables are defined on the same probability space andare finite-valued (i.e., they are not extended random variables).

Definitions

(a) Pointwise convergence on Ω. The sequence (Xn)n∈N converges to X pointwise onΩ if

(∀ω ∈ Ω) limn

Xn(ω) = X(ω) .

Thus in this case, the set C introduced above coincides with Ω.

(a) Almost sure convergence. The sequence (Xn)n∈N converges to X almost surely ifthe event A over which

limn

Xn(ω) = X(ω)

has probability one. In this case, the set C introduced above contains A as a subset, andthus P (C) = 1. We write Xn

a.s.→X.

It is instructive to derive an alternative representation of the event

A = ω : limn

Xn(ω) = X(ω) .

To this end we recall the definition of convergence of a sequence (an)n∈N to a finite limita:

limn

an = a ⇐⇒ (∀r)(∃m)(∀n ≥ m) |a− an| < q−1 .

66

Here m, n, and q are taken to range over N. We also note that in Section 12 we gavean alternative definition of convergence of real sequences in terms of iterated infima andsuprema. The one given directly above is more elementary, and leads to the followingrepresentation for A:

A =

(ω) : (∀q)(∃m)(∀n ≥ m) |X(ω)−Xn(ω)| < q−1

=⋂q

⋃m

⋂

n≥m

ω : |X(ω)−Xn(ω)| < q−1 .

Thus the event A over which Xn(ω) converges to X(ω) is the set of ω’s with the propertythat for every q, the distance |X(ω) −Xn(ω)| is eventually smaller than 1/q. In Section7 we linked the qualifier “eventually” with the limit inferior of a sequence of events; thisconnection is again apparent in the last expression, as each set in the first intersection(over q) is a limit inferior of a sequence of events.

Pointwise convergence on Ω and almost sure convergence are illustrated in the figurebelow, where downward arrows denote convergence at the given abscissae.

Pointwise convergence on Ω Almost sure convergence

(c) Convergence in probability. The sequence (Xn)n∈N converges to X in probabilityif for every ε > 0,

limn

Pr|X −Xn| > ε = 0 .

We write XnP→X. It is easily seen that the > sign in the above expression can be replaced

by ≥, and that ε need only range over the reciprocals of positive integers. Thus for XnP→X

to be true, we require that for every q ∈ N, the sequence of events B1,q, B2,q, . . . definedby

Bn,q = ω : |X(ω)−Xn(ω)| ≥ εsatisfy

limn

P (Bn,q) = 0 .

67

(d) Convergence in rth mean. The sequence (Xn)n∈N converges to X in rth mean if

limn

E[|X −Xn|r] = 0 .

We write XnLr→X. In the special case r = 2 we speak of convergence in quadratic mean,

or mean-square convergence, which we also denote by

Xq.m.→ X and X

m.s.→ X .

Thus convergence in probability and in rth mean differ from the first two modes ofconvergence discussed above in that their definition does not entail convergence of thesequence (Xn(ω))n∈N. For convergence in probability, we require that certain sequencesof events exhibit vanishing probabilities; for convergence in rth mean, we require thata certain sequence of integrals exhibit vanishing values. This is illustrated in the figurebelow.

Convergence in probability Convergence in rth mean

(e) Convergence in distribution. The sequence (Xn)n∈N converges to X in distributionif at every continuity point x of FX (i.e., where FXx = 0), the following is true:

limn

FXn(x) = FX(x) .

We write Xnd→X. As this definition only involves the (marginal) distributions FX and

(for all n) FXn , it is acceptable to identify convergence in distribution with convergence of

distributions. Thus if we let Fn ≡ FXn and F ≡ FX , we can also express Xnd→X as

FXn

w→FX or Fnw→F .

Here “w” stands for “weakly,” and is interpreted as convergence at all points of continuity.

The following figure illustrates convergence in distribution.

68

Uniqueness of limits

In each of the five modes of convergence discussed above, the limiting random variableX is to a larger or smaller extent determined by the sequence (Xn)n∈N. More specifically,if both X and Y are limits (in the appropriate mode) of the sequence (Xn)n∈N, then

(a) under pointwise convergence,

(∀ω) X(ω) = Y (ω) , i.e., X ≡ Y ;

(b) under almost sure convergence, convergence in probability, or convergence in rth

mean,(∃D ∈ F) (∀ω ∈ D) X(ω) = Y (ω) , i.e., X = Y a.s. ;

(c) under convergence in distribution,

FX ≡ FY ,

and thus unlike other modes of convergence, the mappings X(·) and Y (·) here need not berelated in any particular way.

The above facts are fairly easy to establish, and the proofs appear in the referencesfor this section.

Criteria for convergence

One is often interested in establishing convergence of (Xn)n∈N in one of the five modesdiscussed above without regard to the properties of the limiting random variable X. Itis thus useful to have recourse to convergence criteria that do not explicitly involve X,especially in situations where a plausible candidate for X may be difficult to find. Forthe first four modes of convergence, such criteria exist and are expressed in terms of thebehavior of the random variable

|Xm − Xn|for large values of m and n.

69

The idea here stems from the concept of mutual convergence in real sequences: thesequence (an)n∈N converges mutually if

(∀q ∈ N) (∃K) (∀m,n ≥ K) |am − an| < q−1 .

We also write the above aslim

m,n→∞|am − an| = 0 ,

where the above limit is understood over all sequences of pairs (m,n) such that both mand n increase to infinity.

As it turns out, the sequence (an)n∈N converges to a finite limit if and only if itconverges mutually. This equivalence can be readily exploited to yield the following criteriafor pointwise and almost sure convergence:

(a) (Xn)n∈N converges pointwise (to some random variable) if and only if

(∀ω ∈ Ω) limm,n→∞

|Xm(ω) − Xn(ω)| = 0 ;

(b) (Xn)n∈N converges almost surely if and only if

P

ω : limm,n→∞

|Xm(ω) − Xn(ω)| = 0

= 1 ;

For convergence in probability and rth mean, the criteria are similar but rather moredifficult to obtain (see Wong and Hajek):

(c) (Xn)n∈N converges in probability if and only if for every q ∈ N,

limm,n→∞

Pω : |Xm(ω) − Xn(ω)| > q−1 = 0 ;

(d) (Xn)n∈N converges in rth mean if and only if

limm,n→∞

E|Xm − Xn|r = 0 .

Recall that Xnd→X entails convergence of the cdf of Xn to that of X at every continuity

point of the latter cdf. Although it is tempting to think that a sufficient condition forconvergence is the existence of

limn

FXn(x)

for all x, this is not so; the above limit should also coincide with a legitimate cdf at allpoints where the latter is continuous.

To see a simple example where the above limit exists but does not define a cdf, take

Xn ≡ n .

70

Then the cdf of Xn is the indicator function I[n,∞), and for every x,

limn

FXn(x) = 0 .

Clearly there exists no cdf F such that F = 0 at all points of continuity of F , and thus(Xn)n∈N does not converge in distribution.

Relationships between types of convergence

The following table summarizes the principal relationships between the five types ofconvergence defined in this section. As usual, a double arrow denotes implication.

Xn converges to Xpointwise on Ω

Xna.s.→X Xn

Lr→X (r ≥ 1)

XnP→X

Xnd→X

Pointwise convergence on Ω ⇒ Almost sure convergence: Obvious by definition.Almost sure convergence ⇒ Convergence in probability: Recall that the set of ω’s for

which Xn(ω) converges to X(ω) is given by

A =⋂q

⋃m

⋂

n≥m

ω : |X(ω)−Xn(ω)| < q−1 ,

where q, m and n range over the positive integers. If P (A) = 1, then for any given q,

P(⋃

m

⋂

n≥m

ω : |X(ω)−Xn(ω)| < q−1)

= 1 .

As the outer union is over an increasing sequence of events, we have

limm

P( ⋂

n≥m

ω : |X(ω)−Xn(ω)| < q−1)

= 1 ,

and thus alsolimm

Pω : |X(ω)−Xm(ω)| < q−1 = 1 .

71

Upon complementation we obtain

limm

Pω : |X(ω)−Xm(ω)| ≥ q−1 = 0 ,

which proves that XnP→X.

Convergence in rth mean ⇒ Convergence in probability: By the Markov inequality,for every ε > 0 we have

Pr|X −Xn| ≥ ε ≤ E|X −Xn|rεr

.

If XnLr→X, then the the r.h.s. approaches 0 as n tends to infinity, and thus

limn

Pr|X −Xn| ≥ ε = 0 .

Convergence in probability ⇒ Convergence in distribution: See Billingsley.

To see that no equivalences or other implications exist between the five modes ofconvergence, we consider some counterexamples.

(a) Convergence in distribution 6⇒ Convergence in probability: Let Ω = 0, 1 andP0 = P1 = 1/2. Also let

Xn(ω) =

ω, if n is odd;1− ω, if n is even.

The above sequence is identically distributed with

PrXn = 0 = PrXn = 1 = 1/2,

and thus trivially (Xn)n∈N converges in distribution. By the criterion of the previoussubsection, (Xn)n∈N does not converge in probability, since for all n,

Pr|Xn −Xn+1| = 1 = 1 .

(b) Convergence in Probability 6⇒ Almost sure convergence or convergence in rth

mean: Consider the probability space (Ω,F , P ) where Ω = (0, 1], F = B(Ω), and P is theLebesgue measure. We construct a sequence of events (Gn)n∈N ranging over the dyadicintervals:

G1 = (0, 1] ;G2 = (0, 1/2] , G3 = (1/2, 1] ;G4 = (0, 1/4] , G5 = (1/4, 1/2] , G6 = (1/2, 3/4] , G7 = (3/4, 1] ;

etc.

72

If Gn = (an, bn], we define the simple random variable Xn by

Xn(ω) = (bn − an]−1IGn(ω) = P−1(Gn)IGn

(ω) .

The sequence (Xn)n∈N converges to 0 in probability, since for every ε > 0,

Pω : |X(ω)−Xn(ω)| > ε ≤ P (Gn) ,

and P (Gn) converges to zero by construction of the Gn’s. To see that it does not convergealmost surely, first note that every ω lies in infinitely many Gn’s and thus

Xn(ω) = (bn − an)−1 ≥ 1

for infinitely many values of n. On the other hand, every ω also lies outside infinitely manyGn’s and thus Xn(ω) = 0 infinitely often. Therefore for every ω, the sequence (Xn(ω))n∈N

fails to converge, and certainly (Xn)n∈N does not converge almost surely.To see that (Xn)n∈N does not converge in rth mean (where r ≥ 1), note that for every

K, there exist m,n > K such that Gm ∩Gn = ∅. Then

E|Xm −Xn|r = E|Xm|r + E|Xn|r ≥ 2 ,

whence we conclude thatlim

m,n→∞E|Xm −Xn|r 6= 0

and (Xn)n∈N does not converge in rth mean for r ≥ 1.(c) Convergence in rth mean 6⇒ almost sure convergence: We modify the above ex-

ample by letting Xn = IGn . Then

limn

E|Xn|r = limn

P (Gn) = 0

and XnLr→0; yet (Xn)n∈N again does not converge in probability.

(d) Almost sure convergence 6⇒ convergence in rth mean: On the same probabilityspace as above we let Gn = (0, 1/n], Xn = nIGn . Then for every ω in (0, 1] we haveXn(ω) = 0 for n > 1/ω, and thus (Xn)n∈N converges to 0 pointwise on Ω. On the otherhand, for n = 2m we have

E|Xn −Xm|r = mr−1 ,

whence we conclude that for r ≥ 1,

limm,n→∞

E|Xm −Xn|r 6= 0 ,

and (Xn)n∈N does not converge in rth mean for r ≥ 1.(e) Almost sure convergence 6⇒ pointwise convergence on Ω: left as an easy exercise.

73

Remark. As it turns out, the following weaker implications are also true (see Billings-ley or Wong & Hajek):

(i) if XnP→X, then a subsequence of (Xn)n∈N converges to X almost surely;

(ii) if Xna.s.→X and there exists an integrable random variable Y ≥ 0 such that (∀n)

Y ≥ |Xn|r, then XnLr→X.

Thus in (b) above, the subsequence (Xn(k))k∈N1, where n(k) = 2k, converges to 0almost surely; whereas the condition on the existence of an auxiliary r.v. Y is violated in(d).

19. Convergence and expectation

Billingsley, Sec. 5, Inequalities; Sec. 16, Integration to the Limit; Sec. 21, Inequalities.

A question often asked is the following: if a sequence (Xn)n∈N converges (in somesense) to a random variable X, is it true that the expectations EXn converge to EX?

As it turns out, this is always true if convergence to X is in the r-th mean, wherer ≥ 1. For the other standard modes of convergence, the above statement is not alwaystrue. To see this, consider the example given under (d) in the previous section. Wehad Xn → X = 0 pointise on Ω (and hence also almost surely, in probability and indistribution), yet EXn = 1 6= 0.

To show that XnLr→X for r ≥ 1 implies EXn → EX, we invoke the following variant

of the powerful Holder inequality (see Billingsley, Section 5, Inequalities):

(E[|Y |s]

)1/s

≤(E[|Y |r]

)1/r

, 0 < s ≤ r.

Setting Y = X −Xn in the above, we obtain the general fact

XnLr→X =⇒ Xn

Ls→X, for s ≤ r .

The desired conclusion is reached by setting s = 1, and using the simple inequality

|EX − EXn| ≤ E|X −Xn| . (1)

(Remark. It is actually true that XnLr→X actually implies E|Xn|s → E|X|s for all

1 ≤ s ≤ r. To prove this, instead of (1) use

∣∣∣E[|X|s]− E[|Xn|s]∣∣∣ ≤ E[|X −Xn|s] ,

which follows from Minkowski’s inequality (ibid.):

E[|Y + Z|s] ≤ E[|Y |s] + E[|Z|s] , s ≥ 1.)

74

In regard to other modes of convergence, we give sufficient conditions for the inter-change of limits and expectation in the case of almost sure convergence. These conditionsare contained in the following important theorems, which are discussed in Billingsley, Sec-tion 16.

Monotone convergence theorem. If (Xn)n∈N is a nondecreasing sequence of r.v.’sthat converges almost surely to X, and is bounded from below by an integrable r.v., thenEXn → EX. In other words,

Xna.s.→ X, (∀n) Y ≤ Xn ≤ Xn+1, E|Y | < ∞ =⇒ EXn → EX .

(Remark. Since Y ≤ X, EX ≤ +∞. The theorem is true even if EX = +∞.)

Dominated convergence theorem. If (Xn)n∈N is a sequence of r.v.’s that con-verges almost surely to X and is absolutely bounded by an integrable r.v., then EXn →EX. In other words,

Xna.s.→ X, (∀n) |Xn| ≤ Y, E|Y | < ∞ =⇒ EXn → EX .

As an application of the monotone convergence theorem, consider a sequence of non-negative random variables Yn, and let Xn = Y1 + . . . + Yn. Then the Xn’s form a nonde-creasing sequence that converges pointwise to

∑∞i=1 Yi and is bounded from below by 0.

By the monotone convergence theorem,

limn

EXn = E[ ∞∑

i=1

Yi

], or equivalently, E

[ ∞∑

i=1

Yi

]=

n∑

i=1

EYi .

Note that by the remark following the statement of the theorem, the above holds even ifthe sum of all expectations is infinite in value.

As an application of the dominated convergence theorem, consider the following ex-ample. Let Y ≥ 0 be integrable, X ≥ 0 be arbitrary, and define

Zn =XY

n∧ αY ,

where α > 0. Note that (Zn)n∈N is a nonincreasing sequence such that

(∀ω) 0 ≤ limn

Zn(ω) ≤ limn

X(ω)Y (ω)n

= 0 .

Thus Zn → Z = 0 pointwise on Ω. Since we also have 0 ≤ Zn ≤ αY and Y is integrable,we can apply the dominated convegence theorem to conclude that

limn

EZn = EZ = 0 .

Note that the above conclusion does not necessarily hold if Y is nonintegrable. For example,if EY = +∞ and X is a constant r.v., then EZn = +∞ for all n, and thus EZn 6→ 0.

75

It is instructive to verify that the hypotheses of the above two theorems do not holdin the example given under (d) in the previous section. Indeed, the sequence (Xn)n∈N

is not monotone nondecreasing, and hence the monotone convergence theorem does notapply here. Furthermore, any dominating r.v. Y will have to dominate supn Xn; the latteris a simple r.v. whose expectation is

∞∑n=1

n

(1n− 1

n + 1

)=

∞∑n=1

1n + 1

= ∞ .

Hence the dominated convergence theorem does not apply here either.

20. Laws of large numbers

Gray & Davisson, pp. 143–145.Wong, pp. 55–56.Billingsley, Sec. 6; Sec. 21, Moment Generating Functions; Sec. 22, Kolmogorov’s Zero-OneLaw, Kolmogorov’s Inequality, The Strong Law of Large Numbers.

Consider a sequence (Xn)n∈N of random variables in a probability experiment (Ω,F , P ),and suppose that the marginal distributions (FXn)n∈N have some common numericalattribute whose value is unknown to us. In particular, we will assume that this attributecoincides with the mean µ of each FXn (by a simple change of variable, we can see that thisrestriction also covers cases in which the common attribute is expressible as an expectationof a function of Xn).

Suppose now that we wish to estimate the value of µ on the basis of the observedsequence

X1(ω), X2(ω), . . . ;

here ω is the actual outcome of the probability experiment. The weak and strong laws oflarge numbers ensure that such inference is possible (with reasonable accuracy) providedthe dependencies between the Xn’s are suitably restricted: the weak law is valid for un-correlated Xn’s, while the strong law is valid for independent Xn’s. Since independenceis a more restrictive condition than absence of correlation (for square-integrable randomvariables), one expects the strong law to be more powerful than the weak law. This isindeed the case, as the weak law states that the sample average

X1 + · · · + Xn

n

converges to the constant µ in probability, while the strong law asserts that this convergencetakes place almost surely.

Weak law of large numbers

Theorem. Let (Xn)n∈N be a sequence of uncorrelated random variables with com-mon mean EXi = µ. If the variables also have common variance, or more generally,if

limn

1n2

n∑

i=1

VarXi = 0 ,

76

then the sample averageX1 + · · · + Xn

n

converges to the mean µ in probability.

Proof. As was shown in Sec. 17, for every α > 0 the Chebyshev inequality gives

Pr

∣∣∣∣1n

n∑

i=1

Xi − µ

∣∣∣∣ ≥ α

≤ 1

n2α2

n∑

i=1

VarXi .

Thus under the stated assumptions on the variances, we have

limn

Pr

∣∣∣∣1n

n∑

i=1

Xi − µ

∣∣∣∣ ≥ α

= 0 ,

which proves that XnP→µ.

Remark. Note that as it appears above, the right-hand side in the Chebyshev in-equality is just the second moment of the difference between the n-sample average and themean µ. Thus the variance constraint is equivalent to the statement that Xn

q.m.→ µ, whichalso implies Xn

P→µ.

Example. Let Ω = (0, 2π], F = B(Ω), P = uniform measure, and define Xn by

Xn(ω) = an sin nω , (an ∈ R).

Then

EXn =∫

Xn(ω) dP (ω) =12π

∫ 2π

0

sin nω dω = 0 ,

and

Cov(Xm, Xn) =aman

2π

∫ 2π

0

sin mω sin nω dω =

a2n/4, if n = m;

0 otherwise.

Thus (Xn)n∈N is an uncorrelated sequence. Assuming that limn (1/n2)∑n

i=1 a2i = 0 , we

can apply the weak law of large numbers to conclude that for every ε > 0,

limn

P

ω :

1n

∣∣∣∣∣n∑

i=1

an sin nω

∣∣∣∣∣ ≥ ε

= 0 .

Note that the set in the above expression is a union of intervals, since the function

h(ω) = a0 +n∑

i=1

an sin nω

77

has only finitely many roots in the interval (0, 2π]. Thus the law of large numbers impliesthat the total length of those subintervals of (0, 2π] over which

1n

∣∣∣∣∣n∑

i=1

an sin nω

∣∣∣∣∣ ≥ ε

approaches zero as n tends to infinity.

A version of the strong law of large numbers

The moment generating function MX(s) of a random variable X is defined for real sby

MX(s) def= E[esX ] =∫

estdPX(t) .

It is easily seen that for absolutely continuous PX , the above coincides with the bilateralLaplace transform of the density fX .

We always have MX(0) = 1; if X is almost surely bounded, then it is also true thatMX(s) < ∞ for every s. For unbounded X, it is possible to obtain MX(s) = +∞ for someor all values of s other than 0. However, for many distributions encountered in practice(including the gaussian, gamma, geometric, Laplace and Poisson distributions) MX(s) isfinite for s ranging over some interval containing the origin. For such distributions, we canprove the following version of the strong law of large numbers.

Theorem. Let (Xn)n∈N be an i.i.d. sequence whose marginal moment generatingfunction MX(·) is finite over a nontrivial interval containing the origin. Then

1n

n∑

i=1

Xia.s.→ EX1 = µ .

Proof. Let MX(s) be finite for s ∈ [−s0, s0], and note that since

s0|X1| < es0X1 + e−s0X1 ,

the mean µ is also finite. Thus we can assume µ = 0 without loss of generality.We will compute an upper bound to the probability that the sample average of the

first n terms is greater than or equal to some α > 0. For this purpose we use the Chernoffbound:

(∀s > 0) PrY ≥ α ≤ E[es(Y−α)

]= e−sαMY (s) .

Thus we obtain

Pr

X1 + · · · + Xn

n≥ α

= PrX1 + · · ·+ Xn > nα

≤ E[es(X1+···+Xn−nα)

]

= E

[n∏

i=1

es(Xi−α)

]=

n∏

i=1

E[es(Xi−α)

]=

(e−sαMX(s)

)n

78

where the last two equalities follow from the i.i.d. assumption.The next step is to show that there exists a point s in (0, s0) such that e−sαMX(s) < 1.

We know that e−0·αMX(0) = 1, hence it suffices to show that the derivative of e−sαMX(s)takes a negative value at the origin. We have

d

ds

(e−sαMX(s)

)∣∣∣∣s=0

= −αMX(0) + M ′X(0) = −α + M ′

X(0) .

To compute M ′X(0), we write

M ′X(s) =

d

dsE

[esX

]= E

[d

dsesX

]= E

[XesX

],

where the interchange of derivative and Lebesgue integral can be justified using the dom-inated convergence theorem (see also the discussion of moment generating functions inBillingsley). We thus obtain

d

ds

(e−sαMX(s)

)∣∣∣∣s=0

= −α + E[X] = −α < 0

as sought.Hence for every α > 0 there exists a constant ρ+ < 1 such that

Pr

X1 + · · · + Xn

n≥ α

≤ ρn

+ .

Using the same argument with −Xi replacing Xi and −s replacing s, we infer that thereexists a constant ρ− < 1 such that

PrX1 + · · · + Xn

n≤ −α ≤ ρn

− .

Thus if ρ = max(ρ+, ρ−), we have

Pr∣∣∣∣

X1 + · · · + Xn

n

∣∣∣∣ ≥ α

≤ 2ρn ,

and since ρ < 1,

∞∑n=1

Pr∣∣∣∣

X1 + · · · + Xn

n

∣∣∣∣ ≥ α

≤

∞∑n=1

2ρn =2ρ

ρ− 1< ∞ . (1)

Now the event over which the sample average converges to zero is given by

A =⋂

q∈N

⋃m

⋂

n≥m

ω :

∣∣∣∣X1(ω) + · · · + Xn(ω)

n

∣∣∣∣ < q−1

,

79

and thus

Ac =⋃

q∈N

⋂m

⋃

n≥m

ω :

∣∣∣∣X1(ω) + · · · + Xn(ω)

n

∣∣∣∣ ≥ q−1

=⋃

q∈N

lim supn

ω :

∣∣∣∣X1(ω) + · · · + Xn(ω)

n

∣∣∣∣ ≥ q−1

.

By the first Borel-Cantelli lemma (Billingsley, Sec. 4),

∞∑n=1

P (Bn) < ∞ =⇒ P (lim supn

Bn) = 0 ,

and hence by virtue of (1), we have P (Ac) = 0. Thus the sample average converges to zeroalmost surely.

Kolmogorov’s Strong Law of Large Numbers

The general version of the strong law of large numbers is due to Kolmogorov. It isappreciably stronger than the statement of the previous theorem, in that it posits conver-gence of sample averages under constraints on the first two moments of the independentvariables. For a proof, see Billingsley.

Theorem. Let (Xn)n∈N be an independent sequence of random variables with com-mon mean EXn = µ. If either(i) the Xn’s are identically distributed; or(ii) the Xn’s are square-integrable with

∞∑n=1

VarXn

n2< ∞ ,

then the sample averageX1 + · · · + Xn

n

converges to µ almost surely.

Remarks.1. Note that the i.i.d. assumption (case (i)) above) does not exclude the possibility

µ = ±∞, in which case the sample average converges almost surely to a constant extendedrandom variable.

2. For most practical purposes, when considering sequences of independent (notjust uncorrelated) square-integrable random variables, Kolmogorov’s strong law of largenumbers subsumes the weak law as stated earlier in this section. This is because almostsure convergence always implies convergence in probability. It is worth noting, however,that the variance constraint employed in the statement of the weak law, namely

limn

1n2

n∑

i=1

VarXi = 0 ,

80

is somewhat more general than the condition∑∞

n=1 VarXn/n2 < ∞ in Kolmogorov’s stronglaw of large numbers. Thus there are instances of independent sequences to which the weaklaw applies, but the strong law does not.

Examples

(a) Consider again the sequence (Xn)n∈N of random variables of Section 2. For arandom point ω drawn uniformly from the unit interval (0, 1], we defined

Xk(ω) = kth digit in the binary expansion of ω .

The Xk’s were easily shown to be i.i.d. with EXk = 1/2. By the strong law of largenumbers,

P

ω : lim

n→∞X1(ω) + · · · + Xn(ω)

n=

12

= 1 .

As a consequence, the set A1/2 of points on the unit interval whose binary expansions haverunning averages that converge to 1/2 has measure 1 under the Lebesgue measure; thusA1/2 is in a sense as large as the unit interval itself. Yet in a different sense, A1/2 is muchsmaller than the unit interval: A1/2 does not overlap with any of the equally populoussets D (consisting of all points whose asymptotic expansions have divergent time averages)or Ap (defined as A1/2 with p 6= 1/2 replacing 1/2). We have thus exhibited a Borelsubset of the unit interval whose Lebesgue measure equals one and whose complement isuncountable.

Question. What is the Lebesgue measure of all points in the unit interval whoseternary expansions have running averages that converge to 1/3?

(b) Note that if (Xn)n∈N is an i.i.d. sequence, then so is (g(Xn))n∈N, where g : R 7→ Ris Borel measurable. Thus if E[g(X1)] exists,

g(X1) + · · · + g(Xn)n

a.s.−→ E[g(X1)] ,

and a simple variable transformation can be used to estimate E[g(X1)] from the observa-tions. Thus for example, we can use

X21 + · · · + X2

n

n

to estimate the second moment, which always exists.The assumption that the Xn’s are identically distributed is clearly crucial here. Also

note that we cannot give a similar argument for uncorrelated random variables basedon the weak law, even under the assumption that the marginals are identical; this is sobecause g(Yi) and g(Yj) can be correlated even if Yi and Yj are not.

Finally, consider the transformation g = IH , where H is a linear Borel set. As before,we have

IH(X1) + · · · + IH(Xn)n

a.s.−→ E[IH(X1)] ,

81

or equivalently,

number of k’s (where 1 ≤ k ≤ n) for which Xk ∈ H

n

a.s.−→ PX1(H) .

We can interpret the above result as follows. (Xn)n∈N defines a sequence of subexperi-ments of (Ω,F , P ) that serve as mathematical models for the independent repetitions ofsome physical experiment X. The last relationship effectively states that the relative fre-quency of the event X ∈ H amongst the first n independent repetitions of X convergesalmost surely to the probability of that event. This corroborates the so-called frequentistview of probability, according to which the probability of an uncertain event in a randomexperiment is an objective entity which is revealed to an (infinitely patient) observer bythe limit of the relative frequency of that event in a sequence of independent repetitionsof the experiment.

21. Characteristic functions

Gray & Davisson, pp. 135–136, 236–240.Billingsley, Sec. 26.Wong, pp. 29–30.

The characteristic function φX(u) of a random variable X is defined for real u by

φX(u) def= E[eiuX ] = E[cos uX] + iE[sin uX] .

If X has density fX , then

φX(u) =∫ ∞

−∞eiutfX(t) dt ,

and thus φX is the Fourier transform of fX (the moment generating function was seen tocoincide with the bilateral Laplace transform of fX).

The function φX is continuous on the real line, and is such that

φX(0) = 1 , (∀u) |φX(u)| ≤ 1 .

When X has a symmetric distibution PX (e.g., when fX is an even function), the charac-teristic function φX is real-valued and symmetric.

Characteristic functions are discussed at length in Billingsley. We briefly note thefollowing essential properties.

(a) PX completely specifies φX and vice versa. The forward assertion is true bydefinition of φX . The converse can be also shown to be true: the increment FX(b)−FX(a)equals the quantity

limM→∞

12π

∫ M

−M

e−iua − e−iub

iuφ(u) du

whenever PXa = PXb = 0, and thus it is possible to deduce the values of FX atall points of continuity. In the special case in which the modulus |φX | is integrable, the

82

distribution PX has a continuous density fX which can be recovered from φX using theinverse Fourier transform:

fX(t) =12π

∫

−∞∞e−iutφX(u) du .

(b) Convergence of characteristic functions. Recall the definition of convergence indistribution: Xn

d→X if FXnconverges to FX at every point of continuity of FX . As it turns

out, convergence in distribution can be also defined in terms of characteristic functions:Xn

d→X if and only if φXn(u) converges to φX(u) at every point u.

It is often advantageous to study convergence in distribution in terms of characteristicfunctions. If a sequence (φn)n∈N of characteristic functions has the property that φn(u)converges at every u to limit φ(u) such that φ(0) = 1, then the limiting function φ isalso a characteristic function and convergence in distribution is ensured. This gives us aconvenient criterion for this mode of convergence that was absent in the original formulationin terms of cdf’s (see Section 18).

(c) Independence and Characteristic Functions. Suppose X1, . . . , Xn are indepnendentrandom variables. Then the characteristic function of the sum

Y = X1 + · · · + Xn

is given by

φY (u) = E[ei(uX1+ ···+uXn)]

(by independence) = E[eiuX1 ] · · ·E[eiuXn ] = φX1(u) · · ·φXn(u) .

An analogous property was seen to be true for the moment generating function. If eachXk has absolutely continuous distribution, the density of Y is given by the convolution ofthe densities of the Xk’s. This is consistent with the well-known fact of Fourier analysisthat convolution in the time (t) domain is equivalent to multiplication in the frequency(u) domain.

Characteristic function of a random vector

Convention. In all algebraic manipulations that follow, boldface symbols such asa and X will denote column vectors. Thus row vectors will be denoted by aT and XT .The transposition symbol will be omitted where no confusion is likely to arise, e.g., whenwriting X = (X1, . . . , Xn).

The characteristic function of an n-dimensional random vector X = (X1, . . . , Xn) isthe function φX : Rn 7→ R defined by

φX(u) def= E[eiuT X

],

83

or equivalently,

φX(u1, . . . , un) = E[exp i(u1X1 + · · · + unXn)] .

If X has absolutely continuous distribution,

φX(u1, . . . , un) =∫ ∞

−∞. . .

∫ ∞

−∞ei(u1x1+···+unxn)fX(x1, . . . , xn) dx1 · · · dxn ,

and thus φX coincides with the n-dimensional Fourier transform of the density fX.The general properties of n-variate characteristic functions are similar to those of

univariate ones (see (a)–(c) in the previous subsection). A simple property that will provequite useful is the following: if A is an m× n matrix and b is an m-dimensional (column)vector, then the characteristic function of Y = AX + b evaluated at u ∈ Rn is given by

φY(u) = E[eiuT (AX+b)

]= eibT uE

[ei(AT u)T X

]= eibT uφX(AT u) .

Also, if the components of X are independent,

φX(u1, . . . , un) = φX1(u1) · · ·φXn(un) .

Expectation of vectors and the covariance matrix

If

X =

X1...

Xn

and Θ =

X11 X12 . . . X1n

X12 X22 . . . X2n...

.... . .

...Xm1 Xm2 . . . Xmn

,

we define

EX =

EX1...

EXn

and EΘ =

EX11 EX12 . . . EX1n

EX12 EX22 . . . EX2n...

.... . .

...EXm1 EXm2 . . . EXmn

.

The covariance matrix of X is the n× n matrix CX defined by

CXdef= E[(X− EX)(X− EX)T ] .

Thus the (i, j)th entry of CX is simply given by

(CX)ij = E[(Xi − EXi)(Xj − EXj)] = Cov(Xi, Xj) ,

and CX is symmetric.

84

It is easy to verify that expectation is linear, in that

E[AX] = AEX, E[XA] = E[X]A

andE[BΘ] = BE[Θ], E[ΘB] = E[Θ]B

for constant matrices A and B of appropriate size. We can use linearity to evaluate thecovariance matrix of AX as follows:

CAX = E[(AX− E(AX))(AX− E(AX))T ]

= E[A(X− EX)(X− EX)T AT ]

= E[A(X− EX)(X− EX)T ]AT

= AE[(X− EX)(X− EX)T ]AT

= ACXAT .

22. Gaussian random variables and vectors

Gray & Davisson, pp. 122–123, 256–258.Billingsley, Sec. 29, Normal Distributions in Rk.Wong, pp. 39–46.

Gaussian variables

Definition. A random variable X is Gaussian if either of the following is true:(a) X is constant with probability 1;(b) X has density

fX(x) =1√

2πσ2exp

− (x− µ)2

2σ2

for σ2 ≥ 0.

In case (a), we have X = EX with probability 1, and VarX = 0. In case (b), we haveEX = µ, VarX = σ2. In both cases we use the notation

X ∼ N (EX, VarX) ;

here N stands for “normal,” which synonymous with “Gaussian.” Thus the univariateGaussian distribution is parametrically specified through its mean and its variance.

The characteristic function of a N (0, 1) distribution is given by

φ(u) =1√2π

∫ ∞

−∞eiux−(x2/2) dx =

e−u2/2

√2π

∫ ∞

−∞e−(x−iu)2/2 dx .

85

The last integral in the above expression is equal to that of the analytic function exp(−z2/2)along a path parallel to the real axis, and reduces by a standard argument to the integralof the same function over the real axis itself. Thus

φ(u) =e−u2/2

√2π

∫ ∞

−∞e−x2/2 dx = e−u2/2 .

If X ∼ N (µ, σ2) and σ2 ≥ 0, we can write

X = σY + µ

for appropriate Y ∼ N (0, 1), and thus

φX(u) = eiµuφY (σu) = eiµu− 12 σ2u2

.

If X ∼ N (µ, 0), then X = µ with probability 1, and

φX(u) = eiµu .

Thus the identity

φX(u) = exp

i(EX)u− 12(VarX)u2

holds for all Gaussian X.

Gaussian vectors

Definition. An n-dimensional random vector X is Gaussian if it can be expressed inthe form

X = AY + b ,

where A is a nonrandom n× n matrix, b is a nonrandom n-dimensional vector, and Y isan n-dimensional random vector of independent N (0, 0) or N (0, 1) components.

Two alternative definitions of a Gaussian random vector are suggested by the followingtheorem.

Theorem.(i) A n-dimensional random vector X is Gaussian if and only if for all a ∈ Rn, aT X

is a Gaussian random variable.(ii) A n-dimensional random vector X is Gaussian if and only if there exists a vector

m ∈ Rn and an n× n nonnegative-definite symmetric matrix Q such that

φX(u) = exp

imT u − 12uT Qu

.

Proof. We will establish each of the following implications:

X is Gaussian =⇒ X satisfies condition of statement (i)1.

X satisfies condition of statement (i) =⇒ X satisfies condition of statement (ii)2.

X satisfies condition of statement (ii) =⇒ X is Gaussian3.

86

1. Let X = AY + b, where Y is an n-dimensional random vector with independentN (0, 0) or N (0, 1) components; for concreteness, let Yk ∼ N (0, σ2

k), so that (CY)kk = σ2k.

By virtue of independence, φY is given by

φY(u1, . . . , un) =n∏

k=1

φYk(uk)

=n∏

k=1

e−σ2ku2

k/2

= exp−1

2uT CYu

.

Since aT X = aT AY + aT b, we have

φaT X(u) = expi(aT b)uφY(AT au)

= exp

i(aT b)u − 12(aT A)CY(aT A)T u2

.

Since E[aT X] = aT b and

Var(aT X) = CaT X = CaT AY = (aT A)CY(aT A)T ,

we conclude that aT X is a Gaussian random variable.2. Assume that aT X is Gaussian for every choice of a ∈ Rk. Thus for u ∈ Rk we can

writeφX(u) = E

[eiuT X

]

= φuT X(1) = exp

iE[uT X]− 12Var(uT X)

.

Now E[uT X] = uT E[X] and Var(uT X) = uT CXu, whence we conclude that

φX(u) = exp

i(EX)T u − 12uT CXu

.

CX is always symmetric; for nonnegative-definiteness we need uT CXu ≥ 0 for all u, whichis true since uT CXu = Var(uT X).

3. Let the n-dimensional random vector X have characteristic function

φX(u) = exp

imT u − 12uT Qu

,

where Q is a n×n nonnegative-definite symmetric matrix. From elementary linear algebra,we can express any n× n symmetric matrix Q as

Q = BΛBT ,

87

where Λ is an n× n diagonal matrix whose diagonal entries coincide with the eigenvaluesof Q, and B is an n × n matrix whose columns coincide with the column eigenvectors ofQ. B is orthogonal, i.e., BT B = I.

The assumption that Q is nonnegative-definite is the same as requiring that all eigen-values of Q be nonnegative. If all eigenvalues are positive (i.e., if Q is positive-definite),then both Q−1 and Λ−1 exist. If, however, one or more eigenvalues are zero, then neitherQ−1 nor Λ−1 exist. We surmount such difficulties by defining the matrix ∆ as follows:

∆kl =

(Λkl)−1/2, if Λkl > 0,1, otherwise.

∆ is clearly invertible, andJ = ∆Λ∆T

is a diagonal matrix such that Jkl = 1 if Λkl > 0, Jkl = 0 otherwise.

Consider now the transformation

Y = ∆BT (X−m) .

We haveφY(u) = φX−m(B∆T u)

= exp−1

2uT (∆BT )Q(∆BT )T u

= exp−1

2uT (∆BT )BΛBT (∆BT )T u

= exp−1

2uT Ju

.

Thus Y is a vector of independent N (0, 0) or N (0, 1) Gaussian components. It is easilyverified that

X = B∆−1Y + m .

Further properties of Gaussian vectors

(a) If X is a Gaussian vector in Rn and A : Rn 7→ Rm is a linear transformation,then AX is also a Gaussian vector in Rm. Thus the Gaussian property is preserved underall linear transformations. This fact follows easily from the characterization of Gaussianvectors in terms of their characteristic function; the special case m = 1 was also treated inthe theorem of the previous subsection.

Remark. Transformations of the form Y = AX+b are known as affine. Clearly theGaussian property is also preserved under affine transformations.

(b) The distribution of a Gaussian vector X is fully specified by its expectation EXand covariance function CX; no other parameters are needed.

88

(c) If the components of a Gaussian vector X are uncorrelated, then they are alsoindependent. This is so because for a diagonal covariance matrix CX, the characteris-tic function φX can be decomposed into a product of univariate Gaussian characteristicfunctions. Thus in the Gaussian case, the two properties of independence and absence ofcorrelation are equivalent. In the general non-Gaussian case, we have seen that uncorre-latedness does not imply independence.

Whitening of random vectors

Given any random vector X in R (not necessarily Gaussian), we can always find azero-mean vector Y in R such that

X = AY + m

and the components of Y are uncorrelated. We have essentially shown this in step (3) ofthe proof given earlier: CX is a nonnegative-definite symmetric matrix, and can thus berepresented as

CX = BΛBT

for B orthogonal and Λ diagonal with nonnegative entries. Defining ∆ as before, we have

Y = ∆BT (X−m) , X = B∆−1Y + m .

Hence EY = 0, and

CY = ∆BT CXB∆T = ∆Λ∆T = J ,

where J is a diagonal matrix such that Jkl = 1 if Λkl > 0, Jkl = 0 otherwise. Note that if Xis Gaussian, then Y is also Gaussian with uncorrelated, hence independent, components.In the non-Gaussian case, the components of Y need not be independent.

The choice of transformation A is not unique; the above construction in terms ofB, Λ and ∆ is just one standard method of whitening X, i.e., expressing X as an affinetransformation of a zero-mean vector Y with uncorrelated components. In the special casein which CX is positive definite (hence invertible), any matrix A such that

CX = AAT

can yield a whitening transformation. Indeed, A is then also invertible, and if

Y = A−1(X−m) ,

we haveX = AY + m , CY = A−1AAT (A−1)T = I .

Different choices of A include BΛ1/2 (coincides with B∆−1), BΛ1/2BT (symmetric), aswell as the upper- and lower-diagonal forms that are obtained via Gram-Schmidt orthog-onalization.

89

The density of a Gaussian vector.

Let X ∈ Rn be a Gaussian random vector on (Ω,F , P ) such that EX = m, CX = Q.We seek an expression for the density of X in Rn.

(a) Q invertible. As in the previous subsection, we write

CX = AAT

and consider the transformation Y = A−1(X−m). From the foregoing discussion we knowthat CY = I, and hence the components of Y are independent N (0, 1) Gaussian variables.The density of Y is then given by

fY(y) =n∏

k=1

1√2π

e−x2k/2 = (2π)−n/2 exp

−1

2yT y

.

To derive the density of X, we note that the probability of any n-dimensional rectangle Hunder PX is

PX(H) = PY(A−1(H −m))

=∫

. . .

∫

A−1(H−m)

fY(y) dy1 · · · dyn .

To evaluate the above integral, we use the invertible mapping

y = A−1(x−m)

and invoke the rule for change of variables via Jacobians:

dy1 · · · dyn = | detS|dx1 · · · dxn ,

where the matrix-valued function S of x is such that

Sij =∂yi

∂xj.

In this case we simply have S = A−1. We can therefore write

PX(H) =∫

. . .

∫

H

fY(A−1(x−m))|det A−1| dx1 · · · dxn

=∫

. . .

∫

H

(2π)−n/2| detA−1| exp−1

2(x−m)T (A−1)T A−1(x−m)

dx1 · · · dxn .

Recalling that Q = AAT , we obtain the final expression for the density of X:

fX(x) = (2π)−n/2(det Q)−1/2 exp−1

2(x−m)T Q−1(x−m)

.

90

(b) Q non-invertible. In this case Q has at least one zero eigenvalue. If b is aneigenvector corresponding to a zero eigenvalue, then

Qb = 0 ,

which implies thatVar(bT X) = bT Qb = 0 .

Thus the random variable bT X is constant (equal to its expectation) with probability one,and the random vector X lies on the hyperplane

x : bT x = bT m

almost surely. Since hyperplanes are of dimension n− 1, PX is a singular distribution onRn and fX does not exist.

In general, if exactly k eigenvalues of Q are equal to zero, then X lies in a (n − k)-dimensional set which is the intersection of k orthogonal hyperplanes. This means that therandomness of X is effectively “limited” to n − k components, from which the remainingk components via affine combinations. This is consistent with our earlier discussion onwhitening: the random vector

Y = ∆BT (X−m)

will be such thatCY = J ,

and will thus have only n− k truly random (and independent) components; the remainingcomponents will be zero with probability one. Thus in writing

X = AY + m ,

we are effectively constructing X using n− k random variables only.

91

random process.pdf

Documents