basic probability theory - uzh · 2006. 3. 17. · 2 1. basic probability theory a ac b bc figure...

25
CHAPTER 1 Basic Probability Theory The way probability deals with randomness and uncertainty is to assemble all possible outcomes, of the undertaken experiment, into a big set called a sample space, Ω. The realization of any of these outcomes then becomes a process of drawing lots. Each lot is assigned a ‘weight’ and probabilities are computed by ‘adding’ the weights of the lots involved. It is along these lines that we will try here to set the guidelines for the construction of a probabilistic model for a wide range of experiments. 1. Sample Spaces and σ-fields Definition 1. A σ-field is a family of subsets of Ω which satisfies the following properties: (1) Ω belongs to F , (2) if A ∈F then A c ∈F , (3) if a countable sequence of sets A 1 ,A 2 ,... is in F , then the union S i A i F . When a set A belongs to F , we say that it is F -measurable, or if no confusion is possible, that it is measurable. Measurable sets are also called events. Example 2. The set {∅, Ω} is a σ-field. It is called the trivial σ-field and is the smallest possible σ-field. The set of all subsets of Ω is also a σ-field on Ω. It is called the total σ-field and is the largest σ-field on Ω. The set {∅, A, A c , Ω}, where A is a proper subset of Ω (∅⊂ A Ω), is another σ-field on Ω; it is called the σ-field generated by the event A. The σ-field F generated by two proper subsets A and B such that none of the sets A, B, A c or B c are included in one another, is F = {∅,A B,A B c ,A c B,A c B c ,A,B, (AΔB) c ,AΔB,B c ,A c ,A B,A B c ,A c B, (A B) c , Ω}. A more systematic way of representing the elements of F is {(A B) α 1 (A B c ) α 2 (A c B) α 3 (A c B) α 4 , α 1 2 3 4 ∈{0, 1}}, where E α = if α =0, and E α = E if α =1 A σ-field is often seen as a particular bit of information available (and not observed) to a given observer. Different bits of information can be included in one another (see Figure 1) but can also complete one another (see Figure 2). The A A c A B A c B A B c A c B c Figure 1. The σ-fields generated by A (left) and by A and B (right) 1

Upload: others

Post on 17-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

CHAPTER 1

Basic Probability Theory

The way probability deals with randomness and uncertainty is to assemble allpossible outcomes, of the undertaken experiment, into a big set called a samplespace, Ω. The realization of any of these outcomes then becomes a process ofdrawing lots. Each lot is assigned a ‘weight’ and probabilities are computed by‘adding’ the weights of the lots involved. It is along these lines that we will try hereto set the guidelines for the construction of a probabilistic model for a wide rangeof experiments.

1. Sample Spaces and σ-fields

Definition 1. A σ-field is a family of subsets of Ω which satisfies the followingproperties:

(1) Ω belongs to F ,(2) if A ∈ F then Ac ∈ F ,(3) if a countable sequence of sets A1, A2, . . . is in F , then the union

⋃i Ai ∈

F .When a set A belongs to F , we say that it is F-measurable, or if no confusion ispossible, that it is measurable. Measurable sets are also called events.

Example 2. The set ∅, Ω is a σ-field. It is called the trivial σ-field and isthe smallest possible σ-field. The set of all subsets of Ω is also a σ-field on Ω. It iscalled the total σ-field and is the largest σ-field on Ω. The set ∅, A, Ac, Ω, whereA is a proper subset of Ω (∅ ⊂ A ⊂ Ω), is another σ-field on Ω; it is called theσ-field generated by the event A.

The σ-field F generated by two proper subsets A and B such that none of thesets A, B, Ac or Bc are included in one another, is F = ∅, A ∩ B, A ∩ Bc, Ac ∩B, Ac∩Bc, A, B, (A∆B)c, A∆B,Bc, Ac, A∪B, A∪Bc, Ac∪B, (A∩B)c, Ω. A moresystematic way of representing the elements of F is (A∩B)α1 ∪ (A∩Bc)α2 ∪ (Ac∩B)α3 ∪ (Ac ∩B)α4 , α1, α2, α3, α4 ∈ 0, 1, where Eα = ∅ if α = 0, and Eα = E ifα = 1

A σ-field is often seen as a particular bit of information available (and notobserved) to a given observer. Different bits of information can be included inone another (see Figure 1) but can also complete one another (see Figure 2). The

A Ac A ∩B Ac ∩B

A ∩Bc Ac ∩Bc

Figure 1. The σ-fields generated by A (left) and by A and B (right)

1

Page 2: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

2 1. BASIC PROBABILITY THEORY

A Ac B

Bc

Figure 2. The σ-fields generated by A (left) and by B (right)

information conveyed by the boxes in Figures 1 and 2 is as follows. After theexperiment is run, one (and only one) area of each box is selected describing thewhereabouts of the outcome. For example, if the outcome ω of the experimentbelongs to A∩Bc, then the areas A, A∩Bc and Bc would be selected. Clearly theobserver looking at the box on the right of Figure 1 is best informed; his knowledgeis that of the two observers in Figure 2 combined.

Theorem 3. Let (Fi)i be a collection of σ-fields on Ω. Then⋂

i Fi is also aσ-field on Ω.

Let C be a class of subsets of Ω. There exists a smallest σ-field containing C,i.e. it is a σ-field, it contains C and every σ-field containing C will contain it. Itwill be denoted by σ(C) and is equal to

⋂i Fi, where the Fi’s describe the set of all

σ-fields containing C. We say that the σ-field σ(C) is generated by C.Example 4. The most important example of σ-fields is probably the class,

B(R), of Borel sets on R. These sets are the subsets of R that we use most; intervals(of all four types) are examples of such sets. In fact if C is the set of semi-closedon the right intervals of the real line, i.e. C = (a, b]; −∞ < a < b < +∞, thenB(R) = σ(C). B(R) contains, of course, all intervals but also, all countable unionsof semi-closed on the right intervals, the singletons and a wide range of countablecombinations of all the above. One of the most famous example of a non-trivialBorel set is the Cantor set.

Example 5. The union of two σ-fields, say F1 and F2, is, in general, nota σ-field (unless one is included in the other). Indeed, let A and B be two propersubsets of Ω which are different and not complements of one another, in other wordsthe four events A, B, Ac and Bc are different. Then the union of F1 = σ(A)and F2 = σ(B) is equal to ∅, A, B, Ac, Bc, Ω and is not a σ-field as it doesnot contain, for example, A ∪ B. This is a situation where a σ-field big enough tocontain F1 and F2 needs to be constructed. The smallest amongst them, σ(F1∪F2),is denoted by F1 ∨ F2.

Definition 6. A partition of Ω is a collection D of exhaustive and mutuallyexclusive subsets,

D1, D2, . . ., such that for m 6= n,Dm ∩Dn = ∅, and⋃n

Dn = Ω.

The σ-field F generated by a finite partition D is the collection of all finiteunions of Dj ’s and their complements. The sets in D are the basic building blocksfor the σ-field.

2. Probability measures

We now assume that we are given a measurable space, i.e. a pair (Ω,F) madeof a sample space Ω and a σ-field (on Ω) F .

Page 3: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

2. PROBABILITY MEASURES 3

Definition 7. A probability measure (or probability) P is a function definedon F taking values in [0, 1] which satisfies the following two properties :

(1) P(Ω) = 1(2) if (Ai)i is a countable sequence of disjoint (i.e. ∀i, j, Ai ∩ Aj = ∅) F-

measurable events, then

(1) P

(⋃

i

Ai

)=

i

P(Ai).

(1) is called countable additivity. When (1) is only satisfied for finite sequencesit is called finite additivity. In fact, finite additivity reduces to (1) being satisfiedfor sequences of 2 events (check it).

The triple (Ω,F ,P) is called a probability space.A number of first properties are directly deduced from the definition of a prob-

ability P. We give here some of them and leave the proof to the reader.

Theorem 8.

(1) ∀A ∈ F , P(Ac) = 1− P(A)(2) P(∅) = 0(3) ∀A, B ∈ F , A ⊆ B ⇒ P(A) ≤ P(B)(4) ∀A, B ∈ F , P(A ∪B) = P(A) + P(B)− P(A ∩B)

(5) ∀(An)n, P

(⋃

i

Ai

)≤

i

P(Ai)

Proposition 9. If A1, A2, . . . is a decreasing sequence of events, then

P

(⋂n

An

)= lim

nP(An).

We say that P is continuous.

Theorem 10. The countable additivity condition in Definition 7 can be replacedby that of finite additivity and continuity.

An important application of these basic results for probability measures is thecelebrated Borel-Cantelli Lemma. For any sequence of events (An)n, we denote bylim supn An or An (i.o.) ( (i.o.) stands for infinitely often) the set of ω’s whichbelong to infinitely many An’s. Clearly,

An (i.o.) = lim supn

An =⋂n

m≥n

Am,

and

P[An (i.o.)] = P

n

m≥n

Am

= limn

P

m≥n

Am

≤ lim

n

m≥n

P(Am).

The Borel-Cantelli Lemma now easily follows.

Theorem 11 (Borel-Cantelli Lemma I). . Let (An)n be a sequence of events(in F). ∑

n

P(An) < +∞ =⇒ P(lim supn

An) = 0.

Page 4: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

4 1. BASIC PROBABILITY THEORY

3. Random Variables

Very often the exact outcome of an experiment is of little importance and allwe are interested in is a particular ‘measurement’. By measurement, we meanhere a real number associated, in some way, to the outcome. For example, inthe process of tossing a coin three times, there are eight possible outcomes (Ω =HHH, HHT, HTH, HTT, THH, THT, TTH, TTT), but if we are only inter-ested in how many times H comes up, it is enough to keep track of the numberof occurrences of H and not of the actual outcome. That is, we will associatethe value 0 to the outcome TTT , the value 1 to the outcomes HTT , THT andTTH, the value 2 to the outcomes HHT , HTH and THH, and the value 3 tothe outcome HHH. Such a correspondence (or mapping, or function) is called arandom variable. As will become clear later, an important requirement for randomvariables, is that we can make probability statements about them. We may, forexample, want to evaluate the probability that the number, say X, of heads in 3tosses of a fair coin, equals 2. Since ω; X(ω) = 2 = HHT, HTH, THH and

P(HHT) = P(HTH) = P(THH) =18, it is easily found that the probabil-

ity in question equals38. It is therefore essential that sets such as ω; X(ω) = x,

ω; X(ω) ≤ x, ω; x < X(ω) ≤ y,. . . , be measurable, i.e. they belong to theσ-field on which the probability P is defined. This requirement is automatically metin the countable case (finite or infinite) since the σ-field used is the total σ-field.For this reason any real-valued function on a countable sample space is a randomvariable. This is not the case in general. For an arbitrary Ω we will require thatall sets ω; X(ω) ∈ A, where A is a Borel set, be F-measurable. We now assumethat we are given a probability space (Ω,F , P) .

Definition 12. A real-valued function X is a random variable if and only iffor any x in R, the set ω; X(ω) ≤ x belongs to F . A random variable is alsosaid to be F-measurable.

The smallest σ-field for which X is a random variable is called the σ-fieldgenerated by X and is denoted by σ(X).

4. Independence and Conditioning

It is often the case that part of the outcome, of a given experiment, is knownto the observer, while the remaining part is not disclosed. The aim is then to re-asses, based on the (partial) observations made, the chances of a particular eventhappening.

Consider, for example, a game of rolling two fair six-sided dice where the gain,in dollars, is the maximum of the two dice. A straightforward calculation showsthat the expected payoff of the game is (roughly) $4.47. Suppose now that youare told that the sum of the two dice is 9, how does the expected payoff change?There are 4 equally likely ways of getting 9: (3,6), (4,5), (5,4) and (6,3). Theprobability of getting a maximum of, say 6, increases from 11/36 to 1/2 and theexpected return of the new game is $5.50. The bit of information made available tothe player has restricted the number of possible outcomes (from 36 to 4) and hasredistributed the probabilities of each simple event happening; the probability of asimple event increases from 1/36 to 1/4.

In the same manner,

Page 5: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

4. INDEPENDENCE AND CONDITIONING 5

Definition 13. If A and B are two arbitrary events, the latter being of positiveprobability, the conditional probability of A given B, denoted by P(A|B), is

(2) P(A|B) =P(A ∩B)

P(B).

We immediately see that the properties of Definition 7 are satisfied by the setfunction P(.|B), for a fixed event of positive probability, B. That is

P(∅|B) = 0 , P(Ω|B) = 1

and, for any countable sequence (An)n of disjoint events,

P

(⋃n

An|B)

=1

P(B)P

((⋃n

An

)∩B

)

=1

P(B)P

(⋃n

(An ∩B)

)

=1

P(B)

(∑n

P(An ∩B)

)

=∑

n

P(An|B).

P(.|B) is therefore a probability measure. It satisfies the well known law of totalprobability and Bayes’ formula.

Theorem 14. Let D1, D2, . . . be a measurable partition of Ω, and A be anarbitrary event. Then

(3) P(A) =∑

n

P(A|Dn)P(Dn)

and

(4) P(Dn|A) =P(A|Dn)P(Dn)∑k P(A|Dk)P(Dk)

When A and B are two events of positive probability, the realization of oneof them may increase the chances of the other one happening, it may decrease it,or may not have any effect on it. For example, in the game of rolling two fairsix-sided dice, knowing that the sum is 9 increases (from 11/36 to 1/2) the chancesof the maximum being 6, and knowing that the maximum is 5 decreases (from11/36 to 2/9) the chances of the minimum being 1. But knowing that the firstdie shows a 6 does not increase and does not decrease the probability that thesecond will also show a 6. In this case we say that the two events are independent.Note that in this example P[both dice show a 6] = 1/36 = P[first die shows a 6]×P[second die shows a 6]. In general,

Definition 15. The two events A and B are independent if P(A ∩ B) =P(A)P(B).

The condition P(A ∩B) = P(A)P(B) being equivalent to P(A|B) = P(A) andto P(B|A) = P(B), for events of positive probability, either of the latter equalitiescould be used as a definition of independence. In fact, expressing independencein terms of conditional probabilities is more natural. The reason for adoptingDefinition 15 is to encompass the case of events of nil probability.

Independence is obviously dependent upon the probability measure in use. Thefollowing example shows how two events may be independent under one probabilityand dependent under another.

Page 6: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

6 1. BASIC PROBABILITY THEORY

Example 16. Let Ω = ω1, ω2, ω3, ω4, P be the probability

P(ωi) =14∀i ∈ 1, 2, 3, 4,

Q be the probability

Q(ωi) =13∀i ∈ 1, 2, 3,

A be the event ω1, ω2 and B be the event ω1, ω3. Then P(A ∩ B) =14

=

P(A)P(B) and Q(A ∩ B) =136= 4

9= Q(A)Q(B), which shows that A and B are

P-independent but not Q-independent.

The concept of two independent events can be generalized to three or moreevents.

Definition 17. The sets Ai, i ∈ I, are said to be independent if for any n andany subset i1, . . . , in of I,

(5) P(Ai1 ∩ . . . ∩Ain) = P(Ai1) . . . P(Ain).

They are said to be pairwise independent if (5) is satisfied for subsets of I of sizetwo.

Note that the index set, I, can be finite or infinite.The following example shows that independence is stronger than pairwise in-

dependence.

Example 18. Consider the previous example. Under the probability P, theevents A, B and C = ω1, ω4 are pairwise independent, but

P(A ∩B ∩ C) = P(ω1) =146= 1

8= P(A)P(B)P(C).

Furthermore, if A and B are independent events, then so are the events Aand Bc, the events Ac and B, and the events Ac and Bc. In fact, every set of∅, A, Ac,Ω is independent of every set in ∅, B, Bc,Ω. Let us verify the inde-pendence of A and Bc. The others have an identical proof.

P(A ∩Bc) = P(A)− P(A ∩B) = P(A)− P(A)P(B)= P(A)(1− P(B)) = P(A)P(Bc).

It is then natural to say that

Definition 19.

(1) The families Ai, i ∈ I, are independent if for any n, any subset i1, . . . , inof I and any Ai1 , . . . , Ain in Ai1 , . . . ,Ain respectively,

P(Ai1 ∩ . . . ∩Ain) = P(Ai1) . . . P(Ain).

(2) The random variables Xi, i ∈ I, are independent if the families σ(Xi),i ∈ I, are independent.

For all x and y, the events X ≤ x and Y ≤ y belong to σ(X) and σ(Y )respectively. Thus, the independence of X and Y implies that of the events X ≤ xand Y ≤ y, and in fact that of the events X ∈ A and Y ∈ B, for any couple(A,B) of Borel sets. This is the key to the next theorem.

Theorem 20. Let X and Y be two random variables. The following statementsare equivalent:

(1) X and Y are independent,(2) g(X) and h(Y ) are independent, for any pair of borelian functions g and

h,

Page 7: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

5. ONE-DIMENSIONAL DISTRIBUTIONS 7

(3) X ∈ A and Y ∈ B are independent, for any pair of borelian sets Aand B.

Let us now reconsider the game of rolling two fair six-sided dice and let us tryto assess the chances of getting a maximum of, say 6, given that the sum is eitherof 11 possible values: 2,. . . ,12. We find

s ≤ 6 7 8 9 10 ≥ 11P[Max. = 6|Sum = s] 0 1/3 2/5 1/2 2/3 1

This table defines a random variable taking the values 0, 1/3, 2/5, 1/2, 2/3 and1 on the events Sum ≤ 6, Sum = 7, Sum = 8, Sum = 9, Sum = 10 andSum ≥ 11 respectively. These events constitute a partition of Ω and every timethe dice are rolled and the sum made known to the observer, a quick look at thetable, or the random variable it defines, and the chances of a 6 having been drawnare assessed.

One, of course, can generalize the above, and conditione any event with respectto any (finite) partition.

Definition 21. Let D = D1, D2, . . . be a measurable partition of Ω. Theconditional probability of an event A given the partition B is the random variable

(6) P(A|D) =∑

n

P(A|Dn)1Dn .

The conditional probability of an event A with respect to the trivial partitionB = Ω will simply reduce to the probability of A;

P(A|Ω) = P(A).

Furthermore, P(A|Dn) can easily be obtained from P(A|D) by selecting an ω in Dn

then computing P(A|D)(ω). This is no longer true, for example, for P(A|Dm∪Dn),where m 6= n.

5. One-dimensional distributions

Definition 22. The probability measure on the set of Borel sets of R, PX

defined by

(7) PX(A) = P(X ∈ A).

is called the law or distribution of the random variable X and the function FX

defined by

(8) FX(x) = PX((−∞, x]) = P[X ≤ x].

the distribution function of the random variable X.

Theorem 23. The function FX defined by (8) is non-decreasing, right-continuousand satisfies:

(9) FX(−∞) := limx↓−∞

FX(x) = 0 and FX(+∞) := limx↑+∞

FX(x) = 1.

Clearly the knowledge of PX induces that of FX . It turns out that it is possibleto ‘reconstruct’ PX by means of FX . The key to this inversion is the celebratedCaratheodory’s theorem.

Theorem 24. There is a one to one correspondence between probability mea-sures on (R,B(R)) and distribution functions (i.e. non-decreasing, right-continuousreal-valued functions which satisfy (9)).

The next results describe the behaviour of distribution functions.

Theorem 25. The set of points of jump of a distribution function is countable.

Page 8: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

8 1. BASIC PROBABILITY THEORY

Theorem 26. A distribution function which increases only by jumps is calleda discrete df. It can be written as

+∞∑

j=1

δj1[aj ,+∞)

where the aj’s are the different points of jump and the δj’s the corresponding sizes.A distribution function with no jumps (i.e. continuous) is called a continuousdistribution function.Let F be a distribution function. There exists a discrete distribution function Fd,a continuous distribution function Fc and a real number α ∈ [0, 1] such that

(10) F = αFd + (1− α)Fc.

Such a decomposition is unique1. Fd is called the discrete part of F and Fc itscontinuous part.

Definition 27. Let X be an random variable. Its distribution PX is said tobe, discrete if the distribution function FX is discrete, and continuous if FX iscontinuous.

Definition 28. A (continuous) distribution function is said to have a densityf , if f is a non-negative function and

(11) F (x) =∫ x

−∞f(t)dt

Such a distribution function is also said to be absolutely continuous.If no such f exists (anywhere), the distribution function is said to be singular.

Theorem 29. A continuous distribution function F can be decomposed, as

F = βFa + (1− β)Fs

where Fa and Fs are the absolutely continuous part and the singular part of F ,respectively, and β ∈ [0, 1].

Discrete distribution functions can be described by their jumps. Namely if FX

is the discrete distribution function of an random variable X, it can be expressedin terms of the function

pX(x) = ∆FX(x) = FX(x)− FX(x−)

by

FX(x) =∑

y≤x

pX(y).

Absolutely continuous distribution functions are expressed by means of theirdensity,

FX(x) =∫ x

−∞fX(y)dy.

These densities can, in turn, be formulated in terms of their distribution functions;a density fX (almost everywhere) equals the derivative of its distribution functionFX ,

fX(x) = F ′X(x).

1unless α = 0 or α = 1

Page 9: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

7. JOINTLY DISTRIBUTED RANDOM VARIABLES 9

6. Expectation

The notion of expectation originated in gambling. People were interested inassessing their ‘expected’ fortune at the end of a particular game. In a coin-tossing game, for example, where a win increases your capital by, say $w, anda loss decreases it by, say $l, the expected gain (or loss) after one game is clearly$(pw+(1−p)l), where p is the probability of winning. Any triple (w, l, p) for whichthe quantity pw + (1− p)l is nil, represents a fair game and a favourable one if thisquantity is positive. It is worth stressing that a fair game is not one in which theoutcome is always zero, but only that in a long sequence of such games, the averageloss (or gain) is expected to get very small. This idea of expectation was quicklyextended to more complex games with the same rule in mind; the expected returnof a particular game should be the weighted average of all possible returns.

Following the model of a gambling game, we introduce the following definitions.

Definition 30. If X is a discrete random variable then the expectation of Xis defined as

(12) IE[X] =∑

x

xpX(x).

Example 31. Let X be a discrete random variable which takes the values1, . . . , n with equal probabilities, i.e. ∀k = 1, . . . , n, P[X = k] =

1n

. It is easilyseen that

IE[X] =n∑

k=1

kP[X = k] =1n

n∑

k=1

k =n(n + 1)

2n=

n + 12

.

Definition 32. If X is an absolutely continuous random variable then theexpectation of X is defined as

(13) IE[X] =∫

xfX(x)dx.

Often, one is led to consider a (measurable) function of a random variable. Thefollowing result allows us to compute its expected value without having to computeits distribution.

Theorem 33 (Change of variables formula). Let g be a Borel function and Xbe a random variable on (Ω,F , P) . Then

(14) IE[g(X)] = ∑

x g(x)pX(x) if X is discrete∫g(x)fX(x)dx if X is continuous

Definition 34. The nth moment of a random variable X is IE[Xn] (wheneverit exists). Its variance is var(X) = IE[X2]− IE[X]2.

Proposition 35. var(X) = IE[X2

]−IE[X]2 = IE[(X − IE[X])2

]and IE

[X2

] ≥IE[X]2 with equality only if X is non random.

7. Jointly Distributed Random Variables

Definition 36. The joint distribution function of X and Y is

F (x, y) = P(X ≤ x, Y ≤ y)

The marginal distribution functions of X and Y are given by

FX(x) = F (x, +∞) and FY (y) = F (+∞, y)

Page 10: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

10 1. BASIC PROBABILITY THEORY

Definition 37. In the case of two discrete random variables, we say that Xand Y are jointly discrete with joint probability mass function

p(x, y) = P(X = x, Y = y)

Definition 38. When, for two random variables X and Y, we can write

F (x, y) =∫ x

−∞

∫ y

−∞f(u, v)dudv, we say that X and Y are jointly absolutely con-

tinuous with joint density f(x, y).

Proposition 39. In the jointly discrete case we have

pX(x) =∑

y

p(x, y) and pY (y) =∑

x

p(x, y).

In the jointly absolutely continuous case we have

f(x, y) =∂2F

∂x∂y(x, y)

fX(x) =∫

f(x, y)dy and fY (y) =∫

f(x, y)dx

Proposition 40. For any (measurable) function of two variables φ

IE[φ(X, Y )] =∑

x

∑y

φ(x, y)p(x, y)

in the discrete case, and

IE[φ(X, Y )] =∫ ∫

φ(x, y)f(x, y)dxdy

in the jointly absolutely continuous case.In particular IE[X +Y ] = IE[X]+IE[Y ] and cov(X,Y ) = IE[XY ]− IE[X]IE[Y ] =

IE [(X − IE[X]) (Y − IE[Y ])].

Definition 41. Two random variables X and Y are said to be independent ifany of the following two equivalent statements holds

• ∀x, y, F (x, y) = FX(x)FY (y)• ∀x, y, p(x, y) = pX(x)pY (y), in the discrete case, or ∀x, y, f(x, y) = fX(x)fY (y),

in the jointly absolutely continuous case.

Proposition 42. If X and Y are independent then, for any φ and ψ

IE[φ(X)ψ(Y )] = IE[φ(X)]IE[ψ(Y )]

In particular, if X and Y are independent then, IE[XY ] = IE[X]IE[Y ], cov(X, Y ) =0 and var(X + Y ) = var(X) + var(Y ).

Proposition 43. For any sequence of random variables,

var

(n∑

k=1

Xk

)=

n∑

k=1

var (Xk) + 2n∑

l=2

l−1∑

k=1

cov (Xk, Xl) .

In particular, if the random variables X1, . . . , Xn are pairwise independent (ie forany k 6= l, Xk and Xl are independent), then

var

(n∑

k=1

Xk

)=

n∑

k=1

var (Xk)

Page 11: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

9. CONDITIONING REVISITED 11

8. Moment Generating Functions

Definition 44. The moment generating function of a random variable X isgiven by

φ(t) = IE[etX

]

Theorem 45. If φ(t) exists in an open interval containing 0, then the momentgenerating function of a random variable X defines the distribution of X.

Proposition 46. If φ(t) exists in an open interval containing 0, then φ′(0) =IE[X], φ′′(0) = IE

[X2

], φ(n)(0) = IE [Xn], and

φ(t) =∑

n

IE[Xn]n!

tn.

Also, if X and Y are independent then

φX+Y (t) = φX(t)φY (t) .

Example 47. The moment generating function of a uniform over [0, 1] randomvariable is

φ(t) =et − 1

t=

∑n

tn

(n + 1)!.

Thus

IE [Xn] =1

n + 1,

which can be easily checked by direct integration.

9. Conditioning Revisited

Definition 48 (The discrete case). Let X and Y be discrete random variables.• The conditional probability mass function of Y given X = x is given by

pY |X=x(y) =p(X,Y )(x, y)

pX(x)

• The conditional distribution function of Y given X = x is given by

FY |X=x(y) =∑

z≤y

pY |X=x(z)

• The conditional expectation of Y given X = x is given by

IE [Y |X = x] =∑

y

ypY |X=x(y)

Proposition 49. If X and Y are independent discrete random variables, then

pY |X=x(y) = pY (y), FY |X=x(y) = FY (y) and IE [Y |X = x]

Definition 50 (The continuous case). Let X and Y be jointly absolutely con-tinuous random variables.

• The conditional probability density function of Y given X = x is given by

fY |X=x(y) =f(X,Y )(x, y)

fX(x)

• The conditional distribution function of Y given X = x is given by

FY |X=x(y) =∫ y

−∞fY |X=x(z)dz

Page 12: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

12 1. BASIC PROBABILITY THEORY

• The conditional expectation of Y given X = x is given by

IE [Y |X = x] =∫

yfY |X=x(y)dy

Proposition 51. If X and Y are independent jointly absolutely continuousrandom variables, then

fY |X=x(y) = fY (y), FY |X=x(y) = FY (y) and IE [Y |X = x]

Computing expectations by conditioning. If we replace x by X in IE [Y |X = x],we obtain a random variable (a measurable function of X) which we denote byIE[Y X]. The following result will prove to be a valuable tool.

Theorem 52.IE [IE [Y |X]] = IE [Y ]

Example 53. A message requires N time units to be transmitted, where N isa geometric random variable with probability mass function pN (n) = (1 − a)an−1,n = 1, 2, . . .. A single new message arrives during a time unit with probability p,and no messages arrive with probability 1 − p. Let K be the number of messagesthat arrive during the transmission of a single message.

(1) Find the probability mass function of K. Hint: (1−β)−(k+1) =∞∑

n=k

(n

k

)βn−k.

(2) Find IE[K] and var(K) using conditional expectation.

10. Conditioning with respect to σ-fields

Contrary to the conditioning with respect to partitions and random variables,the conditional expectation given a σ-field cannot be defined explicitly, but only itsexistence can be obtained.

The approach uses the Radon-Nikodym theorem. We explain here the aim ofthis result, observe its meaning in the particular case of countable sample spaces,then state it (without proof) in the general case.

The idea is that, if P and Q are two probability measures on the same mea-surable space (Ω,F), it is often possible to ‘balance out’ the weight of any eventunder P in order to obtain the weight of the same event under Q.

Example 54. Let Ω = 1, 2, . . ., P and Q be the probabilities defined by,∀k = 1, 2, . . ., P(k) = pk and Q(k) = qk.

Let us assume that if qk is positive so is pk. Thus

Q(A) =∑

k∈A

qk =∑

k∈A, qk 6=0

qk =∑

k∈A, qk 6=0

qk

pkpk.

Now let X be the rv defined, P-almost surely, by X(k) =qk

pk. Then,

Q(A) =∑

k∈A, qk 6=0

X(k)pk;

i.e. the weight of A under Q is obtained by ‘correcting’ by the factor X the weightsunder P of the atoms of A.

This, of course, becomes impossible when the event in question is P-negligible(i.e. of nil probability under P) but not Q-negligible.

Definition 55. A probability measure Q is said to be absolutely continuouswith respect to a probability measure P, and we write Q ¿ P, if any P-negligibleevent is Q-negligible (i.e P(A) = 0 =⇒ Q(A) = 0).

Page 13: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

10. CONDITIONING WITH RESPECT TO σ-FIELDS 13

Theorem 56 (Radon-Nikodym Theorem). Let P and Q be probability measureson the same measurable space (Ω,F) and assume that Q is absolutely continuouswith respect to P. Then there is a random variable X (i.e. F-measurable) suchthat, ∀A ∈ F ,

(15) Q(A) = IEP[X1A].

The random variable X is unique up to an event of probability zero under P, i.e. ifX ′ is an F-measurable random variable which satisfies (15), then P(X 6= X ′) = 0.

Theorem 57. Let X be an integrable random variable and G be a sub-σ-field ofF . Then there is a unique (up an event of probability zero) G-measurable randomvariable Y which satisfies

(16) ∀A ∈ G, IE[X1A] = IE[Y 1A].

It is denoted by IE[X|G] and is called the conditional expectation of X given G.

The properties of conditional expectations are very similar to those of expec-tations.

Theorem 58. Let X and Y be two integrable rv’s, a and b be two real numbers,and G, G1 and G2 be three sub-σ-fields of F . Then,

(1) IE[aX + bY |G] = aIE[X|G] + bIE[Y |G] (a.s.),(2) If X ≤ Y (a.s.), then IE[X|G] ≤ IE[Y |G] (a.s.),(3) IE[X|∅,Ω] = IE[X],(4) if X is G-measurable, then IE[XY |G] = XIE[Y |G] (a.s.) and(5) if G1 ⊆ G2, then

IE[IE[X|G1]|G2] = IE[IE[X|G2]|G1] = IE[X|G1] (a.s.).

Proposition 59. When G is generated by a random variable Y , IE[X|Y ] co-incides with IE[X|G].

Example 60. Let X be a symmetrical rv of nil expectation. Thus, given thatX2 = a (a > 0), X has equal chances of being +

√a or −√a. We should then

have P[X =√

a|X2 = a] = P[X = −√a|X2 = a] =12, and IE[X|X2 = a] =

√a

2+−√a

2= 0. Show that this is the case by proving that IE[X|σ(X2)] = 0 (a.s.).

Theorem 61. The random variables X and Y are independent if and only if

IE[g(X)h(Y )] = IE[g(X)]IE[h(Y )]

is satisfied for any couple (g, h) of bounded borelian functions, in which case

IE[X|Y ] = IE[X] (a.s.).

10.1. The L2 approach. The conditional expectation IE[X | G] of the randomvariable X given the σ-algebra G, is often called the projection of X on G. Thissection aims at explaining the reasons for such a terminology.

The space Lp(F) of random variables measurable with respect to F with finitemoments of order p is a linear space. When any two almost surely equal randomvariables are identified, the function ||X||p = (IE[|X|p])1/p is a norm on Lp(F). Infact || · ||p is a norm on the space, also denoted hereafter by Lp(F), of equivalenceclasses of elements of Lp(F), where the equivalence relation is defined by

X<Y ⇐⇒ P[X 6= Y ] = 0.

Page 14: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

14 1. BASIC PROBABILITY THEORY

One of the desirable properties that a normed linear space can have, is complete-ness. We say that a normed linear space is complete if any Cauchy (or fundamental)sequence is convergent; ie for any X1, X2, . . . ∈ Lp(F),

limm,n

||Xn −Xm||p = 0

=⇒ ∃X ∈ Lp(F) such that limn||Xn −X||p = 0.

A complete normed linear space is called a Banach space.

Theorem 62. (Lp(F), || · ||p) is complete and is therefore a Banach space.

When p = 2, the mapping

〈, 〉 : L2(F)2 −→ R(X,Y ) 7−→ IE[XY ]

defines a scalar product which induces the norm ||·||2. That is, for all X, Y, Z ∈ L2(F)and α, β ∈ R,

(1) 〈X,Y〉 = 〈Y,X〉,(2) 〈αX + βY,Z〉 = α 〈X,Z〉+ β 〈Y,Z〉,(3) 〈X,X〉 ≥ 0,(4) 〈X,X〉 = 0 ⇔ X = 0 (a.s.),(5) ||X||22 = 〈X,X〉.

L2(F) is therefore what is known as a Hilbert space. Hilbert spaces are the ‘natural’extension of the Euclidean spaces Rn to infinite dimension.

One of the first concepts encountered when studying Hilbert spaces (as it is forEuclidean spaces) is that of the orthogonality of two vectors (ie random variablesin the case of L2(F). Two random variables X and Y are said to be orthogonaland we write X ⊥ Y , if their scalar product is nil, 〈X,Y〉 = IE[XY ] = 0.

Another concept is that of the orthogonal projection on a closed linear subspace.The orthogonal projection of an element x of a Hilbert space H on a closed linearsubspace V is the one and only element y in V such that x− y is orthogonal to allelements in V. The orthogonal projection y of x on V is known to minimize thedistance from x to the elements of V,

||x− y|| = infz∈V

||x− z||.

Now if G is a sub-σ-field of F , then it is easy to see that L2(G) is a closedlinear subspace of L2(F). The orthogonal projection of X ∈ L2(F) on L2(G) isthe only (up to a set of probability 0) random variable Y ∈ L2(G) such that forany Z ∈ L2(G), IE[(X − Y )Z] = 0. Clearly Y is (up to a set of probability 0) theconditional expectation of X given G.

11. Characteristic Functions

A complex-valued random variable Z is a sum of the type

Z = X + iY

where X and Y are two (real-valued) random variables. As for complex numbers,complex-valued random variables can be looked at as random vectors in R2 (C andR2 are often identified). Defining quantities such as means or distribution functionsof a complex-valued random variable, Z = X + iY , then comes back to defining thesame quantities for both components, X and Y . Furthermore Z will be integrableif and only if, both X and Y are integrable, and

IE[Z] = IE[X] + iIE[Y ]

Page 15: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

11. CHARACTERISTIC FUNCTIONS 15

Definition 63. The characteristic function of a random variable X, of itsdistribution PX or of its distribution function FX , is the complex-valued function

ϕX(t) = IE[eitX ] =∫

R

eitxFX(dx)

Since | cos tX| ≤ 1 and | sin tX| ≤ 1, both cos tX and sin tX are integrable,regardless of the nature of X itself. It then follows that ϕX(t) is perfectly definedfor all t’s.

Example 64. Let Xd= Pn(λ), then

ϕX(t) = IE[eitX ]

=+∞∑

k=0

eitke−λ λk

k!

= e−λ+∞∑

k=0

(λeit)k

k!

= e−λeλeit

Example 65. Let Xd= N(0, 1), then

ϕX(t) = IE[eitX ]

=∫ +∞

−∞eitx 1√

2πe−x2/2dx

= e−t2/2

∫ +∞

−∞

1√2π

e−(x−it)2/2dx

= e−t2/2

The last equality is proved integrating (the analytic function) e−z2/2 around a rect-angular contour.

Let us now enumerate a number of first properties some of which follow imme-diately from the definition of characteristic functions.

Theorem 66.P1. ϕX(0) = 1 and |ϕX(t)| ≤ 1.P2. ϕaX+b(t) = eibtϕX(at).P3. If X and Y are independent random variables, then ϕX+Y (t) = ϕX(t)ϕY (t).P4. ϕX(−t) = ϕX(t).P5. ϕX is continuous.

Proof. P5. Let (tn)n be a sequence of real numbers that converges to somet. It is easily checked that the assumptions (|eitnX | ≤ 1 and limn eitnX = eitX) ofthe Dominated Convergence Theorem are met for the sequence (eitnX)n. Thus,

limn

ϕX(tn) = limn

IE[eitnX ] = IE[eitX ] = ϕX(t)

and ϕX is continuous at t. ¤

Example 67. Let Xd= N(µ, σ2), then X = σY + µ for some N(0, 1) random

variable Y . Thus, by P2,

ϕX(t) = ϕσY +µ(t) = eiµte−σ2t2/2

Example 68. Let Yd= Bernoulli(p), then

ϕY (t) = IE[eitY ] = q + peit where q = 1− p.

Page 16: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

16 1. BASIC PROBABILITY THEORY

Now if Xd= Bi(n, p), then X is the sum of n independent Bernoulli(p) random

variables. Thus, by P3,

ϕX(t) =n∏

k=1

(q + peit) = (q + peit)n.

As will be seen later characteristic functions actually characterize their distri-butions (Theorem 72) and it is possible to obtain their distribution functions byinversion (17). But let us first show how to use characteristic functions to compute,when they exist, means of different order.

Theorem 69. If IE[|X|] < +∞, then ϕX is differentiable and

ϕ′X(t) = IE[iXeitX ]

Proof. Our aim is to show that limh→0(ϕX(t + h)−ϕX(t))/h exists (∀t) andis equal to IE[iXeitX ]. Now

ϕX(t + h)− ϕX(t)h

=IE[ei(t+h)X − eitX ]

h= IE

[eitX

(eihX − 1

h

)].

Furthermore ∣∣∣∣eitX

(eihX − 1

h

)∣∣∣∣ ≤∣∣∣∣eihX − 1

h

∣∣∣∣ ≤ |X|.

The last inequality follows from the fact that ∀x ∈ R, |eix − 1| ≤ |x|. The proof isended using the Dominated Convergence Theorem; because X is integrable,

limh

ϕX(t + h)− ϕX(t)h

= IE[eitX lim

h

eihX − 1h

]= IE[eitX iX]

¤

The following corollary follows from the above theorem by induction.

Corollary 70. If IE[|X|n] < +∞, then ϕX is n times differentiable,

ϕ(k)X (t) = IE[(iX)keitX ] k = 1, . . . , n,

and

IE[Xk] =ϕ

(k)X (0)ik

k = 1, . . . , k.

Example 71. Let Xd= N(0, 1), then

ϕX(t) = e−t2/2

ϕ′X(t) = −te−t2/2 ⇒ ϕ′X(0) = 0

ϕ′′X(t) = (t2 − 1)e−t2/2 ⇒ ϕ′′X(0) = −1.

Thus IE[X] = 0 and IE[X2] = −1/i2 = 1.

Theorem 72. Let X and Y be two random variables, FX and FY be theirdistribution functions, and ϕX and ϕY be their characteristic functions. Then

FX = FY ⇔ ϕX = ϕY .

Theorem 73 (Inversion Formula). Let F be a distribution function and ϕ beits characteristic function. If

∫ +∞

−∞|ϕ(t)|dt < +∞,

Page 17: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

11. CHARACTERISTIC FUNCTIONS 17

then F has a density and

(17) f(x) =12π

∫ +∞

−∞e−itxϕ(t)dt

Example 74. Let Xd= Exp(1), then

ϕX(t) =∫ +∞

0

eitxe−xdx =[e(it−1)x]+∞0

it− 1=

11− it

.

A random variable Y having a density

fY (y) =12e−|y|

is said to have the Laplace distribution with parameter 1. Its characteristic functionis,

ϕY (t) =∫ +∞

−∞

12eitye−|y|dy

=12[ϕX(t) + ϕX(−t)]

=1

1 + t2.

Furthermore, ∫ +∞

−∞|ϕY (t)|dt =

∫ +∞

−∞

11 + t2

dt = π < +∞.

(17) can therefore be used and we have

e−|y| = 2fY (−y) =1π

∫ +∞

−∞eity 1

1 + t2dt,

which shows that the cf of the C(1) distribution is e−|t|.

Theorem 75. Two random variables X and Y are independent if and only if

(18) ∀s, t, IE[ei(sX+tY )] = IE[eisX ]IE[eitY ]

Note that (18) can be written in the form

ϕX,Y (s, t) = ϕX(s)ϕY (t)

where ϕX,Y (s, t) = IE[ei(sX+tY )] is called the joint characteristic function of X andY .

Proof. Let G(x, y) = FX(x)FY (y). As the product of two distribution func-tions, G itself is a distribution function (on R2), and∫

R2ei(sx+ty)G(dx, dy) =

R

eisxFX(dx)∫

R

eityFY (dy)

= IE[eisX ]IE[eitY ] = IE[ei(sx+ty)]

=∫

R2ei(sx+ty)FX,Y (dx, dy).

The proof is ended using the multidimensional analogous to Theorem 72 whichimplies that G = FX,Y , i.e. X and Y are independent. ¤

Example 76. In general the equality ϕX+Y (t) = ϕX(t)ϕY (t), ∀t (i.e. (18) fors = t) is not enough to imply the independence of X and Y . Here is a counterexample. Let X

d= C(1), then

ϕ2X(t) = ϕX(2t) = e−2|t| = (e−|t|)2 = ϕX(t)2

Page 18: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

18 1. BASIC PROBABILITY THEORY

but, of course, X is not independent of itself.

12. Exercises

1. Describe the σ-field generated by three events A, B and C assuming thatA∩B ∩C 6= ∅. Show that the size of the σ-field generated by the events A1, . . . , An

which satisfy A1 ∩ . . . ∩An 6= ∅, is equal to 22n

. Discuss the assumption A1 ∩ . . . ∩An 6= ∅.

2. Let (Fi)i be a non-decreasing (i.e. Fi ⊆ Fi+1) countable family of σ-fieldson Ω. Show that the union F =

⋃i Fi may or may not be a σ-field. Illustrate with

examples.

3. It can be shown that every real number between 0 and 1 can be written as aseries

∑n≥1 αn/3n where α1, α2, . . . = 0, 1 or 2. Let then C be the set

n≥1

αn/3n; α1, α2, · · · = 0 or 2.

Show that C is B(R)-measurable and that it is uncountable. C is known as theCantor set.

4. Describe the set of all possible random variables when F is the trivial σ-field,the total σ-field and when it is generated by a set A (by giving first an example thenby attempting, in words, a general description).

5. We know that if X is F-measurable, then so is |X|. Is the converse true?

6. Let X be a random variable taking at least two values, −1 and 1. Show thatX is not σ(X2)-measurable.

7. Let E and F be mutually exclusive events in the sample space of an exper-iment. Suppose that the experiment is repeated until either event E or event Foccurs. What does the sample space of the new super experiment look like? Show

that the probability that event E occurs before event F isP(E)

P(E) + P(F ).

8. An urn contains b black balls and r red balls. One of the balls is drawn atrandom, but when it is put back in the urn c additional balls of the same colour areput in with it. Now suppose that we draw another ball. Show that the probability

that the first ball was black given that the second ball was red isb

b + r + c.

9. An electronic system is made of n identical subsystems, at least k (k ≤ n)of which need to be operational for the system to function.

(1) Suppose that at any given instant, each subsystem functions with proba-bility p independently of all other subsystems. What is the probability thatthe system is operational?

(2) In fact under severe conditions, the probability of a given subsystem func-tioning at a given time is reduced to q; the events describing which subsys-tem functions remaining independent of each other. We further supposethat severe conditions occur with probability r. Obtain the probability thatthe system is operational.

10. Let Xd= Pn(λ) and Y

d= Pn(µ) be independent random variables.• Obtain the conditional probability mass function of X given X + Y = z.• Deduce the conditional mean of X given X + Y = z.• Are X and X + Y independent random variables.

Page 19: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

12. EXERCISES 19

11. An urn contains three white, six red, and five black balls. Six of these ballsare randomly selected from the urn. Let X and Y denote respectively the numberof white and black balls selected. Compute the conditional probability mass functionof X given that Y = 3. Also compute IE [X|Y = 1] .

An unbiased die is successively rolled. Let X and Y denote respectively thenumber of rolls necessary to obtain a six and a five. Find

• IE [X]• IE [X|Y = 1]• IE [X|Y = 5]

12. Let X be a continuous random variable and Y = aX + b (a 6= 0). Showthat

fY (y) =1|a|fX

(y − b

a

).

More generally, if Y = g(X) where g is a strictly monotone and differentiablefunction, show that

fY (y) =1

|g′(g−1(y))|fX(g−1(y)).

13. Let X be a continuous random variable. Assume that FX is strictly in-creasing on the range of X. Show that FX(X) is a uniform over [0, 1] randomvariable. Conversely, if U is uniform over [0, 1] random variable, show that F 1

X(U)is distributed like X.

14. Let Z be a standard normal random variable. Show that

IE[Z2k] =(2k)!2kk!

and IE[Z2k+1] = 0, k = 0, 1, 2, . . .

15. Let Xd= N

(µ, σ2

)and Y

d= N(ν, τ2

)be independent random variables.

• Obtain the conditional probability mass function of X given X + Y = z.• Deduce the conditional mean of X given X + Y = z.• Are X and X + Y independent random variables.

16. The joint density of X and Y is given by

f(X,Y )(x, y) =cxke−λx

yk+1, y > x > 0

with λ > 0 and k > 2. Find• c• the density of X, its mean and variance• the conditional density and conditional expectation of Y given X = x• the mean and the variance of Y• the covariance of X and Y , as well as the variance of X + Y

17. The joint density of X and Y is given by

f(x, y) =e−x/ye−y

y, x, y > 0

Show that IE [X|Y = y] = y.

18. The joint density of X and Y is given by

f(x, y) =e−y

y, 0 < x < y

Compute IE[X2|Y = y

].

Page 20: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

20 1. BASIC PROBABILITY THEORY

19. The conditional variance of X, given the random variable Y, is defined by,

var (X|Y ) = IE[(X − IE [X|Y ])2 |Y

]

Show thatvar (X) = IE [var (X|Y )] + var (IE [X|Y ])

20. Let Xnd= Bi(n, p) and Zn = (Xn − np)/

√np(1− p). Show that

∀t, limn

ϕZn(t) = e−t2/2.

21. Show that∫ +π

−π

cos kt dt =∫ +π

−π

sin kt dt = 0, k ∈ Z \ 0

Let ξ be an integer-valued random variable and ϕ be its characteristic function.Show that

P[ξ = k] =12π

∫ +π

−π

e−iktϕ(t) dt, k ∈ Z

Hint: If ∀t, ω, | X(t, ω) |≤ 1, then∫ π

−πIE[X(t)] dt = IE

[∫ π

−πX(t) dt

]

22. Use a characteristic function approach to prove that there is a sequence ofindependent identically distributed random variables, X1, X2, . . ., such that

1n

n∑

k=1

Xkd= X1.

23. Let X and Y be two independent identically distributed random variablessuch that X + Y and X − Y are also independent. For simplicity we assume thatIE[X] = 0 and that var(X) = 1. Use a characteristic function approach to firstshow that X is symmetrical then that X is normal.Hint: If limn zn = z, then limn(1 + zn/n)n = ez.

24. Let Xnd= Bi(n, p) and Zn = (Xn − np)/

√np(1− p). Show that

∀t, limn

ϕZn(t) = e−t2/2.

25. Let X and Y be two independent identically distributed random variables.Can X − Y have a R(−1, 1) distribution?

26. For which values of n and p, is a Bi(n, p) random variable symmetrical?

27. Show that∫ +π

−π

cos kt dt =∫ +π

−π

sin kt dt = 0, k ∈ Z \ 0

Let ξ be an integer-valued random variable and ϕ be its characteristic function.Show that

P[ξ = k] =12π

∫ +π

−π

e−iktϕ(t) dt, k ∈ Z

Hint: If ∀t, ω, | X(t, ω) |≤ 1, then∫ π

−πIE[X(t)] dt = IE

[∫ π

−πX(t) dt

]

28. Use a characteristic function approach to prove that there is a sequence ofindependent identically distributed random variables, X1, X2, . . ., such that

1n

n∑

k=1

Xkd= X1.

Page 21: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

12. EXERCISES 21

29. Let X and Y be two independent identically distributed random variablessuch that X + Y and X − Y are also independent. For simplicity we assume thatIE[X] = 0 and that var(X) = 1. Use a characteristic function approach to firstshow that X is symmetrical then that X is normal.Hint: If limn zn = z, then limn(1 + zn/n)n = ez.

Page 22: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

22 1. BASIC PROBABILITY THEORY

Appendix A - Operations on Sets

• The Union of two sets A and B, is A ∪B = ω ∈ Ω; ω ∈ A or ω ∈ B.• The Intersection of two sets A and B, is A ∩ B = ω ∈ Ω; ω ∈ A and

ω ∈ B.• The Complement of a set A, is Ac = ω ∈ Ω; ω /∈ A.• The Difference of two sets A and B, is A\B = ω ∈ Ω; ω ∈ A and ω /∈ B.• The Symmetric difference of two sets A and B, is A∆B = (A \B) ∪

(B \A) = (A ∪B) \ (B ∩A) = ω ∈ Ω; ω ∈ A or ω ∈ B but not both.Appendix B - Some Common Random Variables

Bernoulli random variable.• PMF: p ∈ (0, 1)

p(x) =

p if x = 11− p if x = 0

• mean: IE[X] = p• variance: var(X) = p(1− p)• Indicates when a particular event occurs• Moment generating Function: φ (t) = pet + (1− p)

Binomial random variable.• PMF: n integer and p ∈ (0, 1)

p(x) =(nx

)px(1− p)n−x, x = 0, 1, . . . , n

• mean: IE[X] = np• variance: var(X) = np(1− p)• Counts the number of successes in n Bernoulli trials• Moment generating Function: φ (t) = (pet + (1− p))n

Geometric random variable.• PMF: p ∈ (0, 1)

p(x) = p(1− p)x−1, x = 1, 2, . . .

• mean: IE[X] =1p

• variance: var(X) =1− p

p2

• Counts the number of trials to the first success in a Bernoulli sequence

• Moment generating Function: φ (t) =pet

1− (1− p) et

Negative binomial random variable.• PMF: r integer and p ∈ (0, 1)

p(x) =(x−1r−1

)pr(1− p)x−r, x = r, r + 1, . . .

• mean: IE[X] =r

p

• variance: var(X) =r(1− p)

p2

• Counts the number of trials to the rth success in a Bernoulli sequence

• Moment generating Function: φ (t) =(

pet

1− (1− p) et

)r

Poisson random variable.• PMF: λ > 0

p(x) = e−λ λ

x!, x = 0, 1, . . .

• mean: IE[X] = λ

Page 23: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

APPENDIX B - SOME COMMON RANDOM VARIABLES 23

• variance: var(X) = λ• Moment generating Function: φ (t) = exp [λ (et − 1)]

Uniform random variable.• density: a < b

f(x) =1

b− a, x ∈ [a, b]

• mean: IE[X] =a + b

2

• variance: var(X) =(b− a)2

12

• Moment generating Function: φ (t) =ebt − eat

(b− a) t

Exponential random variable.• density: λ > 0

f(x) = λe−λx, x > 0

• mean: IE[X] =1λ

• variance: var(X) =1λ2

• Moment generating Function: φ (t) =λ

λ− t

Gamma random variable.• density: λ > 0 and r > 0

f(x) =λrxr−1e−λx

Γ(r), x > 0

Γ(r) =∫ +∞0

xr−1e−xdx. Γ(r) = (r − 1)Γ(r − 1) and for r integer,Γ(r) = (r − 1)!

• mean: IE[X] =r

λ• variance: var(X) =

r

λ2

• Moment generating Function: φ (t) =(

λ

λ− t

)r

Normal random variable.• density: µ and σ > 0

f(x) =1

σ√

2πexp

((x− µ)2

2σ2

)

• mean: IE[X] = µ• variance: var(X) = σ2

• Moment generating Function: φ (t) = exp[µt + σ2t2/2

]

Page 24: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

24 1. BASIC PROBABILITY THEORY

Appendix C - Matrix Algebra and Random Variables

In this section A′ denotes the transpose of A and |A| its determinant.

• |A′| = |A|, |AB| = |A||B| and (AB)−1 = B−1A−1.• – A is symmetric if A′ = A.

– A is orthogonal if A′A = AA′ = I.– A is non-negative definite if for any x, x′Ax ≥ 0.– A is positive definite if for any x 6= 0, x′Ax > 0.– (λ, x) is an eigenvalue-eigenvector pair of A if x 6= 0 and Ax = λx.

• Let A be a (k × k) symmetric matrix.– A has k eigenvalue-eigenvector pairs.– The eigenvectors can be chosen to be mutually perpendicular and to

have unit norm.– The eigenvalues are solution to the characteristic equation |A−λI| =

0.– Spectral Decomposition. A can be expressed in terms of its k

eigenvalue-eigenvector pairs (λi, ei) as

A =n∑

i=1

λieie′i.

– A is positive definite (resp. non-negative) iff every eigenvalue of A ispositive (resp. non-negative).

– If P = [e1 e2 . . . ek], then P is orthogonal and A = PΛP ′, whereΛ = diag(λ1, . . . , λk).

– A−1 = PΛ−1P ′ =n∑

i=1

1λi

eie′i, A1/2 = PΛ1/2P ′ =

n∑

i=1

√λieie

′i and

A−1/2 = PΛ−1/2P ′ =n∑

i=1

1√λi

eie′i.

• Let X be a k-dimensional random vector and A be a (p×k) matrix. ThenIE[AX] = AIE[X] and cov(AX) = Acov(X)A′.

• If A =(

A11 A12

A21 A22

)and A22 is square and non-singular, then

A−1 =(

D−1 −D−1A12A−122

−A−122 A21D

−1 A−122 A21D

−1A12A−122 + A−1

22

)

where D = A11 −A12A−122 A21.

Appendix D - The Multivariate Normal Distribution

• Let Xd= Np(µ, Σ).

– The density of X is

fX(x) =1

(2π)p/2|Σ1/2| exp(−1

2(x− µ)′Σ−1(x− µ)

).

– If A is (q × p) (q ≤ p), then AXd= Nq(Aµ,AΣA′).

– If Xd= Np(µ, Σ) with |Σ| > 0, then (X − µ)′Σ−1(X − µ) d= χ2

p.• If µ is (p × 1), ν is (q × 1), Σ11 is (p × p), Σ22 is (q × q), Σ12 is (p × q),

Σ21 is (q × p), and Z =(

XY

)d= Np+q

((µν

),

(Σ11 Σ12

Σ21 Σ22

)), then

X|Y =yd= Np

(µ + Σ12Σ−1

22 (y − ν),Σ11 − Σ12Σ−122 Σ21

).

Page 25: Basic Probability Theory - UZH · 2006. 3. 17. · 2 1. BASIC PROBABILITY THEORY A Ac B Bc Figure 2. The ¾-fields generated by A (left) and by B (right) information conveyed by

APPENDIX D - THE MULTIVARIATE NORMAL DISTRIBUTION 25

• Let Zkd= Np(0,Σ), k = 1, . . . ,m, be independent. Then W =

∑mk=1 ZkZ ′k

is said to have a Wischart distribution with m degrees of freedom andcovariance matrix Σ. We write W

d= Wm(Σ).• Let Xk

d= Np(µk, Σ), k = 1, . . . , n, be independent.

– Y =n∑

k=1

bkXkd= Np

(n∑

k=1

bkµk,

(n∑

k=1

b2k

).

–n∑

k=1

bkXk andn∑

k=1

ckXk are independent iff b′c = 0.

– X =1n

n∑

k=1

Xkd= Np

(µ,

1n

Σ)

.

– (n− 1)S =n∑

k=1

(Xk − X)(Xk − X)′ d= Wn−1(Σ).

– X and S are independent.

– n(X − µ)′S−1(X − µ) d=(n− 1)pn− p

Fp,n−p.