Basics of Probability Theory
Stefan Bruder
University of Zurich
September 1, 2015
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 1 / 160
Textbooks:
Jean Jacod and Phillip Protter: Probability Essentials
Albert N. Shiryaev: Probability
Sidney I. Resnick: A Probability Path
Achim Klenke: Probability Theory - A Comprehensive Course
Marc Paollela: Intermediate Probability - A Computational Approach
Marc Paollela: Fundamental Probability - A Computational Approach
Patrick Billingsley Probability and Measure
Olav Kallenberg: Foundations of Modern Probability.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 2 / 160
Overview
1 Probability Space
2 Finite or Countably Infinite Ω
3 Probability Measures on (R,B(R))
4 Random Variables
5 Moments
6 Inequalities
7 Moment Generating Functions
8 Transformations of Random Variables
9 Convergence Concepts
10 Law of Large Numbers
11 Central Limit Theorem
12 Delta Method
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 3 / 160
Probability Space
Probability Space
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 4 / 160
Probability Space
Definition (Probability space)
A probability space is a triple (Ω,B,P) where
Ω is the sample space
B is the σ-algebra on Ω
P is a probability measure; that is, P is a set function with domain Band range [0, 1] such that
1 P(A) ≥ 0 ∀A ∈ B2 P is σ-additive: If An ∈ B are pairwise disjoint events (i.e. Al ∩ Ak = ∅
for l 6= k), then
P
( ∞⋃n=1
An
)=
∞∑n=1
P (An)
3 P(Ω) = 1.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 5 / 160
Probability Space
The sample space is the set of all possible outcomes. Examples:Two successive tosses of a coin Ω = hh, tt, ht, th, the lifetime of alight-bulb Ω = R+, a toss of two diceΩ = (i , j) : 1 ≤ i ≤ 6, 1 ≤ j ≤ 6.An event is a subset of Ω, which has the following properties:
Probability is defined for the subsetIt is observable whether the event has occured or not after theexperiment has been completed.
If A,B are events, then the contrary event is interpreted as thecomplement set Ac , the event ”A or B” is interpreted as the unionA ∪ B, the event ”A and B” is interpreted as the intersection A ∩ B,the sure event is Ω and the impossible event is ∅.The family of all events is denoted by B. B should have the propertythat if A,B ∈ B, then Ac ∈ B, A ∩ B ∈ B, A ∪ B ∈ B, Ω ∈ B and∅ ∈ B.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 6 / 160
Probability Space
Definition (σ-algebra)
A σ-algebra on Ω is defined as a nonempty collection B of subsets of Ωsuch that
1 Ω, ∅ ∈ B.
2 If A ∈ B then Ac ∈ B.
3 If A1,A2, . . . ∈ B then ∪∞i=1Ai ∈ B.
A σ-algebra is closed under complements (2) and countable unions(3).
By De Morgan’s law, a σ-algebra is also closed under countableintersections.
The smallest σ-algebra is ∅,Ω and the largest σ-algebra is thepower set 2Ω.
Intuitively one always chooses B = 2Ω, but it turns out thatdepending on Ω the power set is eventually too big: Example Ω = R.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 7 / 160
Probability Space
Definition
If C ∈ 2Ω, the σ-algebra generated by C, and written σ(C), is the smallestσ-algebra containing C.
Definition
Suppose Ω = R. Let C = (a, b],−∞ ≤ a < b <∞. Then the Borelσ-algebra on R is defined by
B(R) = σ(C)
The elements of B(R) are called Borel sets.
There are many equivalent ways to generate B(R). The Borelσ-algebra can be generated with any kind of interval: open, closed,semiopen, finite, and semi-infinite.
Generalization for Rd : Let C =∏d
i=1(a, b],−∞ ≤ a < b <∞
,
then B(Rd) = σ(C).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 8 / 160
Probability Space
Example
Let Ω = 1, 2, 3, 4, 5, 6 and C = 2, 4 , 6. Then,
σ(C) = ∅, 2, 4 , 6 , 1, 3, 5, 6 , 1, 2, 3, 4, 5 , 2, 4, 6 , 1, 3, 5 ,Ω
We can check the three conditions
Ω, ∅ ∈ σ(C) X
If A ∈ σ(C) then Ac ∈ σ(C) X
If A1,A2, . . . ∈ σ(C) then ∪∞i=1Ai ∈ σ(C) X
Exercise
Show that a σ-algebra is closed under countable intersections, i.e. ifA1,A2, . . . ∈ B ⇒ ∩∞i=1Ai ∈ B.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 9 / 160
Probability Space
Exercise
Define the following two generators of the Borel σ-algebra on RC() = (a, b) : −∞ ≤ a < b ≤ ∞C(] = (a, b] : −∞ ≤ a < b <∞.
1 Show that σ(C()) ⊂ σ(C(]). Use the fact that
(a, b) =∞⋃n=1
(a, b − 1
n
].
2 Show that σ(C(]) ⊂ σ(C()). Use the fact that
(a, b] =∞⋂n=1
(a, b +
1
n
).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 10 / 160
Probability Space
The definition of a probability measure has already several importantimplications:
Proposition
1 P(Ac) = 1− P(A)
2 P(∅) = 0
3 For A,B ∈ B, P(A ∪ B) = P(A) + P(B)− P(A ∩ B)
4 For A,B ∈ B, if A ⊂ B then P(A) ≤ P(B)
5 For An ∈ B, P(∪∞n=1An) ≤∑∞
n=1 P(An)6 For An ∈ B
If An ↑ A, then limn→∞
P(An) = P(A)
If An ↓ A, then limn→∞
P(An) = P(A)
(4) is the monotonicity property, (5) is the σ-subadditivity propertyand (6) is the continuity property of a probability measure.
Proof of (1), (3), and (6a).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 11 / 160
Probability Space
Proof of (1).
Ω = A ∪ Ac . Then by the σ-additivity of P and A ∩ Ac = ∅
P(Ω) = P(A ∪ Ac) (1)
= P(A) + P(Ac) (2)
⇒ P(Ac) = 1− P(A).
Proof of (3).
First note that A = A ∩ Ω = A ∩ (B ∪ Bc) = (A ∩ B) ∪ (A ∩ Bc). Thus,P(A) = P(A∩B) + P(A∩Bc) because (A∩B) and (A∩Bc) are disjoint.
P(A ∪ B) = P((A ∩ Bc) ∪ (A ∩ B) ∪ (Ac ∩ B)) (3)
= P(A ∩ Bc) + P(A ∩ B) + P(Ac ∩ B) (4)
= P(A)− P(A ∩ B) + P(A ∩ B) + P(B)− P(A ∩ B) (5)
= P(A) + P(B)− P(A ∩ B) (6)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 12 / 160
Probability Space
Proof of (6).
An is an increasing sequence of events, i.e. An ⊂ An+1 ⊂ . . ., then
limn→∞
An = ∪∞n=1An := A.
Define B1 = A1,B2 = A2 \ A1, . . . ,Bn = An \ An−1. Then Bn is adisjoint sequence of events with
An = ∪ni=1Bi and ∪∞i=1 Bi = ∪∞i=1Ai = A
P(A) = P(
limn→∞
An
)(7)
= P (∪∞i=1Ai ) (8)
= P (∪∞i=1Bi ) (9)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 13 / 160
Probability Space
Proof of (6) cont’d.
=∞∑i=1
P(Bi ) (10)
= limn→∞
n∑i=1
P(Bi ) (11)
= limn→∞
P(∪ni=1Bi ) (12)
= limn→∞
P(An) (13)
The Proof of limn→∞
P(An) = P(A) if An ↓ A is conceptually similar.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 14 / 160
Probability Space
Definition (Conditional Probability)
Let A,B ∈ B and P(B) > 0. The conditional probability of A given B is
P(A|B) = P(A∩B)P(B) .
Proposition
Given P(B) > 0, the conditional probability is a probability measure on A.
Proof.
Define Q(A) = P(A|B) ∀A ∈ A (with B fixed).
Q(Ω) = P(Ω|B) =P(Ω ∩ B)
P(B)=
P(B)
P(B)= 1
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 15 / 160
Probability Space
Proof.
Let (An)n≥1 be a sequence of pairwise disjoint events, then
Q(∪∞n=1An) =P((∪∞n=1An) ∩ B)
P(B)(14)
=P((∪∞n=1(An ∩ B))
P(B)(15)
=∞∑i=1
P(An ∩ B)
P(B)(16)
=∞∑i=1
P(An|B) (17)
=∞∑i=1
Q(An) (18)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 16 / 160
Probability Space
Definition (Independence of Events)
Two events A and B are independent if P(A ∩ B) = P(A)P(B)
A (possibly infinite) collection of events (Ai )i∈I is an independentcollection if for every finite subset J ⊂ I, it holds thatP(∩i∈JAi ) =
∏i∈J P(Ai )
Independence of events (Ai )i∈I implies pairwise independence, butthe converse is not true! Example: Ω = 1, 2, 3, 4, P(i) = 1
4∀i ∈ 1, 2, 3, 4 ,A = 1, 2, B = 1, 3, C = 2, 3. Then A,B,Care pairwise independent but not independent.
A and B are independent if and only if P(A |B) = P(A) .
Definition
A collection of events (En) is called a partition of Ω if En ∈ A,En ∩ Em = ∅ for m 6= n, P(En) > 0 and ∪nEn = Ω.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 17 / 160
Probability Space
Theorem (Law of Total Probability)
Let (En)n≥1 be a finite or countable partition of Ω. Then if A ∈ B,
P(A) =∑n
P(A|En)P(En)
Theorem (Bayes’ Theorem)
Let (En)n≥1 be a finite or countable partition of Ω, and suppose P(A) > 0.Then
P(En|A) =P(A|En)P(En)∑m P(A|Em)P(Em)
Famous applications of Bayes’ theorem:
HIV test
Monty Hall problem
. . .
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 18 / 160
Probability Space
Proof of the Law of Total Probability.
Note that A = (A ∩ Ω) = (A ∩ (∪nEn)) = (∪n(A ∩ En)), where((A ∩ En))n≥1 is a family of pairwise disjoint sets. Thus,
P(A) = P(∪n(A ∩ En)) =∑n
P(A ∩ En)) =∑n
P(A|En)P(En)
The proof of Bayes’ Theorem is straightforward once we have proven thelaw of total probability:
Proof of Bayes’ Theorem.
P(En|A) =P(En ∩ A)
P(A)=
P(A|En)P(En)∑m P(A|Em)P(Em)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 19 / 160
Probability Space
Proposition
Let P1,P2, . . . be probability measures on the same probability space(Ω,B,P). Define
P(A) =∞∑i=1
λiPi (A) ∀A ∈ B, (19)
where λi ≥ 0 and∑∞
i=1 λi = 1. Then, P as defined in (19), is aprobability measure on (Ω,B).
Exercise
Show that P(A ∩ Bc) = P(A)− P(A ∩ B).
Exercise
Suppose that P(C ) > 0. Show thatP(A ∪ B |C ) = P(A |C ) + P(B |C )− P(A ∩ B |C ).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 20 / 160
Probability Space
Exercise
Suppose that P(C ) > 0 and A1, . . . ,An are all pairwise disjoint. Showthat P(∪ni=1Ai |C ) =
∑ni=1 P(Ai |C ).
Exercise
Show that if events A and B are independent, then so are A and Bc andAc and Bc .
Exercise
Show that the mixture probability measure defined in (19) is a probabilitymeasure on (Ω,B,P).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 21 / 160
Finite or Countably Infinite Ω
Finite or countably infinite Ω
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 22 / 160
Finite or Countably Infinite Ω
Assume that Ω is finite or countably infinite, for exampleΩ = 1, 2, 3, 4, 5, 6 or Ω = N .
In this case it is possible to take the power set as the σ-algebra, i.e.A = 2Ω.
With a finite or countably infinite Ω the construction of a probabilitymeasure is straightforward. Intuitively, one simply needs to determinethe probability of every element of Ω, that is P(ω) for all ω ∈ Ω.
If Ω is not countable (for example Ω = [0, 1]), then A = 2Ω is nolonger possible. One has to work with a σ-algebra that is ”smaller”than 2Ω.
For the general case the construction of a probability measure is muchmore demanding and requires measure theory.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 23 / 160
Finite or Countably Infinite Ω
Theorem
1 A probability measure on the finite or countable set Ω is characterizedby its values on the atoms ω:
P(ω) = pω, ω ∈ Ω.
2 Let (pω)ω∈Ω be a family of real numbers indexed by the finite orcountable set Ω. Then there exists a unique probability measure Psuch that P(ω) = pω if and only if pω ≥ 0 and
∑ω∈Ω pω = 1.
Proof of (1).
Let A ∈ 2Ω, then A = ∪ω∈A ω is a finite or countable union of pairwisedisjoint singletons. Then,
P(A) = P(∪ω∈A ω) =∑ω∈A
P(ω) =∑ω∈A
pω
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 24 / 160
Finite or Countably Infinite Ω
Proof of (2).
⇒:If P(ω) = pω then pω ≥ 0 by definition. Additionally we have,
1 = P(Ω) = P(∪ω∈Ω ω) =∑ω∈Ω
P(ω) =∑ω∈Ω
pω.
⇐:If (pω)ω∈Ω satisfies pω ≥ 0 and
∑ω∈Ω pω = 1. Define P by
P(A) =∑
ω∈A pω. Then,
P(∅) = 0 and P(Ω) =∑ω∈Ω
pω = 1.
Countable additivity is trivial when Ω is finite, when Ω is countable itholds that
∑i∈I∑
ω∈Aipω =
∑ω∈∪iAi
pω.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 25 / 160
Finite or Countably Infinite Ω
Example
Let Ω be a finite set and let A = 2Ω. The probability measure is a uniformprobability measure
P(A) =cardA
cardΩ∀A ∈ 2Ω.
P(A) ≥ 0 ∀A ∈ 2Ω follows directly from the definition of cardinality
P(Ω) = cardΩcardΩ = 1
Let Ai ∈ 2Ω be disjoint events, then
P(∪∞i=1Ai ) =card ∪∞i=1 Ai
cardΩ=
∑∞i=1 cardAi
cardΩ=∞∑i=1
cardAi
cardΩ=∞∑i=1
P(Ai )
On a given finite set Ω the uniform probability measure is unique.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 26 / 160
Finite or Countably Infinite Ω
Example
Let a probability space be given by (N, 2N,P). A probability measure P isdefined by its atomistic values (parametrized with λ > 0)
pn = e−λλn
n!n ∈ N.
pn ≥ 0.∑n pn = e−λ
∑∞n=0
λn
n! = e−λeλ = 1.
Example
Let a probability space be given by (N+, 2N+ ,P). A probability measure P
is defined by its atomistic values (parametrized with α ∈ [0, 1))
pn = (1− α)αn n ∈ N.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 27 / 160
Finite or Countably Infinite Ω
Exercise
Let a probability space be given by (N+, 2N+ ,P). Let two probability
measures be given by
P1(ω) =1
(e − 1)ω!and P2(ω) =
1
3
(3
4
)ωShow that both uniquely define probability measures on (N+, 2
N+).
Show that both probabillity measures are σ-additive, P(N+) = 1 andP(A) ≥ 0 ∀A ∈ 2N+ .
Exercise
Show that there is no uniform probability measure on (N, 2N), that is,there is no probability measure P : 2N → [0, 1] such that
P(i) = P(j) ∀i , j ∈ N.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 28 / 160
Probability Measures on (R,B(R))
Probability measures on (R,B(R))
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 29 / 160
Probability Measures on (R,B(R))
Assume that Ω = R and B is the Borel σ-algebra of R.Thus, theprobability space is given by (R,B(R),P)
Probability measures on (R,B(R)) are very important for the analysisof random variables.
Definition (Distribution Function)
The distribution function F : R→ [0, 1] of P is the function
F (x) = P ((−∞, x ]) , ∀x ∈ R
Proposition
The distribution function F (x) has the following properties:
1 F is monotone non-decreasing
2 F is right continuous
3 limx→−∞
F (x) = 0 and limx→+∞
F (x) = 1.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 30 / 160
Probability Measures on (R,B(R))
Proof of (1).
For x < y it holds that
(−∞, x ] ⊂ (−∞, y ].
Using the monotonicity property of a probability measure
F (x) = P((−∞, x ]) ≤ P((−∞, y ]) = F (y).
Thus, F (x) ≤ F (y) for x < y .
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 31 / 160
Probability Measures on (R,B(R))
Right-continuity means that if xn ↓ x then limxn↓x
F (xn) = F (x).
Proof of (2).
Take a decreasing sequence xn such that xn ↓ x as n→∞, for exampledefine xn = x + 1
n . The sequence of events (−∞, xn]; n ≥ 1 is also adecreasing sequence. Then,
limxn↓x
F (xn) = limn→∞
F (xn) (20)
= limn→∞
P((−∞, xn]) (21)
= P(
limn→∞
(−∞, xn])
(22)
= P(∩∞n=1(−∞, xn]) (23)
= P((−∞, x ]) (24)
= F (x) (25)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 32 / 160
Probability Measures on (R,B(R))
Proof of (3).
Take an increasing sequence xn such that xn ↑ ∞ as n→∞. Thesequence of events (−∞, xn]; n ≥ 1 is an increasing sequence of events.Then,
F (∞) = limx→∞
P((−∞, x ]) (26)
= limn→∞
P((−∞, xn]) (27)
= P(
limn→∞
(−∞, xn])
(28)
= P (∪∞n=1(−∞, xn]) (29)
= P(R) (30)
= P(Ω) (31)
= 1 (32)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 33 / 160
Probability Measures on (R,B(R))
Proof of (3) cont’d.
Take a decreasing sequence xn such that xn ↓ −∞ as n→∞. Thesequence of events (−∞, xn]; n ≥ 1 is a decreasing sequence of events.Then,
F (−∞) = limx→−∞
P((−∞, x ]) (33)
= limn→∞
P((−∞, xn]) (34)
= P(
limn→∞
(−∞, xn])
(35)
= P(∩∞n=1(−∞, xn]) (36)
= P(∅) (37)
= 0 (38)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 34 / 160
Probability Measures on (R,B(R))
Starting from the probability measure P we get its distributionfunction by the definition of F . Thus, P = Q ⇒ FP = FQ
It can be shown that the converse is also true: FP = FQ ⇒ P = Q.This implies that the complete probability measure P is known if weknow its distribution function F and therefore the probability for anygiven set A ∈ B(R) can be determined from F .
It can also be shown that any function F : R→ [0, 1], which ismonotonically non-decreasing, right-continuous with lim
x→−∞F (x) = 0
and limx→+∞
F (x) = 1 is the distribution function of a uniquely
determined probability measure on (R,B(R)).
Example: F (x) =
0, if x < 018 x3, if x ∈ [0, 2)
1 if x ≥ 2
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 35 / 160
Probability Measures on (R,B(R))
Example (Dirac probability measure)
Let c ∈ R. A point mass on R satisfies
P(A) =
1, if c ∈ A
0, otherwise
The distribution function of P is given by
FP(x) = P((−∞, x ]) =
0, if x < c
1, if x ≥ c
Indicator function: 1A : X → 0, 1Previous example: FP(x) = 1[c,∞)(x)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 36 / 160
Probability Measures on (R,B(R))
Example (Lebesque probability measure)
If a, b ∈ R, then define m((a, b]) = b - a (Lebesgue measure on(R,B(R))). Further define (for fixed a, b)
ma,b(A) =m(A ∩ (a, b])
b − a∀A ∈ B(R).
Then, ma,b(A) is a probability measure on (R,B(R)) and its distributionfunction is given by
Fa,b(x) = ma,b((−∞, x ]) =
0, if x < ax−ab−a , if a ≤ x < b
1, if b ≤ x .
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 37 / 160
Probability Measures on (R,B(R))
The distribution function generates the probability of intervals of theform (−∞, x ] for all x ∈ R.
How is the probability for intervals of the form (x , y ], [x , y ], [x , y),(x , y) or x for x < y determined?
Proposition
Let F be the distribution function of P on R and let F (x−) denote theleft limit of F at x. Then for all x < y,
1 P((x , y ]) = F (y)− F (x)
2 P([x , y ]) = F (y)− F (x−)
3 P([x , y)) = F (y−)− F (x−)
4 P((x , y)) = F (y−)− F (x)
5 P(x) = F (x)− F (x−)
Next, we will prove (1) and (2). (3),(4) and (5) are proven in similar way.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 38 / 160
Probability Measures on (R,B(R))
Proof of (1).
P((x , y ]) = P (((−∞, x ] ∪ (y ,+∞))c) (39)
= 1− P((−∞, x ] ∪ (y ,+∞)) (40)
= 1− [P((−∞, x ]) + P((y ,+∞))] (41)
= 1− [F (x) + (1− P((−∞, y ]))] (42)
= F (y)− F (x) (43)
Proof of (2).
Define a decreasing sequence xn such that xn ↓ x , for examplexn = x − 1
n . Then, using (1) yields
P((xn, y ]) = F (y)− F (xn).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 39 / 160
Probability Measures on (R,B(R))
Proof of (2) cont’d.
First, consider the lhs
limn→∞
P((xn, y ]) = P(
limn→∞
(xn, y ])
(44)
= P (∩∞n=1(xn, y ]) (45)
= P([x , y ]) (46)
The rhs converges to F (y)− F (x−) by the definition of the left limit of F .Thus, P([x , y ]) = F (y)− F (x−).
Proposition
Let P be a probability measure on (R,B(R)) and let F be thecorresponding distribution function. Then,
P(x) = 0⇔ F is a continuous function.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 40 / 160
Probability Measures on (R,B(R))
Proof.
”⇒ ”: Suppose P(x) = 0 ∀x ∈ R. P(x) = F (x)− F (x−) = 0implies that F(x) = F(x-), which means that F is left-continuous at x.Since F is a distribution function it is right-continuous, i.e. F(x) = F(x+).Thus, F(x) = F(x-) = F(x+) ∀x ∈ R.”⇐ ”: Suppose F is continuous on R. This implies thatF (x) = F (x−) = F (x+) ∀x ∈ R. Thus,P(x) = F (x)− F (x−) = F (x)− F (x) = 0.
Exercise
Let c , d ∈ R (c < d) and let εc and εd denote point masses at c and d,respectively. Define
P =1
3εc +
2
3εd
Find the distribution function of P.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 41 / 160
Probability Measures on (R,B(R))
Exercise
Suppose a distribution function is given by
F (x) =1
41[0,∞)(x) +
1
21[1,∞)(x) +
1
41[2,∞)(x).
Let P be given byP((−∞, x ]) = F (x).
Find the probabilities of the following events
A = (−12 ,
12 )
B = (−12 ,
32 )
C = (−23 ,
52 )
D = [0, 2)
E = (3,∞).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 42 / 160
Probability Measures on (R,B(R))
Exercise
Suppose a distribution function is given by
F (x) =∞∑i=1
1
2i1[ 1
i,∞)(x).
Let P be given byP((−∞, x ]) = F (x).
Find the probabilities of the following events
A = [1,∞)
B = [ 110 ,∞)
C = 0D = [0, 1
2 )
E = (−∞, 0)
F = (0,∞).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 43 / 160
Probability Measures on (R,B(R))
Definition (Discrete probability measures)
A discrete probability measure is a probability measure for which thecorresponding distribution function is piecewise constant.
Discrete probability measures are countable convex combinations ofpoint masses.
The measure is concentrated on an at most countable set.
The distribution function increases only with ”jumps”:∆F (xk) = F (xk)− F (xk−) = P(xk).
Example
Let P = 19ε1 + 8
9ε0. Then the distribution function is given by
F (x) =8
91[0,∞)(x) +
1
91[1,∞)(x)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 44 / 160
Probability Measures on (R,B(R))
Definition (Absolutely continuous probability measures)
An absolutely continuous measure is a measure for which thecorresponding distribution function is of the following form
F (x) = P ((−∞, x ]) =
∫ x
−∞f (t)dt,
where f (t) is a nonnegative function.
The function f : R→ R+ is called the density of the distribution.
Generalization: Integral in the sense of Lebesgue instead of Riemann.
Not every distribution function admits a density. Example: Pointmass.
Every non-negative function f : R→ R+ that is Riemann (Lebesgue)integrable and such that
∫∞−∞ f (x)dx = 1 defines a distribution
function via the above definition.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 45 / 160
Probability Measures on (R,B(R))
Example
Let a probability space be given by (R,B(R),P). An absolutely continuousprobability measure is characterized by the following distribution function(λ, k ∈ R++)
FP(x) = P((−∞, x ]) =
∫ x
−∞
k
λ
( t
λ
)k−1
e−( tλ
)k1[0,∞)(t)dt ∀t ∈ R.
Let B = (1, 10] ∈ B(R). P(B) = P((1, 10]) = FP(10)− FP(1) = 0.367834
Example
Let a probability space be given by (R,B(R),P). An absolutely continuousprobability measure is characterized by the following distribution function
FP(x) = P((−∞, x ]) =
∫ x
−∞
1√2π
e−12t2
dt ∀t ∈ R
Let B = (1, 10] ∈ B(R). P(B) = P((1, 10]) = FP(10)− FP(1) = 0.158655
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 46 / 160
Probability Measures on (R,B(R))
More dimensions: Probability space (Rd ,B(Rd),P).
Definition
The distribution function F : Rd → [0, 1] of a probability measure on(Rd ,B(Rd)) is the function
F (x1, . . . , xd) = P
(d∏
i=1
(−∞, xi ]
), ∀x = (x1, . . . , xd)′ ∈ Rd
Absolutely continuous measure:
F (x1, . . . , xd) =
∫ x1
−∞· · ·∫ xd
−∞f (t1, . . . , td)dt1 · · · dtd ,
where f (t1, . . . , td) is a non-negative function and integrates to one.
Define ∆aibi F (x1, . . . , xd) =F (x1, . . . , xi−1, bi , xi+1, . . .)− F (x1, . . . , xi−1, ai , xi+1, . . .).
P(∏d
i=1(ai , bi ]) = ∆a1b1 · · ·∆adbd F (x1, . . . , xd).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 47 / 160
Probability Measures on (R,B(R))
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 48 / 160
Probability Measures on (R,B(R))
Example
An absolutely continuous probability measure (R2,B(R2)) is given by thefollowing distribution function
F (x1, x2) = P
(2∏
i=1
(−∞, xi ]
)=
∫ x1
−∞
∫ x2
−∞1[0,1](t1)1[0,1](t2)dt1dt2.
The integral can by simplified to the following expression
F (x1, x2) =
0, if x1 ∨ x2 < 0
x1x2, if x1 ∧ x2 ∈ [0, 1)
x1, if x1 ∈ [0, 1) ∧ x2 ≥ 1
x2, if x2 ∈ [0, 1) ∧ x1 ≥ 1
1, if x1 ∧ x2 ≥ 1.
Let B = (0.25, 0.5]× (0.25, 0.5]. Then,P(B) = F (0.5, 0.5)− F (0.5, 0.25)− F (0.5, 0.25) + F (0.25, 0.25) = 1
16 .Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 49 / 160
Probability Measures on (R,B(R))
Exercise
Derive an expression for the distribution function of an uniformdistribution on [a, b] ⊂ R.
Exercise
Suppose P is an absolutely continuous measure on (R,B(R)) with density
fP(x) =1
2σexp−
|X−µ|σ ∀x ∈ R, µ ∈ R, σ ∈ R++
Show that∫∞−∞ f (x)dx = 1.
Derive an expression for FP(x).
Show that F(x) has the properties of a distribution function.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 50 / 160
Random Variables
Random Variables
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 51 / 160
Random Variables
Definition
Let X : Ω→ R be a function. The inverse image of B ⊂ R is the set
X−1(B) = ω ∈ Ω : X (ω) ∈ B ⊂ Ω
Let (Ω,A,P) be a probability space
Definition
A random variable on (Ω,A,P) is a function X : Ω→ R such that
∀B ∈ B(R) X−1(B) ∈ A.
The measurability condition states that the inverse image is ameasurable set of Ω, i.e. X−1(B) ∈ A. This is essential sinceprobabilities are defined only on A.
A random vector is a function X : Ω→ Rd , where each componentfunction Xi : Ω→ R is a random variable. This means that eachcomponent of a random vector is a random variable.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 52 / 160
Random Variables
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 53 / 160
Random Variables
Example
The simplest example of a random variable is the indicator randomvariable 1A : Ω→ R defined by
1A(ω) =
1, if ω ∈ A
0, if ω /∈ A.
Let us check the measurability condition for every B ∈ B(R):
1−1A (B) =
∅, if 0 /∈ B ∧ 1 /∈ B
Ac , if 0 ∈ B ∧ 1 /∈ B
A, if 0 /∈ B ∧ 1 ∈ B
Ω, if 0 ∈ B ∧ 1 ∈ B.
Thus, the measurability condition is satisfied.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 54 / 160
Random Variables
The definition of a random variable states that a function X : Ω→ Ris a random variable if X is A- measurable, i.e.∀B ∈ B(R) X−1(B) ∈ A.
Given this definition, it seems that in order to establish that afunction X : Ω→ R is a random variable it is necessary to check themeasurability condition for all B ∈ B(R).
Fortunately, there is a simplified criterion.
Proposition
A function X : Ω→ R is a random variable iff
X−1((−∞, t]) ∈ A ∀t ∈ R.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 55 / 160
Random Variables
Example
Let (Ω,A,P) be a probability space. Define X : Ω→ R by
X (ω) = c ∀ω ∈ Ω, c ∈ R.
Check the measurability condition:
∀B ∈ B(R); X−1(B) =
∅, if c ∈ B
Ω, if c /∈ B.
Check the simplified criterion:
∀t ∈ R; X−1((−∞, t]) =
∅, if t < c
Ω, if t ≥ c
∅ ∈ A and Ω ∈ A ⇒ X is a random variable.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 56 / 160
Random Variables
Let X be a random variable defined on (Ω,B,P) with values in (R,B(R))
Definition
The distribution of X , PX : B(R)→ [0, 1], is defined by
PX (B) = P(X−1(B)) = P(ω : X (ω) ∈ B)
∀B ∈ B(R).
Definition
The distribution function of X , FX : R→ [0, 1], is defined by
FX (t) = PX ((−∞, t]) = P(X−1(−∞, t]) ∀t ∈ R.
Notation: Write P(X ≤ t) instead ofP(X−1(−∞, t]) = P(ω ∈ Ω : X (ω) ≤ t).
FX (t) has the following properties: Monotonically non-decreasing,right-continuous, lim
x→−∞F (x) = 0, and lim
x→∞F (x) = 1.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 57 / 160
Random Variables
Proposition
PX is a probability measure on (R,B(R)).
Proof.
∀B ∈ B(R), PX (B) = P(X−1(B)) ≥ 0, since X−1(B) ∈ A.
PX (R) = P(X−1(R)) = P(Ω) = 1, since X : Ω→ R.
Let Bn ∈ B(R) be pairwise disjoint. Then,
PX (∪∞n=1Bn) = P(X−1(∪∞n=1Bn)) (47)
= P(∪∞n=1X−1(Bn)
)(48)
=∞∑n=1
P(X−1(Bn)) (49)
=∞∑n=1
PX (Bn) (50)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 58 / 160
Random Variables
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 59 / 160
Random Variables
Example
Consider the experiment of throwing two dice. The sample space is givenby Ω = (i , j) : 1 ≤ i , j ≤ 6. Define X : Ω→ 2, 3, 4, . . . , 12 by
X ((i , j)) = i + j ∀i , j ∈ Ω.
Then, for example
X = 4 = X−1(4) = (1, 3), (3, 1), (2, 2) ⊂ 2Ω
The induced probability measure is given by
PX (X = i) = P(X−1(i)).
For example,
PX (X ∈ 2, 3) = P((1, 1), (1, 2), (2, 1)) =3
36.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 60 / 160
Random Variables
Proposition
Let X : Ω→ R and Y : Ω→ R be random variables. Then Z = f (X ,Y )is also a random variable for the following functions of X and Y
Z = cX + dY , ∀c , d ∈ RZ = XY
Z = XY , if Y 6= 0
Z =maxX ,Y Z =minX ,Y
Thus, the set of random variables is closed under
addition
multiplication
division
maximum
minimum
scalar multiplication.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 61 / 160
Random Variables
Definition
A family (Xi )i∈I of random variables is called identically distributed if
PXi= PXj
∀i , j ∈ I.
If PX = PY , then this is denoted by Xd= Y .
Definition
The support of a real valued random variable X , denoted by Supp(X ), isthe smallest closed set C such that P(C ) = 1.
Example
Let X ∼ N (0, 1), then Supp(X ) = R.
Let X ∼ U(a,b), then Supp(X ) = [a, b].
Let X ∼Bern(p), then Supp(X ) = 0, 1.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 62 / 160
Random Variables
Random d-vectors: X : Ω→ Rd .
Definition
The distribution function of X,FX : Rd → [0, 1], is defined by
FX(x) = PX
(d∏
i=1
(−∞, xi ]
)= P
(X−1
(d∏
i=1
(−∞, xi ]
))∀x ∈ Rd
Note that
X−1
(d∏
i=1
(−∞, xi ]
)=
d⋂i=1
X−1i ((−∞, xi ]) =
d⋂i=1
Xi ≤ xi
.
Absolutely continuous random variable with density fX(X):
FX(t) =
∫ x1
−∞· · ·∫ xd
−∞fx(t1, . . . td)dt1 · · · dtd
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 63 / 160
Random Variables
Proposition
Let X be a d-dimensional random vector, then the marginal distribution ofXi is given by
FXi(xi ) = lim
xj→∞,j 6=iFX(x1, . . . xd) ∀xi ∈ R
∀i ∈ 1, . . . , d.
Proposition
Let fX(x) be the density of an absolutely continuous d-dimensionalrandom vector, then the marginal density is given by
fXi(xi ) =
∫ ∞−∞· · ·∫ ∞−∞
fX(x1, . . . xd)dx1 · · · dxi−1dxi+1 · · · dxd
∀i ∈ 1, . . . , d.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 64 / 160
Random Variables
Example
Let X be a 2-dimensional random vector with fX given by
fX(x1, x2) =1
π1[−1,1](x1)1
[−√
1−x21 ,+√
1−x21 ]
(x2).
The marginal density of X1 is then given by
fX1(x1) =
∫ ∞−∞
1
π1
[−√
1−x21 ,+√
1−x21 ]
(x2)1[−1,1](x1)dx2 (51)
=
∫ √1−x21
−√
1−x21
1
π1[−1,1](x1)dx2 (52)
=2
π
(√1− x2
1
)1[−1,1](x1). (53)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 65 / 160
Random Variables
Exercise
Let Ω = 1, 2, 3, 4, A = 2Ω and P = U1,2,3,4. Define X by
X (ω) =
3 , ω ∈ 1, 2, 37 , ω ∈ 4 .
Is X a random variable?
Characterize the induced probability measure PX .
Exercise
Let X be a random point chosen uniformely from the region
R ≡ (x , y) : |x |+ |y | ≤ 1 ⊂ R2.
Derive an expression for fX (x) and fY (y).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 66 / 160
Random Variables
Exercise
Give an example to show that Xd= Y 6⇒ X = Y .
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 67 / 160
Random Variables
Intuition for Independence: The information provided by anyindividual random variable should not affect the behavior of the otherrandom variables in the family.
The definition of independence of random variables is abstract:
Definition
The family (Xi )i∈I of random variables is called independent if the family(σ(Xi ))i∈I of induced σ-algebras is independent.
Induced σ-algebra of a random variable X :
σ(X ) =
X−1(B) : B ∈ B(R)
A and B are said to be independent if
P(A ∩ B) = P(A)P(B) ∀A ∈ A ∀B ∈ B
Need for more accessible condition(s).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 68 / 160
Random Variables
Proposition
A family (Xi )i∈I of random variables is independent if and only if, forevery J ⊂ I and every x ∈ RJ
FJ (x) =∏j∈J
Fj(xj),
where FJ : RJ → [0, 1] is defined by FJ (x) = P(Xj ≤ xj ∀j ∈ J ).
General result, since J is an arbitrary index set.With finitely many random variables, the above result reduces to thefamiliar necessary and sufficient condition for independence:
Proposition
The random variables (X1, . . . ,Xn) are independent if and only if
F(X1,...,Xn)(x) =n∏
i=1
FXi(xi ) ∀x ∈ Rn
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 69 / 160
Random Variables
There are alternative (necessary and sufficient) criteria for absolutelycontinuous and discrete random variables:
Proposition
The discrete random variables X1, . . . ,Xn with countable support S aresaid to be independent if and only if
P(X1 = x1, . . . ,Xn = xn) =n∏
i=1
P(Xi = xi) ∀xi ∈ S i = 1, . . . , n
Proposition
Let X be an absolutely continuous, n-dimensional random vector withdensity fX(x), then X1, . . . ,Xn are independent if and only if
fX(x) =n∏
i=1
fXi(xi ) ∀x ∈ Rn
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 70 / 160
Random Variables
Example
Let X1, . . . ,Xn be independent exponentially distributed random variableswith parameters λn > 0. The joint distribution function is then given by
F(X1,...,Xn)(x1, . . . , xn) =n∏
i=1
FXi(xi ) =
n∏i=1
(1− eλixi ).
Define Z =minX1, . . . ,Xn, then
FZ (z) = P(Z ≤ z) (54)
= 1− P(min X1, . . . ,Xn > z) (55)
= 1− P(Xi > z ∀i = 1, . . . , n) (56)
= 1− P(∩ni=1 Xi > z) (57)
= 1− P(X1 > z)× · · · × P(Xn > z) (58)
= 1−n∏
i=1
e−λiz (59)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 71 / 160
Random Variables
Proposition
1 Let Xii .i .d .∼ Bern(p), then Y =
∑ni=1 Xi ∼ Bin(n, p).
2 Let Xiind .∼ Poisson(λi ), then Y =
∑ni=1 Xi ∼ Poisson(
∑ni=1 λi ).
Proposition
1 Let Xii .i .d .∼ N (0, 1), then Y =
∑ni=1 X 2
i ∼ χ(n).
2 Let Xiind .∼ χ(νi ), then Y =
∑ni=1 Xi ∼ χ(
∑ni=1 νi )
3 Let X ∼ N (0, 1), then Y = µ+ σX ∼ N (µ, σ2).
4 Let Xii .i .d .∼ Exp(λ), then Y =
∑ni=1 Xi ∼Gamma(n, λ).
5 Let X ∼ N (0, 1) and Y ∼ χ(ν) be independent, then T = X√Yν
∼ t(ν).
6 Let Xiind .∼ N (µi , σ
2i ), then Y =
∑ni=1 Xi ∼ N (
∑ni=1 µi ,
∑ni=1 σ
2i ).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 72 / 160
Random Variables
Exercise
Let X and Y be independent random variables with S = N and
P(X = i) = P(Y = i) =1
2ii ∈ N.
Find the following probabilities
P(min(X ,Y ) ≤ i)P(X = Y )P(Y > X)
Exercise
Let (X ,Y )′ be an absolutely continuous, bivariate random vector withfX ,Y (x , y) given by
fX ,Y (x , y) = e−y1[0,y ](x)1[0,∞)(y)
Are X and Y independent?
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 73 / 160
Random Variables
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 74 / 160
Moments
Moments
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 75 / 160
Moments
For a simple random variable X =∑n
i=1 ai1Aion (Ω,A,P), the
expectation is defined as
E[X ] =
∫XdP =
n∑i=1
aiP(Ai )
For a non-negative random variable X on (Ω,A,P), the expectationis defined as
E[X ] =
∫XdP = lim
n→∞
∫ξndP,
where ξn is an approximating sequence of simple random variables.
The expectation of an arbitrary random variable X on (Ω,A,P) isthen defined via the decomposition of X into two non-negativerandom variables X + and X−.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 76 / 160
Moments
Definition
Let X : Ω→ R be a random variable.
The positive part of X is defined by
X + = max X , 0
The negative part of X is defined by
X− = −min X , 0
X + and X− are nonnegative random variables.
X = X + − X−
|X | = X + + X−
If X ≥ 0, then X− = 0 and X = X +.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 77 / 160
Moments
Definition (Expectation)
The expectation of a random variable E [X ] is said to exist, or is defined, ifat least one of E [X +] and E [X−] is finite:
min(E[X +],E[X−]) <∞.
In this case we define
E [X ] = E[X +]− E
[X−].
Definition (Finite Expectation)
The expectation of X is said to be finite if E [X +] <∞ and E [X−] <∞.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 78 / 160
Moments
Thus,
E[X ] =
E[X +]− E[X−] if E[X +] <∞,E[X−] <∞∞ if E[X +] =∞,E[X−] <∞−∞ if E[X +] <∞,E[X−] =∞undefined if E[X +] =∞,E[X−] =∞
Proposition
E [X ] <∞⇔ E [|X |] <∞.
Proof.
Follows from |X | = X + + X−
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 79 / 160
Moments
Definition
Let L1(Ω,B,P) be the set of all random variables with E[|X |] <∞:
L1(Ω,B,P) = X : Ω→ R : E[|X |] <∞
Proposition (Properties of expectation)
Let X ,Y ∈ L1 and c ∈ R1 If X = c almost surely, then E[X ] = c.
2 E[cX ] = cE[X ].
3 E[X + Y ] = E[X ] + E[X ].
4 If X ≤ Y , then E[X ] ≤ E[Y ]
5 If X ≥ 0, then E[X ] ≥ 0.
(3) generalizes to the sum of n random variables (Proof by induction).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 80 / 160
Moments
Need for computational formulas for the expectation of a randomvariable.
For discrete and absolutely continuous random variables theexpectation, provided it exists, can be calculated using the familiarformulas.
Proposition
Let X be a discrete random variable, then
E[X ] =+∞∑
i=−∞xiP(X = xi )
Let X be an absolutely continuous random variable with densityfX (x), then
E[X ] =
∫ ∞−∞
xfX (x)dx .
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 81 / 160
Moments
Example
Let X be an absolutely continuous random variable with density function
fX (x) = x−21(1,∞)(x)
Then for the negative part we find
E[X−] = E[−min X , 0] = E[0] = 0 <∞
and for the positive part we find
E[X +] = E[max X , 0] =
∫ ∞1
x−1dx =∞
Thus,min
E[X−],E[X +]
<∞.
⇒ E[X ] exists and E[X ] =∞.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 82 / 160
Moments
Example
Let X be an absolutely continuous random variable with density function
fX (x) =1
2|x |−2
1(1,∞)(|x |)
Then for the negative part we find
E[X−] = E[−min X , 0] =
∫ −1
−∞x−1dx = [ln(|x |)]−1
−∞ =∞
and for the positive part we find
E[X +] = E[max X , 0] =
∫ ∞1
x−1dx = [ln(|x |)]∞1 =∞
Thus, E[X ] does not exist.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 83 / 160
Moments
Example
Let X ∼ U[a,b] and b > a. Then,
E[X ] =
∫ ∞−∞
xfX (x)dx (60)
=
∫ ∞−∞
x
b − a1[a,b](x)dx (61)
=
∫ b
a
x
b − adx (62)
=1
2(a + b) (63)
Exercise
Let X be log-normally distributed with parameters µ and σ2. Derive anexpression for E[X ].
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 84 / 160
Moments
Exercise
Let X ∼Cauchy(0, 1). The density function is then given by
fX (x) =1
π(1 + x2)1(−∞,∞)(x).
Does E[X ] exist?
Exercise
Let X ∼Pareto(α) and α ∈ (0, 1]. The density function is given by
fX (x) =α
xα+11[1,∞)
Does E[X ] exist?
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 85 / 160
Moments
Functions of random variables are often of great interest and thereforealso their expectations.
For the special cases of discrete and absolutely continuous randomvariables there are the following computational formulas.
Proposition
Suppose X is a discrete random variable. If g(X ) ∈ L1 or if g is positive,then
E[g(X )] =∞∑
i=−∞g(xi )P(X = xi )
Proposition
Suppose X is an absolutely continuous random variable with density fX . Ifg(X ) ∈ L1 or if g is positive, then
E[g(X )] =
∫ ∞−∞
g(x)fX (x)dx
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 86 / 160
Moments
Definition
Let Ln(Ω,B,P) be the set of all random variables with E[|X |n] <∞:
Ln(Ω,B,P) = X : Ω→ R : E[|X |n] <∞
Proposition
If X ∈ Ln, then E[|X |k ] <∞ for k = 1, 2, . . . , n.
Proof.
Set |X |n =(|X |k
) nk
for k = 1, 2, . . . , n. Note that the function
f : R+ → R+ defined by f (x) = xrs is convex on R+ for r > s. Then by
Jensen’s inequality
+∞ > E[|X |n] = E[(|X |k
) nk
]≥(E[|X |k ]
) nk.
⇒ E[|X |k ] < +∞ for k = 1, 2, . . . , n ⇒ E[X k ] < +∞ for k = 1, 2, . . . , n.Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 87 / 160
Moments
Definition
If X ∈ Lp, then the r -th moment of X is given by
E[X r ] for r = 1, 2, . . . , p
Definition
If X ∈ Lp, then the r -th absolute moment of X is given by
E[|X |r ] for r = 1, 2, . . . , p
Definition
If X ∈ Lp, then the r -th central moment of X is given by
E[(X − E(X ))r ] for r = 1, 2, . . . , p
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 88 / 160
Moments
Example
Let X follow a Beta distribution with density function
fX (x) =1
B(r , s)x r−1(1− x)s−1
1[0,1](x).
Then,
E[X k ] =
∫ 1
0
1
B(r , s)xk+r−1(1− x)s−1dx .
Using the definition of the Beta function yields
E[X k ] =B(r + k , s)
B(r , s)=
Γ(r + k)Γ(r + s)
Γ(r + s + k)Γ(r)
For example:
E[X ] =rΓ(r)Γ(r + s)
(r + s)Γ(r + s)Γ(r)=
r
r + s
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 89 / 160
Moments
Definition (Variance)
Let X ∈ L2. The variance of X is defined by
σ2X = Var(X ) = E
[(X − E(X ))2
]= E
[X 2]− (E [X ])2
and the standard deviation of X is given by
σX = SD(X ) = +√
Var(X )
Proposition (Properties of Variance)
Var(X ) ≥ 0
Var(aX + b) = a2Var(X ) ∀a, b ∈ RVar(X ) = 0⇔ X = E [X ] almost surely, i.e P(X = E [X ]) = 1
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 90 / 160
Moments
Proposition
If X and Y are in L2, then XY ∈ L1.
Proof.
Note that (X − Y )2 = X 2 + Y 2 − 2XY ≥ 0. Rearranging and takingabsolute values on both sides yields |XY | ≤ X 2 + Y 2. Then by theinequality preserving property of the expectation
E[|XY |] ≤ E[X 2] + E[Y 2].
Thus, E[|XY |] <∞.
Definition (Covariance)
Let X ,Y ∈ L2 . The covariance of X ,Y is defined by
Cov(X ,Y ) = E [(X − E(X ))(Y − E(Y ))] = E(XY )− E(X )E(Y )
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 91 / 160
Moments
Proposition (Properties of Covariance)
Let X ,Y ,Z ,T ∈ L2 and a, b, c , d ∈ R1 Cov(X ,X ) = Var(X )
2 Cov(X ,Y ) = Cov(Y ,X )
3 Cov(aX + b,Y ) = aCov(X ,Y )
4 Cov(X + Z ,Y ) = Cov(X ,Y ) + Cov(Z ,Y )
5 Cov(aX + bZ , cY + dT ) =acCov(X ,Y ) + adCov(X ,T ) + bcCov(Z ,Y ) + bdCov(Z ,T )
Proof of (3).
Cov(X + Z ,Y ) = E[(X + Z − E[X + Z ])(Y − E[Y ])] (64)
= E[XY ]− E[X ]E[Y ] + E[ZY ]− E[Z ]E[Y ] (65)
= Cov(X ,Y ) + Cov(Z ,Y ). (66)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 92 / 160
Moments
Proposition
Let X ,Y ∈ L1 be independent. Then, E[XY ] = E[X ]E[Y ].
Proposition
If X and Y are independent, then Cov(X ,Y ) = 0.
Proof.
Follows trivially from the fact that independence impliesE [XY ] = E [X ]E [Y ].
The converse is false in general: Cov(X ,Y ) = 0 does not imply thatX and Y are independent.
Special case: If X and Y are jointly normal distributed, then X andY are independent iff Cov(X ,Y ) = 0
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 93 / 160
Moments
Definition (Correlation)
Let X ,Y be two real valued random variables in L2. The correlation ofX ,Y is defined by
ρX ,Y =Cov(X ,Y )√
Var(X )√
Var(Y )
Proposition (Properties of Correlation)
Let X ,Y ∈ L2 and a, b, c , d ∈ R−1 ≤ ρX ,Y ≤ 1
|ρX ,Y | = 1⇔ P(Y = a + bX ) = 1
ρaX+b,cY+d = ρX ,Y
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 94 / 160
Moments
Exercise
Provide an example to show that E[XY ] = E[X ]E[Y ] 6⇒ Independence ofX and Y .
Exercise
If a ∈ R \ 0 and b ∈ R, show that
ρX ,aX+b =a
|a|.
Exercise
Let X ,Y ∈ L2 and let
Z =
(1
σY
)Y −
(ρX ,YσX
)X .
Show that σ2Z = 1− ρ2
X ,Y .
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 95 / 160
Moments
The moments of Rd -valued random variables are definedcomponentwise. Thus, we can simply adapt the concepts from theunivariate case.
Definition
Let X = (X1, . . . ,Xn) be an Rn-valued random variable. Provided thatXi ∈ L1, the first moment is defined by
E[X] = (E[X1], . . . ,E[Xd ])′
Definition
Let X = (X1, . . . ,Xn) be an Rn-valued random variable. Provided thatXi ∈ L2, the covariance matrix of X, ΣX, is the n × n matrix defined by
σi ,j = Cov(Xi ,Xj)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 96 / 160
Moments
Proposition (Properties of Covariance Matrices)
1 Var(Xi ) = σi ,i2 σi ,j = σj ,i (symmetric)
3 ∀a ∈ Rn a′ΣXa ≥ 0 (positive semidefinite)
4 Let A ∈ Rm×n. ΣAX = AΣXA′
Proof of (3).
a′ΣXa =n∑
i=1
n∑j=1
aiajσi ,j (67)
= Var
(n∑
i=1
aiXi
)≥ 0 ∀a ∈ Rn (68)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 97 / 160
Moments
Definition
Let X = (X1, . . . ,Xn) be an Rn-valued random variable. Provided thatXi ∈ L2, the correlation matrix of X, ΞX, is the n × n matrix defined by
Ξi ,j =Cov(Xi ,Xj)√
Var(Xi )√
Var(Yi )
Exercise
Let (X ,Y )′ be an absolutely continuous, bivariate random vector withdensity given by
f(X ,Y )(x , y) =1
π1[−1,1](x)1[−
√1−x2,+
√1−x2](y)
Derive Cov(X ,Y ).
Are X and Y independent?
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 98 / 160
Moments
Exercise
Show that |ρX ,Y | ≤ 1. Use the fact that |E [XY ]| ≤ +√
E [X 2]E [Y 2].
Exercise
Let Xi ∈ L2. Show that Var (∑n
i=1 aiXi ) =∑n
i=1
∑nj=1 aiajCov(Xi ,Xj).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 99 / 160
Inequalities
Inequalities
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 100 / 160
Inequalities
Proposition (Modulus inequality)
If X ∈ L1, then|E[X ]| ≤ E[|X |]
Proof.
|E[X ]| = |E[X +]− E[X−]| ≤ E[X +] + E[X−] = E[|X |]
Proposition (Jensen’s inequality)
Let X ∈ L1 and let f : R→ R be a convex (concave) function such thatf (X ) ∈ L1. Then,
E[f (X )] ≥ f (E[X ]), if g is convex,
E[f (X )] ≤ f (E[X ]), if g is concave.
If f is strictly convex (concave) the inequality is strict.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 101 / 160
Inequalities
Proposition (Cauchy-Schwarz inequality)
Let X ,Y ∈ L2. Then,
|E [XY ]| ≤ +√
E [X 2]E [Y 2]
Proof.
For every t ∈ R it holds that
0 ≤ E[(tX + Y )2] = t2E[X 2] + 2tE[XY ] + E[Y 2].
This is a quadratic equation in t and has at most one real root. Firstconsider the case where E[(tX + Y )2] = 0: Let Y = −tX , then
|E [XY ]| =∣∣E [−tX 2
]∣∣ = |t|E[X 2]√E [X 2]E [Y 2] =
√E [X 2]E [t2X 2] = |t|E[X 2].
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 102 / 160
Inequalities
Proof Cont’d.
Next, assume Y 6= −tX . This implies that E[(tX + Y )2] > 0 and thequadratic equation can not have a real root. Thus, its discriminant mustbe negative √
4[(E[XY ])2]− E[X 2]E[Y 2]] < 0.
Then it follows that
(E[XY ])2 − E[X 2]E[Y 2] < 0⇔ |E [XY ]| < +√
E [X 2]E [Y 2].
Combinig both cases yields the Cauchy-Schwarz inequality.
Another version of Cauchy-Schwarz inequality:E [|XY |] ≤ +
√E [X 2]E [Y 2]
Derivation: Apply the previously proven inequality to the randomvariables |X | and |Y |.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 103 / 160
Inequalities
Proposition (Markov’s Inequality)
Let X ∈ Lr , then for a ∈ R++
P(|X | ≥ a) ≤ E[|X |r ]
ar.
Most common special case of Markov’s Inequality:
Proposition (Markov’s Inequality)
Let X be a nonnegative random variable and in ∈ L1, then for a ∈ R++
P(X ≥ a) ≤ E[X ]
a.
Proposition (Chebyshev’s Inequality)
Let X ∈ L2, then for a ∈ R++
P(|X − E[X ]| ≥ a) ≤ E[(X − E[X ])2]
a2
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 104 / 160
Inequalities
Proof.
Define Z = (X − E[X ])2 and apply Markov’s inequality:
P(|X − E[X ]| ≥ a) = P(Z ≥ a2) ≤ E[Z ]
a2=
E[(X − E[X ])2]
a2
Alternative Proof.
Define the following event A = |X − E[X ]| ≥ a, further define thefollowing indicator random variable I = 1A. Then, it holds that|X−E[X ]|2
a2 ≥ I and by the inequality preserving property of the expectation
E[(X − E[X ])2]
a2≥ E[1A] = P(|X − E[X ]| ≥ a).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 105 / 160
Inequalities
Proposition (Minkowski’s Inequality)
Let X ,Y ∈ Lp, then for p > 1
(E[|X + Y |p])1p ≤ (E[|X |p])
1p + (E[|Y |p])
1p
Proposition (Holder’s Inequality)
Let X ∈ Lp and Y ∈ Lq, p, q > 1 and 1p + 1
q = 1. Then,
E[|XY |] ≤ (E[|X |p])1p (E[|Y |q])
1q
Proposition (Lyapunov’s Inequality)
Let X ∈ Ls , then
(E[|X |r ])1r ≤ (E[|X |s ])
1s , 1 ≤ r ≤ s.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 106 / 160
Inequalities
Exercise
Let X ∈ L2. Show that Var(X ) = 0 ⇒ P(X = E[X ]) = 1.
Exercise
Show that even the following more general version of Markov’s inequalityholds:Let X be a real random variable and let f : [0,∞)→ [0,∞) be monotoneincreasing Then for any a > 0 with f (a) > 0,
P(|X | ≥ a) ≤ E[f (|X |)]
f (a).
Hints:
R = X ≥ a ∪ X < af (X ) = f (X )1R(X )
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 107 / 160
Moment Generating Functions
Moment Generating Functions
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 108 / 160
Moment Generating Functions
Definition (Moment Generating Function)
Let X be a real-valued random variable. Its moment generating functionM : R→ R is defined by
MX (t) = E[etX]
Example
Let Z ∼ U[0,1].
MZ (t) = E[etX]
=
∫ 1
0etzdz =
[etz
t
]1
0
=et − 1
tt 6= 0
Let Y ∼ DU(θ) with p.m.f fY (y) = θ−111,...,θ(y)
MY (t) = E[etY]
=θ∑
i=1
1
θeti
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 109 / 160
Moment Generating Functions
Definition
The moment generating function is said to exist if it is finite on a openneighbourhood of zero, i.e., if there is a h ∈ R++ such that,∀t ∈ (−h, h),M(t) < +∞.
Proposition
If the moment generating function exists in an open interval containingzero, it uniquely determines the probability distribution.
Proposition
If MX (t) exists then
all positive moments are finite: ∀r ∈ R++,E [|X |r ] < +∞
E[X j]
= M(j)X (t)
∣∣∣∣t=0
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 110 / 160
Moment Generating Functions
Why the expectation of etX ?
Power series of etX is given by∑∞
i=1tn
n! X n
By the linearity of the expectation:
E[etX ] = E
[ ∞∑n=0
tn
n!X n
]=∞∑n=0
tn
n!E[X n] ∀ |t| < h.
It can be shown that termwise differentiation is valid, then
M(j)X (t) =
∞∑i=j
t(i−j)
(i − j)!E[X i ] = E
[X j
∞∑n=0
(tX )n
n!
]= E[X jetX ].
It follows that M(j)X (t)
∣∣∣∣t=0
= E[X j].
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 111 / 160
Moment Generating Functions
Example
Let X ∼ Gamma(α, λ) and α, λ ∈ R++.
MX (t) =
∫ +∞
0etx
λα
Γ(α)xα−1e−λxdx =
λα
Γ(α)
∫ +∞
0xα−1ex(t−λ)dx
The integral converges for t < λ:
MX (t) =λα
Γ(α)
Γ(α)
(λ− t)α=
(λ
λ− t
)α∀t ∈ (−∞, λ)
Thus, MX (t) <∞ ∀t ∈ (−λ, λ).
E [X ] = M′X (0) =α
λ
E[X 2]
= M′′X (0) =α (α + 1)
λ2
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 112 / 160
Moment Generating Functions
Example
Let Y ∼ U[0,1].
MY (t) =
et−1t , t 6= 0
1 , t = 0
MY (t) is continuous at zero:
limt→0
et − 1
t= lim
t→0
et
1= 1 = MY (0).
By l’Hopital’s rule
M′Y (0) = limt→0
tet − et + 1
t2= lim
t→0
tet + et
2.
Thus, E[X ] = 12 .
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 113 / 160
Moment Generating Functions
Proposition
Let X be a real-valued random variable with m.g.f. MX (t) and defineY = µ+ σX , then MY (t) = eµtMX (σt)
Let Xi be n independent random variables with m.g.f.’s MXi(t) and
define Z =∑n
i=1 Xi , then MZ (t) = MX1(t) · · ·MXn(t) on thecommon interval where all m.g.f.’s exist.
Proof.
MY (t) = E[etY]
(69)
= E[et(µ+σX )
](70)
= E[etµeσtX
](71)
= etµE[eσtX
](72)
= etµMX (σt) (73)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 114 / 160
Moment Generating Functions
Proof.
MZ (t) = E[etZ]
(74)
= E[et(∑n
i=1 Xi )]
(75)
= E[etX1 · · · etXn
](76)
= E[etX1
]· · ·E
[etXn
](77)
= MX1(t) · · ·MXn(t) (78)
This property is very helpful to determine the distribution of sums ofindependent random variables.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 115 / 160
Moment Generating Functions
Example
Show that if Xii .i .d .∼ N (0, 1), then
∑ni=1 X 2
i ∼ χ(n). First derive the m.g.fof X 2:
E[etx2] =
∫ ∞−∞
1√2π
e−x2(t− 1
2)dx =
∫ ∞−∞
1√2π
e−(
x2
2
)(1−2t)
dx .
Make the change of variable u = x√
1− 2t, then dx = du√1−2t
;
E[etx2] =
1√1− 2t
∫ ∞−∞
1√2π
e−12u2
du =1√
1− 2t.
Thus, E[etx2] <∞ for t < 1
2 .
M(t)∑ni=1 X
2i
=
(1√
1− 2t
)n
= M(t)Z∼χ(n)
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 116 / 160
Moment Generating Functions
Exercise
Let Xiind∼ Poisson(λi ) and define Y =
∑ni=1 Xi
Derive an expression for MX1(t).
Show that Y ∼ Poisson(∑n
i=1 λi )
Exercise
Let X follow an exponential distribution with density
fX (x) = λe−λx1[0,∞)(x).
Derive Var(X ) via the moment generating function.
Exercise
Let Xiind∼ N (µi , σ
2i ) and define Sn =
∑ni=1 Xi and D = X1 − X2.
Show that Sn ∼ N (∑n
i=1 µi ,∑n
i=1 σ2i ).
Show that if σ21 = σ2
2 then S2 and D are independent.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 117 / 160
Transformations of Random Variables
Transformations of Random Variables
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 118 / 160
Transformations of Random Variables
Let X be an absolutely continuous random variable with density fX andsuppose that Y = g(X ). Is it possible to express the density of Y in termsof fX ?
Proposition
Let X have a continuous density function fX . Let g : D ⊆ R→ R be a C1
function and strictly monotone. Then, Y = g(X ) has the density
fY (y) = fX (g−1(y))
∣∣∣∣ d
dyg−1(y)
∣∣∣∣If FY (y) is differentiable, then fY (y) = d
dy FY (y)
Modified version for piecewise strictly monotone and piecewise C1
functions.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 119 / 160
Transformations of Random Variables
Proof.
Suppose g is increasing and let h(y) = g−1(y). Then,
FY (y) = P(g(X ) ≤ y) (79)
= P(h(g(X )) ≤ h(y)) (80)
= P(X ≤ h(y)) (81)
=
∫ h(y)
−∞fX (t)dt (82)
h(y) is differentiable and therefore by applying Leibniz’s rule
d
dyFY (y) =
d
dy
∫ h(y)
−∞fX (t)dt = f (h(y))h′(y).
If g is decreasingd
dyFY (y) = f (h(y))(−h′(y)).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 120 / 160
Transformations of Random Variables
Proof cont’d.
Thus, summarizing both cases
fY (y) = fX (g−1(y))
∣∣∣∣ d
dyg−1(y)
∣∣∣∣Proposition
Let X have a continuous density fX . Let g : R→ R be piecewise strictlymonotone and piecewise continuously differentiable: that is, there existintervals I1, I2, . . . , In which partition R such that g is strictly monotoneand continuously differentiable on the interior of each Ii . For each i ,g : Ii → R is invertible on g(Ii ). Let Y = g(X ) and let hi be the inversefunction. Then the density fY of Y exists and is given by
fY (y) =n∑
i=1
fX (hi (y))
∣∣∣∣ d
dyhi (y)
∣∣∣∣1g(Ii )(y).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 121 / 160
Transformations of Random Variables
Example
Let X ∼ N (0, 1) and let Y = X 2. Define I1 = [0,∞) and I2 = (−∞, 0).Then g is injective and strictly monotone on I1 and I2. The inversefunctions h1 and h2 are given by
h1(y) =√
y and h2(y) = −√y .
Thus,
fY (y) = fX (h1(y))∣∣h′1(y)
∣∣1[0,∞)(y) + fX (h2(y))∣∣h′2(y)
∣∣1(0,∞)(y)(83)
=1√2π
e−y
21
2√
y1(0,∞)(y) +
1√2π
e−y
2 1(0,∞)(y) (84)
=1√2π
1√
ye−y
2 1(0,∞)(y) (85)
⇒ Y ∼ χ(1).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 122 / 160
Transformations of Random Variables
Example (Cont’d)
Alternative derivation:
FY (y) = P(Y ≤ y) (86)
= P(
X 2 ≤ y
) (87)
= P(−√y ≤ X ≤ √y) (88)
= FX (√
y)− FX (−√y) (89)
=
∫ √y−∞
1√2π
e−x2
2 dx −∫ −√y−∞
1√2π
e−x2
2 dx . (90)
Thus,
fY (y) =d
dyFY (y) =
d
dyFX (√
y)− d
dyFX (−√y) =
1√2π
e−y2
1
2√
y1(0,∞)(y).
⇒ Y ∼ χ(1).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 123 / 160
Transformations of Random Variables
Proposition
Let X be a Rn-valued absolutely continuous random variable with densityfX. Let g : Rn → Rn be a C1 function and injective with non-vanishingJacobian. Then Y = g(X ) has density
fY(y) =
fX(g−1(y))
∣∣detJg−1(y)∣∣ , if y is in the range of g
0, otherwise .
Proposition
Let S ∈ Rn be partitioned into disjoint subsets S0,S1, . . . ,Sm such that∪mi=0Si = S, and such that that S0 has Lebesgue measure zero and thatfor each i = 1, . . . ,m, gi : Si → Rn is injective and in C1 withnon-vanishing Jacobian. Let Y = g(X), then
fY(y) =m∑i=1
fX(g−1i (y))
∣∣∣detJg−1i
(y)∣∣∣ ,
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 124 / 160
Transformations of Random Variables
Exercise
Let X ∼ U[0,1] and let Y = − 1λ with λ > 0. Derive the density function of
Y .
Exercise
Let X ∼ U[−1,1]. Derive the density for Y = X k for k ∈ N \ 0.
Exercise
Let X be an absolutely continuous and positive random variable withdensity fX . Define Y = 1
(X+1) . Find the density of Y .
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 125 / 160
Convergence Concepts
Convergence Concepts
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 126 / 160
Convergence Concepts
Let X be a random variable and let (Xn)n≥1 be a sequence of randomvariables defined on the same probability space (Ω,B,P).
Definition (Almost Sure Convergence)
A sequence of random variables (Xn)n≥1 converges almost surely to alimiting random variable X if
P(ω ∈ Ω : lim
n→∞Xn(ω) = X (ω)
)= 1
and is denoted by Xna.s.→ X .
Equivalent definition:
Definition
A sequence of random variables (Xn)n≥1 converges almost surely to alimiting random variable X if
P(ω ∈ Ω : lim
n→∞Xn(ω) 6= X (ω)
)= 0
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 127 / 160
Convergence Concepts
Let X be a random variable and let (Xn)n≥1 be a sequence of randomvariables defined on the same probability space (Ω,B,P).
Definition (Convergence in Probability)
A sequence of random variables (Xn)n≥1 converges in probability to alimiting random variable X if for any ε > 0 we have
limn→∞
P(ω ∈ Ω : |Xn(ω)− X (ω)| > ε) = 0
and is denoted by Xnp→ X .
More common notation: limn→∞
P(|Xn − X | > ε) = 0.
Thus, Xnp→ X states that the probability that Xn and X are more
than a prescribed ε > 0 apart converges to zero as n→∞.
Statistics: An estimator βn is consistent for β if βnp→ β.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 128 / 160
Convergence Concepts
Application:
Theorem (WLLN)
Let (Xn)n≥1 be a sequence of i.i.d random variables in L2 with µ = E [X1]
and σ2 = Var(X ). Define Xn = 1n
∑ni=1 Xi , then
Xnp→ µ
Proof.
P(|Xn − E
[Xn
]| ≥ ε
)= P
(|Xn − µ| ≥ ε
)(91)
≤Var
(Xn
)ε2
(92)
=Var(X1)
nε2(93)
Hence, limn→∞
P(|Xn − µ| ≥ ε
)= 0⇒ Xn
p→ µ.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 129 / 160
Convergence Concepts
Let X be a random variable and let (Xn)n≥1 be a sequence of randomvariables, which are not necessarily defined on the same probability spaceand let CFX
⊂ R denote all points at which FX is continuous.
Definition (Convergence in Distribution)
The sequence (Xn)n≥1 converges to X in distribution, if
limn→∞
FXn(t) = FX (t) ∀t ∈ CFX
and is denoted by Xnd→ X .
Convergence in distribution is the weakest form of convergence.
The notion of convergence in distribution only requires theconvergence of the distribution functions. This is the reason why Xand every Xn can be defined on a different probability space.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 130 / 160
Convergence Concepts
Example
Let X be a point mass at 0 (i.e. X = 0 almost surely) and let thedistribution function of Xn be given by
FXn(x) =enx
1 + enx∀x ∈ R.
Clearly, the distribution function of X is given by
FX (x) =
0, if x < 0
1, if x ≥ 0,
and the limiting distribution function is given by
limn→∞
FXn(x) = F ∗(x) =
0, if x < 012 , if x = 0
1, if x > 0.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 131 / 160
Convergence Concepts
Example (cont’d)
limx→0+
F ∗(x) 6= F ∗(0). Thus, the limiting distribution function is not
right-continuous and therefore not a distribution function.
The definition of convergence in distribution requires only
limn→∞
FXn(t) = FX (t) ∀t ∈ CFX.
CFX= R \ 0.
limn→∞
FXn(x) = FX (x) =
0, if x < 0
1, if x ≥ 0,∀x ∈ CFX
⇒ Xnd→ X .
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 132 / 160
Convergence Concepts
Convergence in distribution can also be established via momentgenerating functions.
Proposition
Let (Xn)n≥1 be a sequence of random variables such that MXn(t) exist for|t| < h, h ∈ R++ and all n ∈ N. If X is a random variable such thatMX (t) exists for |t| ≤ h1 < h, then
limn→∞
MXn(t) = MX (t) for |t| < h1 ⇒ Xnd→ X .
Proposition (Scheffe’s Lemma)
Let (Xn)n≥1 be a sequence of absolutely continuous random variables withcorresponding sequence of density functions (fXn(x))n≥1 and let X be anabsolutely continuous random variable with density fX (x).Then,
fXn(x)→ fX (x) a.s. ⇒ Xnd→ X .
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 133 / 160
Convergence Concepts
Exercise
Let (Ω,B,P) be a probability space, where Ω = [0, 1], B = B([0, 1]) andP = U[0,1]. Define the following sequence of random variables Xn : Ω→ Rby
Xn(ω) = ω + ωn ∀n ∈ N
and the following random variable by
X (ω) = ω.
Does it hold that Xna.s.→ X , i.e. P (ω ∈ Ω : limn→∞ Xn(ω) = X (ω)) = 1?
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 134 / 160
Convergence Concepts
Proposition
Let (Xn)n≥1 be a sequence of random variables, then
Xna.s.→ X ⇒ Xn
p→ X ⇒ Xnd→ X .
Proposition
Let (Xn)n≥1 and (Yn)n≥1 be sequences of random variables and c ∈ R,then
If Xna.s.→ X and Yn
a.s.→ Y ⇒ Xn + Yna.s.→ X + Y .
If Xnp→ X and Yn
p→ Y ⇒ Xn + Ynp→ X + Y .
If Xnd→ X and Yn
d→ c ⇒ Xn + Ynd→ X + c.
If Xna.s.→ X and Yn
a.s.→ Y ⇒ XnYna.s.→ XY .
If Xnp→ X and Yn
p→ Y ⇒ XnYnp→ XY .
If Xnd→ X and Yn
d→ c ⇒ XnYnd→ Xc.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 135 / 160
Convergence Concepts
Proposition
If for some c ∈ R it holds that X = c almost surely, i.e P(X = c) = 1,
then Xnd→ X ⇒ Xn
p→ X .
Proof.
Fix ε > 0. Then,
P(|Xn − c | > ε) = P(Xn > ε+ c ∪ Xn < c − ε) (94)
= P(Xn > ε+ c) + P(Xn < c − ε) (95)
≤ (1− FXn(ε+ c)) + Fxn(c − ε) (96)
For any ε > 0, (c ± ε) ∈ CFX, therefore
limn→∞
[(1− FXn(ε+ c)) + Fxn(c − ε)] = 0
Thus, limn→∞
P(|Xn − c | > ε) = 0.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 136 / 160
Convergence Concepts
Let X be a Rd -valued random variable and let (Xn)n≥1 be a sequence ofrandom variables defined on the same probability space (Ω,B,P).
Definition (Almost Sure Convergence)
A sequence of Rd -valued random variables (Xn)n≥1 converges almostsurely to a limiting random variable X if
P(ω ∈ Ω : lim
n→∞Xn(ω) = X(ω)
)= 1
and is denoted by Xna.s.→ X.
Almost sure convergence can be established either componentwise ordirectly for the vector sequence:
Xna.s.→ X⇔ Xn,i
a.s.→ Xi ∀i ∈ 1, . . . , d
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 137 / 160
Convergence Concepts
Let X be a Rd -valued random variable and let (Xn)n≥1 be a sequence ofrandom variables defined on the same probability space (Ω,B,P).
Definition (Convergence in Probability)
A sequence of Rd -valued random variables (Xn)n≥1 converges inprobability to a limiting random variable X if for any ε > 0 we have
limn→∞
P(ω ∈ Ω : ||Xn(ω)− X(ω)|| > ε) = 0
and is denoted by Xnp→ X.
Convergence in probability can be established either componentwiseor via the vector sequence, i.e.:
Xnp→ X⇔ Xn,i
p→ Xi ∀i ∈ 1, . . . , d
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 138 / 160
Convergence Concepts
The extension of the definition convergence of in distribution forRn-valued random variables is immediate
limn→∞
FXn(t) = FX(t) ∀t ∈ CFX
Multivariate distribution functions are almost intractable.
Unfortunately, componentwise convergence in distribution does notimply convergence in distribution of the vector sequence. Theconverse is indeed true
Xd→ X⇒ Xn,i
d→ Xi ∀i ∈ 1, . . . , d
Proposition (Cramer-Wold device)
Let (Xn)n≥1 be a sequence of random variables, then
Xnd→ X⇔ λ′Xn
d→ λ′X λ ∈ Rd
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 139 / 160
Convergence Concepts
Almost sure convergence, convergence in probability and convergence indistribution are preserved under continuous transformations.
Theorem (Continuous Mapping Theorem)
Let g : D ⊆ Rd → Rr be a continuous function that does not depend onn, then
If Xna.s.→ X , then g(Xn)
a.s.→ g(X )
If Xnp→ X , then g(Xn)
p→ g(X )
If Xnd→ X , then g(Xn)
d→ g(X )
Theorem (Slutzky’s theorem)
Let Xnd→ X , Yn
p→ Y and Anp→ A for some A ∈ Rk×r . Then
AnXn + Ynd→ AX + Y .
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 140 / 160
Convergence Concepts
Exercise
Show that√
n(βn − β)d→ N (0,Ω) implies that βn
p→ β.
Exercise
Provide an example of two sequences of random variables (Xn)i≥1 and
(Yn)i≥1 and two random variables X and Y such that Xnd→ X and
Ynd→ Y , but (Xn,Yn)′ 6 d→ (X ,Y ).
Exercise
Let Xn ∼ U[− 1n, 1n
] and let X = 0 almost surely. Is it true that Xnp→ X ?
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 141 / 160
Law of Large Numbers
Law of Large Numbers
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 142 / 160
Law of Large Numbers
There are a lot of different laws of large numbers depending on theassumptions imposed on the random variables.
We have already seen a weak law of large numbers which included thestrong assumptions that the sequence of random variables is i.i.d andin L2. However, the proof of the WLLN still works if we assume thatthe random variables are only uncorrelated instead of independent.Thus, the theorem can be rewritten in the following way:
Theorem (WLLN)
Let (Xn)n≥1 be a sequence of identically distributed and uncorrelated (i.e.
Cov(Xi ,Xj) = 0 ∀j 6= i ) random variables in L2 with µ = E [X1]. DefineXn = 1
n
∑ni=1 Xi , then
Xnp→ µ
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 143 / 160
Law of Large Numbers
Proof.
P(|Xn − E
[Xn
]| ≥ ε
)≤
Var(Xn
)ε2
(97)
=Var
(1n
∑ni=1 Xi
)ε2
(98)
=
∑ni=1
∑nj=1 Cov(Xi ,Xj)
n2ε2(99)
=
∑ni=1 Var(Xi )
n2ε2(100)
=nVar(X1)
n2ε2(101)
=Var(X1)
nε2(102)
Hence, limn→∞
P(|Xn − µ| ≥ ε
)= 0⇒ Xn
p→ µ.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 144 / 160
Law of Large Numbers
There is a weak law of large numbers which drops the assumptionabout the finiteness of the second moment. This is known asKhintchin’s WLLN:
Theorem (Khintchin’s WLLN)
Let (Xi )i≥1 be a sequence of i.i.d. random variables with E[|X1|] <∞.Define Xn = 1
n
∑ni=1 Xi . Then,
Xnp→ E [X1]
Example
Let Xii .i .d∼ t(2). Then E[Xi ] = 0 and σ2
Xi=∞. The WLLN does not apply
because the second moment fails to be finite. However, Khintchin’s WLLNdoes apply and we can conclude that Xn
p→ E[Xi ] = 0.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 145 / 160
Law of Large Numbers
n0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Ave
rage
-6
-4
-2
0
2
4
6
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 146 / 160
Law of Large Numbers
The strong law of large numbers refers to almost sure convergence.
Kolmogorov’s SLLN requires the same assumptions as Khintchin’sWLLN, but establishes almost sure convergence.
Theorem (Kolmogorov’s SLLN)
Let (Xi )i≥1 be a sequence of i.i.d random variables with E [|X1|] <∞.
Define Xn = 1n
∑ni=1 Xi . Then,
Xna.s.→ E [X1]
Almost sure convergence and convergence in probability of Rd -valuedrandom variables can be established componentwise. Thus, the law oflarge numbers extend easily to Rd -valued random variables.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 147 / 160
Law of Large Numbers
The fact that (measurable) functions of independent random variables areagain independent implies the following proposition.
Proposition
Let (Xn)n≥1 be a sequence of i.i.d random variables in Lk and define the
following sequence (Yn)n≥1 :=(X kn
)n≥1
, then
1
n
n∑i=1
X ki
p→ E[X k1 ]
Example
Let Xii .i .d .∼ N (0, 4), then the first four moments are E[X1] = 0,
E[X 21 ] = 4, E[X 3
1 ] = 0 and E[X 41 ] = 54
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 148 / 160
Law of Large Numbers
n0 5000 10000
Firs
t Mom
ent
-10
-5
0
5
10
n0 5000 10000
Sec
ond
Mom
ent
0
10
20
30
40
50
n0 5000 10000
Thi
rd M
omen
t
-10
-5
0
5
10
n0 5000 10000
Fou
rth
Mom
ent
0
50
100
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 149 / 160
Central Limit Theorem
Central Limit Theorem
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 150 / 160
Central Limit Theorem
Theorem (CLT for i.i.d. random variables)
Let (Xn)n≥1 be an i.i.d sequence of random variables in L2 withµ = E [X1] and σ2 = Var(X1) and define Xn = 1
n
∑ni=1 Xi . Then,
√n(Xn − µ)
σ
d→ N (0, 1)
or in terms of sums ∑ni=1 Xi − nµ√
nσ
d→ N (0, 1)
There are a lot of different CLT’s depending on the underlyingassumptions.
Asymptotic approximation: Xna∼ N (µ, σ
2
n ).
Proof via characteristic function and Levy’s Continuity Theorem.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 151 / 160
Central Limit Theorem
The CLT extends to the multivariate case:
Theorem
Let (Xn)n≥1 be an i.i.d sequence of Rd -valued random variables with meanvector µ = (E [X1] , . . . ,E [Xd ])′ and covariance matrix ΣX = (σi ,j)1≤i ,j≤d .Define Xn = 1
n
∑ni=1 Xi . Then,
√n(Xn − µ
) d→ N (0[d×1],ΣX)
Sidenote: There is no requirement for ΣX to be invertible. Thus, thelimiting normal random variable eventually does not have a density,because a multivariate normal distribution admits a density if ΣX isinvertible.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 152 / 160
Central Limit Theorem
Example
Let (Xi )i≥1 be an i.i.d. sequence of random variables with P(Xi = 1) = pand P(Xi = 0) = 1− p. Define Sn =
∑ni=1 Xi . Then Sn ∼ Bin(n, p). We
have E [Xi ] = p and Var(Xi ) = p(1− p). By the central limit theorem
Sn − np√np(1− p)
d→ N (0, 1)
and
Sna∼ N
(np,
np(1− p)
n
).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 153 / 160
Delta Method
Delta method
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 154 / 160
Delta Method
Theorem (Delta method)
Let (Xn)i≥1 be a sequence of Rd -valued random variables such that forsome γ > 0,
nγ(Xn − ψ)d→ X .
Further, let f : Rd → Rr be a C1 function with J(ψ) denoting theJacobian matrix of f evaluated at ψ. Then,
nγ(f (Xn)− f (ψ))d→ J(ψ)X .
Corollary
Let (Xn)i≥1 be a sequence of Rd -valued random variables such that√
n(Xn − ψ)d→ N(0,Ω). Then,
√n(f (Xn)− f (ψ))
d→ N(0, J(ψ)ΩJ(ψ)′).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 155 / 160
Delta Method
The Delta method can be proven by either starting with the mean valuetheorem or a Taylor series expansion.
Theorem (Mean Value Theorem)
Let h : Rd → Rk be a C1 function. Then h(x) = h(x0) + J(x)(x − x0),where x is between x and x0.
Proof.
By the MVT there exists a Yn ∈ Rd between Xn and ψ (elementwise) suchthat
f (Xn) = f (ψ) + J(Yn)(Xn − ψ).
Rearranging and multiplying both sides with nγ yields
nγ(f (Xn)− f (ψ)) = J(Yn)nγ(Xn − ψ).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 156 / 160
Delta Method
Proof.
The fact that nγ(Xn − ψ)d→ X implies that Xn
p→ ψ and since Yn isbetween Xn and ψ, i.e. ||Yn − β|| ≤ ||Xn − β|| it follows that
Ynp→ ψ.
f has continuous first derivatives, thus by the continuation mappingtheorem
J(Yn)p→ J(ψ).
Finally, applying Slutzky’s theorem yields the desired result
√n(f (Xn)− f (ψ)) = J(Yn)nγ(xn − ψ)
d→ J(ψ)X .
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 157 / 160
Delta Method
Example (Asymptotic distribution of (Xn)2)
Let (Xi )i≥1 be a sequence of i.i.d random variables with µ 6= 0 and
σ2 <∞. The CLT states that
√n(Xn − µ)
d→ N (0, σ2).
The Delta method yields that
√n((Xn)2 − µ2)
d→ J(µ)Z ,Z ∼ N (0, σ2).
Finally, the asymptotic distribution is given by
√n((Xn)2 − µ2)
d→ N (0, 4µ2σ2).
Thus,
(Xn)2 a∼ N (µ2,4µ2σ2
n).
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 158 / 160
Delta Method
Exercise
Assume Xi ∼ tν ∀i ∈ 1, . . . , n and ν > 4. The MoM-estimator of ν
from an i.i.d sample of size n is given by νMoM =2 1n
∑ni=1 X
2i
1n
∑ni=1 X
2i −1
. Find a
standard error for νMoM .
Exercise
Let (Xi )i≥1 be an i.i.d sequence of real-valued random variables in L2 withµ = E[X1] and σ2 = Var(X1). Define s2 = 1
n−1
∑ni=1(Xi − Xn)2. Show
thatXn − µ
sd→ N (0, 1).
Exercise
Prove the corollary to the Delta method.
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 159 / 160
Delta Method
Thank you for your attention!
Stefan Bruder (UZH) Basics of Probability Theory September 1, 2015 160 / 160