probability (continued) - nyu courantcfgranda/pages/dsga1002_fall15/material... · probability...

DS-GA 1002 Lecture notes 3 September 28, 2015

Probability (continued)

1 Expectation

As we have seen in the previous section, we can characterize the behavior of a randomvariable completely thorough its cdf, its pmf or its pdf (depending on whether the variable isdiscrete or continuous). However, it is often useful to work with quantities that describe thebehavior of a random variable more parsimoniously. This is the purpose of the expected-valueoperator, which maps functions of a random variable to a single number.

1.1 Definition

The expected value or expectation of a function of a random variable is an average of itsvalues weighted by their probabilities. The expected value of a function of several randomvariables is defined analogously.

Definition 1.1 (Expectation for discrete random variables). Let X be a discrete randomvariable with range R. The expected value of any real-valued function g : R→ R is

E (g (X)) =∑

x∈Rg (x) pX (x) . (1)

If X, Y are discrete random variables defined on the same probability space with range R′,the expected value of a function h : R2 → R is

E (h (X, Y )) =∑

(x,y)∈R′h (x, y) pX,Y (x, y) . (2)

Definition 1.2 (Expectation for continuous random variables). Let X be a continuous ran-dom variable. The expected value of any real-valued function g : R→ R is

E (g (X)) =

∫ ∞

x=−∞g (x) fX (x) dx. (3)

If X, Y are continuous random variables defined on the same probability space, the expectedvalue of a function h : R2 → R is

E (h (X, Y )) =

∫ ∞

x=−∞

∫ ∞

y=−∞h (x, y) fX,Y (x, y) dx dy. (4)

In the case of quantities that depend on both continuous and discrete random variables, theproduct of the marginal and conditional distributions (see (78) in Lecture Notes 2) plays therole of the joint pdf or pmf.

Definition 1.3 (Expectation for quantities that depend on continuous and discrete randomvariables). If C is a continuous random variable and D a discrete random variable with rangeRD defined on the same probability space, the expected value of a function h : R2 → R is

E (h (C,D)) =

∫ ∞

c=−∞

∑

d∈RDh (c, d) fC (c) pD|C (d|c) dc (5)

=∑

d∈RD

∫ ∞

c=−∞h (c, d) pD (d) fC|D (c|d) dc. (6)

Note that the expected value of a certain quantity may be infinite or not even exist if thecorresponding sum or integral tends towards infinity or has an undefined value. This isillustrated by Examples 1.4 and 1.8 below.

Example 1.4 (St Petersburg paradox). A casino offers you the following game. You willflip an unbiased coin until it lands on heads and the casino will pay you 2k dollars where kis the number of flips. How much are you willing to pay in order to play?

You compute the expected gain. You model the number of flips X as a geometric randomvariable such that pX (k) = 1/2k. The gain is 2X so

E (Gain) =∞∑

k=1

2k · 1

2k=∞. (7)

The expected gain is∞, but since you only get to play once, perhaps you should not pay anarbitrarily large amount of money. This is known as the St Petersburg paradox.

A fundamental property of the expectation operator is that it is linear.

Theorem 1.5 (Linearity of expectation). For any constants a and b and any functionsg1, g2 : R2 → R

E (a g1 (X, Y ) + b g2 (X, Y )) = aE (g1 (X, Y )) + bE (g2 (X, Y )) . (8)

Proof. The theorem follows immediately from the linearity of sums and integrals.

2

If two random variables are independent, then the expectation of the product factors into aproduct of expectations.

Theorem 1.6 (Expectation of functions of independent random variables). If X, Y areindependent random variables defined on the same probability space, and g, h : R → R areunivariate real-valued functions, then

E (g (X)h (Y )) = E (g (X)) E (h (Y )) . (9)

Proof. The result follows directly from Definitions 2.20 and 2.21 in Lecture Notes 2. We willprove the result for continuous random variables, but the proof for discrete random variablesis almost the same.

E (g (X)h (Y )) =

∫ ∞

x=−∞

∫ ∞

y=−∞g (x)h (y) fX,Y (x, y) dx dy (10)

=

∫ ∞

x=−∞

∫ ∞

y=−∞g (x)h (y) fX (x) fY (y) dx dy by independence (11)

= E (g (X)) E (h (Y )) . (12)

1.2 Mean and variance

The expected value or mean of a random variable has special significance. It is the centerof mass of the probability measure of the random variable.

Definition 1.7 (Mean). The mean or first moment of X is the expected value of X: E (X).

If the distribution of a random variable is very heavy tailed, which means that the probabilityof the random variable taking large values decays slowly, the mean of a random variable maybe infinite. This was the case of the random variable representing the gain in Example 1.4.The following example shows that the mean may not exist if the value of the correspondingsum or integral is not defined.

Example 1.8 (Cauchy random variable). The pdf of the Cauchy random variable is givenby

fX(x) =1

π(1 + x2).

3

By the definition of expected value,

E(X) =

∫ ∞

−∞

x

π(1 + x2)dx =

∫ ∞

0

x

π(1 + x2)dx−

∫ ∞

0

x

π(1 + x2)dx.

Now, by the change of variables t = x2,∫ ∞

0

x

π(1 + x2)dx =

∫ ∞

0

1

2π(1 + t)dt = lim

t→∞log(1 + t)

2π=∞,

so E(X) does not exist, as it is the difference of two limits that tend to infinity.

The expected value of the square of a random variable quantifies its expected energy.

Definition 1.9 (Second moment). The mean square value or second moment of X is theexpected value of X2: E (X2).

The variance of a random value quantifies its deviation from the mean. Figure 1 showsthe pdfs of Gaussian random variables with different variances. Intuitively, a larger varianceimplies a larger spread of the values of the random variable from the mean.

Definition 1.10 (Variance and standard deviation). The variance of X is the mean squaredeviation from the mean

Var (X) := E((X − E (X))2

)(13)

= E(X2)− E2 (X) . (14)

The standard deviation σX of X is the square root of the variance

σX :=√

Var (X). (15)

The variance operator is not linear, but it is straightforward to determine the variance of alinear function of a random variable.

Lemma 1.11 (Variance of linear functions). For any constants a and b

Var (aX + b) = a2 Var (X) . (16)

Proof.

Var (aX + b) = E((aX + b− E (aX + b))2

)(17)

= E((aX + b− aE (X)− b)2

)(18)

= a2 E((X − aE (X))2

)(19)

= a2 Var (X) . (20)

4

−6 −4 −2 0 2 4 6

0

0.1

0.2

0.3

0.4

x

f X(x)

σX = 1σX = 2σX = 3

Figure 1: Gaussian random variable with different standard deviations.

We have compiled the means and variances of some important random variables in Table 1.The derivation can be found in Section A of the appendix.

1.3 Bounding probabilities using expectations

It is often much more tractable to estimate the mean and the variance of a random variablethan its whole distribution. The following inequalities allow to characterize the behavior ofa random valuable to some extent just from its mean and variance.

Theorem 1.12 (Markov’s inequality). Let X be a nonnegative random variable. For anypositive constant a > 0,

P (X > a) ≤ E (X)

a(21)

Proof. Consider the indicator variable 1X>a. Clearly the random variable

X − a 1X>a ≥ 0. (22)

So in particular its expectation is non-negative (as it is the sum or integral of a non-negativequantity over the positive real line). By linearity of expectation and the fact that 1X>a is aBernouilli random variable with expectation P (X > a) we have

E (X) ≥ aE (1X>a) = aP (X > a) . (23)

5

Random variable Parameters Mean Variance

Bernouilli p p p (1− p)

Geometric p 1p

1−pp2

Binomial n, p np np (1− p)

Poisson λ λ λ

Uniform a, b a+b2

(b−a)212

Exponential λ 1λ

1λ2

Gaussian µ, σ µ σ2

Table 1: Means and variance of common random variables, derived in Section A of the appendix.

Example 1.13 (Age of students). You hear that the mean age of NYU students is 20 years,but you know quite a few students that are older than 30. You decide to apply Markov’sinequality to bound the fraction of students above 30 by modeling age as a nonnegativerandom variable A.

P(A > 30) ≤ E (A)

30=

2

3,

So at most two thirds of the students are over 30.

As you can see from Example 1.16, Markov’s inequality is rather loose. The reason is thatit barely uses any information about the distribution of the random variable. Chebyshev’sinequality takes into account the variance, allowing us to control the deviation of the randomvariable from its mean.

Theorem 1.14 (Chebyshev’s inequality). For any positive constant a > 0 and any randomvariable X with bounded variance,

P (|X − E (X)| > a) ≤ Var (X)

a2. (24)

6

Proof. Applying Markov’s inequality to the random variable Y = (X − E (X))2 yields theresult.

An interesting corollary to Chebyshev’s inequality shows that if the variance of a randomvariable is zero, then the random variable is a constant or, to be precise, the probability thatit deviates from its mean is zero.

Corollary 1.15. If Var (X) = 0 then P (X 6= E (X)) = 0.

Proof. Take any ε > 0, by Chebyshev’s inequality

P (|X − E (X)| > ε) ≤ Var (X)

ε2= 0. (25)

Example 1.16 (Age of students (continued)). You are not very satisfied with your bound onthe number of students above 30. You ask around and discover that the standard deviationof student age is actually just 3 years. You apply Chebyshev’s inequality to this informationand obtain

P(|A− E (A)| > 10) ≤ Var (A)

100=

9

100.

So actually at least 91% of the students are under 30 (and above 10).

1.4 Correlation and covariance

The covariance of two random variables provides a first-order characterization of the jointbehavior of two random variables. It measures whether the two random variables deviatesimilarly from their respective means.

Definition 1.17 (Covariance). The covariance of X and Y is

Cov (X, Y ) := E ((X − E (X)) (Y − E (Y ))) (26)

= E (XY )− E (X) E (Y ) . (27)

If Cov (X, Y ) = 0, X and Y are uncorrelated.

7

The variance of the sum of two random variables can be expressed in terms of their individualvariances and their covariance. As a result, their fluctuations reinforce each other if thecovariance is positive and cancel each other if it is negative.

Theorem 1.18 (Variance of the sum of two random variables).

Var (X + Y ) = Var (X) + Var (Y ) + 2 Cov (X, Y ) . (28)

Proof.

Var (X + Y ) = E((X + Y − E (X + Y ))2

)(29)

= E((X − E (X))2

)+ E

((Y − E (Y ))2

)+ 2E ((X − E (X)) (Y − E (Y )))

(30)

= Var (X) + Var (Y ) + 2 Cov (X, Y ) . (31)

Corollary 1.19. If X and Y are uncorrelated, then

Var (X + Y ) = Var (X) + Var (Y ) . (32)

The following lemma and example show that independence implies uncorrelation, but un-correlation does not always imply independence.

Lemma 1.20 (Independence implies uncorrelation). If two random variables are indepen-dent, then they are uncorrelated.

Proof. By Theorem 1.6, if X and Y are independent

Cov (X, Y ) = E (XY )− E (X) E (Y ) = E (X) E (Y )− E (X) E (Y ) = 0. (33)

Example 1.21 (Uncorrelation does not imply independence). Let X and Y be two inde-pendent Bernouilli random variables with parameter 1/2. Consider the random variables

U = X + Y, (34)

V = X − Y. (35)

8

Note that

pU (0) = P (X = 0, Y = 0) =1

4, (36)

pV (0) = P (X = 1, Y = 1) + P (X = 0, Y = 0) =1

2, (37)

pU,V (0, 0) = P (X = 0, Y = 0) =1

46= pU (0) pV (0) =

1

8, (38)

so U and V are not independent. However, they are uncorrelated as

Cov (U, V ) = E (UV )− E (U) E (V ) (39)

= E ((X + Y ) (X − Y ))− E (X + Y ) E (X − Y ) (40)

= E(X2)− E

(Y 2)− E2 (X) + E2 (Y ) = 0. (41)

The final equality holds because X and Y have the same distribution.

The covariance does not take into account the magnitude of the variances of the randomvariables involved. The Pearson correlation coefficient is obtained by normalizing thecovariance using the standard deviations of both variables.

Definition 1.22 (Pearson correlation coefficient). The Pearson correlation coefficient of tworandom variables X and Y is

ρX,Y :=Cov (X, Y )

σXσY. (42)

Although it might not be immediately obvious, the magnitude of the correlation coefficientis bounded by one. A useful interpretation of this coefficient is that it quantifies to whatextent X and Y are linearly related. In fact, if it is equal to 1 or -1 then one of the variablesis a linear function of the other! This follows from the notorious Cauchy-Schwarz inequality.

Theorem 1.23 (Cauchy-Schwarz inequality). For any random variables X and Y definedon the same probability space

|E (XY )| ≤√

E (X2) E (Y 2). (43)

Assume E (X2) 6= 0,

E (XY ) =√

E (X2) E (Y 2) ⇐⇒ Y =

√E (Y 2)

E (X2)X, (44)

E (XY ) = −√

E (X2) E (Y 2) ⇐⇒ Y = −√

E (Y 2)

E (X2)X. (45)

9

Proof. If E (X2) = 0 then X = 0 by Corollary 1.15 X = 0 with probability one, which impliesE (XY ) = 0 and consequently that equality holds in (43). The same is true if E (Y 2) = 0.

Now assume that E (X2) 6= 0 and E (Y 2) 6= 0. Let us define the constants a =√

E (Y 2) and

b =√

E (X2). By linearity of expectation,

E((aX + bY )2

)= a2E

(X2)

+ b2E(Y 2)

+ 2 a bE (XY ) (46)

= 2(

E(X2)

E(Y 2)

+√

E (X2) E (Y 2)E (XY )), (47)

E((aX − bY )2

)= a2E

(X2)

+ b2E(Y 2)− 2 a bE (XY ) (48)

= 2(

E(X2)

E(Y 2)−√

E (X2) E (Y 2)E (XY )). (49)

The expectation of a non-negative quantity is nonzero because the integral or sum of anon-negative quantity is non negative. Consequently, the left-hand side of (46) and (48) isnon-negative, so (47) and (49) are both non-negative, which implies (43).

Let us prove (45) by proving both implications.

(⇒). Assume E (XY ) = −√

E (X2) E (Y 2). Then (47) equals zero, so

E

((√E (X2)X +

√E (X2)Y

)2)= 0, (50)

which by Corollary 1.15 means that√

E (Y 2)X = −√

E (X2)Y with probability one.

(⇐). Assume Y = −E(Y 2)E(X2)

X. Then one can easily check that (47) equals zero, which

implies E (XY ) = −√

E (X2) E (Y 2).

The proof of (44) is almost identical (using (46) instead of (47)).

Corollary 1.24. The Pearson correlation coefficient of two random variables X and Ysatisfies

|ρX,Y | ≤ 1, (51)

with equality if and only if there is a linear relationship between X and Y

|ρX,Y | = 1 ⇐⇒ Y = cX + d. (52)

where

c :=

σYσX

if ρX,Y = 1,

− σYσX

if ρX,Y = −1,b := E (Y )− cE (X) . (53)

10

Proof. The result follows from defining

U := X − E (X) , (54)

V := Y − E (Y ) . (55)

From the definition of the variance and the correlation coefficient,

E(U2)

= Var (X) , (56)

E(V 2)

= Var (Y ) (57)

ρX,Y =E (UV )√

E (U2) E (V 2). (58)

The result now follows from applying Theorem 1.23 to U and V .

Figure 2 shows the joint pdf of two Gaussian random variables for different values of thecorrelation coefficient. If the coefficient is zero, then the joint pdf has a spherical form. Ifthe coefficient approaches 1, then the joint pdf becomes skewed so that the two variableshave similar values. If the coefficient approaches -1, then the same happens, only now therandom variables will tend to have similar values with opposite sign.

1.5 Conditional expectation

The expectation of a function of two random variables conditioned on one of them taking acertain value can be computed using the conditional pmf or pdf.

E (g (X, Y ) |X = x) =∑

y∈Rg (x, y) pY |X (y|x) , (59)

if Y is discrete and has range R, whereas

E (g (X, Y ) |X = x) =

∫ ∞

y=−∞g (x, y) fY |X (y|x) dy, (60)

if Y is continuous.

Note that E (g (X, Y ) |X = x) can actually be interpreted as a function of x since it mapsevery value of x to a real number. We can then define the conditional expectation ofg (X, Y ) given X as follows.

Definition 1.25 (Conditional expectation). The conditional expectation of g (X, Y ) givenX is

E (g (X, Y ) |X) := h (X) , (61)

where

h (x) := E (g (X, Y ) |X = x) . (62)

11

−20

2 −2

0

20

0.1

xy

f X,Y(X

,Y)

−20

2 −2

0

20

0.1

0.2

xy

f X,Y(X

,Y)

−20

2 −2

0

20

0.1

0.2

xy

f X,Y(X

,Y)

Figure 2: Joint pdf of a bivariate Gaussian random variable X,Y such that X and Y have zeromean and unit variance for ρ = 0 (top), ρ = 0.8 (center) and ρ = −0.8 (bottom).

12

Beware the confusing definition, the conditional expectation is actually a random variable!

Iterated expectation is a useful tool to compute the expected value of a certain quantity thatdepends on several random variables. The idea is that the expected value can be obtainedas the expectation of the conditional expectation.

Theorem 1.26 (Iterated expectation). For any random variables X and Y and any functiong : R2 → R

E (g (X, Y )) = E (E (g (X, Y ) |X)) . (63)

Proof. We prove the result for continuous random variables, the proof for discrete randomvariables, and for quantities that depend on both continuous and discrete random variables,is almost identical. To make the explanation clearer, we define

h (x) := E (g (X, Y ) |X = x) (64)

=

∫ ∞

y=−∞g (x, y) fY |X (y|x) dy. (65)

Now,

E (E (g (X, Y ) |X)) = E (h (X)) (66)

=

∫ ∞

x=−∞h (x) fX (x) dx (67)

=

∫ ∞

x=−∞

∫ ∞

y=−∞fX (x) fY |X (y|x) g (x, y) dy dx (68)

= E (g (X, Y )) . (69)

Example 1.27 (Lost cellphones). Chris loses his cellphone quite often when he is traveling.He calculates that the probability that he loses it on any given trip is between 0.1 and 0.2.He decides to model his uncertainty over that probability as a uniform random variable in[0.1, 0.2]. If he takes n trips next year, what is the expected number of times he will losehis cellphone? Assume that he never learns to be careful (i.e. the event that he loses hiscellphone on a trip is independent of whether he loses it in other trips).

Let L be the probability that Chris loses his cellphone in a given trip and C the numberof lost cellphones. Under the independence assumption, the number of lost cellphones isdistributed as a binomial random variable with parameter L. This implies that

E (C|L = l) = nl, (70)

13

so the conditional expectation of C given L is the random variable nL. By iterated expec-tation,

E (C) = E (E (C|L)) =

∫ 0.2

0.1

nl

0.1dl = 0.15n. (71)

According to the probabilistic model on average he will lose 0.15n phones.

2 Random vectors

In this section, we extend the definitions in Section 2 of Lecture Notes 2 concerning bivariatedistributions to multiple random variables that are jointly distributed. When several randomvariables are defined on the same probability space, we group them into a vector.

Definition 2.1 (Random vector). Let X1, X2, . . . , Xn be n random variables defined on thesame probability space. We define a random vector as

X :=

X1

X2

· · ·Xn

. (72)

2.1 Joint distribution of a random vector

The distribution of a discrete random vector is completely determined by its joint probabilitymass function.

Definition 2.2 (Joint probability mass function). Let X : Ω → Rn be a random vector ofdimension n on a probability space (Ω,F ,P). The joint pmf of X is defined as

pX (x) := P (X1 = x1, X2 = x2, . . . , Xn = xn) . (73)

In words, pX (x) is the probability of X being equal to x.

As in the case of two-dimensional random variables, we characterize the distribution of ran-dom vectors defined on continuous spaces by restricting them to belong to multidimensionalBorel sets, which can be thought of as unions, intersections, etc. of Cartesian productsof intervals. As a result, the distribution of a continuous random vector is completely de-termined by its joint cumulative distribution function and, if it exists, its joint probabilitydensity function.

14

Definition 2.3 (Joint cumulative distribution function). Let X : Ω → Rn be a randomvector of dimension n on a probability space (Ω,F ,P). The joint cdf of X is defined as

FX (x) := P (X1 ≤ x1, X2 ≤ x2, . . . , Xn ≤ xn) . (74)

In words, FX (x) is the probability that Xi ≤ xi for all i = 1, 2, . . . , n.

Definition 2.4 (Joint probability density function). If the joint cdf of a random vector Xis differentiable, then its joint pdf is defined as

fX (x) :=∂nFX (x)

∂x1 ∂x2 · · · ∂xn. (75)

In order to ease the exposition, we introduce some notation.

Definition 2.5 (Subvector). A subvector of a vector x is the vector formed by a subset ofits entries, indexed by a set I = i1, i2, . . . , im ⊆ 1, 2, . . . , n,

xI :=

xi1xi2· · ·xim

. (76)

Definition 2.6 (Summing and integrating over a vector). Given a subset of the entries of avector I = i1, i2, . . . , im ⊆ 1, 2, . . . , n we denote the operation of summing or integratingover those entries by

∑

Ig (x) :=

∑

i1∈R1

∑

i2∈R2

· · ·∑

im∈Rmg (x1, x2, . . . , xn) , (77)

∫

Ig (x) dxI :=

∫ ∞

i1=−∞

∫ ∞

i2=−∞· · ·∫ ∞

im=−∞g (x1, x2, . . . , xn) dxi1 dxi2 · · · dxim , (78)

where Ri represents the range of Xi for 1 ≤ i ≤ n.

As in the case of bivariate distributions, it follows from the axioms of probability measuresthat

pX (x) ≥ 0, (79)

fX (x) ≥ 0, (80)∑

1,2,...,npX (x) = 1, (81)

∫

1,2,...,nfX (x)dx = 1, (82)

15

and for any set S ⊆ Rn

P (X ∈ S) =∑

x∈SpX (x) (83)

in the case of discrete random vectors and

P (X ∈ S) =

∫

x∈SfX (x)dx (84)

in the case of continuous random vectors.

Generalizing the concept of marginalization, to obtain the joint distribution of a subset ofthe variables in the vector we either sum or integrate over the rest of the variables.

Theorem 2.7 (Marginalization). The marginal pmf of a discrete random subvector XI,I ⊆ 1, 2, . . . , n, is obtained by summing over the components not included in the subvector

pXI (xI) =∑

1,2,...,n/IpX (x) . (85)

Similarly, the marginal pdf is obtained by integrating over the rest of the components,

fXI (xI) =

∫

1,2,...,n/IfX (x) dx1,2,...,n/I . (86)

Proof. Let 1, 2, . . . , n /I = j1, j2, . . . , jn−m. For discrete random vectors, (85) followsfrom applying (83) to the event

Xi1 = xi1 , . . . , Xim = xim (87)

=⋃

xj1∈R1

· · ·⋃

xjn−m∈Rn−m

Xi1 = xi1 , . . . , Xim = xim , Xj1 = xj1 , . . . , Xjn−m = xjn−m

, (88)

where Ri represents the range of Xi for 1 ≤ i ≤ n. For continuous random vectors, (86)follows from applying (84) to the event

Xi1 ≤ xi1 , . . . , Xim ≤ xim (89)

=∞⋃

xj1=−∞· · ·

∞⋃

xjn−m=−∞

Xi1 ≤ xi1 , . . . , Xim ≤ xim , Xj1 = xj1 , . . . , Xjn−m = xjn−m

(90)

in order to obtain the marginal cdf and then differentiating.

Applying the same ideas as in the bivariate case, we define the conditional distribution of asubvector given the rest of the random vector.

16

Definition 2.8 (Conditional pmf and pdf). The conditional pmf of a discrete random sub-vector XI, I ⊆ 1, 2, . . . , n, given the subvector X1,...,n/I is

pXI |X1,...,n/I(xI |x1,...,n/I

):=

pX (x)

pX1,...,n/I(x1,...,n/I

) . (91)

The conditional pdf of a discrete random subvector XI, I ⊆ 1, 2, . . . , n, given the subvectorX1,...,n/I is

fXI |X1,...,n/I(xI |x1,...,n/I

):=

fX (x)

fX1,...,n/I(x1,...,n/I

) . (92)

It is often useful to represent the joint pmf or pdf of a random vector by factoring it intoconditional pmfs or pdfs using the chain rule.

Lemma 2.9 (Chain rule for random vectors). The joint pmf of a random vector X can bedecomposed into

pX (x) = pX1 (x1) pX2|X1 (x2|x1) . . . pXn|X1,...,Xn−1 (xn|x1, . . . , xn−1) (93)

=n∏

i=1

pXi|X1,...,i−1

(xi|x1,...,i−1

). (94)

The joint pdf of a random vector X can be decomposed into

fX (x) = fX1 (x1) fX2|X1 (x2|x1) . . . fXn|X1,...,Xn−1 (xn|x1, . . . , xn−1) (95)

=n∏

i=1

fXi|X1,...,i−1

(xi|x1,...,i−1

). (96)

Where the order is arbitrary, you can reorder the components of the vector in any way youlike.

Proof. The result follows from applying the definitions of conditional pmf and pdf recursively.

Example 2.10 (Desert). Dani and Felix are traveling through the desert in Arizona. Theybecome concerned that their car might break down and decide to build a probabilistic modelto evaluate the risk. They model the time that will pass until the car breaks down as anexponential random variable with a parameter that depends on the state of the motor andthe state of the road. These three quantities form a random vector

X =

X1

X2

X3

, (97)

17

where X1 is the time until the car breaks down, X2 is the state of the motor and X3 is thestate of the road. Unfortunately they have no idea what the state of the motor is so theyassume that it is uniform between 0 (no problem with the motor) and 1 (the motor is almostdead). Similarly, they have no information about the road, so they also assume that itsstate is a uniform random variable between 0 (no problem with the road) and 1 (the road isterrible). In addition, they assume that the states of the road and the car are independentand that the parameter of the exponential random variable that represents the time in hoursuntil there is a breakdown is equal to X2 +X3.

To find the joint distribution of the random vector, we apply the chain rule to obtain,

fX (x) = fX2 (x2) fX3|X2 (x3|x2) fX1|X2,X3 (x1|x2, x3) (98)

= fX2 (x2) fX3 (x3|x2) fX1|X2,X3 (x1|x2, x3) (by independence of X2 and X3) (99)

=

(x2 + x3) e

−(x2+x3)x1 for x1 ≥ 0, 0 ≤ x2 ≤ 1, 0 ≤ x3 ≤ 1,

0 otherwise.(100)

Note that we start with X2 and X3 because we know their marginal distribution, whereaswe only know the conditional distribution of X1 given X2 and X3.

After 15 minutes, the car breaks down. The road seems OK, about a 0.2 in the scale theydefined for the value of X3, so they naturally wonder about the state the motor was in. Thisis given by the conditional distribution of X2 given X1 and X3.

To compute it, we first need to compute the distribution of X1 and X3 by marginalizing overX2. In order to simplify the computations, we use the following simple lemma.

Lemma 2.11. For any constant c > 0,∫ 1

0

e−cx dx =1− e−c

c, (101)

∫ 1

0

xe−cx dx =1− (1 + c) e−c

c2. (102)

Proof. Equation (101) is obtained using the antiderivative of the exponential function (itself),whereas integrating by parts yields (102).

We have

fX1,3(x1,3

)= e−x1x3

(∫ 1

x2=0

x2e−x1x2 dx2 + x3

∫ 1

x2=0

e−x1x2 dx2

)(103)

= e−x1x3(

1− (1 + x1) e−x1

x21+x3 (1− e−x1)

x1

)by (101) and (102) (104)

=e−x1x3

x21

(1 + x1x3 − e−x1 (1 + x1 + x1x3)

), (105)

18

for x1 ≥ 0, 0 ≤ x3 ≤ 1.

The conditional pdf of X2 given X1 and X3 is

fX2|X1,3(x2|x1,3

)=fX (x)

fX1,3(106)

=(x2 + x3) e

−(x2+x3)x1

e−x1x3x21

(1 + x1x3 − e−x1 (1 + x1 + x1x3))(107)

=(x2 + x3)x

21e−x1x2

1 + x1x3 − e−x1 (1 + x1 + x1x3), (108)

for x1 ≥ 0, 0 ≤ x2 ≤ 1, 0 ≤ x3 ≤ 1. Plugging in the observed values, the conditional pdf isequal to

fX2|X1,3

(x2 |[0.250.2

])=

(x2 + 0.2) 0.252e−0.25x2

1 + 0.25 · 0.2− e−0.25 (1 + 0.25 + 0.25 · 0.2)(109)

≈ 1.66 (x2 + 0.2) e−0.25x2 . (110)

for 0 ≤ x2 ≤ 1 and to zero otherwise. The pdf is plotted in Figure 3. According to themodel, it seems quite likely that the state of the motor was not good.

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

x2

f X2|X

1,3

( x2|[ 0

.25

0.2

])

Figure 3: Conditional pdf of X2 given X1 = 0.25 and X3 = 0.2 in Example 2.10.

19

2.2 Independence of random vectors

The components of a random vector are mutually independent if their joint pmf or pdffactors into the individual marginal pmfs or pdfs.

Definition 2.12 (Jointly independent random variables). The components of a randomvector X are mutually independent if

pX (x) =n∏

i=1

pXi (xi) (111)

in the case of discrete random vectors and if

fX (x) =n∏

i=1

fXi (xi) (112)


The definition of conditional independence is very similar.

Definition 2.13 (Mutually conditionally independent random variables). The componentsof a subvector XI, I ⊆ 1, 2, . . . , n are mutually conditionally independent given anothersubvector XJ , J ⊆ 1, 2, . . . , n and I ∩ J = ∅ if

pXI |XJ (xI |xJ ) =∏

i∈IpXi|XJ (xi|xJ ) (113)

in the case of discrete random vectors and

fXI |XJ (xI |xJ ) =∏

i∈IfXi|XJ (xi|xJ ) (114)


As we saw in Examples 4.3 and 4.4 of Lecture Notes 1, independence does not imply condi-tional independence or vice versa. The following example shows that pairwise independencedoes not imply mutual independence.

Example 2.14 (Pairwise independence does not imply joint independence). Let X1 and X2

be the outcomes of independent unbiased coin flips. Let X3 be the indicator of the eventX1 and X2 have the same outcome,

X3 =

1 if X1 = X2,

0 if X1 6= X2.(115)

20

The pmf of X3 is

pX3 (1) = P (X1 = 1, X2 = 1 ∪ X1 = 0, X2 = 0) (116)

= P (X1 = 1, X2 = 1) P (X1 = 0, X2 = 0) =1

2, (117)

pX3 (0) = P (X1 = 0, X2 = 1 ∪ X1 = 0, X2 = 1) (118)

= P (X1 = 0, X2 = 1) P (X1 = 0, X2 = 1) =1

2. (119)

X1 and X2 are independent by assumption. X1 and X3 are independent because

pX1,X3 (0, 0) = P (X1 = 0, X2 = 1) =1

4= pX1 (0) pX3 (0) , (120)

pX1,X3 (1, 0) = P (X1 = 1, X2 = 0) =1

4= pX1 (1) pX3 (0) , (121)

pX1,X3 (0, 1) = P (X1 = 0, X2 = 0) =1

4= pX1 (0) pX3 (1) , (122)

pX1,X3 (1, 1) = P (X1 = 1, X2 = 1) =1

4= pX1 (1) pX3 (1) . (123)

X2 and X3 are independent too (the reasoning is the same).

However, are X1, X2 and X3 mutually independent?

pX1,X2,X3 (1, 1, 1) = P (X1 = 1, X2 = 1) =1

46= pX1 (1) pX2 (1) pX3 (1) =

1

8. (124)

They are not.

2.3 Mean and covariance of a random vector

The expected value of a function g : Rn → R that depends on the value of a vector is definedin the same way as in the case of bivariate distributions.

Definition 2.15 (Expected value of a function of a random vector). The expected value ofa function g : Rn → R applied to a random vector X is defined as

E (g (X)) =∑

1,...,ng (x) pX (x) (125)

if X is discrete and

E (g (X)) =

∫

1,...,ng (x) fX (x) dx (126)

if X is continuous.

21

The mean of a random vector is the vector of the means of its components.

Definition 2.16 (Mean of a random vector). The mean of a random vector X is defined as

E (X) :=

E (X1)E (X2)· · ·

E (Xn)

. (127)

Similarly, if the entries of a matrix are random variables, we define the mean of the matrixas the matrix of the means of the entries,

E

A11 A12 · · · A1n

A21 A22 · · · A2n...

.... . .

...An1 An2 · · · Ann

:=

E (A11) E (A12) · · · E (A1n)E (A21) E (A22) · · · E (A2n)

......

. . ....

E (An1) E (An2) · · · E (Ann)

. (128)

It follows immediately from the linearity of the expectation operator in one dimension thatthe mean operator is linear.

Theorem 2.17 (Mean of linear transformation of a random vector). For any random vectorX of dimension n, any matrix A ∈ Rm×n and b ∈ Rm

E (AX + b) = AE (X) + b. (129)

Proof.

E (AX + b) =

E (∑n

i=1A1iXi + b1)E (∑n

i=1A2iXi + b2)· · ·

E (∑n

i=1AmiXi + bn)

(130)

=

∑ni=1A1iE (Xi) + b1∑ni=1A2iE (Xi) + b2

· · ·∑ni=1AmiE (Xi) + bn

by linearity of expectation (131)

= AE (X) + b. (132)

The covariance matrix of a random vector captures the interaction between the com-ponents of the vector. It contains the variance of each variable in the diagonal and thecovariances between the different variables in the off diagonals.

22

Definition 2.18. The covariance matrix of a random vector X is defined as

ΣX :=

Var (X1) Cov (X1, X2) · · · Cov (X1, Xn)Cov (X2, X1) Var (X2) · · · Cov (X2, Xn)

......

. . ....

Cov (Xn, X2) Cov (Xn, X2) · · · Var (Xn)

(133)

= E(XXT

)+ E (X) E (X)T . (134)

From Theorem 2.17 we obtain a simple expression for the covariance matrix of the lineartransformation of a random vector.

Theorem 2.19 (Covariance matrix after a linear transformation). Let X be a random vectorof dimension n with covariance matrix Σ. For any matrix A ∈ Rm×n and b ∈ Rm,

ΣAX+b = AΣXAT . (135)

Proof.

ΣAX+b = E(

(AX + b) (AX + b)T)− E (AX + b) E (AX + b)T (136)

= E(AXXTAT + bXTAT + AXbT + bbT

)− (AE (X) + b) (AE (X) + b)T (137)

= AE(XXT

)AT + b E (X)T AT + AE (X) bT + bbT (138)

− AE (X) E (X)T AT − AE (X) bT − b E (X)T AT − bbT (139)

= A(

E(XXT

)− E (X) E (X)T

)AT (140)

= AΣXAT . (141)

2.4 Gaussian random vectors

Gaussian random vectors are a multidimensional generalization of Gaussian random vari-ables. They are parametrized by their mean and covariance matrix. Figure 2 shows examplesof the joint pdf of a two-dimensional Gaussian random vector.

Definition 2.20 (Gaussian random vector). A Gaussian random vector X is a randomvector with joint pdf

fX (x) =1√

(2π)n |Σ|exp

(−1

2(x− µ)T Σ−1 (x− µ)

)(142)

where the mean vector µ ∈ Rn and the covariance matrix Σ, which is symmetric and positivedefinite, parametrize the distribution.

23

A fundamental property of Gaussian random vectors is that performing linear transforma-tions on them always yields vectors with joint distributions that are also Gaussian.

Theorem 2.21 (Linear transformations of Gaussian random vectors are Gaussian). Let Xbe a Gaussian random vector of dimension n with mean µ and covariance matrix Σ. Forany matrix A ∈ Rm×n and b ∈ Rm Y = AX + b is a Gaussian random vector with meanAµ + b and covariance matrix AΣAT .

Proof. By Theorems 2.17 and 2.19 AX + b has mean Aµ+ b and covariance matrix AΣAT .We still need to prove that the joint distribution of Y is Gaussian. We will show that thisis the case in the univariate case (the argument can be extended to the vector case). LetY = aX + b,

FY (y) = P (aX + b ≤ y) (143)

=

P(X ≤ y−b

a

)= FX

(y−ba

)if a ≥ 0,

P(X ≥ y−b

a

)= 1− FX

(y−ba

)if a < 0.

(144)

To obtain the pdf of Y we differentiate with respect to y. If a ≥ 0,

fY (y) =dFY (y)

dy(145)

=dFX

(y−ba

)

dy(146)

=fX(y−ba

)

a(147)

=1√

2πaσe−

(y−b−aµ)22a2σ2 . (148)

Similarly, if a < 0

fY (y) =d(1− FX

(y−ba

))

dy(149)

= −fX(y−ba

)

a(150)

=1√

2π |a|σe−

(y−b−aµ)22a2σ2 . (151)

The pdf of Y is a Gaussian pdf with mean aµ and variance a2σ2. The multivariate case canbe established in a similar way.

A corollary of this result is that the joint pdf of a subvector of a Gaussian random vector isalso a Gaussian vector.

24

Corollary 2.22 (Marginals of Gaussian random vectors are Gaussian). The joint pdf of anysubvector of a Gaussian random vector is Gaussian. Without loss of generality, assume thatthe subvector X consists of the first m entries of the Gaussian random vector,

Z :=

[XY

], (152)

where

ΣZ =

[ΣX ΣXY

ΣTXY ΣY

], ΣXY := E

(XY T

)− E (X) E (Y )T . (153)

Then X is a Gaussian random vector with mean E (X) and covariance matrix ΣX.

Proof. Note that

X =

[Im 0m×n−m

0n−m×m 0n−m×n−m

] [XY

]=

[Im 0m×n−m

0n−m×m 0n−m×n−m

]Z, (154)

where I ∈ Rm×m is an identity matrix and 0c×d represents a matrix of zeros of dimensionsc× d. The result then follows from Theorem 2.21.

Interestingly, as a result of the dependence of the Gaussian joint pdf on the covariancematrix, in the case of Gaussian random vectors uncorrelation implies mutual independence.

Lemma 2.23 (Uncorrelation implies mutual independence for Gaussian random vectors). Ifall the components of a Gaussian random vector are uncorrelated, then they are also mutuallyindependent.

Proof. The parameter Σ of the joint pdf of a Gaussian random vector is its covariancematrix (one can verify this by applying the definition of covariance and integrating). If allthe components are uncorrelated then

ΣX =

σ21 0 · · · 0

0 σ22 · · · 0

......

. . ....

0 0 · · · σ2n

, (155)

where σi is the standard deviation of the ith component. Now, the inverse of this diagonalmatrix is just

Σ−1X =

1σ21

0 · · · 0

0 1σ22· · · 0

......

. . ....

0 0 · · · 1σ2n

, (156)

25

and its determinant is |Σ| = ∏ni=1 σ

2i so that from (142)

fX (x) =1√

(2π)n |Σ|exp

(−1

2(x− µ) Σ−1 (x− µ)

)(157)

=n∏

i=1

1√(2π)σi

exp

(−(xi − µi)2

2σ2i

)(158)

=n∏

i=1

fXi (xi) . (159)

26

A Derivation of means and variances in Table 1

A.1 Bernoulli

E (X) = pX (1) = p, (160)

E(X2)

= pX (1) , (161)

Var (X) = E(X2)− E2 (X) = p (1− p) . (162)

A.2 Geometric

To compute the mean of a geometric random variable, we apply Lemma B.3:

E (X) =∞∑

k=1

k pX (k) (163)

=∞∑

k=1

k p (1− p)k−1 (164)

=p

1− p∞∑

k=1

k (1− p)k (165)

=1

p. (166)

To compute the mean squared value we apply Lemma B.4:

E(X2)

=∞∑

k=1

k2 pX (k) (167)

=∞∑

k=1

k2 p (1− p)k−1 (168)

=p

1− p∞∑

k=1

k2 (1− p)k (169)

=2− pp2

. (170)

Var (X) = E(X2)− E2 (X) =

1− pp2

. (171)

27

A.3 Binomial

By Lemma 5.8 in Lecture Notes 1, if we define n Bernouilli random variables with parameterp we can write a binomial random variable with parameters n and p as

X =n∑

i=1

Bi, (172)

where B1, B2, . . . are mutually independent Bernouilli random variables with parameter p.Since the mean of all the Bernouilli random variables is p, by linearity of expectation

E (X) =n∑

i=1

E (Bi) = np. (173)

Note that E (B2i ) = p and E (BiBj) = p2 by independence, so

E(X2)

= E

(n∑

i=1

n∑

j=1

BiBj

)(174)

=n∑

i=1

E(B2i

)+ 2

n−1∑

i=1

n∑

i=j+1

E (BiBj) = np+ n (n− 1) p2. (175)

Var (X) = E(X2)− E2 (X) = np (1− p) . (176)

A.4 Poisson

From calculus we have

∞∑

k=0

λk

k!= eλ, (177)

which is the Taylor series expansion of the exponential function. Now we can establish that

E (X) =∞∑

k=1

kpX (k) (178)

=∞∑

k=1

λke−λ

(k − 1)!(179)

= e−λ∞∑

m=0

λm+1

m!(180)

= λ, (181)

28

and

E(X2)

=∞∑

k=1

k2pX (k) (182)

=∞∑

k=1

kλke−λ

(k − 1)!(183)

= e−λ( ∞∑

k=1

(k − 1)λk

(k − 1)!+

kλk

(k − 1)!

)(184)

= e−λ( ∞∑

m=1

λm+2

m!+∞∑

m=1

λm+1

m!

)(185)

= λ2 + λ. (186)

Var (X) = E(X2)− E2 (X) = λ. (187)

A.5 Uniform

We apply the definition of expected value for continuous random variables to obtain

E (X) =

∫ ∞

−∞xfX (x) dx =

∫ b

a

x

b− adx (188)

=b2 − a2

2 (b− a)=a+ b

2. (189)

Similarly,

E(X2)

=

∫ b

a

x2

b− adx (190)

=b3 − a3

3 (b− a)(191)

=a2 + ab+ b2

3. (192)

Var (X) = E(X2)− E2 (X) (193)

=a2 + ab+ b2

3− a2 + 2ab+ b2

4=

(b− a)2

12. (194)

29

A.6 Exponential

Applying integration by parts,

E (X) =

∫ ∞

−∞xfX (x) dx (195)

=

∫ ∞

0

xλe−λxdx (196)

= xe−λx]∞0 +

∫ ∞

0

e−λxdx (197)

=1

λ. (198)

Similarly,

E(X2)

=

∫ ∞

0

x2λe−λxdx (199)

= x2e−λx]∞0 + 2

∫ ∞

0

xe−λxdx (200)

=2

λ2. (201)

Var (X) = E(X2)− E2 (X) =

1

λ2. (202)

A.7 Gaussian

We apply the change of variables t = (x− µ) /σ.

E (X) =

∫ ∞

−∞xfX (x) dx (203)

=

∫ ∞

−∞

x√2πσ

e−(x−µ)2

2σ2 dx (204)

=σ√2π

∫ ∞

−∞te−

t2

2 dt+µ√2π

∫ ∞

−∞e−

t2

2 dt (205)

= µ, (206)

where the last step follows from the fact that the integral of a bounded odd function over asymmetric interval is zero and from Lemma 5.21 in Lecture Notes 1.

30

Applying the change of variables t = (x− µ) /σ and integrating by parts, we obtain that

E(X2)

=

∫ ∞

−∞x2fX (x) dx (207)

=

∫ ∞

−∞

x2√2πσ

e−(x−µ)2

2σ2 dx (208)

=σ2

√2π

∫ ∞

−∞t2e−

t2

2 dt+2µσ√

2π

∫ ∞

−∞te−

t2

2 dt+µ2

√2π

∫ ∞

−∞e−

t2

2 dt (209)

=σ2

√2π

(t2e−

t2

2 ]∞−∞ +

∫ ∞

−∞e−

t2

2 dt

)+ µ2 (210)

= σ2 + µ2. (211)

Var (X) = E(X2)− E2 (X) = σ2. (212)

B Geometric series

Lemma B.1. For any α 6= 0 and any integers n1 and n2

n2∑

k=n1

αk =αn1 − αn2+1

1− α . (213)

Corollary B.2. If 0 < α < 1

∞∑

k=0

αk =α

1− α. (214)

Proof. We just multiply the sum by the factor (1− α) / (1− α) which obviously equals one,

αn1 + αn1+1 + · · ·+ αn2−1 + αn2 =1− α1− α

(αn1 + αn1+1 + · · ·+ αn2−1 + αn2

)(215)

=αn1 − αn1+1 + αn1+1 + · · · − αn2 + αn2 − αn2+1

1− α (216)

=αn1 − αn2+1

1− α (217)

Lemma B.3. For 0 < α < 1

∞∑

k=1

k αk =1

(1− α)2. (218)

31

Proof. By Corollary B.2,

∞∑

k=0

αk =1

1− α. (219)

Since the left limit converges, we can differentiate on both sides to obtain

∞∑

k=0

kαk−1 =1

1− α. (220)

Lemma B.4. For 0 < α < 1

∞∑

k=1

k2 αk =α (1 + α)

(1− α)3. (221)

Proof. By Lemma B.3,

∞∑

k=1

k2αk =α (1 + α)

(1− α)3. (222)

Since the left limit converges, we can differentiate on both sides to obtain

∞∑

k=1

k2αk−1 =1 + α

(1− α)3. (223)

32

probability (continued) - nyu courantcfgranda/pages/dsga1002_fall15/material... · probability...

Documents