math2901 revision seminar - mathsoc.unsw.edu.au€¦ · table of contents 1 basic probability...

MATH2901 Revision SeminarPart I: Probability Theory

Jeffrey Yang

UNSW Society of Statistics/UNSW Mathematics Society

August 2020

Jeffrey Yang (UNSW Society of Statistics/UNSW Mathematics Society)MATH2901 Revision Seminar August 2020 1 / 116

The theoretical content was mostly sourced from Libo’s notes.

Most of the examples were taken from the slides Rui Tong (2018 StatSocTeam) made for that year’s revision seminar.

All examples are presented at the end so that it isn’t as obvious whichtechniques/methods should be used.

It is recommended that you refer to the official lecture notes when quotingdefinitions/results.

Table of Contents

1 Basic Probability Theory

2 Review of Random Variables

3 Computations for Common Distributions

4 Inequalities

5 Limit of Random Variables

6 Tips and Assorted Examples

Basic Probability Theory

1 Axioms and Basic Results

2 Conditional Probability and Independence

3 Basic Laws

4 Bayes Formula

Introduction to Probability Theory

Probability theory is about modelling and analysing random experiments.We can do this mathematically by specifying an appropriate probabilityspace consisting of three components:

1 a sample space, Ω, which is the set of all possible outcomes.

2 an event space, F , which is the set of all events ”of interest”. Here,we can understand F as being a set of subsets of Ω which satisfiescertain conditions.

3 a probability function, P : F → [0, 1], which assigns each event inthe event space a probability.

Of course, for the model to be meaningful, P should satisfy certain axioms.

Axioms

Definition (Probability Space)

A probability space is the triple (Ω,F ,P). Here, P satisfies the axioms

1 P(A) ≥ 0 for all A ∈ F2 P(Ω) = 1

3 P (⋃∞

i=1 Ai ) =∑∞

i=1 P(Ai ) for mutually exclusive events A1,A2, ... ∈ F

Exercise for the reader: Prove the axioms.

Elementary Results

From the axioms, we are able to derive the following fundamental results:

1 If A1,A2, ...,Ak are mutually exclusive,

k∑i=1

P(Ai ).

2 P(∅) = 0.

3 For any A ⊆ Ω, 0 ≤ P(A) ≤ 1 and P(A) = 1− P(A).

4 If B ⊂ A, then P(B) ≤ P(A). Hence, if B occurs → A occurs, thenP(B) ≤ P(A).

These results can be used without proof.

Conditional Probability

Definition (Conditional Probability)

The conditional probability that an event A occurs, given that an event Bhas occurred is

P(A|B) =P(A ∩ B)

Here, we require P(B) 6= 0.

Given that B has occurred, the total probability for the possible results ofan experiment equals P(B). Visually, we see that the only outcomes in Athat are now possible are those in A ∩ B.

Independence

Definition (Independent Events)

Events A and B are independent if P(A ∩ B) = P(A)P(B).

Recall that for any two events A and B , we have P(A ∩ B) = P(A|B)P(B).Thus, A and B are independent if and and only P(A|B) = P(A) (orequivalently, P(B|A) = P(B)). Intuitively, this means that knowing event Ahas occurred does not give us any information on the probability of event Boccurring (and vice versa).

Independence

Definition (Pairwise Independent Events)

For a countable sequence of events Ai, the events are pairwiseindependent if

P(Ai ∩ Aj) = P(Ai )P(Aj)

for all i 6= j

Definition ((Mutually) Independent Events)

For a countable sequence of events Ai, the events are (mutually)independent if for any collection Ai1 ,Ai2 , ...,Ain , we have

P(Ai1 ∩ ... ∩ Ain) = P(Ai1)...P(Ain).

Independence implies pairwise independence but not vice versa. Can youthink of an example of where pairwise independence does not implyindependence?

The Multiplicative Law

Definition (The Multiplicative Law)

For events A1,A2, we have

P(A1 ∩ A2) = P(A2|A1)P(A1).

For events A1,A2,A3, we have

P(A1 ∩ A2 ∩ A3) = P(A3 ∩ A2 ∩ A1)

= P(A3|A2 ∩ A1)P(A2 ∩ A1)

= P(A3|A1 ∩ A2)P(A2|A1)P(A1).

We trust that the reader is able to generalise this to cases involving agreater number of events.

The Multiplicative Law is highly useful when dealing with asequence of dependent trials.

The Additive Law

Definition (The Additive Law)

For events A and B, we have

P(A ∪ B) = P(A) + P(B)− P(A ∩ B).

You might have noticed that the Additive Law resembles theInclusion-exclusion principle. The following quote from Wikipedia shedssome light on this:

”As finite probabilities are computed as counts relative to the cardinality ofthe probability space, the formulas for the principle of inclusion–exclusionremain valid when the cardinalities of the sets are replaced by finiteprobabilities.”

https://en.wikipedia.org/wiki/Inclusion-exclusion principle

The Law of Total Probability

Definition (The Law of Total Probability)

Suppose A1,A2, ...,Ak are mutually exclusive (Ai ∩ Aj = ∅ for all i 6= j)

and exhaustive (⋃k

i=1 Ai = Ω = sample space); that is, A1, ...,Ak forms apartition of Ω. Then, for any event B, we have

P(B) =k∑

P(B|Ai )P(Ai ).

The Law of Total Probability relates marginal probabilities to conditionalprobabilities and is often used in calculations involving Bayes’ Formula.

Bayes’ Formula/Bayes’ Theorem/Bayes’ Law

Definition (Bayes’ Theorem)

For a partition A1,A2, ...,Ak and an event B,

P(Aj |B) =P(B|Aj)P(Aj)∑ki=1 P(B|Ai )P(Ai )

=P(B|Aj)P(Aj)

In essence, Bayes’ Theorem allows us to reverse the order of conditioningprovided that we know the marginal probabilities of event A and event B.When dealing with problems involving Bayes’ Theorem, it is recommendedthat one draws a tree diagram.

Here, we shall adopt the frequentist interpretation of probability whereprobability measures a proportion of outcomes – as opposed to the Bayesianinterpretation where probability measures a degree of belief.

Table of Contents

4 Inequalities

Review of Random Variables

1 Discrete Random Variables

2 Continuous Random Variables

3 Cumulative Distribution Function

4 Expectation and Moments

Discrete Random Variables

Definition (Discrete Random Variable)

The random variable X is discrete if there are countably many values x forwhich P(X = x) > 0.

The probability structure of X is typically described by its probability(mass) function.

Definition (Probability (Mass) Function)

The probability function of the discrete random variables X is given by

fX (x) = P(X = x).

The following two properties are important and apply to all discrete randomvariables:

1 fX (x) ≥ 0 for all x ∈ R2∑

all x fX (x) = 1

Continuous Random Variables

A continuous random variable has a continuum of possible values.Continuous random variables do not have a probability (mass) function, buthave the analogous (probability) density function.

Definition ((Probability) Density Function)

The density function of a continuous random variable is a real-valuedfunction fX on R with the property∫

AfX (x)dx = P(X ∈ A)

for any (measurable) set A ⊆ R.

The following two properties are important and apply to all continuousrandom variables:

1 fX (x) ≥ 0 for all x ∈ R.2∫∞−∞ fX (x)dx = 1.

Cumulative Distribution Function (cdf)

Definition (Cumulative Distribution Function)

The cumulative distribution function (cdf) of a random variable X isdefined by

FX (x) = P(X ≤ x).

Here, X can be either continuous or discrete.

In the case where X is a continuous random variable, we have the followingimportant results:

1 FX =∫∞−∞ fX (t)dt.

2 fX (x) = F ′X (x).

3 P(a ≤ X ≤ b) =∫ ba fX (x)dx = area under fX between a and b.

Thus, if we know one of FX or fX , we are able to derive the other. Moreover,once we know these, we are able to derive any probability/property of X .

Important Remarks on Continuous Random Variables

Suppose X is a continuous random variable. Then we have

P(X = a) = 0 for any a ∈ R.

Hence, it is only meaningful to talk about the probability of X lying in somesubinterval(s) of R. Consequently, we have

P(a < X < b) = P(a ≤ X < b) = P(a < X ≤ b) = P(a ≤ X ≤ b).

Thus, we typically do not have to worry about whether an interval containsits boundary points or not.

This is NOT the case for discrete random variables.

Expectation

Definition (Expected Value of a Discrete Random Variable)

The expected value or mean of a discrete random variable X is

EX = E[X ]def=∑all x

x · P(X = x) =∑all x

x · fX (x),

where fX is the probability function of X .

Definition (Expected Value of a Continuous Random Variable)

The expected value of mean of a continuous random variable X is

E[X ] =

∫ ∞−∞

x · fX (x)dx ,

where fX is the density function of X .

In either case, E[X ] can be interpreted as the long run average of X .Jeffrey Yang (UNSW Society of Statistics/UNSW Mathematics Society)MATH2901 Revision Seminar August 2020 21 / 116

Expectation of Transformed Random Variables

Often, we are interested in a transformation of a random variable. Inparticular, we often examine the rth moment of X about some constant a,E[(X − a)r ]. The following results provide a method of calculating theexpectation of a transformation of a random variable.

Result (Transformation of a Discrete Random Variable)

Let X be a discrete random variable and let g be a function of X . Then

Eg(X ) = E[g(X )] =∑all x

g(x) · fX (x).

Result (Transformation of a Continuous Random Variable)

Let X be a continuous random variable and let g be a function of X . Then

Eg(X ) = E[g(X )] =

∫ ∞−∞

g(X )fX (x)dx .

Properties of Expectation

Result (Linearity of Expectation)

Let X ,Y be random variables and a, b be constants. Then

E[aX + bY ] = aE[X ] + bE[Y ].

Result (Expected Value of a Constant)

For a constant c ,E[c] = c .

In general, for dependent random variables X ,Y ,

E[XY ] 6= E[X ]E[Y ].

Also, if g is a transformation of X , then typically,

E[g(X )] 6= g(E[X ]).

Variance and Standard Deviation

Definition (Variance)

Let µ = E[X ]. Then the variance of X , denoted Var[X ], is defined as

Var[X ] = E[(X − µ)2].

Observe that this is the second moment of X about µ.

Definition (Standard Deviation)

The standard deviation of X is the square root of its variance:

σ =√

Var[X ].

Both variance and standard deviation are measures of the spread of arandom variable. Standard deviations are in the same unit as X and thus,can be more readily interpreted. However, variances are easier to work withtheoretically.

Properties of Variance

Result (Alternative Formula for Variance)

Let X be a random variable and let µ = E[X ]. Then

Var[X ] = E[X 2]− µ2.

The variance will often be calculated using this formula.

Result (Nonlinearity of Variance)

Let X be a random variable and let a be a constant. Then

Var[X + a] = Var[X ]

Var[aX ] = a2 Var[X ].

Pay attention to the nonlinearity of variance. A common mistake is treatingvariance as being linear.

Moment Generating Functions

Definition (Moment Generating Function)

The moment generating function (mgf) of a random variable X is

µX (u) = E[euX ].

We say that the moment generating function of X exists if mX (u) is finitein some interval containing zero.

The following result regarding the r th moment of X , E[X r ], shows why themoment generating function is called as such.

Result (r th Moment of a random variable)

Let X be a random variable. Then in general, we have

E[X r ] = m(r)X (0)

for r = 0, 1, 2, ....

Properties of Moment Generating Functions

Result (Uniqueness)

Let X and Y be two random variables all of whose moments exist. If

mX (u) = mY (u)

for all u in a neighbourhood of 0 (i.e., for all |µ| < ε for some ε < 0), then

FX (x) = FY (y) for all x ∈ R.

That is, the moment generating function of a random variable is unique.

Properties of Moment Generating Functions

Result (Convergence)

Let Xn : n = 1, 2, ... be a sequence of random variables, each withmoment generating function mXn(u). Furthermore, suppose that

limn→∞

mXn(u) for all u in a neighbourhood of 0

and mX (u) is a moment generating function of a random variable X . Then

limn→∞

FXn(x) = FX (x) for all x ∈ R.

That is, convergence of moment generating functions implies convergenceof cumulative distribution functions.

Location and Scale Families of Densities

Result (Location Family of Densities)

A location family of densities based on the random variable U is thefamily of densities fX (x) where X = U + c for all possible c . Here, fX (x) isgiven by:

fX (x) = fU(x − c).

Result (Scale Family of Densities)

A scale family of densities based on the random variable U is the family ofdensities fX (x) where X = cU for all possible c . fX (x) is given by:

fX (x) = c−1fU

In essence, the density function of a continuous random variable may belongto a family of density functions that all have a similar form. This relates tothe concept of parameters in statistics.Proving the above results is not difficult and is a good exercise.

Table of Contents

4 Inequalities

Computations for Common Distributions

1 Common distributions summary

2 Computing the MGF of random variables

3 Computing the moments directly

4 Computing the distribution of simple transformed random variables

5 Bivariate density computations

6 Computing the sum of random variables using the convolutionformula, bivariate transformation and MGFs

7 Identifying distributions from the MGF

Bernoulli Distribution

Definition (Bernoulli Distribution)

For a Bernoulli trial, define the random variable

1 if the trial results in success

0 otherwise

Then X is said to have a Bernoulli distribution.

Result (Probability Function of X )

If X is a Bernoulli random variable defined according to a Bernoulli trialwith success probability 0 < p < 1 then the probability function of X is

fX (x) =

p if x = 1

1− p if x = 0.

An equivalent way of writing this is fX (x) = px(1− p)x , x = 0, 1.

Binomial Distribution

Definition (Binomial Random Variable)

Consider a sequence of n independent Bernoulli trials, each with successprobability p. If

X = total number of successes

then X is a Binomial random variable with parameters n and p. Acommon shorthand is

X ∼ Bin(n, p).

Here, ”∼” has the meaning ”is distributed as” or ”has distribution”.

Results

If X ∼ Bin(n, p) then

1 fX (x) =(nx

)px(1− p)n−x , x = 0, ..., n,

2 E[X ] = np,

3 Var(X ) = np(1− p).

Geometric Distribution

Definition (Geometric Distribution)

IfX = number of trials until first success,

then X is said to have a geometric distribution with parameter p (theprobably of success on each trial).

Results

If X ∼ Geom(p), then

1 fX (x ; p) = p(1− p)x−1, x = 1, 2, ...

2 E[X ] = 1p ,

3 Var(X ) = 1−pp2 .

Poisson Distribution

Definition (Poisson Distribution)

The random variable X has a Poisson distribution with parameter λ > 0if its probability function is

fX (x ;λ) = P(X = x) =e−λλx

x!, x = 0, 1, 2, ...

A common abbreviation is X ∼ Poisson(λ).

The Poisson distribution is a model of the occurrence of point events in acontinuum. The number of points occurring in a time interval t is a randomvariable with a Poisson(λt) distribution

Results

If X ∼ Poisson(λ)

1 E(X ) = λ,

2 Var(X ) = λ.Jeffrey Yang (UNSW Society of Statistics/UNSW Mathematics Society)MATH2901 Revision Seminar August 2020 35 / 116

Exponential Distribution

The exponential distribution is useful for describing the probability structureof positive random variables. The exponential distribution is closely relatedto the Poisson distribution; the time until the next event has an exponentialdistribution with parameter β = 1

Definition

A random variable X is said to have an exponential distribution withparameter β > 0 if X has density function:

fX (x ;β) =1

βe−x/β, x > 0.

Results1 E(X ) = β,

2 Var(X ) = β2,

3 Memoryless property: P(X > s + t|X > s) = P(X > t).

Uniform Distribution

Definition (Uniform Distribution)

A continuous random variable X that can take values in the interval (a, b)with equal likelihood is said to have a uniform distribution on (a,b). Acommon shorthand is

X ∼ Unif(a, b).

Results

If X ∼ Unif(a, b), then

1 fX (x ; a, b) = 1b−a , a < x < b,

2 E(X ) = a+b2 ,

3 Var(X ) = (b−a)2

Gamma Function

The Gamma Function extends the factorial function to the real numbers.

Definition (Gamma Function)

The Gamma function at x ∈ R is given by

Γ(x) =

∫ ∞0

tx−1e−tdt.

Results1 Γ(x) = (x − 1)Γ(x − 1),

2 Γ(n) = (n − 1)!,, n = 1, 2, 3...

3 Γ( 12 ) =

√π,

4∫∞

0 xme−xdx = m! for m = 0, 1, 2, ....

Beta Function

Definition (Beta Function)

The Beta function at x , y ∈ R is given by

B(x , y) =

0tx−1(1− t)y−1dt.

Result

For all x , y ∈ R,

B(x , y) =Γ(x)Γ(y)

Γ(x + y).

Phi Function

Definition (Φ)

For all x ∈ R,

Φ(x) =1√2π

−∞e−t

2/2dt.

We cannot simplify the above expression for Φ(x) as there is no closed-formanti-derivative.Φ gives the cumulative distribution function of the standard normaldistribution.

Results1 limx→−∞Φ(x) = 0,

2 limx→∞Φ(x) = 1,

3 Φ(0) = 12 ,

4 Φ is monotonically increasing over R.

Normal Distribution

Definition (Normal Distribution)

The random variable X is said to have a normal distribution withparameters µ and σ2 (where −∞ < µ <∞ and σ2 > 0) if X has densityfunction

fX (x ;µ, σ) =1

2πe−(x−µ)2

2σ2 ,−∞ < x <∞.

A common shorthand isX ∼ N(µ, σ2).

Results

If X ∼ N(µ, σ2),

1 E(X ) = µ,

2 Var(X ) = σ2,

Computing Normal Distribution Probabilities

Result

If Z ∼ N(0, 1) then

P(Z ≤ x) = FZ (x) = Φ(x).

In other words, the Φ function is the cumulative distribution function of theN(0, 1) random variable.

Result

If X ∼ N(µ, σ2) then

Z =X − µσ

∼ N(0, 1).

Gamma Distribution

Definition (Gamma Distribution)

A random variable X is said to have a Gamma distribution withparameters α and β (where α, β > 0) if X has density function:

fX (x ;α, β) =e−α/βxα−1

Γ(α)βα, x > 0

A common shorthand is:

X ∼ Gamma(α, β).

Results

If X ∼ Gamma(α, β) then

1 E(X ) = αβ,

2 Var(X ) = αβ2.

Moreover, Y has an Exponential distribution iff Y ∼ Gamma(1, β).Jeffrey Yang (UNSW Society of Statistics/UNSW Mathematics Society)MATH2901 Revision Seminar August 2020 43 / 116

Beta Distribution

Definition (Beta Distribution)

A random variable X is said to have a Beta distribution with parametersaα, β > 0 if its density function is

fX (x ;α, β) =xα−1(1− x)β−1

B(α, β), 0 < x < 1.

The Beta distribution generalises the Unif(0, 1) distribution which can bethought of as a Beta distribution with a = b = 1.

Results

If X has a distribution with parameters α and β, then

1 E(X ) = αα+β ,

2 Var(X ) = αβ(α+β+1)(α+β)2 .

Joint Probability Function and Density Function

Definition (Joint Probability Function)

If X and Y are discrete random variables, then the joint probabilityfunction of X and Y is

fX ,Y (x , y) = P(X = x ,Y = y),

the probability that X = x and Y = y .

Definition (Joint Density Function)

The joint density function of continuous random variables X and Y is abivariate function fX ,Y with the property∫ ∫

AfX ,Y (x , y)dxdy = P((X ,Y ) ∈ A)

for any (measurable) subset A of R2.

Joint Cumulative Distribution Function

Definition (Joint Cumulative Distribution Function)

The joint cdf of X and Y is

FX ,Y (x , y) = P(X ≤ x ,Y ≤ y)

∑u≤x

∑v≤y P(X = u,Y = v) (X discrete)∫ y

−∞∫ x−∞ fX ,Y (u, v)dudv (X continuous).

Result1 If X and Y are discrete radom variables, then∑

∑all y fX ,Y (x , y) = 1.

2 If X and Y are continuous random variables then∫∞−∞

∫∞−∞ fX ,Y (x , y)dxdy = 1.

Expectation of Joint Functions

Result (Expectation of Joint Functions)

If g is any function of g(X ,Y ),

E[g(X ,Y )] =

∑all y g(x , y)P(X = x ,Y = y) discrete

=∫∞−∞

∫∞−∞ g(x , y)fX ,Y (x , y)dxdy continuous

Marginal Probability Function

Result (Marginal Probability Function)

If X and Y are discrete, then fX (x) and fY (y) can be calculated fromfX ,Y (x , y) as follows:

fX (x) =∑all y

fX ,Y (x , y)

fY (y) =∑all x

fX ,Y (x , y).

fX (x) is sometimes referred to as the marginal probability function of X .

Marginal Density Function

Result (Marginal Density Function)

If X and Y are continuous, then fX (x) and fY (y) can be calculated fromfX ,Y (x , y) as follows:

fX (x) =

∫ ∞−∞

fX ,Y (x , y)dy

fY (y) =

∫ ∞−∞

fX ,Y (x , y)dx .

fX (x) is sometimes referred to as the marginal density function of X .

Conditional Probability Function

Definition (Conditional Probability Function)

If X and Y are discrete, the conditional probability function of X givenY = y is

fX |Y (x |y) = P(X = x |Y = y) =P(X = x ,Y = y)

P(Y = y)=

fX ,Y (x , y)

fY (y).

Similarly,

fY |X (y |x) = P(Y = y |X = x) =fX ,Y (x , y)

fX (x).

Result

P(Y ∈ A|X = x) =∑y∈A

fY |X (y |X = x).

Conditional Density Function

Definition (Conditional Density Function)

If X and Y are continuous, the conditional density function of X givenY = y is

fX |Y (x |Y = y) =fX ,Y (x , y)

fY (y).

Similarly,

fY |X (y |X = x) =fX ,Y (x , y)

fX (x).

Shorthand: fY |X (y |x).

Result

P(a ≤ Y ≤ b|X = x) =

afY |X (y |x)dy .

Conditional Expected Value and Variance

Result (Conditional Expected Value)

The conditional expected value of X given Y = y is

E(X |Y = y) =

∑all x xP(X = x |Y = y) if X is discrete∫∞−∞ xfX |Y (x |y)dx if X is continuous

Result (Conditional Variance)

The conditional variance of X given Y = y is

Var(X |Y = y) = E(X 2|Y = y)− [E(X |Y = y)]2

E(X 2|Y = y) =

∑all x x

2P(X = x |Y = y)∫∞−∞ x2fX |Y (x |y)dx .

Independent Random Variables

Definition (Independent)

Random variables X and Y are independent if and only if for all x , y

fX ,Y (x , y) = fX (x)fY (y).

Result

Random variables X and Y are independent if and only if for all x , y

fY |X (x , y) = fY (y)

Result

If X and Y are independent

FX ,Y (x , y) = FX (x) · FY (y).

Covariance

Definition (Covariance

The covariance of X and Y is

Cov(X ,Y ) = E[(X − µX )(Y − µY )].

Cov(X ,Y ) measures how much X and Y vary about their means and alsohow much they vary together linearly.

Results1 Cov(X ,X ) = Var(X )

2 Cov(X ,Y ) = E(XY )− µXµY3 If X and Y are independent, then Cov(X ,Y ) = 0.

Correlation

Definition (Correlation)

The correlation between X and Y is

Corr(X ,Y ) =Cov(X ,Y )√

Var(X ) · Var(Y ).

Corr(X ,Y ) measures the strength of the linear association between X andY . Independent random variables are uncorrelated, but uncorrelatedvariables are not necessarily independent (Consider the case whereE(X ) = 0 and Y = X 2).

Results1 |Corr(X ,Y )| ≤ 1

2 |Corr(X ,Y )| = 1 if and only if P(Y = a + bX ) = 1 for some constantsa, b.

Transformations

Result

For discrete X ,

fY (y) = P(Y = y) = P[h(X ) = y ] =∑

x :h(x)=y

P(X = x).

Result

For continuous X , if h is monotonic over the set x : fX (x) > 0 then

fY (y) = fX (x)

∣∣∣∣dxdy∣∣∣∣

= fX (h−1(y))

∣∣∣∣dxdy∣∣∣∣

for y such that fX (h−1(y)) > 0.

Linear Transformation

Result

For a continuous random variable X , if Y = aX + b is a lineartransformation of X with a 6= 0, then

fY (y) =1

(y − b

)for all y such that fX ( y−ba ) > 0.

Leading into Bivariate Transformations...

If X and Y have joint density fX ,Y (x , y) and U is a function of X and Y ,we can find the density of U by calculating FU(u) = P(U ≤ u) anddifferentiating.

Bivariate Transformations

Result

If U and V are functions of continuous random variables X and Y , then

fU,V (u, v) = fX ,Y (x , y) · |J|

∣∣∣∣∣∣∂x∂u

∂x∂v

∂y∂u

∂y∂v

∣∣∣∣∣∣is a determinant called the Jacobian of the transformation.The full specification of fU,V (u, v) requires that the range of (u, v) valuescorresponding to those (x , y) for which fX ,Y (x , y) > 0 is determined.

Bivariate Transformations

To find fU(u) by bivariate transformation:

1 Define some bivariation transformation (U,V ).

2 Find fU,V (u, v).

3 We want the marginal distribution of U. So now find∫∞−∞ fU,V (u, v).

Using a bivariate transformation to find the distribution of U is often moreconvenient that deriving it via the cumulative distribution function. Usingthe cdf requires double integration, which we can avoid when we use abivariate transformation.

Sum of Independent Random Variables - ProbabilityFunction/Density Function Approach

Result (Discrete Convolution Formula)

Suppose that X and Y are independent random variables raking onlynon-negative integer values, and let Z = X + Y . Then

fX (z) =z∑

fX (z − y)fY (y), z = 0, 1, ...

Result (Continuous Convolution Formula)

Suppose X and Y are independent continuous random variables withX ∼ fX (x) and Y ∼ fY (y). Then Z = X + Y has density

fZ (z) =

∫all possible y

fX (z − y)fY (y)dy .

Sum of Gammmas

Result

If X1,X2, ...,Xn are independent with Xi ∼ Gamma(αi , β), then

n∑i=1

Xi ∼ Gamma(n∑

αi , β).

Sum of Independent Random Variables - MomentGenerating Function Approach

Result

Suppose that X and Y are independent random variables with momentgenerating functions mX and mY . Then

mX+Y (u) = mX (u)mY (u).

This generalises to n independent random variables.

Result

If X ∼ N(µX , σ2X ) and Y ∼ N(σY , σ

2Y ) are independent then

X + Y ∼ N(µX + µY , σ2X + σ2

This also generalises.

Table of Contents

4 Inequalities

Inequalities

1 Jensen’s Inequality

2 Markov’s inequality

3 Chebyshev’s inequality

Jensen’s Inequality

Result (Jensen’s Inequality)

If h is a convex function (i.e., concave up) and X is a random variable, then

E[h(X )] ≥ h(E[X ]).

There are several formulations of Jensen’s inequality and the aboveformulation is the one most relevant for probability theory. Studentsstudying MATH2701 next term will be delighted to re-encounter Jensen’sinequality albeit in a different formulation.

Markov’s Inequality

Result (Markov’s Inequality)

Let X be a nonnegative random variable and a > 0. Then

P(X ≥ a) ≤ E[X ]

That is, the probability that X is at least a is at most the expectation of Xdivided by a.

Chebychev’s Inequality

Result (Chebychev’s Inequality)

If X is any random variable with E[X ] = µ, Var(X ) = σ2 then

P(|X − µ| > kσ) ≤ 1

That is, the probability that X is more than k standard deviations from itsmean is less than 1

The significance of Chebychev’s Inequality is that we can make specificprobabilistic statements about a random variable given only its mean andstandard deviation – observe that we have not made any assumptions onthe distribution of X .Interestingly, Chebychev’s Inequality can be easily derived as a corollary ofMarkov’s Inequality (Hint: consider the random variable (X − E[X ])2)

Table of Contents

4 Inequalities

Limit of Random Variables

1 Almost Sure Convergence

2 Convergence in Probability

3 Convergence in Distribution

4 Central Limit Theorem

5 Law of Large Numbers

6 The Statement and Application of the Delta Method

Almost Sure Convergence

Definition (Almost Sure Convergence)

The sequence of numerical random variables X1,X2, ... is said to convergealmost surely to a numerical random variable X , denoted Xn

a.s.→ X if

P(ω : lim

n→∞Xn(ω) = X (ω)

Within the context of probability theory, an event is said to happen almostsurely if it happens with probability 1. Note that it is possible for the set ofexceptions to be non-empty as long as it has probability 0.

Almost Sure Convergence

The previous definition of almost sure convergence can be difficult to workwith so we will make use of an alternative definition.

Result (Alternative Definition of Almost Sure Convergence)

Xna.s.→ X

if and only if for every ε > 0,

limn→∞

(supk≥n|Xk − X | > ε

Almost Sure Convergence is the mode of convergence used in the StrongLaw of Large Numbers.

Convergence in Probability

The main idea behind Convergence in Probability is that the probability ofan “unusual” event becomes smaller as the sequence progresses.

Definition (Convergence in Probability)

The sequence of random variables X1,X2, ... converges in probability to arandom variable X if, for all ε > 0,

limn→∞

P(|Xn − X | > ε) = 0.

This is usually written as

XnP→ X .

For context, an estimator is called consistent if it converges in probabilityto the quantity being estimated. Furthermore, convergence in probability isthe type of convergence established by the Weak Law of Large Numbers.

Relationship between Almost Sure Convergence andConvergence in Probability

Result (Almost Sure Convergence implies Convergence in Probability)

Xna.s.→ X =⇒ Xn

P→ X

a.s.→ 0 ⇐⇒ supk≥n|Xk |

P→ 0.

To understand why almost sure convergence is stronger than convergence inprobability, almost sure convergence depends on a joint distribution whereasconvergence in probability depends only on a marginal distribution.Wise words of wisdom: Almost sure convergence means no noodle leavesthe strip (for large enough n), convergence in probability means theproportion of noodles leaving the strip goes to 0 (as n→∞).

Convergence in Distribution

Convergence in Distribution is concerned with whether the distributions ofXi converges to the distribution of some random variable X . In other words,we increasingly expect to see the next outcome in a sequence of randomexperiments become better modelled by a given probability distribution.

Definition (Convergence in Distribution)

Let X1,X2, ... be a sequence of random variables. We say that Xn

converges in distribution to X if

limn→∞

FXn(x) = FX (x)

for all x where FX is continuous. A common shorthand is Xnd→ X .

We say that FX is the limiting distribution of Xn.

Convergence in distribution often arises in applications of the central limittheorem.

math2901 revision seminar - mathsoc.unsw.edu.au€¦ · table of contents 1 basic probability...

Documents

matrix computations (golub)

quadratic inequalities solving quadratic inequalities

pre-calculus lesson 7: solving inequalities linear...

optimization online - robust network design: …robust...

scientific computations

precision computations

1 properties of inequality standard 1 solution of...

inequalities 33 22 11 denoting inequalities properties...

compe472 parallel computing embarrassingly parallel...

scientific computations 1_gallopoulos

inequalities - compound...inequalities - compound an...

heliospheric computations

borrow & fill computations

chapter 1 - fundamentals 1.7 - inequalities. rules for...

adjustment computations

ui computations

embarrassingly parallel computations partitioning and...

matrix computations - golub

understood volume computations did not understand volume...

ok, beamer. - mathsoc.unsw.edu.au · proof. we can express...