statistical inference - lecture 3: common families of...

Statistical InferenceLecture 3: Common Families of Distributions

MING GAO

DASE @ ECNU(for course related communications)

[email protected]

Mar. 24, 2020

Outline

1 Discrete Distributions

2 Continuous Distributions

3 Exponential Family

4 Location and Scale Families

5 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Mar. 24, 2020 2 / 28

Discrete Distributions

Discrete distributions

A r.v. X is said to have a discrete distribution if the range of X iscountable. In most situations, the r.v. has integer-valued outcomes.

Discrete uniform distribution

A r.v. X has a discrete uniform (1,N) distribution if

P(X = x |N) =1

N, x = 1, 2, · · · ,N,

where N is a specified integer. This distribution puts equal mass oneach of the outcomes 1, 2, · · · ,N.∑k

i=1 i = k(k+1)2 , and

∑ki=1 i

2 = k(k+1)(2k+1)6 .

E (X ) =∑N

x=1 xP(X = x |N) = N+12 ;

Var(X ) = E (X 2)− E (X )2 = (N+1)(N−1)12 .

Bernoulli Trials

Definition

Each performance of an experiment with two possible outcomes iscalled a Bernoulli trial.

In general, a possible outcome of a Bernoulli trial is called asuccess or a failure.

If p is the probability of a success and q is the probability of afailure, it follows that p + q = 1.

E (X ) = 0·(1−p)+1·p = p and E (X 2) = 02 ·(1−p)+12 ·p = p;

Var(X ) = p(1− p).

Bernoulli Trials

Definition

Each performance of an experiment with two possible outcomes iscalled a Bernoulli trial.

In general, a possible outcome of a Bernoulli trial is called asuccess or a failure.

If p is the probability of a success and q is the probability of afailure, it follows that p + q = 1.

E (X ) = 0·(1−p)+1·p = p and E (X 2) = 02 ·(1−p)+12 ·p = p;

Var(X ) = p(1− p).

Binomial distribution

Many problems can be solved by determining the probability of k successeswhen an experiment consists of n mutually independent Bernoulli trials.

Let r.v. Xi be the i−th experimental outcome (i = 1, 2, · · · , n),where Xi denote whether it successes or not. Hence, we have

{1, if we obtain head with probability p;0, otherwise with probability (1− p).

Let r.v. X =∑n

i=1 Xi . We have

P(X = x |n, p) =

)px(1− p)n−x , x = 0, 1, 2, · · · , n.

We call this function the binomial distribution, i.e., B(k ; n, p) =P(X = k) = C (n, k)pkqn−k .

Binomial distribution

Many problems can be solved by determining the probability of k successeswhen an experiment consists of n mutually independent Bernoulli trials.

Let r.v. Xi be the i−th experimental outcome (i = 1, 2, · · · , n),where Xi denote whether it successes or not. Hence, we have

{1, if we obtain head with probability p;0, otherwise with probability (1− p).

Let r.v. X =∑n

i=1 Xi . We have

P(X = x |n, p) =

)px(1− p)n−x , x = 0, 1, 2, · · · , n.

We call this function the binomial distribution, i.e., B(k ; n, p) =P(X = k) = C (n, k)pkqn−k .

Expected value of Binomial r.v.s

Theorem

The expected number of successes when n mutually independentBernoulli trials are performed, where p is the probability of successon each trial, is np.

Proof.

Let X be the r.v. equal to # successes in n trials. We have knownthat P(X = k) = C (n, k)pkqn−k . Hence, we have

E (X ) =n∑

k · P(X = k) =n∑

k · C (n, k)pkqn−k

n ·(n − 1

k − 1

)pkqn−k = np

n−1∑j=0

(n − 1

)pjqn−1−j

= np(p + q)n−1 = np

Expected value of Binomial r.v.s

Theorem

The expected number of successes when n mutually independentBernoulli trials are performed, where p is the probability of successon each trial, is np.

Proof.

Let X be the r.v. equal to # successes in n trials. We have knownthat P(X = k) = C (n, k)pkqn−k . Hence, we have

E (X ) =n∑

k · P(X = k) =n∑

k · C (n, k)pkqn−k

n ·(n − 1

k − 1

)pkqn−k = np

n−1∑j=0

(n − 1

)pjqn−1−j

= np(p + q)n−1 = np

Variance of Binomial r.v.s

Question: Let r.v. X be the number of successes of n mutuallyindependent Bernoulli trials, where p is the probability of success oneach trial. What is the variance of X?

Solution:

E(X 2) =n∑

k2 · P(X = k) =n∑

k(k − 1) · P(X = k) +n∑

k · P(X = k)

= n(n − 1)p2n∑

(n − 2

k − 2

)pk−2qn−k + np

= n(n − 1)p2n−2∑j=0

(n − 2

)pjqn−2−j + np

= n(n − 1)p2(p + q)n−2 + np = n(n − 1)p2 + np,

V (X ) = E(X 2)− (E(X ))2 = n(n − 1)p2 + np − (np)2 = np(1− p).

Variance of Binomial r.v.s

Question: Let r.v. X be the number of successes of n mutuallyindependent Bernoulli trials, where p is the probability of success oneach trial. What is the variance of X?Solution:

E(X 2) =n∑

k2 · P(X = k) =n∑

k(k − 1) · P(X = k) +n∑

k · P(X = k)

= n(n − 1)p2n∑

(n − 2

k − 2

)pk−2qn−k + np

= n(n − 1)p2n−2∑j=0

(n − 2

)pjqn−2−j + np

= n(n − 1)p2(p + q)n−2 + np = n(n − 1)p2 + np,

V (X ) = E(X 2)− (E(X ))2 = n(n − 1)p2 + np − (np)2 = np(1− p).

Geometric distribution

Let r.v. Y be # experiments until the first success obtained inindependent Bernoulli trials.

P(Y = k) = P(X1 = 0 ∧ X2 = 0 ∧ · · ·Xk−1 = 0 ∧ Xk = 1)

= Πk−1i=1 P(Xi = 0) · P(Xk = 1) = pqk−1

We call this function the Geometric distribution, i.e.,

G (k; p) = pqk−1.

The geometric distribution is sometimes used to model“lifetimes” or “time until failure” of components.

For example,, if the probability is 0.001 that a light bulb willfail on any given day, what is the probability that it will last atleast 30 days?

P(Y = k) = P(X1 = 0 ∧ X2 = 0 ∧ · · ·Xk−1 = 0 ∧ Xk = 1)

= Πk−1i=1 P(Xi = 0) · P(Xk = 1) = pqk−1

G (k; p) = pqk−1.

P(Y = k) = P(X1 = 0 ∧ X2 = 0 ∧ · · ·Xk−1 = 0 ∧ Xk = 1)

= Πk−1i=1 P(Xi = 0) · P(Xk = 1) = pqk−1

G (k; p) = pqk−1.

Expectation of Geometric r.v.s

Theorem

E (X ) and Var(X ) when a r.v. X follows a Geometric distributionare 1

p and qp2 , where p is the probability of success on each trial.

Proof.

We have known that P(X = k) = qk−1p. Hence, we have

E (X ) =∞∑k=0

k · qk−1p = p(∞∑

∞∑k=m

qk−1)

= p(∞∑

qm−1

1− q) =

∞∑m=1

qm−1

1− q=

Theorem

Proof.

E (X ) =∞∑k=0

k · qk−1p = p(∞∑

∞∑k=m

qk−1)

= p(∞∑

qm−1

1− q) =

∞∑m=1

qm−1

1− q=

Theorem

Proof.

E (X ) =∞∑k=0

k · qk−1p = p(∞∑

∞∑k=m

qk−1)

= p(∞∑

qm−1

1− q) =

∞∑m=1

qm−1

1− q=

Variance of Geometric r.v.s

E (X 2) =∞∑k=0

k2 · P(X = k) =∞∑k=1

[k(k − 1) + k] · P(X = k)

= p∞∑k=2

(2k−1∑j=1

j)qk−1 +1

= 2p∞∑j=1

∞∑k=j+1

(jqk−1) +1

2q + p

V (X ) =2q + p

p2− (

2q − (1− p)

Variance of Geometric r.v.s

E (X 2) =∞∑k=0

k2 · P(X = k) =∞∑k=1

[k(k − 1) + k] · P(X = k)

= p∞∑k=2

(2k−1∑j=1

j)qk−1 +1

= 2p∞∑j=1

∞∑k=j+1

(jqk−1) +1

2q + p

V (X ) =2q + p

p2− (

2q − (1− p)

Hypergeometric distributions

Suppose we have a large urn filled with N balls that are identical in everyway except that M are red and N −M are green. Let a r.v., denoted as X ,be the number of red balls in a sample of size K .

The r.v. X has a hypergeometric distribution given by

P(X = x |N,M,K ) =

)(N−MK−x

) , x = 1, 2, · · · ,K

∑Kx=0

)(N−MK−x

K∑x=0

P(X = x) =K∑

)(N−MK−x

) = 1.

Hypergeometric distributions Cont’d

Expectation and variance

E (X ) =K∑

)(N−MK−x

) =K∑

(M−1x−1

)(N−MK−x

(N−1K−1

K∑y=0

(M−1y

)((N−1)−(M−1)K−1−y

(N−1K−1

Var(X ) =KM

N· (N −M)(N − K )

N(N − 1).

Possion distributions

A r.v. X , taking the values in the nonnegative integers, has aPoisson(λ) distribution if

P(X = x |λ) =λx

x!e−λ, x = 0, 1, 2, · · · .

Note that∑∞

k! = ex ;

E (X ) =∑∞

x=0 xλx

x! e−λ = λe−λ

∑∞y=0

λx−1

(x−1)! = λ;

E (X 2) =∞∑x=0

x!e−λ =

∞∑x=1

[x(x − 1) + x ]λx

x!e−λ = λ2 + λ;

Var(X ) = λ

Continuous Distributions

Continuous uniform distribution

The continuous uniform distribution is defined by spreading massuniformly over an interval [a, b]. Its pdf is given by

f (x |a, b)

b−a , if x ∈ [a, b];

0, otherwise.

E (X ) =∫ ba

xb−adx = a+b

Var(X ) =∫ ba

(x− a+b2

b−a dx = (b−a)2

Gamma distribution

Note that, if α > 0, then∫ +∞

0 tα−1e−tdt <∞.Let Γ(α) =

∫ +∞0 tα−1e−tdt, Γ(α + 1) = Γ(α);

Γ(n) = (n − 1)!;

Γ( 12 ) =

√π.

Gamma distribution is defined the interval [0,+∞). Its pdf is givenby

f (x |α, β) =1

Γ(α)βαxα−1e−x/β, 0 ≤ x <∞, α > 0, β > 0.

E (X ) = 1Γ(α)βα

∫ +∞0 xαe−x/βdx = αβ;

Var(X ) = 1Γ(α)βα

∫ +∞0 xα+1e−x/βdx − (αβ)2 = αβ2.

Special cases of Gamma distribution

Chi squared distribution

Let α = p2 , and p ∈ Z+, β = 2, then its pdf is given by

f (x |p) =1

Γ( p2 )2

xp2 −1e−x/2, 0 ≤ x <∞,

which is the chi squared pdf with p degree of freedom. Note thatE (X ) = p,Var(X ) = 2p.

Exponential distribution

If we set α = 1 for Gamma distribution, then its pdf is given by

f (x |p) =1

βe−x/β , 0 ≤ x <∞.

Note that E (X ) = β,Var(X ) = β2.

Special cases of Gamma distribution

Chi squared distribution

Let α = p2 , and p ∈ Z+, β = 2, then its pdf is given by

f (x |p) =1

Γ( p2 )2

xp2 −1e−x/2, 0 ≤ x <∞,

which is the chi squared pdf with p degree of freedom. Note thatE (X ) = p,Var(X ) = 2p.

Exponential distribution

If we set α = 1 for Gamma distribution, then its pdf is given by

f (x |p) =1

βe−x/β , 0 ≤ x <∞.

Note that E (X ) = β,Var(X ) = β2.

Normal distribution/Gaussian distribution

The pdf of the normal distribution with mean µ and variance σ2,denoted as N(µ, σ2), is given by

f (x |µ, σ2) =1√2πσ

e−(x−µ)2

2σ2 ,−∞ < x <∞.

E (X ) = µ,Var(X ) = σ2;

The standard normal distribution is N(0, 1) with pdf

f (x |0, 1) = 1√2πe−

If X ∼ N(µ, σ2), the r.v. Z = X−µσ ∼ N(0, 1);

P(|X − µ| ≤ σ) = P(|Z | ≤ 1) = 0.6826; (1)

P(|X − µ| ≤ 2σ) = P(|Z | ≤ 2) = 0.9544; (2)

P(|X − µ| ≤ 3σ) = P(|Z | ≤ 3) = 0.9974. (3)

f (x |µ, σ2) =1√2πσ

e−(x−µ)2

2σ2 ,−∞ < x <∞.

E (X ) = µ,Var(X ) = σ2;

f (x |0, 1) = 1√2πe−

P(|X − µ| ≤ σ) = P(|Z | ≤ 1) = 0.6826; (1)

P(|X − µ| ≤ 2σ) = P(|Z | ≤ 2) = 0.9544; (2)

P(|X − µ| ≤ 3σ) = P(|Z | ≤ 3) = 0.9974. (3)

f (x |µ, σ2) =1√2πσ

e−(x−µ)2

2σ2 ,−∞ < x <∞.

E (X ) = µ,Var(X ) = σ2;

f (x |0, 1) = 1√2πe−

P(|X − µ| ≤ σ) = P(|Z | ≤ 1) = 0.6826; (1)

P(|X − µ| ≤ 2σ) = P(|Z | ≤ 2) = 0.9544; (2)

P(|X − µ| ≤ 3σ) = P(|Z | ≤ 3) = 0.9974. (3)

f (x |µ, σ2) =1√2πσ

e−(x−µ)2

2σ2 ,−∞ < x <∞.

E (X ) = µ,Var(X ) = σ2;

f (x |0, 1) = 1√2πe−

P(|X − µ| ≤ σ) = P(|Z | ≤ 1) = 0.6826; (1)

P(|X − µ| ≤ 2σ) = P(|Z | ≤ 2) = 0.9544; (2)

P(|X − µ| ≤ 3σ) = P(|Z | ≤ 3) = 0.9974. (3)

f (x |µ, σ2) =1√2πσ

e−(x−µ)2

2σ2 ,−∞ < x <∞.

E (X ) = µ,Var(X ) = σ2;

f (x |0, 1) = 1√2πe−

P(|X − µ| ≤ σ) = P(|Z | ≤ 1) = 0.6826; (1)

P(|X − µ| ≤ 2σ) = P(|Z | ≤ 2) = 0.9544; (2)

P(|X − µ| ≤ 3σ) = P(|Z | ≤ 3) = 0.9974. (3)

Normal approximation

Let X ∼ binomial(25, 0.6). We can approximate X with a normalr.v., Y , with mean µ = 25 × 0.6 = 15 and standard deviation σ =√

25× 0.6(1− 0.6) = 2.45. Thus

P(X ≤ 13) ≈ P(Y ≤ 13) = P(Z ≤ 13− 15

2.45) (4)

= P(Z ≤ −0.82) = 0.206; (5)

P(X ≤ 13) =13∑x=0

)0.6x0.425−x = 0.267. (6)

In general, X ∼ binomial(n, p), then E (X ) = np and Var(X ) =np(1− p). We can approximate the distribution with N(np, np(1−p)).

Normal approximation

Let X ∼ binomial(25, 0.6). We can approximate X with a normalr.v., Y , with mean µ = 25 × 0.6 = 15 and standard deviation σ =√

25× 0.6(1− 0.6) = 2.45. Thus

P(X ≤ 13) ≈ P(Y ≤ 13) = P(Z ≤ 13− 15

2.45) (4)

= P(Z ≤ −0.82) = 0.206; (5)

P(X ≤ 13) =13∑x=0

)0.6x0.425−x = 0.267. (6)

In general, X ∼ binomial(n, p), then E (X ) = np and Var(X ) =np(1− p). We can approximate the distribution with N(np, np(1−p)).

Beta distribution

The beta family of distribution is a continuous family on (0, 1) in-dexed by two parameters. The Beta(α, β) pdf is

f (x |α, β) =1

B(α, β)xα−1(1− x)β−1, 0 < x < 1, α > 0, β > 0,

where B(α, β) denotes the beta function,

B(α, β) =

0xα−1(1− x)β−1dx .

B(α, β) = Γ(α)Γ(β)Γ(α+β) .

E (X ) = αα+β , and Var(X ) = αβ

(α+β)2(α+β+1);

E (X n) = Γ(α+n)Γ(α+β)Γ(α+β+n)Γ(α) .

Beta distribution

f (x |α, β) =1

B(α, β)xα−1(1− x)β−1, 0 < x < 1, α > 0, β > 0,

B(α, β) =

0xα−1(1− x)β−1dx .

B(α, β) = Γ(α)Γ(β)Γ(α+β) .

(α+β)2(α+β+1);

Beta distribution

f (x |α, β) =1

B(α, β)xα−1(1− x)β−1, 0 < x < 1, α > 0, β > 0,

B(α, β) =

0xα−1(1− x)β−1dx .

B(α, β) = Γ(α)Γ(β)Γ(α+β) .

(α+β)2(α+β+1);

Beta distribution

f (x |α, β) =1

B(α, β)xα−1(1− x)β−1, 0 < x < 1, α > 0, β > 0,

B(α, β) =

0xα−1(1− x)β−1dx .

B(α, β) = Γ(α)Γ(β)Γ(α+β) .

(α+β)2(α+β+1);

Cauchy distribution

The Cauchy distribution is a symmetric, bell-shaped distribution on(−∞,+∞) with pdf

f (x |θ) =1

1 + (x − θ)2,−∞ < x < +∞,−∞ < θ < +∞.

E |X | =∫ +∞−∞

|x |1+(x−θ)2 dx =∞;

The parameter θ does measure the center of the distribution; itis the median.

P(X ≥ θ) = 12 .

Cauchy distribution

f (x |θ) =1

1 + (x − θ)2,−∞ < x < +∞,−∞ < θ < +∞.

E |X | =∫ +∞−∞

|x |1+(x−θ)2 dx =∞;

P(X ≥ θ) = 12 .

Cauchy distribution

f (x |θ) =1

1 + (x − θ)2,−∞ < x < +∞,−∞ < θ < +∞.

E |X | =∫ +∞−∞

|x |1+(x−θ)2 dx =∞;

P(X ≥ θ) = 12 .

Cauchy distribution

f (x |θ) =1

1 + (x − θ)2,−∞ < x < +∞,−∞ < θ < +∞.

E |X | =∫ +∞−∞

|x |1+(x−θ)2 dx =∞;

P(X ≥ θ) = 12 .

Lognormal distribution

If X is a r.v. whose logarithm is normally distributed, that is, logX ∼N(µ, σ2), then X has a lognormal distribution. Its pdf is

f (x |µ, σ2) =1√2πσ

(log x−µ)2

2σ2 , x > 0,−∞ < µ < +∞, σ > 0.

E (X ) = E (e log X ) = eµ+σ2/2;

Var(X ) = e2(µ+σ2) − e2µ+σ2.

f (x |µ, σ2) =1√2πσ

(log x−µ)2

2σ2 , x > 0,−∞ < µ < +∞, σ > 0.

E (X ) = E (e log X ) = eµ+σ2/2;

Var(X ) = e2(µ+σ2) − e2µ+σ2.

f (x |µ, σ2) =1√2πσ

(log x−µ)2

2σ2 , x > 0,−∞ < µ < +∞, σ > 0.

E (X ) = E (e log X ) = eµ+σ2/2;

Var(X ) = e2(µ+σ2) − e2µ+σ2.

Exponential Family

Exponential family

A family of pdfs or pmfs is called an exponential family if it can beexpressed as

f (x |θ) = h(x)c(θ)exp

(∑ki=1 wi (θ)ti (x)

Here h(x) ≥ 0 and ti (x) is real-valued function of the observationx , and c(θ) ≥ 0 and wi (θ) is real-valued function of the possiblyvector-valued parameter θ.

To verify that a family of pdfs or pmfs is an exponential family,we must identify the functions h(x), c(θ),wi (x), and ti (x) andshow that the family has the above form.

Bernoulli, Gaussian, Binomial, Poisson, Exponential, Weibull,Laplace, Gamma, Beta, Multinomial, Wishart distributions areall exponential families

Exponential Family

Exponential family

Exponential Family

Exponential family

Exponential Family

Example: binomial distribution

Let consider the binomial(n, p) family, where n ∈ Z+

f (x |p) =

)px(1− p)n−x =

)(1− p)n(

1− p)x

)(1− p)nexpx log ( p

1−p ) (7)

Define

h(x) =

), x ∈ [0, n];

0, otherwise., c(p) = (1− p)n (8)

w1(p) = log (p

1− p), t1(x) = x . (9)

Exponential Family

Example: binomial distribution

f (x |p) =

)px(1− p)n−x =

)(1− p)n(

1− p)x

1−p ) (7)

Define

h(x) =

), x ∈ [0, n];

0, otherwise., c(p) = (1− p)n (8)

w1(p) = log (p

1− p), t1(x) = x . (9)

Exponential Family

Example: normal distribution

Let consider the normal family N(µ, σ2)

f (x |µ, σ2) =1√2πσ

e−(x−µ)2

=1√2πσ

e−µ2

2σ2 e−x2

2σ2 + µx

σ2 (10)

Define

h(x) = 1, c(θ) = c(µ, σ) =1√2πσ

e−µ2

w1(µ, σ) =1

σ2,w1(µ, σ) =

t1(x) = −x2

2, t2(x) = x . (11)

Exponential Family

Let consider the normal family N(µ, σ2)

f (x |µ, σ2) =1√2πσ

e−(x−µ)2

=1√2πσ

e−µ2

2σ2 e−x2

2σ2 + µx

σ2 (10)

Define

h(x) = 1, c(θ) = c(µ, σ) =1√2πσ

e−µ2

w1(µ, σ) =1

σ2,w1(µ, σ) =

t1(x) = −x2

2, t2(x) = x . (11)

Exponential Family

f (x |p) =

)px(1− p)n−x =

)(1− p)n(

1− p)x

1−p ) (12)

Define

h(x) =

), x ∈ [0, n];

0, otherwise., c(p) = (1− p)n (13)

w1(p) = log (p

1− p), t1(x) = x . (14)

Exponential Family

f (x |p) =

)px(1− p)n−x =

)(1− p)n(

1− p)x

1−p ) (12)

Define

h(x) =

), x ∈ [0, n];

0, otherwise., c(p) = (1− p)n (13)

w1(p) = log (p

1− p), t1(x) = x . (14)

Exponential Family

Theorem

If X is a r.v. with pdf or pmf,

E(∑k

i=1∂wi (θ)∂θj

ti (X ))

= − ∂∂θj

log c(θ);

Var(∑k

i=1∂wi (θ)∂θj

ti (X ))

− ∂2

∂θ2j

log c(θ)− E(∑k

i=1∂2wi (θ)∂θ2

jti (X )

Their advantage is that we can replace integration or summation bydifferentiation, which is often more straightforward.

Exponential Family

Theorem

If X is a r.v. with pdf or pmf,

E(∑k

i=1∂wi (θ)∂θj

ti (X ))

= − ∂∂θj

log c(θ);

Var(∑k

i=1∂wi (θ)∂θj

ti (X ))

− ∂2

∂θ2j

log c(θ)− E(∑k

i=1∂2wi (θ)∂θ2

jti (X )

Their advantage is that we can replace integration or summation bydifferentiation, which is often more straightforward.

Exponential Family

Binomial mean and variance

For the binomial distribution, we have

dpw1(p) =

1− p=

p(1− p)(15)

dplog c(p) =

dpn log 1− p =

−np(1− p)

Thus, we have

p(1− p)

1− p.

Exponential Family

Binomial mean and variance

For the binomial distribution, we have

dpw1(p) =

1− p=

p(1− p)(15)

dplog c(p) =

dpn log 1− p =

−np(1− p)

Thus, we have

p(1− p)

1− p.

Take-aways

Conclusions

Discrete distributions

Continuous distributions

Exponential family

Location and scale families

Inequalities

statistical inference - lecture 3: common families of...

Documents