Download - Stat Infer
-
7/27/2019 Stat Infer
1/114
LECTURE NOTES ONSTATISTICAL INFERENCE
KRZYSZTOF POD GORSKI
Department of Mathematics and Statistics
University of Limerick, Ireland
November 23, 2009
-
7/27/2019 Stat Infer
2/114
Contents
1 Introduction 4
1.1 Models of Randomness and Statistical Inference . . . . . . . . . . . . 4
1.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Probability vs. likelihood . . . . . . . . . . . . . . . . . . . . 8
1.2.2 More data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Likelihood and theory of statistics . . . . . . . . . . . . . . . . . . . 15
1.4 Computationally intensive methods of statistics . . . . . . . . . . . . 15
1.4.1 Monte Carlo methods studying statistical methods using com-
puter generated random samples . . . . . . . . . . . . . . . . 16
1.4.2 Bootstrap performing statistical inference using computers . 18
2 Review of Probability 21
2.1 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Distribution of a Function of a Random Variable . . . . . . . . . . . . 22
2.3 Transforms Method Characteristic, Probability Generating and Mo-
ment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Sums of Independent Random Variables . . . . . . . . . . . . 26
2.4.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . 27
2.4.3 The Bivariate Change of Variables Formula . . . . . . . . . . 28
2.5 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . 29
1
-
7/27/2019 Stat Infer
3/114
2.5.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . 29
2.5.3 Negative Binomial and Geometric Distribution . . . . . . . . 302.5.4 Hypergeometric Distribution . . . . . . . . . . . . . . . . . 31
2.5.5 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . 32
2.5.6 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . 33
2.5.7 The Multinomial Distribution . . . . . . . . . . . . . . . . . 33
2.6 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . 34
2.6.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . 34
2.6.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . 35
2.6.3 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . 35
2.6.4 Gaussian (Normal) Distribution . . . . . . . . . . . . . . . . 36
2.6.5 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . 38
2.6.6 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6.7 Chi-square Distribution . . . . . . . . . . . . . . . . . . . . . 39
2.6.8 The Bivariate Normal Distribution . . . . . . . . . . . . . . . 39
2.6.9 The Multivariate Normal Distribution . . . . . . . . . . . . . 40
2.7 Distributions further properties . . . . . . . . . . . . . . . . . . . . 42
2.7.1 Sum of Independent Random Variables special cases . . . . 42
2.7.2 Common Distributions Summarizing Tables . . . . . . . . 45
3 Likelihood 48
3.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 48
3.2 Multi-parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 The Invariance Principle . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Estimation 61
4.1 General properties of estimators . . . . . . . . . . . . . . . . . . . . 61
4.2 Minimum-Variance Unbiased Estimation . . . . . . . . . . . . . . . . 64
4.3 Optimality Properties of the MLE . . . . . . . . . . . . . . . . . . . 69
5 The Theory of Confidence Intervals 71
5.1 Exact Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 71
2
-
7/27/2019 Stat Infer
4/114
5.2 Pivotal Quantities for Use with Normal Data . . . . . . . . . . . . . . 75
5.3 Approximate Confidence Intervals . . . . . . . . . . . . . . . . . . . 80
6 The Theory of Hypothesis Testing 87
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Hypothesis Testing for Normal Data . . . . . . . . . . . . . . . . . . 92
6.3 Generally Applicable Test Procedures . . . . . . . . . . . . . . . . . 97
6.4 The Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . 101
6.5 Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.6 The 2 Test for Contingency Tables . . . . . . . . . . . . . . . . . . 109
3
-
7/27/2019 Stat Infer
5/114
Chapter 1
Introduction
Everything existing in the universe is the fruit of chance.
Democritus, the 5th Century BC
1.1 Models of Randomness and Statistical Inference
Statistics is a discipline that provides with a methodology allowing to make an infer-
ence from real random data on parameters of probabilistic models that are believed to
generate such data. The position of statistics with relation to real world data and corre-
sponding mathematical models of the probability theory is presented in the following
diagram.
The following is the list of few from plenty phenomena to which randomness is
attributed.
Games of chance
Tossing a coin
Rolling a die
Playing Poker
Natural Sciences
4
-
7/27/2019 Stat Infer
6/114
Real World Science & Mathematics
Random Phenomena Probability Theory
Data Samples Models
Statistics
Prediction and Discovery Statistical Inference
-
? ?
-
? ?
-
HH
HH
HH
HH
HHHj ?
?
Figure 1.1: Position of statistics in the context of real world phenomena and mathe-
matical models representing them.
5
-
7/27/2019 Stat Infer
7/114
Physics (notable Quantum Physics)
Genetics
Climate
Engineering
Risk and safety analysis
Ocean engineering
Economics and Social Sciences
Currency exchange rates
Stock market fluctations
Insurance claims
Polls and election results
etc.
1.2 Motivating Example
Let X denote the number of particles that will be emitted from a radioactive source
in the next one minute period. We know that X will turn out to be equal to one of
the non-negative integers but, apart from that, we know nothing about which of the
possible values are more or less likely to occur. The quantity X is said to be a random
variable.
Suppose we are told that the random variable X has a Poisson distribution with
parameter = 2. Then, ifx is some non-negative integer, we know that the probability
that the random variable X takes the value x is given by the formula
P(X = x) =
x
exp()x!where = 2. So, for instance, the probability that X takes the value x = 4 is
P(X = 4) =24 exp(2)
4!= 0.0902 .
6
-
7/27/2019 Stat Infer
8/114
We have here a probability model for the random variable X. Note that we are using
upper case letters for random variables and lower case letters for the values taken byrandom variables. We shall persist with this convention throughout the course.
Let us still assume that the random variable X has a Poisson distribution with
parameter but where is some unspecified positive number. Then, if x is some non-
negative integer, we know that the probability that the random variable X takes the
value x is given by the formula
P(X = x|) = x exp()
x!, (1.1)
for
R+. However, we cannot calculate probabilities such as the probability that X
takes the value x = 4 without knowing the value of.
Suppose that, in order to learn something about the value of, we decide to measure
the value ofX for each of the next 5 one minute time periods. Let us use the notation
X1 to denote the number of particles emitted in the first period, X2 to denote the
number emitted in the second period and so forth. We shall end up with data consisting
of a random vector X = (X1, X2, . . . , X 5). Consider x = (x1, x2, x3, x4, x5) =
(2, 1, 0, 3, 4). Then x is a possible value for the random vector X. We know that the
probability that X1 takes the value x1 = 2 is given by the formula
P(X = 2|) = 2 exp()2!
and similarly that the probability that X2 takes the value x2 = 1 is given by
P(X = 1|) = exp()1!
and so on. However, what about the probability that X takes the value x? In order for
this probability to be specified we need to know something about the joint distribution
of the random variables X1, X2, . . . , X 5. A simple assumption to make is that the ran-
dom variables X1, X2, . . . , X 5 are mutually independent. (Note that this assumption
may not be correct since X2 may tend to be more similar to X1 that it would be to X5.)
However, with this assumption we can say that the probability that X takes the value x
7
-
7/27/2019 Stat Infer
9/114
is given by
P(X = x|) = 5i=1
xi exp()xi!
,
=2 exp()
2!
1 exp()1!
0 exp()
0!
3 exp()
3!
4 exp()4!
,
=10 exp(5)
288.
In general, ifX = (x1, x2, x3, x4, x5) is any vector of 5 non-negative integers, then
the probability that X takes the value x is given by
P(X = x|) =5i=1
xi exp()xi!
,
=5
i=1 xi exp(5)5i=1
xi!
.
We have here a probability model for the random vector X.
Our plan is to use the value x ofX that we actually observe to learn something
about the value of. The ways and means to accomplish this task make up the subject
matter of this course. The central tool for various statistical inference techniques is
the likelihood method. Below we present a simple introduction to it using the Poisson
model for radioactive decay.
1.2.1 Probability vs. likelihood
. In the introduced Poisson model for a given , say = 2, we can observe a function
p(x) of probabilities of observing values x = 0, 1, 2, . . . . This function is referred to
as probability mass function . The graph of it is presented below
The usage of such function can be utilized in bidding for a recorded result of future
experiments. If one wants to optimize chances of correctly predicting the future, the
choice of the number of recorded particles would be either on 1 or 2.
So far, we have been told that the random variable Xhas a Poisson distribution with
parameter where is some positive number and there are physical reason to assume
8
-
7/27/2019 Stat Infer
10/114
0 2 4 6 8 10
0
.00
0.0
5
0.1
0
0.1
5
0.2
0
0
.25
Number of particles
Probability
Figure 1.2: Probability mass function for Poisson model with = 2.
that such a model is correct. However, we have arbitrarily set = 2 and this is more
questionable. How can we know that it is correct a correct value of the parameter? Let
us analyze this issue in detail.
If x is some non-negative integer, we know that the probability that the random
variable X takes the value x is given by the formula
P(X = x|) = xe
x!,
for > 0. But without knowing the true value of , we cannot calculate probabilities
such as the probability that X takes the value x = 1.
Suppose that, in order to learn something about the value of , an experiment is
performed and a value ofX = 5 is recorded. Let us take a look at the probability mass
function for = 2 in Figure 1.2. What is the probability ofX to take value 2? Do we
like what we see? Why? Would you bet 1 or 2 in the next experiment?
We certainly have some serious doubt about our choice of = 2 which was arbi-
trary anyway. One can consider, for example, = 7 as an alternative to = 2. Here
are graphs of the pmf for the two cases. Which of the two choices do we like? Since it
9
-
7/27/2019 Stat Infer
11/114
0 2 4 6 8 10
0.0
0
0.0
5
0.1
0
0.1
5
0.2
0
0.2
5
Number of particles
Probability
0 2 4 6 8 10
0.0
0
0.0
5
0.1
0
0.1
5
Number of particles
Probability
Figure 1.3: The probability mass function for Poisson model with = 2 vs. the one
with = 7.
was more probable to get X = 5 under the assumption = 7 than when = 2, we say
= 7 is more likely to produce X = 5 than = 2. Based on this observation we can
develop a general strategy for chosing .
Let us summarize our position. So far we know (or assume) about the radioactive
emission that it follows Poisson model with some unknown > 0 and the value x = 5
has been once observed. Our goal is somehow to utilized this knowledge. First, we
note that the Poisson model is in fact not only a function of x but also of
p(x
|) =
xe
x!
.
Let us plug in the observed x = 5, so that we get a function of that is called
likelihood function
l() =5e
120.
The graph of it is presented on the next figure. Can you localize on this graph the values
of probabilities that were used to chose = 7 over = 2? What value of appears to
be the most preferable if the same argument is extended to all possible values of ? We
observe that the value of = 5 is most likely to produce value x = 5. In the result of
our likelihood approach we have used the data x = 5 and the Poisson model to makeinference - an example of statistical inference .
10
-
7/27/2019 Stat Infer
12/114
0 5 10 15
0.0
0
0.0
5
0.1
0
0.
15
theta
Likelihood
Figure 1.4: Likelihood function for the Poisson model when the observed value is
x = 5.
Likelihood Poisson model backward
Poisson model can be stated as a probability mass function that maps possible values
x into probabilities p(x) or if we emphasize the dependence on into p(x|) that isgiven below
p(x
|) = l(
|x) =
xe
x!
,
With the Poisson model with given one can compute probabilities that variouspossible numbers x of emitted particles can be recorded, i.e. we consider
x p(x|)
with fixed. We get the answer how probable are various outcomes x.
With the Poisson model where x is observed and thus fixed one can evaluate howlikely it would be to get x under various values of, i.e. we consider
l(|x)
with fixed. We get the answer how likely various could produced the observed
x.
11
-
7/27/2019 Stat Infer
13/114
Exercise 1. For the general Poisson model
p(x|) = l(|x) = xex!
,
1. for a given > find the most probable value of the observation x.
2. for a given observation x find the most likely value of.
Give a mathematical argument for your claims.
1.2.2 More data
Suppose that we perform another measurement of the number of emitted particles. Let
us use the notation X1 to denote the number of particles emitted in the first period, X2
to denote the number emitted in the second period. We shall end up with data consisting
of a random vector X = (X1, X2). The second measurement yielded x2 = 2, so that
x = (x1, x2) = (5, 2).
We know that the probability that X1 takes the value x1 = 5 is given by the formula
P(X = 5|) = 5e
5!
and similarly that the probability that X2 takes the value x2 = 2 is given by
P(X = 2|) = 2e
2!.
However, what about the probability that X takes the value x = (5, 2)? In order for
this probability to be specified we need to know something about the joint distribution
of the random variables X1, X2. A simple assumption to make is that the random
variables X1, X2 are mutually independent. In such a case the probability that X takes
the value x = (x1, x2) is given by
P(X = (x1, x2)
|) =
x1e
x1!
x2e
x2!
= e2x1+x2
x1!x2!
.
After little of algebra we easily find the likelihood function of observingX = (5, 2)
as
l(|(5, 2)) = e2 7
240
12
-
7/27/2019 Stat Infer
14/114
0 5 10 15
0.
000
0.0
05
0.0
10
0.0
15
0.0
20
0.0
25
Likelihood
0 5 10 15
0.
00
0.
05
0.
10
0.
15
theta
Likelihood
Figure 1.5: Likelihood of observing (5, 2) (top) vs. the one of observing 5 (bottom).
and its graph is presented in Figure 1.5 in comparison with the previous likelihood for
a single observation.
Two important effects of adding an extra information should be noted
We observe that the location of the maximum shifted from 5 to 3 compared tosingle observation.
We also note that the range of likely values for has diminished.
Let us suppose that eventually we decide to measure three more values of X.
Let us use the vector notation X = (X1, X2, . . . , X 5) to denote observable random
13
-
7/27/2019 Stat Infer
15/114
vector. Assume that three extra measurements yielded 3, 7, 7 so that we have x =
(x1, x2, x3, x4, x5) = (5, 2, 3, 7, 7). Under the assumption of independence the proba-bility that X takes the value x is given by
P(X = x|) =5i=1
xie
xi!.
The likelihood function of observing X = (5, 2, 3, 7, 7) under independence can
be easily derived to be24e5
14515200.
In general, ifX = (x1, . . . , xn) is any vector of 5 non-negative integers, then the
likelihood is given by
l(|(x1, . . . , xn) = n
i=1 xienni=1
xi!.
The value that maximizes this likelihood is called the maximum likelihood estimatorof.
In order to find values that effectively maximize likelihood, the method of calculus
can be implemented. We note that in our example we deal only with one variable and
computation of derivative is rather straightforward.
Exercise 2. For the general case of likelihood based on Poisson model
l(|x1, . . . , xn) = n
i=1 xienni=1
xi!
using methods of calculus derive a general formula for the maximum likelihood esti-
mator of. Using the result find for (x1, x2, x3, x4, x5) = (5, 2, 3, 7, 7).Exercise 3. It is generally believed that time X that passes until there is half of the
original radioactive material follow exponential distribution f(x|) = ex, x > 0.For beryllium 11 five experiments has been performed and values 13.21, 13.12, 13.95,
13.54, 13.88 seconds has been obtained. Find and plot the likelihood function for and
based on this determine the most likely .
14
-
7/27/2019 Stat Infer
16/114
1.3 Likelihood and theory of statistics
The strategy of making statistical inference based on the likelihood function as de-
scribed above is the recurrent theme in mathematical statistics and thus in our lecture.
Using mathematical argument we would compare various strategies to infering about
the parameters and often we will demonstrate that the likelihood based methods are
optimal. It will show its strength also as a criterium deciding between various claims
about parameters of the model which is the leading story of testing hypotheses.
In the modern days, the role of computers has increased in statistical methodology.
New computationally intense methods of data explorations become one of the central
areas of modern statistcs. Even there, methods that refer to likelihood play dominant
roles, in particular, in Bayesian methodology.
Despite this extensive penetration of statistical methodology by likelihood techin-
ques, by no means statistics can be reduced to analysis of likelihood. In every area of
statistics, there are important aspects that require reaching beyond likelihood, in many
cases, likelihood is not even a focus of studies and development. The purpose of this
course is to present both the importance of likelihood approach across statistics but also
presentation of topics for which likelihood plays a secondary role if any.
1.4 Computationally intensive methods of statistics
The second part of our presentation of modern statistical inference is devoted to compu-
tationally intensive statistical methods. The area of data explorations is rapidly growing
in importance due to
common access to inexpensive but advance computing tools,
emerging of new challenges associated with massive highly dimensional data farexceeding traditional assumptions on which traditional methods of statistics have
been based.
In this introduction we give two examples that illustrate the power of modern computers
and computing software both in analysis of statistical models and in performing actual
15
-
7/27/2019 Stat Infer
17/114
statistical inference. We start with analyzing a performance of a statistical procedure
using random sample generation.
1.4.1 Monte Carlo methods studying statistical methods using
computer generated random samples
Randomness can be used to study properties of a mathematical model. The model itself
may be probabilistic or not but here we focus on the probabilistic ones. Essentially, it
is based on repetitive simulations of random samples corresponding to the model and
observing behavior of objects of interests. An example of Monte Carlo method is ap-
proximate the area of circle by tossing randomly a point (typically computer generated)on the paper where a circle is drawn. The percentage of points that fall inside the circle
represents (approximately) percentage of the area covered by the circle, as illustrated
in Figure 1.6.
Exercise 4. Write an R code that would explore the area of an elipsoid using Monte
Carlo method.
Below we present an application of Monte Carlo approach to studying fitting meth-
ods for the Poisson model.
Deciding for Poisson model
Recall that the Poisson model is given by
P(X = x|) = xe
x!.
It is relatively easy to demonstrate that the mean value of this distribution is equal to
and standard deviation is also equal to .
Exercise 5. Present a formal argument showing that for a Poisson random variable X
with parameter ,E
X = and VarX = .
Thus for a sample of observations x = (x1, . . . , xn) it is reasonable to consider
16
-
7/27/2019 Stat Infer
18/114
Figure 1.6: Monte Carlo study of the circle area approximation for sample size of
10000 is 3.1248 which compares to the true value of = 3.141593.
both
1 = x,
2 = x2 x2as estimators of.
We want to employ Monte Carlo method to decide which one is better. In the
process we run many samples from the Poisson distribution and check which of the
17
-
7/27/2019 Stat Infer
19/114
Histogram of means
means
Frequency
2.5 3.0 3.5 4.0 4.5 5.0 5.5
0
100
Histogram of vars
vars
Frequenc
y
0 5 10 15
0
150
300
Figure 1.7: Monte Carlo results of comparing estimation of = 4 by the sample mean
(left) vs. estimation using the sample standard deviation right.
estimates performs better. The resulting histograms of the values of estimator are pre-
sented in Figure 1.8. It is quite clear from the graphs that the estimator based on the
mean is better than the one based on the variance.
1.4.2 Bootstrap performing statistical inference using computers
Bootstrap (resampling) methods are one of the examples of Monte Carlo based statis-
tical analysis. The methodology can be summarized as follows
Collect statistical sample, i.e. the same type of data as in classical statistics.
Used a properly chosen Monte Carlo based resampling from the data using RNG create so called bootstrap samples.
Analyze bootstrap samples to draw conclusions about the random mechanism
18
-
7/27/2019 Stat Infer
20/114
that produced the original statistical data.
This way randomness is used to analyze statistical samples that, by the way, are also a
result of randomness. An example illustrating the approach is presented next.
Estimating nitrate ion concentration
Nitrate ion concentration measurements in a certain chemical lab has been collected
and their results are given in the following table. The goal is to estimate, based on
0.51 0.51 0.51 0.50 0.51 0.49 0.52 0.53 0.50 0.47
0.51 0.52 0.53 0.48 0.49 0.50 0.52 0.49 0.49 0.50
0.49 0.48 0.46 0.49 0.49 0.48 0.49 0.49 0.51 0.47
0.51 0.51 0.51 0.48 0.50 0.47 0.50 0.51 0.49 0.48
0.51 0.50 0.50 0.53 0.52 0.52 0.50 0.50 0.51 0.51
Table 1.1: Results of 50 determinations of nitrate ion concentration in g per ml.
these values, the actual nitrate ion concentration. The overall mean of all observations
is 0.4998. It is natural to ask what is the error of this determination of the nitrate
concentration. If we would repeat our experiment of collecting 50 samples of nitrate
concentrations many times we would see the range of error that is made. However,
it would be a waste of resources and not a viable method at all. Instead we resample
new data from our data and use so obtained new samples for assessment of the error
and compare the obtained means (bootstrap means) with the original one. The differ-
ences of these represent the bootstrap estimation errors their distribution is viewed
as a good representation of the distribution of the true error. In Figure ??, we see the
bootstrap counterpart of the distribution of the estimation error.
Based on this we can safely say that the nitrate concentration is 49.99 0.005.Exercise 6. Consider a sample of daily number of buyers in a furniture store
8, 5, 2, 3, 1, 3, 9, 5, 5, 2, 3, 3, 8, 4, 7, 11, 7, 5, 12, 5
Consider the two estimators of for a Poisson distribution as discussed in the previous
section. Describe formally the procedure (in steps) of obtaining a bootstrap confidence
19
-
7/27/2019 Stat Infer
21/114
Histogram of bootstrap
bootstrap
Frequency
-0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008
0
20
40
60
80
Figure 1.8: Boostrap estimation error distribution.
interval for using each of the discussed estimatoand provide with 95% bootstrap
confidence intervals for each of them.
20
-
7/27/2019 Stat Infer
22/114
Chapter 2
Review of Probability
2.1 Expectation and Variance
The expected value E[Y] of a random variable Y is defined as
E[Y] =i=0
yiP(yi);
ifY is discrete, and
E[Y] = yf(y)dy;ifY is continuous, where f(y) is the probability density function. The varianceVar[Y]
of a random variable Y is defined as
Var[Y] = E(Y E[Y])2;
or
Var[Y] =
i=0
(yi E[Y])2P(yi);
ifY is discrete, and
V ar[Y] =
(y E[Y])2f(y)dy;
ifY is continuous. When there is no ambiguity we often writeEY forE[Y], andVarY
for Var[Y].
21
-
7/27/2019 Stat Infer
23/114
A function of a random variable is itself a random variable. If h(Y) is function of
the random variable Y, then the expected value ofh(Y) is given by
E[h(Y)] =i=0
h(yi)P(yi);
ifY is discrete, and ifY is continuous
E[h(Y)] =
h(y)f(y) dy.
It is relatively straightforward to derive the following results for the expectation
and variance of a linear function ofY .
E[aY + b] = aE[Y] + b,
V ar[aY + b] = a2Var[Y],
where a and b are constants. Also
Var[Y] = E[Y2] (E[Y])2 (2.1)
For expectations, it can be shown more generally that
E
k
i=1 aihi(Y) =k
i=1 aiE[hi(Y)],where ai, i = 1, 2, . . . , k are constants and hi(Y), i = 1, 2, . . . , k are functions of the
random variable Y.
2.2 Distribution of a Function of a Random Variable
If Y is a random variable than for any regular function X = g(Y) is also a random
variable. The cumulative distribution function ofX is given as
FX(x) = P(X x) = P(Y g1(, x]).
The density function of X if exists can be found by differentiating the right hand side
of the above equality.
22
-
7/27/2019 Stat Infer
24/114
Example 1. Let Y has a density fY and X = Y2. Then
FX(x) = P(Y2 < x) = P(x Y x) = FY(x) FY(x).
By taking a derivative in x we obtain
fX(x) =1
2
x
fY(
x) + fY(
x)
.
If additionally the distribution of Y is symmetric around zero, i.e. fY(y) = fY(y),then
fX(x) =1x
fY(
x).
Exercise 7. Let Z be a random variable with the density fZ(z) = ez2
/2/2, socalled the standard normal (Gaussian) random variable. Show that Z2 is a Gamma(1/2, 1/2)
random variable, i.e. that it has the density given by
12
x1/2ex/2.
The distribution ofZ2 is also called chi-square distribution with one degree of freedom.
Exercise 8. Let FY(y) be a cumulative distribution function of some random variable
Y that with probability one takes values in a set RY . Assume that there is an inverse
function F1
Y [0, 1] RY so that FYF1
Y (u) = u for u [0, 1]. Check that for U Unif(0, 1) the random variable Y = F1Y (U) has FY as its cumulative distribution
function.
The densities ofg(Y) are particularly easy to express if g is a strictly monotone as
shown in the next result
Theorem 2.2.1. LetY be a continuous random variable with probability density func-
tion fY. Suppose thatg(y) is a strictly monotone (increasing or decreasing) differ-
entiable (and hence continuous) function of y. The random variable Z defined by
Z = g(Y) has probability density function given by
fZ(z) = fY
g1(z) ddz ga(z)
(2.2)where g1(z) is defined to be the inverse function of g(y).
23
-
7/27/2019 Stat Infer
25/114
Proof. Let g(y) be a monotone increasing (decreasing) function and let FY(y) and
FZ(z) denote the probability distribution functions of the random variables Y and Z.Then
FZ(z) = P(Z z) = P(g(Y) z) = P(Y ()g1(z)) = (1)FY(g1(z))
By the chain rule,
fZ(z) =d
dzFZ(z) = () d
dzFY(g
1(z)) = fY(g1(z))dg1dz (z)
.
Exercise 9. (The Log-Normal Distribution) Suppose Z is a standard normal distribu-
tion and g(z) = eaz+b. Then Y = g(Z) is called a log-normal random variable.
Demonstrate that the density ofY is given by
fY(y) =1
2a2y1 exp
log
2(y/eb)
2a2
.
2.3 Transforms Method Characteristic, Probability Gen-
erating and Moment Generating Functions
The probability generating function of a random variable Y is a function denoted by
GY(t) and defined by
GY(t) = E(tY),
for those t R for which the above expectation is convergent. The expectation definingGY(t) converges absolutely if |t| 1. As the name implies, the p.g.f generates theprobabilities associated with a discrete distribution P(Y = j) = pj , j = 0, 1, 2, . . . .
GY(0) = p0, GY(0) = p1, GY(0) = 2!p2.
In general the kth derivative of the p.g.f ofY satisfies
G(k)Y(0) = k!pk.
24
-
7/27/2019 Stat Infer
26/114
The p.g.f can be used to calculate the mean and variance of a random variable Y. Note
that in the discrete case GY(t) = j=1jpjtj1 for 1 < t < 1. Let t approach onefrom the left, t 1, to obtain
GY(1) =j=1
jpj = E(Y) = Y.
The second derivative ofGY(t) satisfies
GY(t) =j=1
j(j 1)pjtj2,
and consequently
GY(1) =j = 1j(j 1)pj = E(Y2) E2(Y).
The variance ofY satisfies
2Y = EY2 EY + EY E2Y = GY(1) + GY(1) G2Y(1).
The moment generating function (m.g.f) of a random variable Y is denoted by
MY(t) and defined as
MY(t) = E
etY
,
for some tR. The moment generating function generates the moments EYk
MY(0) = 1, MY(0) = Y = E(Y), MY(0) = EY
2,
and, in general,
M(k)Y(0) = EYk.
The characteristic function (ch.f) of a random variable Y is defined by
Y(t) = EeitY,
where i =1.
A very important result concerning generating functions states that the moment
generating function uniquely defines the probability distribution (provided it exists in
an open interval around zero). The characteristic function also uniquely defines the
probability distribution.
25
-
7/27/2019 Stat Infer
27/114
Property 1. If Y has the characteristic function Y(t) and the moment generating
function MY(t), then forX = a + bY
X(t) =eaitY(bt)
MX(t) =eatMY(bt).
2.4 Random Vectors
2.4.1 Sums of Independent Random Variables
Suppose that Y1, Y2, . . . , Y n are independent random variables. Then the moment gen-
erating function of the linear combination Z = ni=1 aiYi is the product of the indi-vidual moment generating functions.
MZ(t) =EetaiYi
=Eea1tY1Eea2tY2 EeantYn
=ni=1
MYi(aiYi).
The same argument gives also that Z(t) =ni=1 Yi(aiY i).
When X and Y are discrete random variables, the condition of independence is
equiva- lent to pX,Y(x, y) = pX(x)pY(y) for all x, y. In the jointly continuous case
the condition of independence is equivalent to fX,Y(x, y) = fX(x)fY(y) for all x, y.
Consider random variables X and Y with probability densities fX(x) and fY(y) re-
spectively. We seek the probability density of the random variable X+ Y. Our general
result follows from
FX+Y(a) =P(X+ Y < a)
=
X+Y
-
7/27/2019 Stat Infer
28/114
Thus the density function fX+Y(z) =
fX(z y)fY(y) dy which is called the
convolution of the densities fX and fY .
2.4.2 Covariance and Correlation
Suppose that X and Y are real-valued random variables for some random experiment.
The covariance ofX and Y is defined by
Cov(X, Y) = E[(X EX)(Y EY)]
and (assuming the variances are positive) the correlation ofX and Y is defined by
(X, Y) =Cov(X, Y)Var(X)Var(Y) .
Note that the covariance and correlation always have the same sign (positive, nega-
tive, or 0). When the sign is positive, the variables are said to be positively correlated,
when the sign is negative, the variables are said to be negatively correlated, and when
the sign is 0, the variables are said to be uncorrelated. For an intuitive understanding
of correlation, suppose that we run the experiment a large number of times and that
for each run, we plot the values (X, Y) in a scatterplot. The scatterplot for positively
correlated variables shows a linear trend with positive slope, while the scatterplot for
negatively correlated variables shows a linear trend with negative slope. For uncorre-
lated variables, the scatterplot should look like an amorphous blob of points with no
discernible linear trend.
Property 2. You should satisfy yourself that the following are true
Cov(X, Y) =EXY EXEYCov(X, Y) =Cov(Y, X)
Cov(Y, Y) =Var(Y)
Cov(aX+ bY + c, Z) =aCov(X, Z) + bCov(Y, Z)
Var ni=1
Yi
=n
i,j=1
Cov(Yi, Yj)
If X and Y are independent, then they are uncorrelated. The converse is not true
however.
27
-
7/27/2019 Stat Infer
29/114
2.4.3 The Bivariate Change of Variables Formula
Suppose that (X, Y) is a random vector taking values in a subset S ofR2 with proba-
bility density function f. Suppose that U and V are random variables that are functions
ofX and Y
U = U(X, Y), V = V(X, Y).
If these functions have derivatives, there is a simple way to get the joint probability
density function g of (U, V). First, we will assume that the transformation (x, y) (u, v) is one-to-one and maps Sonto a subset T ofR2. Thus, the inverse transformation
(u, v) (x, y) is well defined and maps T onto S. We will assume that the inverse
transformation is smooth, in the sense that the partial derivatives
x
u,
x
v,
y
u,
y
v,
exist on T, and the Jacobian
(x, y)
(u, v)=
xu
xv
yu
yv
= xu yv xv yuis nonzero on T. Now, let B be an arbitrary subset of T. The inverse transformation
maps B onto a subset A ofS. Therefore,
P((U, V) B) = P((X, Y) A) = A f(x, y) dxdy.But, by the change of variables formula for double integrals, this can be written as
P((U, V) B) =
B
f(x(u, v), y(u, y))
(x, y)(u, v) dudv.
By the very meaning of density, it follows that the probability density function of
(U, V) is
g(u, v) = f(x(u, v), y(u, v))
(x, y)(u, v) , (u, v) T.
The change of variables formula generalizes to Rn.
Exercise 10. Let U1 and U2 be independent random variables with the density equalto one over [0, 1], i.e. standard uniform random variables. Find the density of the
following vector of variables
(Z1, Z2) = (
2log U1 cos(2U2),
2log U1 sin(2U2)).
28
-
7/27/2019 Stat Infer
30/114
2.5 Discrete Random Variables
2.5.1 Bernoulli Distribution
A Bernoulli trial is a probabilistic experiment which can have one of two outcomes,
success (Y = 1) or failure (Y = 0) and in which the probability of success is . We
refer to as the Bernoulli probability parameter. The value of the random variable Y is
used as an indicator of the outcome, which may also be interpreted as the presence or
absence of a particular characteristic. A Bernoulli random variable Y has probability
mass function
P(Y = y|) = y
(1 )1
y (2.4)for y = 0, 1 and some (0, 1). The notation Y Ber() should be read as therandom variable Y follows a Bernoulli distribution with parameter .
A Bernoulli random variable Y has expected value E[Y] = 0 P(Y = 0) + 1 P(Y = 1) = 0(1)+1 = , and varianceVar[Y] = (0)2(1)+(1)2 =(1 ).
2.5.2 Binomial Distribution
Consider independent repetitions of Bernoulli experiments, each with a probability of
success . Next consider the random variable Y, defined as the number of successes in
a fixed number of independent Bernoulli trials, n . That is,
Y =ni=1
Xi,
where Xi Bernoulli() for i = 1, . . . , n. Each sequence of length n containing yones and (n y) zeros occurs with probability y(1 )(n y). The number ofsequences with y successes, and consequently (n y) fails, is
n!
y!(n y)! = ny.The random variable Y can take on values y = 0, 1, 2, . . . , n with probabilities
P(Y = y|) =
n
y
y(1 )ny. (2.5)
29
-
7/27/2019 Stat Infer
31/114
The notation Y Bin(n, ) should be read as the random variable Y follows a bi-
nomial distribution with parameters n and . Finally using the fact that Y is the sumof n independent Bernoulli random variables we can calculate the expected value as
E[Y] = E[
Xi] = PE[Xi] =
= n and variance as Var[Y] = V ar[
Xi] =Var[Xi] =
(1 ) = n(1 ).
2.5.3 Negative Binomial and Geometric Distribution
Instead of fixing the number of trials, suppose now that the number of successes, r,
is fixed, and that the sample size required in order to reach this fixed number is the
random variable N. This is sometimes called inverse sampling. In the case of r = 1,
using the independence argument again, leads to geomoetric distribution
P(N = n|) = (1 )n1, n = 1, 2, . . . (2.6)
for n = 1, 2, . . . which is the geometric probability function with parameter . The
distribution is so named as successive probabilities form a geometric series. The no-
tation N Geo() should be read as the random variable N follows a geometricdistribution with parameter . Write (1 ) = q. Then
E[N] =
n=1
nqn =
n=0
d
dq(qn) = d
dqn=0
qn=
d
dq
1
1 q
=
(1 q)2 =1
.
Also,
E[N2] =n=1
n2qn1 = n=1
d
dq(nqn) =
d
dq
n=1
nqn
= d
dq
q
1 qE(N)
= d
dq
q(1 q)2
= 12 + 2(1 )3 = 22 1 .Using Var[N] = E[N2] (E[N])2, we get Var[N] = (1 )/2.
Consider now sampling continues until a total of r successes are observed. Again,
let the random variable N denote number of trial required. If the rth success occurs
30
-
7/27/2019 Stat Infer
32/114
on the nth trial, then this implies that a total of (r 1) successes are observed by the
(n 1)th trial. The probability of this happening can be calculated using the binomialdistribution as
n 1r 1
r1(1 )nr.
The probability that the nth trial is a success is . As these two events are indepen-
dent we have that
P(N = n|r, ) =
n 1r 1
r(1 )nr (2.7)
for n = r, r + 1, . . . . The notation N NegBin(r, ) should be read as the randomvariable N follows a negative binomial distribution with parameters r and . This is
also known as the Pascal distribution.
E[Nk] =n=r
nk
n 1r 1
r(1 )nr
=r
n=r
nk1
n
r
r+1(1 )nr since n
n 1r 1
= r
n
r
=r
m=r+1
(m 1)k1
m 1r
r+1(1 )m(r+1)
=r
E
(X 1)k1 ,where X Negativebinomial(r + 1, ). Setting k = 1 we get E(N) = r/. Settingk = 2 gives
E[N2] =r
E(X 1) = r
r + 1
1
.
ThereforeVar[N] = r(1 )/2.
2.5.4 Hypergeometric Distribution
The hypergeometric distribution is used to describe sampling without replacement.
Consider an urn containing b balls, of which w are white and b w are red. Weintend to draw a sample of size n from the urn. Let Y denote the number of white balls
selected. Then, for y = 0, 1, 2, . . . , n we have
P(Y = y|b,w,n) =wy
bwny
bn
. (2.8)31
-
7/27/2019 Stat Infer
33/114
The expected value of the jth moment of a hypergeometric random variable is
E[Y] =ny=0
yjP(Y = y) =ny=1
yj wybwnybn
.The identities
y
w
y
= w
w 1y 1
n
b
n
= b
b 1n 1
can be used to obtain
E[Yj ] =nw
b
n
y=1yj1
w1y1
bwn1
b1n1
=nw
b
n1x=0
(x + 1)j1w1
x
bwn1x
b1n1
=
nw
bE[(X+ 1)j1]
where Xis a hypergeometric random variable with parameters n1, b1, w1. Fromthis it is easy to establish that E[Y] = n and Var[Y] = n(1 )(b n)/(b 1),where = w/b is the fraction of white balls in the population.
2.5.5 Poisson Distribution
Certain problems involve counting the number of events that have occurred in a fixed
time period. A random variable Y, taking on one of the values 0, 1, 2, . . . , is said to be
a Poisson random variable with parameter if for some > 0,
P(Y = y|) = y
y!e, y = 0, 1, 2, . . . (2.9)
The notation Y Pois() should be read as random variable Y follows a Poissondistribution with parameter . Equation 2.9 defines a probability mass function, since
y=0y
y!
e = e
y=0y
y!
= ee = 1.
The expected value of a Poisson random variable is
E[Y] =y=0
yey
y!= e
y=1
y1
(y 1)! = e
j=0
j
(j)!= .
32
-
7/27/2019 Stat Infer
34/114
To get the variance we first compute the second moment
E[Y2] =y=0
y2e yy!
= y=1
ye y1y 1! =
j=0
(j + 1)e jj!
= ( + 1).
Since we already have E[Y] = , we obtain Var[Y] = E[Y2] (E[Y])2 = .Suppose that Y Binomial(n, p), and let = np. Then
P(Y = y|np) =
n
y
py(1 p)ny
=
n
y
n
y 1
n
ny=
n(n 1) (n y + 1)ny
y
y!
(1 /n)n
(1 /n)y
.
For n large and moderate, we have that1
n
n e, n(n 1) (n y + 1)
ny 1,
1
n
y 1.
Our result is that a binomial random variable Bin(n, p) is well approximated by a
Poisson random variable Pois( = np) when n is large and p is small. That is
P(Y = y|n, p) enp (np)y
y!.
2.5.6 Discrete Uniform Distribution
The discrete uniform distribution with integer parameter N has a random variable Y
that can take the vales y = 1, 2, . . . , N with equal probability 1/N. It is easy to show
that the mean and variance ofY are E[Y] = (N + 1)/2, and Var[Y] = (N2 1)/12.
2.5.7 The Multinomial Distribution
Suppose that we perform n independent and identical experiments, where each ex-
periment can result in any one of r possible outcomes, with respective probabilities
p1, p2, . . . , pr, where ri=1pi = 1. If we denote by Yi, the number of the n experi-ments that result in outcome number i, then
P(Y1 = n1, Y2 = n2, . . . , Y r = nr) =n!
n1!n2! nr!pn11 p
n22 pnr5 (2.10)
33
-
7/27/2019 Stat Infer
35/114
where
ri=1 ni = n. Equation 2.10 is justified by noting that any sequence of out-
comes that leads to outcome i occurring ni times for i = 1, 2, . . . , r, will, by theassumption of independence of experiments, have probability pn11 p
n22 pnrr of occur-
ring. As there are n! = (n1!n2! nr!) such sequence of outcomes equation 2.10 isestablished.
2.6 Continuous Random Variables
2.6.1 Uniform Distribution
A random variable Y is said to be uniformly distributed over the interval (a, b) if itsprobability density function is given by
f(y|a, b) = 1b a , ifa < y < b
and equals 0 for all other values of y. Since F(u) =u f(y)dy, the distribution
function of a uniform random variable on the interval (a, b) is
F(u) =
0; u a,(u a)/(b a); a < u b,
1; u > b
The expected value of a uniform random turns out to be the mid-point of the interval,
that is
E[Y] =
yf(y)dy =
ba
y
b a dy =b2 a22(b a) =
b + a
2.
The second moment is calculated as
E[Y2] =
ba
y2
b a dy =b3 a33(b a) =
1
3(b2 + ab + a2),
hence the variance is
Var[Y] = E[Y2] (E[Y])2 = 112
(b a)2.
The notation Y U(a, b) should be read as the random variable Y follows a uniformdistribution on the interval (a, b).
34
-
7/27/2019 Stat Infer
36/114
2.6.2 Exponential Distribution
A random variable Y is said to be an exponential random variable if its probability
density function is given by
f(y|) = ey , y > 0, > 0.
The cumulative distribution of an exponential random variable is given by
F(a) =
a0
eydy = ey |a0 = 1 ea, a > 0.
The expected value E[Y] =
0
yey dy requires integration by parts, yielding
E[Y] = yey |0 + 0
ey dy = ey
|0 = 1 .
Integration by parts can be used to verify that E[Y2] = 22. Hence Var[Y] = 1/2.
The notation Y Exp() should be read as the random variable Y follows an expo-nential distribution with parameter .
Exercise 11. Let Y U[0, 1]. Find the distribution ofY = log U. Can you identifyit as a one of the common distributions?
2.6.3 Gamma Distribution
A random variable Y is said to have a gamma distribution if its density function is
given by
f(y|) = eyy1/(), 0 < y, > 0, > 0
where (), is called the gamma function and is defined by
() =
0
euu1du.
The integration by parts of() yields the recursive relationship
() = euu1|0 +0
eu( 1)u2 du (2.11)
= ( 1)0
euu2 du = ( 1)( 1). (2.12)
35
-
7/27/2019 Stat Infer
37/114
For integer values = n, this recursive relationship reduces to (n + 1) = n!. Note,
by setting = 1 the gamma distribution reduces to an exponential distribution. Theexpected value of a gamma random variable is given by
E[Y] =
()
0
yey dy = !
()
0
ueu du,
after the change of variable u = y. Hence E[Y] = ( + 1)/(()) = /. Using
the same substitution
E[Y2] =
()
0
y+1ey dy =( + 1)
2,
so that Var[Y] = /2. The notation Y Gamma(, ) should be read as therandom variable Y follows a gamma distribution with parameters and .
Exercise 12. Let Y Gamma(, ). Show that the moment generating function forY is given for t (, ) by
MY(t) =1
(1 t/) .
2.6.4 Gaussian (Normal) Distribution
A random variable Z is a standard normal (or Gaussian) random variable if the density
ofZ is specified by
f(z) = 12
ez2/2. (2.13)
It is not immediately obvious that (2.13) specifies a probability density. To show that
this is the case we need to prove
12
ez2/2dy = 1
or, equivalently, that I = e
z2/2 dz =
2. This is a classic results and so is
well worth confirming. Consider
I2 =
e
z2/2 dz
e
w2/2 dw =
e(z
2+w2)/2 dzdw.
The double integral can be evaluated by a change of variables to polar coordinates.
Substituting z = r cos , w = r sin , and dzdw = rddr, we get
I2 =
0
0
er2/2rddr = 2
0
rer2/2 dr = 2er2/2|10 = 2.
36
-
7/27/2019 Stat Infer
38/114
Taking the square root we get I =
2. The result I =
2 can also be used to
establish the result (1/2) = . To prove that this is the case note that(1/2) =
0
euu1/2 du = 20
ez2
dz =
.
The expected value of Z equals zero because zez2/2 is integrable and asymmetric
around zero. The variance ofZ is given by
Var[Z] =12
z2ez2/2 dz.
Thus
Var[Z] = 12
z2ez2/2 dz
=12
zez2/2| + +
e z2/2 dz
=12
ez2/2 dz
=1.
If Z is a standard normal distribution then Y = + Z is called general normal
(Gaussian distribution) with parameters and . The density ofY is given by
f(y|, ) = 122
e(y)222 .
We have obviously E[Y] = and Var[Y] = 2. The notation Y N(, 2) shouldbe read as the random variable Y follows a normal distribution with mean parameter
and variance parameter 2. From the definition ofY it follows immediately that
a + bY, where a and b are known constants, is again normal distribution.
Exercise 13. Let Y N(, 2). What is the distribution ofX = a + bY?Exercise 14. Let Y N(, 2). Show that the moment generating function if Y is
given by
MY(t) = et+2t2/2.
Hint Consider first the standard normal variable and then apply Property 1.
37
-
7/27/2019 Stat Infer
39/114
2.6.5 Weibull Distribution
The Weibull distribution function has the form
F(y) = 1 expy
b
a, y > 0.
The Weibull density can be obtained by differentiation as
f(y|a, b) =a
b
yb
a1exp
y
b
a.
To calculate the expected value
E[Y] =
0
ya
1
b
aya1 exp
y
b
ady
we use the substitutions u = (y/b)
a
, and du = aba
y
a
1
dy. These yield
E[Y] = b
0
u1/aeudu = b
a + 1
a
.
In a similar manner, it is straightforward to verify that
E[Y2] = b2
a + 2
a
,
and thus
Var[Y] = b2
a + 2
a
2
a + 1
a
.
2.6.6 Beta Distribution
A random variable is said to have a beta distribution if its density is given by
f(y|a, b) = 1B(a, b)
ya1(1 y)b1, 0 < y < 1.
Here the function
B(a, b) =
10
ua1(1 u)b1 duis the beta function, and is related to the gamma function through
B(a, b) =(a)(b)
(a + b).
Proceeding in the usual manner, we can show that
E[Y] =a
a + b
Var[Y] =ab
(a + b)2(a + b + 1).
38
-
7/27/2019 Stat Infer
40/114
2.6.7 Chi-square Distribution
Let Z N(0, 1), and let Y = Z2. Then the cumulative distribution function
FY(y) = P(Y y) = P(Z2 y) = P(y Z y) = FZ(y) FZ(y)
so that by differentiating in y we arrive to the density
fY(y) =1
2
y[fz(
y) + fz(y)] = 1
2yey/2,
in which we recognize Gamma(1/2, 1/2). Suppose that Y =n
i=1 Z2i , where the
Zi N(0, 1) for i = 1, . . . , n are independent. From results on the sum of indepen-dent Gamma random variables, Y
Gamma(n/2, 1/2). This density has the form
fY(y|n) = ey/2yn/21
2n/2(n/2), y > 0 (2.14)
and is referred to as a chi-squared distribution on n degrees of freedom. The notation
Y Chi(n) should be read as the random variable Y follows a chi-squared dis-tribution with n degrees of freedom. Later we will show that if X Chi(u) andY Chi(v), it follows that X+ Y Chi(u + v).
2.6.8 The Bivariate Normal Distribution
Suppose that Uand V are independent random variables each, with the standard normal
distribution. We will need the following parameters X , Y , X > 0, Y > 0,
[1, 1]. Now let X and Y be new random variables defined by
X =X + XU,
V =Y + YU + Y
1 2V.
Using basic properties of mean, variance, covariance, and the normal distribution, sat-
isfy yourself of the following.
Property 3. The following properties hold
1. X is normally distributed with mean X and standard deviation X,
2. Y is normally distributed with mean Y and standard deviation Y ,
39
-
7/27/2019 Stat Infer
41/114
3. Corr(X, Y) = ,
4. X andY are independent if and only if = 0.
The inverse transformation is
u =x X
X
v =y Y
Y
1 2 (x X)
X
1 2so that the Jacobian of the transformation is
(x, y)
(u, v)=
1
XY1 2 .
Since U and V are independent standard normal variables, their joint probability den-
sity function is
g(u, v) =1
2e
u2+v2
2 .
Using the bivariate change of variables formula, the joint density of (X, Y) is
f(x, y) =1
2XY
1 2 exp (x X)
2
22X(1 2)+
(x X)(y Y)XY(1 2)
(y Y)222Y(1 2)
Bivariate Normal Conditional Distributions
In the last section we derived the joint probability density function f of the bivariate
normal random variables X and Y. The marginal densities are known. Then,
fY|X(y|x) = fY,X (y, x)fX(x) =1
22Y(1 2)exp
(y (Y + Y(x X)/X))
2
22Y(1 2)
.
Then the conditional distribution of Y given X = x is also Gaussian, with
E(Y|X = x) =Y + Y x XX
Var(Y|X = x) = 2Y(1 2)
2.6.9 The Multivariate Normal Distribution
Let denote the 2 2 symmetric matrix 2X XYYX
2Y
40
-
7/27/2019 Stat Infer
42/114
Then
det|| = 2
X2
Y (XY)2
= 2
X2
Y(1 2
)
and
1 =1
1 2
1/2X /(XY)/(XY) 1/2Y
.Hence the bivariate normal distribution (X, Y) can be written in matrix notation as
f(X,Y)(x, y) =1
2
det|| exp
12
x Xy Y
T 1 x X
y Y
.
Let Y = (Y1, . . . , Y p) be a random vector. Let E(Yi) = i, i = 1, . . . , p, and define
the p-length vector = (1, . . . , p). Define the p p matrix through its elementsCov(Yi, Yj) for i, j = 1, . . . p. Then, the random vector Y has a p-dimensional multi-
variate Gaussian distribution if its density function is specified by
fY(y) =1
(2)p/2||1/2 exp
12
(y )T1(y )
. (2.15)
The notation Y MV Np(, ) should be read as the random variable Y follows amultivariate Gaussian (normal) distribution with p-vector mean and p p variance-covariance matrix .
41
-
7/27/2019 Stat Infer
43/114
2.7 Distributions further properties
2.7.1 Sum of Independent Random Variables special cases
Poisson variables
Suppose X Pois() and Y Pois(). Assume that X and Y are independent.Then
P(X+ Y = n) =nk=0
P(X = k, Y = n k)
=n
k=0P(X = k)P(Y = n k)
=nk=0
ek
k!e
nk
(n k)!
=e(+)
n!
nk=0
n!
k!(n k)! knk
=e ( + ) ( + )n
n!.
That is, X+ Y Pois( + ).
Binomial Random Variables
We seek the distribution of Y + X, where Y Bin(n, ) and X Bin(m, ).Since X + Y is modelling the situation where the total number of trials is fixed at
n + m and the probability of a success in a single trial equals . Without performing
a calculations, we expect to find that X + Y Bin(n + m, ). To verify that notethat X = X1 + + Xn where Xi are independent Bernoulli variables with parameter while Y = Y1 + + Ym where Yi are also independent Bernoulli variables withparameter . Assuming that Xis are independent ofYis we obtain that X + Y is the
sum of n + m indpendent Bernoulli random variables with parameter , i.e. X + Y
has Bin(n + m, ) distribution.
42
-
7/27/2019 Stat Infer
44/114
Gamma, Chi-square, and Exponential Random Variables
Let X Gamma(, ) and Y Gamma() are independent. Then the momentgenerating function ofX+ Y is given as
MX+Y(t) = MX(t)MY(t) =1
(1 + t/)1
(1 + t/)=
1
(1 + t/)+
But this is the moment generating function of a Gamma random variable distributed
as Gamma( + , ). The result X + Y Chi(u + v) where X Chi(u) andY Chi(v), follows as a corollary.
Let Y1, . . . , Y n be n independent exponential random variables each with parameter
. Then Z = Y1 + Y2 + + Yn is a Gamma(n, ) random variable. To see thatthis is indeed the case, write Yi Exp(), or alternatively, Yi Gamma(1, ). ThenY1 + Y2 Gamma(2, ), and by induction
ni=1 Yi Gamma(n, ).
Gaussian Random Variables
Let X N(X , 2X) and Y N(Y, 2Y). Then the moment generating function ofX+ Y is given by
MX+Y(t) = MX(t)MY(t) = eXt+
2Xt
2/2eYt+2Yt
2/2 = e(X+Y)t+(2X+
2Y )t
2/2
which proves that X+ Y N(X + Y, 2X + 2Y).
43
-
7/27/2019 Stat Infer
45/114
44
-
7/27/2019 Stat Infer
46/114
2.7.2 Common Distributions Summarizing Tables
Discrete DistributionsBernoulli()
pmf P(Y = y|) = y(1 )1y, y = 0, 1, 0 1mean/variance E[Y] = ,Var[Y] = (1 )mgf M Y(t) = e
t + (1 )Binomial(n, )
pmf P(Y = y|) = ny
y(1 )ny, y = 0, 1, . . . , n , 0 1
mean/variance E[Y] = n,Var[Y] = n(1 )mgf M Y(t) = [e
t + (1 )]n
Discrete uniform(N)
pmf P(Y = y|N) = 1/N,y = 1, 2, . . . , N mean/variance E[Y] = (N + 1)/2,Var[Y] = (N + 1)(N 1)/12mgf M Y(t) =
1N
et 1eNt
1et
Geometric()
pmf P(Y = y|N) = (1 )y1, y = 1, . . . , 0 1mean/variance E[Y] = 1/,Var[Y] = (1 )/2
mgf M Y(t) = et/[1 (1 )et], t < log(1 )
notes The random variable X = Y 1 is NegBin(1, ).Hypergeometric(b,w,n)
pmf P(Y = y|b,w,n) = wy
bwny
/bn
, y = 0, 1, . . . , n ,
b (b w) y b,b,w,n 0mean/variance E[Y] = nw/b,Var[Y] = nw(b w)(b n)/(b2(b 1))Negative binomial(r, )
pmf P(Y = y|r, ) = r+y1y
r(1 )y, y = 0, 1, . . . , n ,
b (b w) y N, 0 < 1mean/variance E[Y] = r(1 )/,Var[Y] = r(1 )/2
mgf M Y(t) = /(1 (1 )et)r, t < log(1 )
notes
An alternative form of the pmf, used in the derivation in our notes, is
given by P(N = n
|r, ) =
n1r1
r(1
)nr, n = r, r + 1, . . .
where the random variable N = Y + r. The negative binomial can alsobe derived as a mixture of Poisson random variables.
Poisson()
pmf P(Y = y|) = ye/y!, y = 0, 1, 2, . . . , 0 < mean/variance E[Y] = ,Var[Y] = ,
mgf M Y(t) = e(et1)
45
-
7/27/2019 Stat Infer
47/114
Continuous Distributions
Uniform U(a, b)
pmf f(y|a, b) = 1/(b a), a < y < bmean/variance E[Y] = (b + a)/2,Var[Y] = (b a)2/12,mgf M Y(t) = (e
bt eat)/((b a)t)notes
A uniform distribution with a = 0 and b = 1 is a special case of thebeta distribution where ( = = 1).
Exponential E()
pmf f(y|) = ey, y > 0, > 0
mean/variance E[Y] = 1/,Var[Y] = 1/2,
mgf M Y(t) = 1/(1 t/)
notesSpecial case of the gamma distribution. X = Y1/ is Weibull, X =
2Y is Rayleigh, X = log(Y /) is Gumbel.
Gamma G(, )
pmf f(y|) = eyy1/(), y > 0, , > 0mean/variance E[Y] = /,Var[Y] = /2,
mgf M Y(t) = 1/(1 t/)
notes Includes the exponential ( = 1) and chi squared ( = n/2, = 1/2).
Normal N( , 2 )
pmf f(y|, 2) = 122
e(y)2/(22), > 0
mean/variance E[Y] = ,Var[Y] = 2,
mgf M Y(t) = et+2t2/2
notes Often called the Gaussian distribution.
Transforms
The generating functions of the discrete and continuous random variables discussed
thus far are given in Table 2.7.2.
46
-
7/27/2019 Stat Infer
48/114
Distrib. p.g.f. m.g.f. ch.f.
Bi(n, ) (t + )n (et + )n (eit + )n
Geo() t/(1 t) /(et ) /(eit )NegBin(r, ) r(1 t)r r(1 et)r r(1 eit)r
P oi() e(1t) e(et1) e(e
it1)
Unif(, ) et(et 1)/(t) eit(eit 1)/(it)Exp() (1 t/)1 (1 it/)1
Ga(c, ) (1 t/)c (1 it/)c
N(, 2
) exp t + 2t2/2 exp it 2t2/2Table 2.1: Transforms of distributions. In the formulas = 1 .
47
-
7/27/2019 Stat Infer
49/114
Chapter 3
Likelihood
3.1 Maximum Likelihood Estimation
Let x be a realization of the random variable Xwith probability density fX(x|) where = (1, 2, . . . , m)
T is a vector ofm unknown parameters to be estimated. The set
of allowable values for , denoted by , or sometimes by , is called the parameter
space. Define the likelihood function
l(|x) = fX(x|). (3.1)It is crucial to stress that the argument of fX(x|) is x, but the argument ofl(|x) is .It is therefore convenient to view the likelihood function l() as the probability of the
observed data x considered as a function of. Usually it is convenient to work with the
natural logarithm of the likelihood called the log-likelihood, denoted by
log l(|x) = log l(|x).
When R1 we can define the score function as the first derivative of the log-likelihood
S() =
log l().
The maximum likelihood estimate (MLE) of is the solution to the score equation
S() = 0.
48
-
7/27/2019 Stat Infer
50/114
At the maximum, the second partial derivative of the log-likelihood is negative, so we
define the curvature at as I(
) where
I() = 2
2log l().
We can check that a solution of the equation S() = 0 is actually a maximum by
checking that I() > 0. A large curvature I() is associated with a tight or strong
peak, intuitively indicating less uncertainty about .
The likelihood function l(|x) supplies an order of preference or plausibility amongpossible values of based on the observed y. It ranks the plausibility of possible values
of by how probable they make the observed y. IfP(x| = 1) > P(x| = 2) thenthe observed x makes = 1 more plausible than = 2, and consequently from
(3.1), l(1|x) > l(2|x). The likelihood ratio l(1|x)/l(2|x) = f(1|x)/f(2|x) isa measure of the plausibility of 1 relative to 2 based on the observed fact y. The
relative likelihood l(1|x)/l(2|x) = k means that the observed value x will occur ktimes more frequently in repeated samples from the population defined by the value 1
than from the population defined by 2. Since only ratios of likelihoods are meaningful,
it is convenient to standardize the likelihood with respect to its maximum.
When the random variables X1, . . . , X n are mutually independent we can write the
joint density as
fX(x) =nj=1
fXj (xj)
where x = (x1, . . . , xn) is a realization of the random vector X = (X1, . . . , X n),
and the likelihood function becomes
LX(|x) =nj=1
fXj (xj |).
When the densities fXj (xj) are identical, we unambiguously write f(xj).
Example 2 (Bernoulli Trials). Consider n independent Bernoulli trials. The jth ob-
servation is either a success or failure coded xj = 1 and xj = 0 respectively,
and
P (Xj = xj) = xj (1 )1xj
49
-
7/27/2019 Stat Infer
51/114
for j = 1, . . . , n . The vector of observations y = (x1, x2, . . . , xn)T is a sequence of
ones and zeros, and is a realization of the random vector Y = (X1, X2, . . . , X n)T
. Asthe Bernoulli outcomes are assumed to be independent we can write the joint probabil-
ity mass function ofY as the product of the marginal probabilities, that is
l() =
nj=1
P(Xj = xj)
=nj=1
xj (1 )1xj
= xj (1 )n
xj
= r(1
)nr
where r =n
i=1 xj is the number of observed successes (1s) in the vector y. The
log-likelihood function is then
log l() = r log + (n r) log(1 ),
and the score function is
S() =
log l() =
r
(n r)
1 .
Solving for S() = 0 we get = r/n. We also have
I() = r2
+ n r(1 )2 > 0 ,
guaranteeing that is the MLE. Each Xi is a Bernoulli random variable and has ex-
pected value E(Xi) = , and variance Var(Xi) = (1 ). The MLE (y) is itself arandom variable and has expected value
E() = E r
n
= E
ni=1 Xi
n
=
1
n
ni=1
E (Xi) =1
n
ni=1
= .
If an estimator has on average the value of the parameter that it is intended to estimate
than we call it unbiased, i.e. ifE = . From the above calculation it follows that (y)is an unbiased estimator of. The variance of(y) isVar() = Var
ni=1 Xi
n
=
1
n2
ni=1
Var (Xi) =1
n2
ni=1
(1 ) = (1 )n
.
2
50
-
7/27/2019 Stat Infer
52/114
Example 3 (Binomial sampling). The number of successes in n Bernoulli trials is a
random variable R taking on values r = 0, 1, . . . , n with probability mass function
P(R = r) =
n
r
r(1 )nr.
This is the exact same sampling scheme as in the previous example except that instead
of observing the sequence y we only observe the total number of successes r. Hence
the likelihood function has the form
LR (|r) =
n
r
r(1 )nr.
The relevant mathematical calculations are as follows
log lR (|r) = lognr+ r log() + (n r) log(1 )
S() =r
n+
n r1 =
r
n
I() =r
2+
n r(1 )2 > 0
E() =E(r)
n=
n
n= unbiased
Var() =Var(r)
n2=
n(1 )n2
=(1 )
n.
2
Example 4 (Prevalence of a Genotype). Geneticists interested in the prevalence of a
certain genotype, observe that the genotype makes its first appearance in the 22ndsub-
ject analysed. If we assume that the subjects are independent, the likelihood function
can be computed based on the geometric distribution, as l() = (1 )n1. Thescore function is then S() = 1 (n 1)(1 )1. Setting S() = 0 we get = n1 = 221. Moreover I() = 2 + (n 1)(1 )2 and is greater than zerofor all , implying that is MLE.
Suppose that the geneticists had planned to stop sampling once they observed r =
10 subjects with the specified genotype, and the tenth subject with the genotype was
the 100th subject anaylsed overall. The likelihood of can be computed based on the
negative binomial distribution, as
l() =
n 1r 1
r(1 )nr
51
-
7/27/2019 Stat Infer
53/114
for n = 100, r = 5. The usual calculation will confirm that = r/n is MLE. 2
Example 5 (Radioactive Decay). In this classic set of data Rutherford and Geiger
counted the number of scintillations in 72 second intervals caused by radioactive de-
cay of a quantity of the element polonium. Altogether there were 10097 scintillations
during 2608 such intervals
Count 0 1 2 3 4 5 6 7
Observed 57 203 383 525 532 408 573 139
Count 8 9 10 11 12 13 14
Observed 45 27 10 4 1 0 1
The Poisson probability mass function with mean parameter is
fX(x|) = x exp()
x!.
The likelihood function equals
l() = xi exp()
xi!=
xi exp(n)
xi!.
The relevant mathematical calculations are
log l() = (xi)log() n log [(xi!)]S() =
xi
n
= xin
= x
I() =xi2
> 0,
implying is MLE. AlsoE() =E(xi) =
1n
= , so is an unbiased estimator.
Next Var() = 1n2Var(xi) =
1n. It is always useful to compare the fitted values
from a model against the observed values.
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Oi 57 203 383 525 532 408 573 139 45 27 10 4 1 0 1
Ei 54 211 407 525 508 393 254 140 68 29 11 4 1 0 0
+3 -8 -24 0 +24 +15 +19 -1 -23 -2 -1 0 -1 +1 +1
The Poisson law agrees with the observed variation within about one-twentieth of its
range. 2
52
-
7/27/2019 Stat Infer
54/114
Example 6 (Exponential distribution). Suppose random variables X1, . . . , X n are i.i.d.
as Exp(). Then
l() =ni=1
exp(xi)
= n exp
xi
log l() = n log
xi
S() =n
ni=1
xi
= n
xi
I() =n
2 > 0 .
Exercise 15. Demonstrate that the expectation and variance of are given as follows
E[] =n
n 1
Var[] =n2
(n 1)2(n 2) 2.
Hint Find the probability distribution of Z =ni=1
Xi, where Xi Exp().
Exercise 16. Propose the alternative estimator = n1n . Show that is unbiased
estimator of with the variance
Var[] =2
n 2 .
As this example demonstrates, maximum likelihood estimation does not automati-
cally produce unbiased estimates. If it is thought that this property is (in some sense)
desirable, then some adjustments to the MLEs, usually in the form of scaling, may be
required.
Example 7 (Gaussian Distribution). Consider data X1, X2 . . . , X n distributed as N(, ).
Then the likelihood function is
l(, ) =
1
nexp
ni=1
(xi )2
2
53
-
7/27/2019 Stat Infer
55/114
and the log-likelihood function is
log l(, ) = n2
log (2) n2
log() 12
ni=1
(xi )2 (3.2)
Unknown mean and known variance As is known we treat this parameter as a con-
stant when differentiating wrt . Then
S() =1
ni=1
(xi ), = 1n
ni=1
xi, and I() =n
> 0 .
Also, E[] = n/n = , and so the MLE of is unbiased. Finally
Var[] =1
n2Var
n
i=1 xi =
n= (E[I()])1 .
Known mean and unknown variance Differentiating (3.2) wrt returns
S() = n2
+1
22
ni=1
(xi )2,
and setting S() = 0 implies
=1
n
ni=1
(xi )2.
Differentiating again, and multiplying by 1 yields
I() = n22
+ 13
ni=1
(xi )2.
Clearly is the MLE since
I() =n
22> 0.
Define
Zi = (Xi )2/
,
so that Zi N(0, 1). From the appendix on probabilityn
i=1 Z2i 2n,implying E[
Z2i ] = n, and Var[
Z2i ] = 2n. The MLE
= (/n)ni=1
Z2i .
54
-
7/27/2019 Stat Infer
56/114
Then
E[] = Enn
i=1
Z2
i = ,and
Var[] =
n
2Var
ni=1
Z2i
=
22
n.
2
Our treatment of the two parameters of the Gaussian distribution in the last example
was to (i) fix the variance and estimate the mean using maximum likelihood; and then
(ii) fix the mean and estimate the variance using maximum likelihood. In practice we
would like to consider the simultaneous estimation of these parameters. In the next
section of these notes we extend MLE to multiple parameter estimation.
3.2 Multi-parameter Estimation
Suppose that a statistical model specifies that the data y has a probability distribution
f(y; , ) depending on two unknown parameters and . In this case the likelihood
function is a function of the two variables and and having observed the value y is
defined as l(, ) = f(y; , ) with log l(, ) = log l(, ). The MLE of(, ) is a
value (, ) for which l(, ) , or equivalently log l(, ) , attains its maximum value.
Define S1(, ) = log l/ and S2(, ) = log l/. The MLEs (, ) can
be obtained by solving the pair of simultaneous equations
S1(, ) = 0
S2(, ) = 0
Let us consider the matrix I(, )
I(, ) = I11(, ) I12(, )
I21(, ) I22(, ) = 2
2 log l2
log l
2
log l 2
2 log l The conditions for a value (0, 0) satisfying S1(0, 0) = 0 and S2(0, 0) = 0
to be a MLE are that
I11(0, 0) > 0, I22(0, 0) > 0,
55
-
7/27/2019 Stat Infer
57/114
and
det(I(0, 0) = I11(0, 0)I22(0, 0) I12(0, 0)2
> 0.
This is equivalent to requiring that both eigenvalues of the matrix I(0, 0) be positive.
Example 8 (Gaussian distribution). Let X1, X2 . . . , X n be iid observations from a
N(, 2) density in which both and 2 are unknown. The log likelihood is
log l(, 2) =ni=1
log
1
22exp[ 1
22(xi )2]
=
n
i=1 1
2log [2] 1
2log[2] 1
22(xi )2
= n
2log [2] n
2log[2] 1
22
ni=1
(xi )2.
Hence for v = 2
S1(, v) =log l
=
1
v
ni=1
(xi ) = 0
which implies that
=1
n
ni=1
xi = x. (3.3)
Also
S2(, v) =log l
v= n
2v+
1
2v2
ni=1
(xi )2 = 0
implies that
2 = v =1
n
ni=1
(xi )2 = 1n
ni=1
(xi x)2. (3.4)
Calculating second derivatives and multiplying by 1 gives that I(, v) equals
I(, v) =
nv
1v2
ni=1
(xi )1v2
n
i=1(xi ) n2v2 +
1v3
n
i=1(xi )2
Hence I(, v) is given by nv 0
0 n2v2
56
-
7/27/2019 Stat Infer
58/114
Clearly both diagonal terms are positive and the determinant is positive and so (, v)
are, indeed, the MLEs of(, v).
Go back to equation (3.3), and X N(,v/n). ClearlyE(X) = (unbiased) andVar(X) = v/n. Go back to equation (3.4). Then from Lemma 1 that is proven below
we havenv
v 2n1
so that
E
nv
v
= n 1
E(v) = n 1n vInstead, propose the (unbiased) estimator of2
S2 = v =n
n 1 v =1
n 1ni=1
(xi x)2 (3.5)
Observe that
E(v) =
n
n 1E(v) =
n
n 1
n 1n
v = v
and v is unbiased as suggested. We can easily show that
Var(v) =2v2
(n 1)2
Lemma 1 (Joint distribution of the sample mean and sample variance) . IfX1, . . . , X n
are iid N(, v) then the sample mean X and sample variance S2 are independent.Also X is distributed N(,v/n) and (n 1)S2/v is a chi-squared random variablewith n 1 degrees of freedom.
Proof. Define
W =ni=1
(Xi X)2 =ni=1
(Xi )2 n(X )2
Wv
+(X )2
v/n=
ni=1
(Xi )2v
57
-
7/27/2019 Stat Infer
59/114
The RHS is the sum of n independent standard normal random variables squared, and
so is distributed 2
n. Also,
X N(,v/n), therefore (
X )2
/(v/n) is the squareof a standard normal and so is distributed 21 These Chi-Squared random variables have
moment generating functions (1 2t)n/2 and (1 2t)1/2 respectively. Next, W/vand (X )2/(v/n) are independent
Cov(Xi X, X) = Cov(Xi, X) Cov(X, X)= Cov
Xi,
1
n
Xj
Var(X)
=1
n
j
Cov(Xi, Xj) vn
= vn vn = 0
But, Cov(Xi X, X ) = Cov(Xi X, X) = 0 , hencei
Cov(Xi X, X ) = Cov
i
(Xi X), X
= 0
As the moment generating function of the sum of independent random variables is
equal to the product of their individual moment generating functions, we see
E
et(W/v)
(1 2t)1/2 = (1 2t)n/2
E et(W/v) = (1 2t)(n1)/2But (12t)(n1)/2 is the moment generating function of a 2 random variables with(n1) degrees of freedom, and the moment generating function uniquely characterizesthe random variable S = (W/v).
Suppose that a statistical model specifies that the data x has a probability distri-
bution f(x;) depending on a vector of m unknown parameters = (1, . . . , m).
In this case the likelihood function is a function of the m parameters 1, . . . , m and
having observed the value ofx is defined as l() = f(x;) with log l() = log l().
The MLE of is a value for which l(), or equivalently log l(), attains its
maximum value. For r = 1, . . . , m define Sr() = log l/r. Then we can (usually)
find the MLE by solving the set of m simultaneous equations Sr() = 0 for r =
58
-
7/27/2019 Stat Infer
60/114
1, . . . , m . The matrix I() is defined to be the m m matrix whose (r, s) element is
given by Irs where Irs = 2
log l/rs. The conditions for a value
satisfyingSr() = 0 for r = 1, . . . , m to be a MLE are that all the eigenvalues of the matrix I()
are positive.
3.3 The Invariance Principle
How do we deal with parameter transformation? We will assume a one-to-one trans-
formation, but the idea applied generally. Consider a binomial sample with n = 10
independent trials resulting in data x = 8 successes. The likelihood ratio of 1 = 0.8
versus 2 = 0.3 isl(1 = 0.8)
l(2 = 0.3)=
81(1 1)282(1 2)2
= 208.7 ,
that is, given the data = 0.8 is about 200 times more likely than = 0.3.
Suppose we are interested in expressing on the logit scale as
log{/(1 )} ,
then intuitively our relative information about 1 = log(0.8/0.2) = 1.29 versus
2 = log(0.3/0.7) = 0.85 should beL(1)L(2)
=l(1)
l(2)= 208.7 .
That is, our information should be invariantto the choice of parameterization. ( For
the purposes of this example we are not too concerned about how to calculate L(). )
Theorem 3.3.1 (Invariance of the MLE). If g is a one-to-one function, and is the
MLE of then g() is the MLE ofg().
Proof. This is trivially true as we let = g1() then f{y|g1()} is maximized in exactly when = g(). When g is not one-to-one the discussion becomes more subtle,
but we simply choose to define gMLE() = g()
It seems intuitive that if is most likely for and our knowledge (data) remains
unchanged then g() is most likely for g(). In fact, we would find it strange if is an
59
-
7/27/2019 Stat Infer
61/114
estimate of, but 2 is not an estimate of2. In the binomial example with n = 10 and
x = 8 we get = 0.8, so the MLE ofg() = /(1 ) is
g() = /(1 ) = 0.8/0.2 = 4.
60
-
7/27/2019 Stat Infer
62/114
Chapter 4
Estimation
In the previous chapter we have seen an approach to estimation that is based on the
likelihood of observed results. Next we study general theory of estimation that is used
to compare between different estimators and to decide on the most efficient one.
4.1 General properties of estimators
Suppose that we are going to observe a value of a random vector X. Let
Xdenote the
set of possible valuesX can take and, for x X, let f(x|) denote the probability thatX takes the value x where the parameter is some unknown element of the set .
The problem we face is that of estimating . An estimator is a procedure which
for each possible value x X specifies which element of we should quote as anestimate of. When we observe X = x we quote (x) as our estimate of. Thus is
a function of the random vector X. Sometimes we write (X) to emphasise this point.
Given any estimator we can calculate its expected value for each possible value
of . As we have already mentioned when discussing the maximum likelihoodestimation, an estimator is said to be unbiased if this expected value is identically equal
to . If an estimator is unbiased then we can conclude that if we repeat the experiment
an infinite number of times with fixed and calculate the value of the estimator each
time then the average of the estimator values will be exactly equal to . To evaluate
61
-
7/27/2019 Stat Infer
63/114
the usefulness of an estimator = (x) of , examine the properties of the random
variable =
(X).
Definition 1 (Unbiased estimators). An estimator = (X) is said to be unbiased for
a parameter if it equals in expectation
E[(X)] = E() = .
Intuitively, an unbiased estimator is right on target. 2
Definition 2 (Bias of an estimator). The bias of an estimator = (X) of is defined
as bias() = E[(X) ]. 2
Note that even if is an unbiased estimator of , g() will generally not be an
unbiased estimator ofg() unless g is linear or affine. This limits the importance of the
notion of unbiasedness. It might be at least as important that an estimator is accurate
in the sense that its distribution is highly concentrated around .
Exercise 17. Show that for an arbitrary distribution the estimator S2 as defined in (3.5)
is an unbiased estimator of the variance of this distribution.
Exercise 18. Consider the estimator S2 of variance 2 in the case of the normal dis-
tribution. Demonstrate that although S2 is an unbiased estimator of 2, S is not an
unbiased estimator of. Compute its bias.
Definition 3 (Mean squared error). The mean squared error of the estimator is de-
fined as MSE() = E( )2. Given the same set of data, 1 is better than 2 ifMSE(1) MSE(2) (uniformly better if true ). 2
Lemma 2 (The MSE variance-bias tradeoff). The MSE decomposes as
MSE() = Var() + bias()2.
62
-
7/27/2019 Stat Infer
64/114
Proof. We have
MSE() = E( )2
= E{ [ E() ] + [ E() ]}2
= E[ E()]2 + E[E() ]2
+2E
[ E()][E() ]
=0
= E[ E()]2 + E[E() ]2
= Var() + [E() ]2
bias()2.
NOTE This lemma implies that the mean squared error of an unbiased estimator is
equal to the variance of the estimator.
Exercise 19. Consider X1, . . . , X n where Xi N(, 2) and is known. Threeestimators of are 1 = X =
1n
ni=1 Xi, 2 = X1, and 3 = (X1 + X)/2. Discuss
their properties which one you would recommend and why.
Example 9. Consider X1, . . . , X n to be independent random variables with means
E(Xi) = and variances Var(Xi) = 2i . Consider pooling the estimators of into a
common estimator using the linear combination = w1X1 + w2X2 + + wnXn.We will see that the following is true
(i) The estimator is unbiased if and only if
wi = 1.
(ii) The estimator has minimum variance among this class of estimators when the
weights are inversely proportional to the variances 2i .
(iii) The variance of for optimal weights wi is Var() = 1/
i
2i .
Indeed, we have E() = E(w1X1 + + wnXn) = i wiE(Xi) = i wi =
i wi so is unbiased if and only if
i wi = 1. The variance of our estimator is
Var() =
i w2i
2i , which should be minimized subject to the constraint
i wi = 1.
Differentiating the Lagrangian L = i w2i 2i (i wi 1) with respect to wi and63
-
7/27/2019 Stat Infer
65/114
setting equal to zero yields 2wi2i = wi 2i so that wi = 2i /(
j
2j ).
Then, for optimal weights we get Var() = i w2i 2i = (i 4i 2i )/(i 2i )2 =1/(
i 2i ).
Assume now that the instead of Xi we observe biased variable Xi = Xi + for
some = 0. When 2i = 2 we have that Var() = 2/n which tends to zero forn whereas bias() = and MSE() = 2/n + 2. Thus in the general casewhen the bias is present it tends to dominate the variance as n gets larger, which is very
unfortunate.
Exercise 20. Let X1, . . . , X n be an independent sample of size n from the uniform
distribution on the interval (0, ), with density for a single observation being f(x
|) =
1 for 0 < x < and 0 otherwise, and consider > 0 unknown.
(i) Find the expected value and variance of the estimator = 2X.
(ii) Find the expected value of the estimator = X(n), i.e. the largest observation.
(iii) Find an unbiased estimator of the form = cX(n) and calculate its variance.
(iv) Compare the mean square error of and .
4.2 Minimum-Variance Unbiased EstimationGetting a small MSE often involves a tradeoff between variance and bias. For unbiased
estimators, the MSE obviously equals the variance, MSE() = Var(), so no tradeoff
can be made. One approach is to restrict ourselves to the subclass of estimators that are
unbiasedand minimum variance.
Definition 4 (Minimum-variance unbiased estimator). If an unbiased estimator ofg()
has minimum variance among all unbiased estimators of g() it is called a minimum
variance unbiased estimator (MVUE). 2
We will develop a method of finding the MVUE when it exists. When such an
estimator does not exist we will be able to find a lower bound for the variance of an
unbiased estimator in the class of unbiased estimators, and compare the variance of our
unbiased estimator with this lower bound.
64
-
7/27