mathematical statistics

Mathematical Statistics

Old School

John I. Marden

Department of Statistics

University of Illinois at Urbana-Champaign

Contents

Contents iii

1 Introduction 11.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Example – Fruit Flies . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Example – Political Interest . . . . . . . . . . . . . . . . . . . . . . 21.1.3 Example – Mouth Size . . . . . . . . . . . . . . . . . . . . . . . . . 3

I Distribution Theory 7

2 Distributions and Densities 92.1 Probability models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Probability density functions . . . . . . . . . . . . . . . . . . . . . 112.3.2 Probability mass functions . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Distributions without pdf’s or pmf’s . . . . . . . . . . . . . . . . . . . . . 152.4.1 Example: Late start . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.2 Example: Spinner . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.3 Example: Random coin . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Expected Values 193.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 Indicator functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Means, Variances, and Covariances . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.2 The variance of linear combinations & affine transformations . . 24

3.3 Other summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4 The moment and cumulant generating functions . . . . . . . . . . . . . 27

3.4.1 Example: Normal . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4.2 Example: Gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.3 Example: Binomial and Multinomial . . . . . . . . . . . . . . . . 30

3.5 Vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

iii

iv Contents

4 Marginal Distributions and Independence 354.1 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 Example: Marginals of a multinomial . . . . . . . . . . . . . . . . 364.2 Marginal densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.1 Example: Ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.1 Example: Independent exponentials . . . . . . . . . . . . . . . . 404.3.2 Spaces and densities . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3.3 IID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Transformations I 455.1 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.1 Example: Sum of uniforms . . . . . . . . . . . . . . . . . . . . . . 455.1.2 Example: Chi-squared on 1 df . . . . . . . . . . . . . . . . . . . . 475.1.3 Example: Probability transform . . . . . . . . . . . . . . . . . . . 48

5.2 Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . . 495.2.1 Example: Uniform→ Exponential . . . . . . . . . . . . . . . . . . 505.2.2 Example: Sums of independent Gamma’s . . . . . . . . . . . . . 505.2.3 Linear combinations of independent Normals . . . . . . . . . . . 505.2.4 Normalized means . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.3.1 Example: Sum of discrete uniforms . . . . . . . . . . . . . . . . . 535.3.2 Example: Convolutions . . . . . . . . . . . . . . . . . . . . . . . . 545.3.3 Example: Sum of two Poissons . . . . . . . . . . . . . . . . . . . . 555.3.4 Binomials and Bernoullis . . . . . . . . . . . . . . . . . . . . . . . 56

6 Transformations II: Continuous distributions 596.1 One dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2.1 Example: Gamma – Beta . . . . . . . . . . . . . . . . . . . . . . . 616.2.2 Example: The Dirichlet distribution . . . . . . . . . . . . . . . . . 636.2.3 Example: Affine transformations . . . . . . . . . . . . . . . . . . 656.2.4 Example: Bivariate Normal . . . . . . . . . . . . . . . . . . . . . . 656.2.5 Example: Orthogonal transformations . . . . . . . . . . . . . . . 666.2.6 Example: Polar coordinates . . . . . . . . . . . . . . . . . . . . . . 67

6.3 Order statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.3.1 Approximating the mean and variance . . . . . . . . . . . . . . . 71

7 Conditional distributions 737.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.2 Examples of conditional distributions . . . . . . . . . . . . . . . . . . . . 74

7.2.1 Simple linear model . . . . . . . . . . . . . . . . . . . . . . . . . . 747.2.2 Finite mixture model . . . . . . . . . . . . . . . . . . . . . . . . . 757.2.3 Hierarchical models . . . . . . . . . . . . . . . . . . . . . . . . . . 757.2.4 Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.3 Conditional + Marginal→ Joint . . . . . . . . . . . . . . . . . . . . . . . 767.3.1 Joint densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.4 The marginal distribution of Y . . . . . . . . . . . . . . . . . . . . . . . . 787.4.1 Example: Coins and the Beta-Binomial . . . . . . . . . . . . . . . 787.4.2 Example: Simple normal linear model . . . . . . . . . . . . . . . 78

Contents v

7.4.3 Marginal mean and variance . . . . . . . . . . . . . . . . . . . . . 797.4.4 Example: Fruit flies, continued . . . . . . . . . . . . . . . . . . . . 81

7.5 The conditional from the joint . . . . . . . . . . . . . . . . . . . . . . . . 827.5.1 Example: Coin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.5.2 Bivariate normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.6 Bayes Theorem: Reversing the conditionals . . . . . . . . . . . . . . . . . 857.6.1 Example: AIDS virus . . . . . . . . . . . . . . . . . . . . . . . . . 867.6.2 Example: Beta posterior for the binomial . . . . . . . . . . . . . . 86

7.7 Conditionals and independence . . . . . . . . . . . . . . . . . . . . . . . 867.7.1 Example: Independence of residuals and X . . . . . . . . . . . . 87

8 The Multivariate Normal Distribution 898.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

8.1.1 The spectral decomposition . . . . . . . . . . . . . . . . . . . . . . 918.2 Some properties of the multivariate normal . . . . . . . . . . . . . . . . . 92

8.2.1 Affine transformations . . . . . . . . . . . . . . . . . . . . . . . . 928.2.2 Marginals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928.2.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8.3 The pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 938.4 The sample mean and variance . . . . . . . . . . . . . . . . . . . . . . . . 948.5 Chi-squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

8.5.1 Moments of a chi-squared . . . . . . . . . . . . . . . . . . . . . . 988.5.2 Other covariance matrices . . . . . . . . . . . . . . . . . . . . . . 98

8.6 Student’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1008.7 Linear models and the conditional distribution . . . . . . . . . . . . . . 101

II Statistical Inference 103

9 Statistical Models and Inference 1059.1 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059.2 Interpreting probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1069.3 Approaches to inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

10 Introduction to Estimation 11110.1 Definition of estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11110.2 Plug-in methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11210.3 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11310.4 The likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

10.4.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . 11510.5 The posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 11710.6 Bias, variance, and MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11810.7 Functions of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

10.7.1 Example: Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

11 Likelihood and sufficiency 12311.1 The likelihood principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

11.1.1 Binomial and negative binomial . . . . . . . . . . . . . . . . . . . 12411.2 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

11.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

vi Contents

11.3 Conditioning on a sufficient statistic . . . . . . . . . . . . . . . . . . . . . 12811.3.1 Example: IID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13111.3.2 Example: Normal mean . . . . . . . . . . . . . . . . . . . . . . . . 131

11.4 Rao-Blackwell: Improving an estimator . . . . . . . . . . . . . . . . . . . 13211.4.1 Example: Normal probability . . . . . . . . . . . . . . . . . . . . 13311.4.2 Example: IID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

12 UMVUE’s 13512.1 Unbiased estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13512.2 Completeness: Unique unbiased estimators . . . . . . . . . . . . . . . . 137

12.2.1 Example: Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . 13912.2.2 Example: Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

12.3 Uniformly minimum variance estimators . . . . . . . . . . . . . . . . . . 14012.3.1 Example: Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

12.4 Completeness for exponential families . . . . . . . . . . . . . . . . . . . . 14112.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

12.5 Fisher’s Information and the Cramér-Rao Lower Bound . . . . . . . . . 14312.5.1 Example: Double exponential . . . . . . . . . . . . . . . . . . . . 14412.5.2 Example: Normal µ

2 . . . . . . . . . . . . . . . . . . . . . . . . . 145

13 Location Families and Shift Equivariance 14713.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14713.2 Location families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14813.3 Shift-equivariant estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 149

13.3.1 The Pitman estimator . . . . . . . . . . . . . . . . . . . . . . . . . 14913.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

13.4.1 Exponential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15213.4.2 Double Exponential . . . . . . . . . . . . . . . . . . . . . . . . . . 153

13.5 General invariance and equivariance . . . . . . . . . . . . . . . . . . . . . 15413.5.1 Groups and their actions . . . . . . . . . . . . . . . . . . . . . . . 15413.5.2 Invariant models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15613.5.3 Induced actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15713.5.4 Equivariant estimators . . . . . . . . . . . . . . . . . . . . . . . . 15813.5.5 Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

14 The Decision-theoretic Approach 15914.1 The setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15914.2 The risk function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16014.3 Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

14.3.1 Stein’s surprising result . . . . . . . . . . . . . . . . . . . . . . . . 16214.4 Bayes procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

14.4.1 Bayes procedures and admissiblity . . . . . . . . . . . . . . . . . 16614.5 Minimax procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

14.5.1 Example: Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

15 Asymptotics: Convergence in probability 16915.1 The set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16915.2 Convergence in probability to a constant . . . . . . . . . . . . . . . . . . 170

15.2.1 Chebychev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . 17115.3 The law of large numbers; Mapping . . . . . . . . . . . . . . . . . . . . . 173

Contents vii

16 Asymptotics: Convergence in distribution 17716.1 Converging to a constant random variable . . . . . . . . . . . . . . . . . 17916.2 Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . . 18016.3 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 180

16.3.1 Supersizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18116.4 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18316.5 The ∆-method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

16.5.1 The median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18616.5.2 Variance stabilizing transformations . . . . . . . . . . . . . . . . 187

16.6 The MLE in Exponential Families . . . . . . . . . . . . . . . . . . . . . . 18816.7 The multivariate ∆-method . . . . . . . . . . . . . . . . . . . . . . . . . . 190

17 Likelihood Estimation 19517.1 The setup: Cramér’s conditions . . . . . . . . . . . . . . . . . . . . . . . . 19517.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

17.2.1 Convexity and Jensen’s inequality . . . . . . . . . . . . . . . . . . 19717.2.2 A consistent sequence of roots . . . . . . . . . . . . . . . . . . . . 198

17.3 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20017.4 Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20317.5 Multivariate parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

18 Hypothesis Testing 20718.1 Accept/Reject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21018.2 Test functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21318.3 Simple versus simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21418.4 The Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . 217

18.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21818.4.2 Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

18.5 Uniformly most powerful tests . . . . . . . . . . . . . . . . . . . . . . . . 22318.5.1 One-sided exponential family testing problems . . . . . . . . . . 22618.5.2 Monotone likelihood ratio . . . . . . . . . . . . . . . . . . . . . . 227

18.6 Some additional criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

19 Likelihood-based testing 22919.1 Score tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

19.1.1 One-sided . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23019.1.2 Many-sided . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

19.2 The maximum likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . 23319.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23419.2.2 Asymptotic null distribution . . . . . . . . . . . . . . . . . . . . . 236

20 Nonparametric Tests and Randomization 24120.1 Testing the median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

20.1.1 A randomization distribution . . . . . . . . . . . . . . . . . . . . 24320.2 Randomization test formalities . . . . . . . . . . . . . . . . . . . . . . . . 247

20.2.1 Using p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24920.2.2 Using the normal approximation . . . . . . . . . . . . . . . . . . 24920.2.3 When the regular and randomization distributions coincide . . 25020.2.4 Example: The signed rank test . . . . . . . . . . . . . . . . . . . . 250

20.3 Two-sample tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

viii Contents

20.3.1 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25320.3.2 Difference in means . . . . . . . . . . . . . . . . . . . . . . . . . . 25420.3.3 Proof of Lemma 36 . . . . . . . . . . . . . . . . . . . . . . . . . . . 25520.3.4 Wilcoxon two-sample rank test . . . . . . . . . . . . . . . . . . . . 256

20.4 Linear regression→ Monotone relationships . . . . . . . . . . . . . . . . 25720.4.1 Spearman’s ρ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25720.4.2 Kendall’s τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

20.5 Fisher’s exact test for independence . . . . . . . . . . . . . . . . . . . . . 259

Bibliography 263

Chapter 1

Introduction

Formally, a statistical model is a family of probability distributions on a samplespace. In practice, the model is supposed to be an abstraction of some real (scientific)process or experiment. The sample space specifies the set of potential outcomes ofthe experiment (the data), and the presumption is that one of the members of thefamily of probability distributions describes the chances of the various outcomes. Ifthe distribution were known, then the model would be a probability model. Whatmakes it a statistical model is that the distribution is unknown, and one wishes toinfer something about the distribution, or make some kind of decision, based onobserving the data. Choosing the model for a given situation is not always easy. It isimportant to both be very specific about what is being assumed, and to be well awarethat the model is wrong. The hope is that the model forms a basis for inference thatcomes reasonably close to reality.

We will start with some examples. These are small data sets with fairly simplemodels, but they give an idea of the type of situations we are considering. Section2.1 formally presents probability models, and Section 9.1 presents statistical models.Subsequent sections cover some general types of models.

1.1 Examples

1.1.1 Example – Fruit Flies

Arnold [1981] presents an experiment concerning the genetics of Drosophila pseudoob-scura, a type of fruit fly. We are looking at a particular locus (place) on a pair ofchromosomes. The locus has two possible alleles (values): TL ≡ TreeLine and CU≡ Cuernavaca. Each individual has two of these, one on each chromosome. Theindividual’s genotype is the pair of alleles it has. Thus the genotype could be (TL,TL),(TL,CU), or (CU,CU). [There is no distinction made between (CU,TL) and (TL,CU).]

The objective is to estimate θ ∈ (0, 1), the proportion of CU in the population. Inthis experiment, the researchers randomly collected 10 adult males. Unfortunately,one cannot determine the genotype of the adult fly just by looking at him. Onecan determine the genotype of young flies, though. So the researchers bred each ofthese ten flies with a (different) female known to be (TL,TL), and analyzed two of theoffspring from each mating. The genotypes of these offspring are given in Table 1.1.

1

2 Chapter 1. Introduction

Father Offsprings’ Genotypes1 (TL,TL) & (TL,TL)2 (TL,TL) & (TL,CU)3 (TL,TL) & (TL,TL)4 (TL,TL) & (TL,TL)5 (TL,CU) & (TL,CU)6 (TL,TL) & (TL,CU)7 (TL,CU) & (TL,CU)8 (TL,TL) & (TL,TL)9 (TL,CU) & (TL,CU)

10 (TL,TL) & (TL,TL)

Table 1.1: The data on the fruit flies

The sample space consists of all possible combinations of 10 sets of two pairs ofalleles, each pair being either (TL,TL) or (TL,CU). The probability of these outcomes isgoverned by the population proportion of CU’s, but some assumptions are necessaryto specify the probabilities:

1. The ten chosen Fathers are a simple random sample from the population.

2. The chance that a given Father has 0, 1 or 2 CU’s in his genotype follows theHardy-Weinberg laws, which means that the number of CU’s is like flipping acoin twice independently, with probability of heads being θ.

3. For a given mating, the two offspring are each equally likely to get either ofthe Father’s two alleles (as well as a TL from the Mother), and what the twooffspring get are independent.

Based on these assumptions, the goal would be to figure out the probability distri-bution, and to find an estimate of θ and its variability, e.g., a 95% confidence interval.One should also try to assess the plausibility of the assumptions, although with asmall data set like this one it may be difficult.

1.1.2 Example – Political Interest

Lazarsfeld, Berelson and Gaudet (1968, The People’s Choice) collected some data todetermine the relationship between level of education and intention to vote in anelection. The variables of interest were

• X = Education: 0 = Some high school, 1 = No high school

• Y = Interest: 0 = Great political interest, 1 = Moderate political interest, 2 = Nopolitical interest

• Z = Vote: 0 = Intends to vote, 1 = Does not intend to vote

Here is the table of data:

1.1. Examples 3

Y = 0 Y = 1 Y = 2Z = 0 Z = 0 Z = 0 Z = 0 Z = 0 Z = 0

X = 0 490 5 917 69 74 58X = 1 279 6 602 67 145 100

You would expect X and Z to be dependent, that is, people with more educationare more likely to vote. That’s not the question. The question is whether educa-tion and voting are conditionally independent given interest, that is, once you knowsomeone’s level of political interest, knowing their educational level does not helpyou predict whether they vote.

The model is that the n = 2812 observations are independent and identicallydistributed, with each having the same set of probabilities of landing in any one ofthe twelve possible cells of the table. Based on this model, how can the hypothesis ofinterest be expressed? How can it be tested, given the data?

1.1.3 Example – Mouth Size

Measurements were made on the size of mouths of 27 kids at four ages: 8, 10, 12,

and 141 . The measurement is the distance from the “center of the pituitary to the

pteryomaxillary fissure 2" in millimeters.There are 11 girls (Sex=1) and 16 boys (Sex=0). Here is the raw data:

Age8 Age10 Age12 Age14 Sex

1 21.0 20.0 21.5 23.0 1

2 21.0 21.5 24.0 25.5 1

3 20.5 24.0 24.5 26.0 1

4 23.5 24.5 25.0 26.5 1

5 21.5 23.0 22.5 23.5 1

6 20.0 21.0 21.0 22.5 1

7 21.5 22.5 23.0 25.0 1

8 23.0 23.0 23.5 24.0 1

9 20.0 21.0 22.0 21.5 1

10 16.5 19.0 19.0 19.5 1

11 24.5 25.0 28.0 28.0 1

12 26.0 25.0 29.0 31.0 0

13 21.5 22.5 23.0 26.5 0

14 23.0 22.5 24.0 27.5 0

15 25.5 27.5 26.5 27.0 0

16 20.0 23.5 22.5 26.0 0

17 24.5 25.5 27.0 28.5 0

18 22.0 22.0 24.5 26.5 0

19 24.0 21.5 24.5 25.5 0

20 23.0 20.5 31.0 26.0 0

21 27.5 28.0 31.0 31.5 0

22 23.0 23.0 23.5 25.0 0

23 21.5 23.5 24.0 28.0 0

24 17.0 24.5 26.0 29.5 0

1Potthoff and Roy (1964), Biometrika, pages 313-326.2Not sure what that is. Is it the same as the pterygomaxillary fissure? You can find pictures of that on

the web.

4 Chapter 1. Introduction

8 9 10 11 12 13 14

2025

30

Age

Siz

e

Figure 1.1: Mouth size growth curves

25 22.5 25.5 25.5 26.0 0

26 23.0 24.5 26.0 30.0 0

27 22.0 21.5 23.5 25.0 0

Figure 1.1 contains a plot of each kid’s mouth size over time, where the girls arein red, and the boys are in blue. Generally, the sizes increase over time, and the boystend to have larger mouths than the girls.

One approach to such growth curves is to assume that in the population, thegrowth is quadratic, where the coefficients are possibly different for girls and boys.That is, for x = 8, 10, 12, 14, the population average growth curves are, respectively,

fGirls(x) = α0 + α1x + α2x2 and fBoys(x) = β0 + β1x + β2x2. (1.1)

1.1. Examples 5

Now each individual’s values are going to deviate from the average for that one’ssex. The trick is to model these deviations. Generally, one would expect a personwho is larger than average at age 8 will also be larger than average at age 10, etc., soone cannot assume all 27× 4 deviations are independent. What kind of dependenciesare there? Are the variances the same for the various ages? For the two sexes?

Some questions which can be addressed include

• Are cubic terms necessary, rather than just quadratic?

• Are the quadratic terms necessary (is α2 = β2 = 0, so that the curves arestriaght lines)?

• Are the girls’ and boys’ curves the same (are αj = βj for j = 0, 1, 2)?

• Are the girls’ and boys’ curves parallel (are α1 = β1 and α2 = β2, but maybeα0 6= β0)?

Part I

Distribution Theory

7

Chapter 2

Distributions and Densities

2.1 Probability models

Starting with the very general, suppose X is a random object. It could be a singlevariable, a vector, a matrix, or something more complicated, e.g., a function, infinitesequence, or image. The space of X is X , the set of possible values X can take on.Then a probability distribution on X, or on X , is a function P that assigns a valuein [0, 1] to subsets of X . For “any" subset A ⊂ X , P[A] is the probability X ∈ A. Itcan also be written P[X ∈ A]. (The quotes on “any" are to point out that technically,only subsets in a “sigma field" of subsets of X are allowed. We will gloss over thatrestriction, not because it is unimportant, but because for our purposes we do not getinto too much trouble doing so.)

In order for P to be a probability distribution, it has to satisfy some axioms:

1. P[X ] = 1.

2. If A1, A2, . . . are disjoint (Aj ∩ Aj = ∅ for i 6= j), then P[∪∞i=1Ai] = ∑

∞i=1 P[Ai].

The second axiom means to refer to finite unions as well as infinite ones. Usingthese axioms, along with the restriction that 0 ≤ P[A] ≤ 1, all the usual properties ofprobabilities can be derived. Some such follow.

• Complement. The complement of a set A is Ac = X − A, that is, everythingthat is not in A (but in X ). Clearly, A and Ac are disjoint, and their union iseverything:

A ∩ Ac = ∅, A ∪ Ac = X ,

so,1 = P[X ] = P[A ∪ Ac] = P[A] + P[Ac],

which meansP[Ac] = 1− P[A].

That is, the probability the object does not landing A is 1 minus the probabilitythat it does land in A.

• P[∅] = 0, because the empty set is the complement of X , which has probability1.

9

10 Chapter 2. Distributions and Densities

• Union of two (nondisjoint) sets. If A and B are not disjoint, then it is notnecessarily true that P[A ∪ B] = P[A] + P[B]. But A ∪ B can be separated intotwo disjoint sets: The set A and the part of B not in A, which is [B ∩ Ac]. Then

P[A ∪ B] = P[A] + P[B ∩ Ac]. (2.1)

Now B = (B ∩ A) ∪ (B ∩ Ac), and (B ∩ A) and (B ∩ Ac) are disjoint, so

P[B] = P[B ∩ A] + P[B ∩ Ac]⇒ P[B ∩ Ac] = P[B]− P[A ∩ B].

Then stick that formula into (2.1), so that

P[A∪ B] = P[A] + P[B]− P[A ∩ B].

The above definition doesn’t help much in specifying a probability distribution. Inprinciple, one would have to give the probability of every possible subset, but luckilythere are simplifications.

We will deal primarily with random variables and finite collections of randomvariables. A random variable has space X ⊂ R, the real line. A collection of p randomvariables has space X ⊂ Rp, the p-dimensional Euclidean space. The elements areusually arranged in some convenient way, such as in a vector (row or column), matrix,multidimensional array, or triangular array. Mostly, we will have them arranged as acolumn vector

X =

X1X2...

Xp

(2.2)

or a row vector.Some common ways to specify the probabilities of a collection of p random vari-

ables include

1. Distribution functions

2. Densities

3. Moment generating functions (or characteristic functions)

4. Representations

Distribution functions and characteristic functions always exist, moment generat-ing functions do not. The densities we will deal with are only those with respectto Lebesgue measure, or counting measure — which means for us, densities do notalways exist. By “representation" we mean the random variables are expressed as afunction of some other known random variables.

2.2 Distribution functions

The distribution function for X is the function F : Rp → [0, 1] given by

F(x) = F(x1, . . . , xp) = P[X1 ≤ x1, . . . , Xp ≤ xp ]. (2.3)

Note that F is defined on all of Rp , not just the space X . In principal, given F, one canfigure out the probability of all subsets A ⊂ X (although no one would try), which

2.3. Densities 11

means F uniquely identifies P, and vice versa. If F is a continuous function, then X istermed continuous.

For a single random variable X, the distribution function is F(x) = P[X ≤ x] forx ∈ R. This function satisfies the following properties:

1. F(x) is nondecreasing in x.

2. limx→−∞ F(x) = 0.

3. limx→∞ F(x) = 1.

4. For any x, limy↓x F(y) = F(x).

The fourth property is that F is continuous from the right. It need not be continu-ous. For example, suppose X is the number of heads (i.e., 0 or 1) in one flip of a fair

coin. Then X = 0, 1, and P[X = 0] = P[X = 1] = 12 . Then

F(x) =

0 i f x < 01/2 i f 0 ≤ x < 1

1 i f x ≥ 1.(2.4)

Now F(1) = 1, and if y ↑ 1, say, y = 1− 1/m, then for m = 1, 2, 3, . . ., F(1− 1/m) =12 , which approaches 1

2 , not 1. On the other hand, if y ↓ 1, say y = 1 + 1/m, thenF(1 + 1/m) = 1, which does approach 1.

A jump in F at a point x means that P[X = x] > 0, in fact, the probability is theheight of the jump. If F is continuous at x, then P[X = x] = 0. Figure 2.1 showsa distribution function with jumps at 1 and 6, which means that the probability Xequals either of those points is positive, the probability being the height of the gaps(which are 1/4 in this plot). Otherwise, the function is continuous, hence no othervalue has positive probability. Note also the flat part between 1 and 4, which meansthat P[1 < X <≤ 4] = 0.

Not only do all distribution functions for random variables satisfy those four prop-erties, but any function F that satisfies those four is a legitimate distribution function.

Similar results hold for finite collections of random variables:

1. F(x1, . . . , xp) is nondecreasing in each xi, holding the others fixed.

2. limx→−∞ F(x, x, . . . , x) = 0.

3. limx→∞ F(x, x, . . . , x) = 1.

4. For any (x1, . . . , xp), limy1↓x1,...,yp↓xpF(y1, . . . , yp) = F(x1, . . . , xp).

2.3 Densities

2.3.1 Probability density functions

A density with respect to Lebesgue measure on Rp, which we simplify to “pdf" for“probability density function," is a function f : X → [0, ∞) such that for any subsetA ⊂ X ,

P[A] =∫ ∫

· · ·∫

Af (x1, x2, . . . , xp)dx1dx2 . . . dxp. (2.5)


0 2 4 6 8

0.0

0.4

0.8

x

F(x

)

Figure 2.1: A distribution function

If X has a pdf, then it is continuous. In fact, it is differentiable, f being the derivativeof F:

f (x1, . . . , xp) =∂

∂x1· · · ∂

∂xpF(x1, . . . , xp). (2.6)

There are continuous distributions that do not have pdf’s, but they are weird. Anypdf has to satisfy the two properties,

1. f (x1, . . . , xp) ≥ 0 for all (x1, . . . , xp) ∈ X ;

2.∫ ∫· · ·∫X f (x1, x2, . . . , xp)dx1dx2 . . . dxp = 1.

It is also true that any function f satisfying those two conditions is a pdf of alegitimate probability distribution.

Here is a list of some famous univariate (so that p = 1 and X ⊂ R) distributionswith pdf’s:

2.3. Densities 13

Name Parameters Space X pdf f (x)

Normal: N(µ, σ2) µ ∈ R, σ2> 0 R 1√

2π σexp(− (x−µ)2

2σ2 )

Uniform: U(a, b) −∞ < a < b < ∞ (a, b) 1b−a

Exponential: Exponential(λ) λ > 0 (0, ∞) λ exp(−λx)

Gamma: Gamma(α, λ) α > 0, λ > 0 (0, ∞) λα

Γ(α)exp(−λx)xα−1

Beta: Beta(α, β) α > 0, β > 0 (0, 1)Γ(α,β)

Γ(α)Γ(β)xα−1(1− x)β−1

Cauchy R 1π

11+x2

The Γ(α) is the Gamma function, presented later. There are many more, such asStudent’s t, chi-squared, F, . . .. The most famous multivariate distribution is themultivariate normal. We will look at that one in Chapter 8.

Example: A bivariate pdf

Suppose (X, Y) has space

W = (0, 1)× (0, 1) = (x, y) | 0 < x < 1, 0 < y < 1 (2.7)

and pdf

f (x, y) = c(x + y). (2.8)

The constant c is whatever it needs to be so that the pdf integrates to 1, i.e.,

1 = c∫ 1

0

∫ 1

0(x + y)dydx = c

∫ 1

0(x + 1/2)dx = c(1/2 + 1/2) = c. (2.9)

So the pdf is simply f (x, y) = x + y. Some values of the distribution function are

F(0, 0) = 0

F(1

2,

1

4) =

∫ 12

0

∫ 14

0(x + y)dydx =

∫ 12

0(

x

4+

1

32)dx =

1

32+

1

32=

1

16

F(1/2, 2) =∫ 1

2

0

∫ 1

0(x + y)dydx =

∫ 12

0(x +

1

2)dx =

1

8+

1

4=

3

8

F(2, 1) = 1. (2.10)


Other probabilities:

P[X + Y ≤ 1

2] =

∫ 12

0

∫ 12−x

0(x + y)dydx

=∫ 1

2

0

(x (

1

2− x) +

1

2(

1

2− x)2

)dx

=∫ 1

2

0

(1

8− x2

2

)dx

=1

16− 1

2

(1/2)3

3

=1

24, (2.11)

and for 0 < y < 1,

P[Y ≤ y] =∫ 1

0

∫ y

0(x + w)dwdx =

∫ 1

0(xy +

y2

2)dx =

y

2+

y2

2=

1

2y(1 + y), (2.12)

which is the distribution function of Y, at least for 0 < y < 1.

2.3.2 Probability mass functions

A discrete random variable is one for which X is a countable (which includes finite)set. Its probability can be given by its probability mass function, which we will call“pmf," f : X → [0, 1] given by

P(x1, . . . , xp) = f (x1, . . . , xp). (2.13)

The pmf gives the probabilities of the individual points.1 The probability of anysubset A is the sum of the probabilities of the individual points in A.

Some univariate discrete distributions:

1Measure-theoretically, the pmf is the density with respect to counting measure on X .

2.4. Distributions without pdf’s or pmf’s 15

Nam

eP

aram

eter

sS

pac

eX

pm

ff(

x)

Bin

om

ial:

Bin

(n,p

)n

po

siti

ve

inte

ger

,0≤

p≤

10

,1,.

..,n

(n x)

px(1−

p)n−

x

Po

isso

n:

Poi

sson

(λ)

λ>

00

,1,2

,...

exp(−

λ)

λx

x!

Dis

cret

eu

nif

orm

:D

U(a

,b)

aan

db

inte

ger

s,a

<b

a,a

+1,

...,

b1

b−a+

1

Geo

met

ric:

Geo

met

ric(

p)

0<

p<

10

,1,2

,...

p(1−

p)x

The distribution of a discrete random variable is purely a jump function, that is, itis flat except for jumps of height f (x) at x for each x ∈ X .

The most famous multivariate discrete distribution is the multinomial, which welook at in Example 3.4.3.

2.4 Distributions without pdf’s or pmf’s

Distributions need not be either discrete or having a density with repect to Lebesguemeasure. We present here some simple examples.


2.4.1 Example: Late start

Consider waiting for a train to leave. It will not leave early, but very well may leavelate. There is a positive probability, say 10%, it will leave exactly on time. If it doesnot leave on time, there is a continuous distribution for how late it leaves. Thus it isnot totally discrete, but not continuous, either, so it has neither a pdf nor pmf. It doeshave a distribution function, because everything does. A possible one is

F(x) =

0 if x < 01/10 if x = 01− (9/10) exp(−x/100) if x > 0,

(2.14)

where x is the number of minutes late. Here is a plot:

−100 0 100 200 300 400

0.0

0.4

0.8

x

F(x

)

Is this a legitimate distribution function? It is easy to see it is nondecreasing, onceone notes that 1− (9/10) exp(−x/100) > 1/10 if x > 0. The limits are ok as x → ±∞.It is also continuous from the right, where the only tricky spot is limx↓0 F(x), whichgoes to 1− (9/10) exp(0) = 1/10 = F(0), so it checks.

One can then find the probabilities of various late times, e.g., it has no chance ofleaving early, 10% chance of leaving exactly on time, F(60) = 1− (9/10) exp(−60/100)≈ 0.506 chance of being at most one hour late, F(300) = 1− (9/10) exp(−300/100) ≈0.955 chance of being at most five hours late, etc. (Sort of like Amtrak.)

2.4. Distributions without pdf’s or pmf’s 17

2.4.2 Example: Spinner

Imagine a spinner whose pointer is one unit in length. It is spun so that it is equallylikely to be pointing in any direction. The random quantity is the (x, y) location of

the end of the pointer, so that X = (x, y) ∈ R2 | x2 + y2 = 1, the circle with radius1. The distribution of (X, Y) is not discrete, because it can land anywhere on the

circle, but it does not have a density with respect to Lebesgue measure on R2 becausethe integral over the circle is the volume above the circle under the pdf, that volumebeing 0.

But there is a distribution function. The F(x, y) is the arc length of the part of thecircle that has x-coordinate less than or equal to x and y-coordinate less than or equalto y, divided by total arc length (which is 2π):

F(x, y) =arc length((u, v) | u2 + v2 = 1, u ≤ x, v ≤ y)

2π. (2.15)

Here is a sketch of F:

x

y

F(x,y)

But there is an easier way to describe the distribution. For any point (x, y) onthe circle, one can find the angle with the x-axis of the line connecting (0, 0) and(x, y), θ = Angle(x, y), so that x = cos(θ) and y = sin(θ). For uniqueness’ sake, take


θ ∈ [0, 2π). Then (x, y) being uniform on the circle implies that θ is uniform from 0to 2π. Then the distribution of (X, Y) can be described via

(X, Y) = (cos(θ), sin(θ)), where θ ∼ Uni f orm[0, 2π). (2.16)

Such a description is called a representation, in that we are representing one set ofrandom variables as a function of another set (which in this case is just the one θ).

2.4.3 Example: Random coin

Imagine now a two-stage process, where one first chooses a coin out of an infinite col-lection of coins, then flips the coin n = 10 times independently. The coins have differ-ent probabilities of heads x, so that over the population of coins, X ∼ Uni f orm(0, 1).Let Y be the number of heads among the n flips. Then what is random is (X, Y).This vector is neither discrete nor continuous: X is continuous and Y is discrete. Thespace is a union of 11 (= n + 1) line segments,

W = (x, 0) | 0 < x < 1 ∪ (x, 1) | 0 < x < 1 ∪ · · · ∪ (x, 10) | 0 < x < 1. (2.17)

Here is the graph ofW :

0.0 0.4 0.8

02

46

810

x: (0,1)

y:

0,1,

...,1

0

Chapter 3

Expected Values

3.1 Definition

The distribution function F contains all there is to know about the distribution of arandom vector, but it is often difficult to take in all at once. Quantities that summarizeaspects of the distribution are often helpful, many of which are expected values offunctions of X. We define expected value in the pdf and pmf cases. There are manyX’s that have neither a pmf nor pdf, but even in those cases we can often find theexpected value.

Definition 1. Expected Value. Suppose X has pdf f , and g : X → R. If

∫· · ·

∫

X|g(x1, . . . , xp)| f (x1, x2, . . . , xp)dx1 . . . dxp < ∞, (3.1)

then the expected value of g(X), E[g(X)], exists and

E[g(X)] =∫· · ·

∫

Xg(x1, . . . , xp) f (x1, x2, . . . , xp)dx1 . . . dxp. (3.2)

If X has pmf f , and

∑ · · ·∑(x1,...,xp)∈X |g(x1, . . . , xp)| f (x1, x2, . . . , xp) < ∞, (3.3)

then the expected value of g(X), E[g(X)], exists and

E[g(X)] = ∑ · · ·∑(x1,...,xp)∈X g(x1, . . . , xp) f (x1, x2, . . . , xp). (3.4)

The requirement (3.1) that the absolute value of the function must have a finiteintegral is there to eliminate ambiguous situations. For example, consider the Cauchy

distribution with pdf f (x) = (1/π)(1/(1 + x2) and space R, and take g(x) = x, sowe wish to find E[X]. Consider

∫ ∞

−∞|x| f (x)dx =

∫ ∞

−∞

1

π

|x|1 + x2

dx = 2∫ ∞

0

1

π

x

1 + x2dx. (3.5)

19

20 Chapter 3. Expected Values

For large |x|, the integrand is on the order of 1/|x|, which does not have a finiteintegral. More precisely, it is not hard to show that

x

1 + x2>

1

2xfor x > 1. (3.6)

Thus∫ ∞

−∞

1

π

|x|1 + x2

dx > 2∫ ∞

1

1

π

1

xdx

= (2/π) log(x) |∞1= (2/π) log(∞) = ∞. (3.7)

Thus, according to the definition, we say that “the expected value of the Cauchy doesnot exist." By the symmetry of the density, it would be natural to expect the expectedvalue to be 0. But, what we have is

E[X] =∫ 0

−∞x f (x)dx +

∫ ∞

0x f (x)dx = −∞ + ∞ = Undefined. (3.8)

That is, we cannot do the integral, so the expected value is not defined.One could allow +∞ and −∞ to be legitimate values of the expected value, e.g.,

say that E[X2] = +∞ for the Cauchy, as long as the value is unambiguous. We arenot allowing that possibility formally, but informally will on occasion act as thoughwe do.

Expected values cohere in the proper way, that is, if Y is a random vector that is afunction of X, say Y = h(X), then for a function g of y,

E[g(Y)] = E[g(h(X))], (3.9)

if the latter exists. This property helps in finding the expected values when represen-tations are used. For example, in the spinner case (2.16),

E[X] = E[cos(θ)] =1

2π

∫ 2π

0cos(θ)dθ = 0, (3.10)

where the first expected value has X as the random variable, for which we do nothave a pdf, and the second expected value has θ as the random variable, for whichwe do have a pdf (the Uni f orm[0, 2π)).

An important feature of expected values is their linearity, which follows by thelinearity of integrals and sums:

Lemma 1. For any random variables X, Y, and constant c,

E[cX] = cE[X] and E[X + Y] = E[X] + E[Y], (3.11)

if the expected values exist.

The lemma can be used to show more involved linearities, e.g.,

E[aX + bY + cZ + d] = aE[X] + bE[Y] + cE[Z] + d (3.12)

(since E[d] = d for a constant d), and

E[g(X) + h(X)] = E[g(X)] + E[h(X)]. (3.13)

3.2. Means, Variances, and Covariances 21

Warning. Be aware that for non-linear functions, the expected value of a function isNOT the function of the expected value, i.e.,

E[g(X)] 6= g(E[X]) (3.14)

unless g(x) is linear, or you are lucky. For example,

E[X2] 6= (E[X])2, (3.15)

unless X is a constant. (Which is fortunate, because otherwise all variances would be0. See equation 3.20 below.)

3.1.1 Indicator functions

An indicator function is one which takes on only the values 0 and 1. It is usuallygiven as IA for a subset A ⊂ X , where A contains the values for which the functionis 1:

IA(x) =

1 i f x ∈ A0 i f x 6∈ A

. (3.16)

These functions give an alternative expression for probabilities in terms of expectedvalues:

E[IA(X)] = 1× P[X ∈ A] + 0× P[X 6∈ A] = P[A]. (3.17)

3.2 Means, Variances, and Covariances

Means, variances, and covariances are particular expected values. For a randomvariable, the mean is just its expected value:

The mean of X = E[X] (often denoted µ). (3.18)

(From now on, we will usually suppress the phrase “if it exists" when writing ex-pected values, but think of it to yourself when reading “E.") The variance is theexpected value of the deviation from the mean, squared:

The variance of X = Var[X] = E[(X− E[X])2] (often denoted σ2). (3.19)

The standard deviation is the square root of the variance. It is often a nicer quantitybecause it is in the same units as X, and measures the “typical" size of the deviationof X from its mean.

A very useful formula for finding variances is

Var[X] = E[X2]− (E[X])2, (3.20)

which can be seen, letting µ = E[X], as follows:

E[(X− µ)2] = E[X2 − 2Xµ + µ2] = E[X2]− 2E[X]µ + µ2 = E[X2]− µ2. (3.21)

With two random variables, (X, Y), say, there is in addition the covariance:

The covariance of X and Y = Cov[X, Y] = E[(X− E[X])(Y− E[Y])]. (3.22)


The covariance measures a type of relationship between X and Y. Notice that theexpectand is positive when X and Y are both greater than or both less than theirrespective means, and negative is one is greater and one less. Thus if X and Y tendto go up or down together, the covariance will be positive, while if when one goes upthe other goes down, the covariance will be negative. Note also that it is symmetric,Cov[X, Y] = Cov[Y, X], and Cov[X, X] = Var[X].

As for the variance in (3.20), we have the formula

Cov[X, Y] = E[XY]− E[X]E[Y]. (3.23)

The correlation coefficient is a normalization of the covariance, which is generallyeasier to interpret:

The correlation coefficient of X and Y = Corr[X, Y] =Cov[X, Y]√

Var[X]Var[Y]. (3.24)

This is a unitless quantity that measures the linear relationship of X and Y. It isbounded by −1 and +1. To see this, we first need the following.

Lemma 2. Cauchy-Schwarz. For random variables (U, V),

E[UV]2 ≤ E[U2]E[V2], (3.25)

with equality if and only if

U = 0 or V = βU with probability 1, (3.26)

for β = E[UV]/E[U2].

Here, the phrase “with probability 1" means P[U = 0] = 1 or P[V = aU] = 1. It isa technicality that you can usually ignore. It is there because for continuous variables,for example, you can always change a few points (which have 0 probability) withoutaffecting the distribution.

Proof. The lemma is easy to see if U is always 0, because then E[UV] = E[U2] = 0.

Suppose it is not, so that E[U2] > 0. Consider

E[(V − bU)2] = E[V2 − 2bUV + b2U2] = E[V2]− 2bE[UV] + b2E[U2]. (3.27)

Because the expectand on the left is nonnegative, so is the expected value for any b.In particular, it is nonnegative for the b that minimizes the expected value, which iseasy to find:

∂

∂bE[(V − bU)2] = −2E[UV] + 2bE[U2], (3.28)

and setting that to 0 yields b = β where

β =E[UV]

E[U2]. (3.29)

Then

E[V2]− 2βE[UV] + β2E[U2] = E[V2]− 2E[UV]

E[U2]E[UV] +

(E[UV]

E[U2]

)2

E[U2]

= E[V2]− E[UV]2

E[U2]

≥ 0, (3.30)

3.2. Means, Variances, and Covariances 23

from which (3.25) follows.There is equality in (3.25) if and only if there is equality in (3.30), which means

that

E[(V − βU)2] = 0. (3.31)

Because the expectand is nonnegative, its expected value can be 0 if and only if it is0, i.e.,

(V − βU)2 = 0 with probability 1. (3.32)

But that equation implies the second part of (3.26), proving the lemma. 2

For variables (X, Y), apply the lemma with U = X− E[X] and V = Y− E[Y]:

E[(X− E[X])(Y− E[Y])]2 ≤ E[(X− E[X])2]E[(Y− E[Y])2]

⇐⇒ Cov[X, Y]2 ≤ Var[X]Var[Y]. (3.33)

Thus from (3.24),

−1 ≤ Corr[X, Y] ≤ 1. (3.34)

Furthermore, if there is an equality in (3.33), then either X is a constant, or

Y− E[Y] = b(X− E[X]) ⇔ Y = α + βX, (3.35)

where

β =Cov[X, Y]

Var[X]and α = E[Y]− βE[X]. (3.36)

In this case,

Corr[X, Y] =

1 i f β > 0−1 i f β < 0

. (3.37)

Thus the correlation coefficient measures the linearity of the relationship betweenX and Y, +1 meaning perfectly positively linearly related, −1 meaning perfectlynegatively linearly related.

3.2.1 Example

Suppose (X, Y) has pdf f (x, y) = 2 for (x, y) ∈ W = (x, y) | 0 < x < y < 1, whichis the upper-left triangle of the unit square:


0.0 0.4 0.8

0.0

0.4

0.8

x

y

One would expect the correlation to be positive, since the large y’s tend to go withlarger x’s, but the correlation would not be +1, because the space is not simply a linesegment. To find the correlation, we need to perform some integrals:

E[X] =∫ 1

0

∫ y

0x2dxdy =

∫ 1

0y2dy =

1

3, E[Y] =

∫ 1

0

∫ y

0y2dxdy = 2

∫ 1

0y2dy =

2

3,

(3.38)

E[X2] =∫ 1

0

∫ y

0x22dxdy =

∫ 1

0

2y3

3dy =

1

6, E[Y2] =

∫ 1

0

∫ y

0y22dxdy = 2

∫ 1

0y3dy =

1

2,

(3.39)and

E[XY] =∫ 1

0

∫ y

0xy2dxdy =

∫ 1

0y3dy =

1

4. (3.40)

Then

Var[X] =1

6− 1

32=

1

18, Var[Y] =

1

2−(

2

3

)2

=1

18, Cov[X, Y] =

1

4− 1

3

2

3=

1

36,

(3.41)and, finally,

Corr[X, Y] =1/36√

(1/18)(1/18)=

1

2. (3.42)

This value does seem plausible: positive but not too close to 1.

3.2.2 The variance of linear combinations & affine transformations

A linear combination of the variables, X1, . . . , Xp, is a function of the form

b1X1 + · · ·+ bpXp, (3.43)

3.3. Other summaries 25

for constants b1, . . . , bp. An affine transformation just adds a constant:

a + b1X1 + · · ·+ bpXp. (3.44)

Thus they are almost the same, and if you want to add the (constant) variable X0 ≡ 1,you can think of an affine transformation as a linear combination, as one does whensetting up a linear regression model with intercept. Here we find formulas for thevariance of an affine transformation.

Start with a + bX:

Var[a + bX] = E[(a + bX− E[a + bX])2]

= E[(a + bX− a− bE[X]])2]

= E[b2(X− E[X])2]

= b2E[(X− E[X])2]

= b2Var[X]. (3.45)

The constant a goes away (it does not contribute to the variability), and the constantb is squared. For a linear combination of two variables, the variance involves the twovariances, as well as the covariance:

Var[a + bX + cY] = E[(a + bX + cY)− E[a + bX + cY])2]

= E[(b(X− E[X]) + c(Y− E[Y]))2]

= b2E[(X− E[X])2] + 2bc E[(X− E[X])(Y− E[Y])]

+ c2E[(Y− E[Y])2]

= b2Var[X] + c2Var[Y] + 2bc Cov[X, Y]. (3.46)

With p variables, we have

Var[a +p

∑i=1

biXi] =p

∑i=1

b2i Var[Xi] + 2∑ ∑1≤i<j≤p

bibjCov[Xi, Xj]. (3.47)

This formula can be made simpler using matrix and vector notation, which we do inthe next section.

3.3 Other summaries

Means, variances and covariances are often good summaries of distributions; othersare mentioned below. These summaries typically do not characterize the distribu-tions. For example, two random variables could have the same means and samevariances, but still be quite different: a Binomial(16, 1/2) has mean 8 and variance 4,as does a Normal(8, 4). The distribution function contains all the information about arandom variable, or a random sample, as does the probability mass function or prob-ability density function, if such exists, as well as the moment generating function.

Some types of summaries:

• Centers. A measure of center is a number that is “typical" of the entire distri-bution. The mean and the median are the two most popular, but there are alsotrimmed means, the tri-mean, the mode, and others.


• Position – Quantiles. A positional measure is one that gives the value that is ina certain relation to the rest of the values. For example, the .25 quantile is thevalue such that the random variable is (approximately) below the value 25% ofthe time, and above it 75% of the time. The median is the 1/2 quantile. More

precisely, for q ∈ [0, 1], a qth quantile is any value ηq such that

P[X ≤ ηq] ≥ q and P[X ≥ ηq] ≥ 1− q. (3.48)

A particular quantile may not be unique. It is if the distribution function isstrictly increasing at the value.

• Spreads. A measure of spread indicates how spread out the distribution is, thatis, how much do the values vary from the center. Famous ones are the standarddeviation (and variance); interquartile range, which is the difference between

the 3/4th quantile and (1/4)th quantile; the range, which is the maximum mi-nus the minimum; and the MAD, median absolute deviation, E[|X− η|], whereη is the median of X.

• Moments. The kth raw moment of a random variable is the expected value of

its kth power, where k = 1, 2, . . .. The kth central moment is the expected value

of the kth power of its deviation from the mean µ, at least for k > 1:

kth raw moment = µ′k = E[Xk], k = 1, 2, . . .

kth central moment = µk = E[(X− µ)k], k = 2, 3, . . . . (3.49)

Thus µ′1 = µ = E[X], µ′2 = E[X2], and µ2 = σ2 = Var[X] = µ′2 − µ21. The first

central moment µ1 is µ′1 = µ, the mean. It is not hard, but a bit tedious, to figure

out the kth central moment from the first k raw moments, and vice versa. It isnot uncommon for given moments not to exist. In particular, if the kth momentdoes not exist, then neither does any higher moment.

The third central moment is generally measure of skewness, where symmet-ric distributions have 0 skewness, and a tail heavier to the right than to theleft would have a positive skewness. Usually it is normalized so that it is notdependent on the variance:

Skewness = κ3 =µ3

σ3. (3.50)

The fourth central moment is a measure of kurtosis. It, too, is normalized:

Kurtosis = κ4 =µ4

σ4− 3. (3.51)

The Normal(0, 1) has µ4 = 3, so that subtracting 3 in (3.51) means the kurtosisof a normal is 0. It is not particularly easy to figure out what kurtosis meansin general, but for nice unimodal densities, it measures “boxiness." A negativekurtosis indicates a density more boxy than the Normal. The Uniform is boxy.A positive kurtosis indicates a pointy middle and heavy tails, such as the doubleexponential.

• Generating Functions. A generating function is a meta-summary. Typically,derivatives of the generating function yield certain quantities, e.g., the momentgenerating function generates moments, the cumulant generating function gen-erates cumulants, etc.

3.4. The moment and cumulant generating functions 27

3.4 The moment and cumulant generating functions

The moment generating function (mgf for short) of X is a function from Rp → [0, ∞]given by

MX(t1, . . . , tp) = E[et1 X1+···+tpXp

]. (3.52)

The mgf does not always exist, that is, often the integral or sum defining the expectedvalue diverges. That is ok, as long as it is finite for t = (t1, . . . , tp) in a neighborhoodof 0p. If that property holds, then the mgf uniquely determines the distribution of X.

Theorem 1. Uniqueness of MGF If for some ǫ > 0,

MX(t) < ∞ and MX(t) = MY(t) for all t such that ‖t‖ ≤ ǫ, (3.53)

then X and Y have the same distribution.

If one knows complex variables, the characteristic function is superior because italways exists. It is defined as φX(t) = E[exp(i(t1X1 + · · ·+ tNXN))].

The uniqueness in Theorem 1 is the most useful property of mgf’s, but they canalso be useful in generating moments, as in (3.57). We next extend the result tovectors.

Lemma 3. Suppose X has mgf such that for some ǫ > 0,

MX(t) < ∞ for all t such that ‖t‖ ≤ ǫ. (3.54)

Then for any nonnegative integers k1, . . . , kp,

E[Xk11 Xk2

2 · · ·Xkpp ] =

∂k1+···kp

∂tk11 · · · ∂t

kpp

MX(t)|t=0p, (3.55)

which is finite.

The quantities E[Xk1

1 Xk22 · · ·X

kpp ] are called mixed moments. The most common

one is just E[X1X2], used in finding the covariance. Notice that this lemma impliesthat all mixed moments are finite under the condition (3.54). This condition is actuallyquite strong.

Specializing to a random variable X, the mgf is

MX(t) = E[etX]. (3.56)

If it exists for t in a neighborhood of 0, then all moments of X exist, and

∂k

∂tkMX(t)|t=0 = E[Xk]. (3.57)

The cumulant generating function is the log of the moment generating function,

cX(t) = log(MX(t)). (3.58)

It generates the cumulants, which are defined by what the cumulant generating func-

tion generates, i.e., for a random variable, the kth cumulant is

γk =∂k

∂tkcX(t)|t=0. (3.59)


Cumulants are slightly easier to work with than moments. The first four are

γ1 = E[X] = µ1 = µ

γ2 = Var[X] = µ2 = σ2

γ3 = E[(X− E[X])3] = µ3

γ4 = E[(X− E[X])4]− 3 Var[X]2 = µ4 − 3µ22 = µ4 − 3σ4. (3.60)

The skewness (3.50) and kurtosis (3.51) are then simple functions of the cumulants:

Skewness[X] =γ3

σ3and Kurtosis[X] =

γ4

σ4. (3.61)

3.4.1 Example: Normal

Suppose X ∼ N(µ, σ2) (with σ2> 0). Then its mgf is

MX(t) = E[etX]

=1√2πσ

∫ ∞

−∞etxe− 1

2(x−µ)2

σ2 dx

=1√2πσ

∫ ∞

−∞etx− 1

21

σ2 (x2−2µx+µ2)

=1√2πσ

∫ ∞

−∞e− 1

21

σ2 (x2−2(µ+tσ2)x+µ2). (3.62)

In the exponent, complete the square with respect to the x:

x2 − 2(µ + tσ2)x = (x− (µ + tσ2))2 − (µ + tσ2)2. (3.63)

Then

MX(t) = e12

1

σ2 ((µ+tσ2)2−µ2)∫ ∞

−∞

1√2πσ

e− 1

21

σ2 (x−(µ+tσ2))2

dx. (3.64)

Notice that the integrand is the pdf of a N(µ + tσ2, σ2), which means the integral is

1. Expanding the (µ + tσ2)2 and simplifying a bit yields

MX(t) = etµ+ 12 σ2t2

, t ∈ R. (3.65)

The cumulant generating function is then a simple quadratic:

cX(t) = tµ +1

2σ2t2, (3.66)

and it is easy to see that

c′X(0) = µ, c′′(0) = σ2, c′′′(t) = 0. (3.67)

Thus the mean is µ and variance is σ2 (not surprisingly), and all other cumulants are0. In particular, the skewness and kurtosis are both 0.


3.4.2 Example: Gamma

The Gamma distribution has two parameters: α > 0 is the shape parameter, and λ > 0

is the scale parameter. Its space is X = (0, ∞), and its pdf is c xα−1 exp(−λx), where

1

c=

∫ ∞

0xα−1 exp(−λx)dx

=∫ ∞

0(u/λ)α−1 exp(−u)du/λ (u = λx)

=1

λα

∫ ∞

0uα−1e−udu

=1

λα Γ(α). (3.68)

The gamma function is defined to be that very integral, i.e., for α > 0,

Γ(α) =∫ ∞

0uα−1e−udu. (3.69)

It generally has no closed-form value, but there are some convenient facts:

Γ(α + 1) = αΓ(α) for α > 0;

Γ(n) = (n− 1)! for n = 1, 2, . . . ;

Γ(1

2) =

√π. (3.70)

The pdf for a Γ(α, λ) is then

f (x | α, λ) =λα

Γ(α)xα−1e−λx, x ∈ (0, ∞). (3.71)

If α = 1, then this distribution is called the Exponential(λ), with pdf

g(x | λ) = λ e−λx, x ∈ X = (0, ∞). (3.72)

The mgf is

MX(t) = E[etX] =λα

Γ(α)

∫ ∞

0etxxα−1e−λx

=λα

Γ(α)

∫ ∞

0xα−1e−(λ−t)x. (3.73)

That integral needs (λ− t) > 0 to be finite, so we need t < λ, which means the mgfis finite for a neighborhood of zero, since λ > 0. Now the integral at the end of (3.73)is the same as that in (3.68), except that λ− t is in place of λ. Thus we have

E[etX] =λα

Γ(α)

Γ(α)

(λ− t)α

=

(λ

λ− t

)α

, t < λ. (3.74)


We will use the cumulant generating function cX(t) = log(MX(t)) to obtain themean and variance, because it is slightly easier. Thus

c′X(t) =∂

∂tα(log(λ)− log(λ− t)) =

α

λ− t=⇒ E[X] = c′X(0) =

α

λ, (3.75)

and

c′′X(t) =∂2

∂t2α(log(λ)− log(λ− t)) =

α

(λ− t)2=⇒ Var[X] = c′′X(0) =

α

λ2. (3.76)

In general, the kth cumulant (3.59) is

γk = (k− 1)!α

λk, (3.77)

and in particular

Skewness[X] =2α/λ3

α3/2/λ3=

2√α

and

Kurtosis[X] =6α/λ4

α2/λ4=

6

α. (3.78)

Thus the skewness and kurtosis depends on just the shape parameter α. Also, theyare positive, but tend to 0 as α increases.

3.4.3 Example: Binomial and Multinomial

The Binomial is a model for counting the number of successes in n trials, e.g., thenumber of heads in ten flips of a coin, where the trials are independent (formallydefined in Section 4.3) and have the same probability p of success. As in Section2.3.2,

X ∼ Binomial(n, p) =⇒ fX(x) =

(n

x

)px(1− p)n−x, x ∈ X = 0, 1, . . . , n. (3.79)

The fact that the pdf sums to 1 relies on the binomial theorem:

(a + b)n =n

∑x=0

axbn−x, (3.80)

with a = p and b = 1− p. This theorem also helps in finding the mgf:

MX(t) = E[etX] =n

∑x=0

etx fX(x)

=n

∑x=0

etx

(n

x

)px(1− p)n−x

=n

∑x=0

(n

x

)(pet)x(1− p)n−x

= (pet + 1− p)n. (3.81)


It is finite for all t ∈ R, as is the case for any bounded random variable.The first two moments are

E[X] = M′X(0) = n(pet + 1− p)n−1 pet |t=0 = np, (3.82)

and

E[X2] = M′′X(0)

= [n(n− 1)(pet + 1− p)n−2(pet)2 + n(pet + 1− p)n−1 pet] |t=0

= n(n− 1)p2 + np. (3.83)

Thus

Var[X] = E[X2]− E[X]2

= n(n− 1)p2 + np− (np)2

= np− np2

= np(1− p). (3.84)

(In Section 5.3.4 we will exhibit an easier approach.)The Multinomial distribution also models the results of n trials, but here there are

K possible categories for each trial. E.g., one may roll a die n times, and see whetherit is a one, two, . . ., or six (so K = 6); or one may randomly choose n people, each ofwhom is then classified as short, medium, or tall (so K = 3). As for the Binomial, thetrials are assumed independent, and the probability of an individual trial coming upin category k is pk, so that p1 + · · ·+ pK = 1. The random vector is X = (X1, . . . , XK)′,where Xk is the number of observations from category k. Letting p = (p1, . . . , pK)′,we have

X ∼ Multinomial(n, p) =⇒ fX(x) =

(n

x

)px1

1 · · · pxKK , x ∈ X , (3.85)

where the space consists of all possible ways K nonnegative integers can sum to n:

X = x ∈ RK | xk ∈ 0, . . . , n for each k, and x1 + · · ·+ xK = n, (3.86)

and for x ∈ X , (n

x

)=

n!

x1! · · · xK !. (3.87)

This pmf is related to the multinomial theorem:

(a1 + · · ·+ aK)n = ∑x∈ X

ax11 · · · a

xKK . (3.88)

Note that the Binomial is basically a special case of the Multinomial with K = 2:

X ∼ Binomial(n, p) =⇒ (X, n− X)′ ∼ Multinomial(n,(p, 1− p)′). (3.89)


Now for the mgf. It is a function of t = (t1, . . . , tK):

MX(t) = E[et′X ] = ∑ux∈ X

et′x f )X(x)

= ∑ux∈ X

et1 x1 · · · etK xK

(n

x

)px1

1 · · · pxKK

= ∑ux∈ X

(n

x

)(p1et1)x1 · · · (pKetK )xK

= (p1et1 + · · ·+ pKetK )n< ∞ for all t ∈ R

K. (3.90)

The mean and variance of each Xk can be found much as for the Binomial. We findthat

E[Xk] = npK and Var[Xk] = npk(1− pk). (3.91)

For the covariance between X1 and X2, we first find

E[X1X2] =∂2

∂t1∂t2MX(t)|t=0K

= n(n− 1)(p1et1 + · · ·+ pKetK )n−2 p1et1 p2et2 |t=0K

= n(n− 1)p1 p2. (3.92)

ThusCov[X1, X2] = n(n− 1)p1 p2 − (np1)(np2) = −np1 p2. (3.93)

Similarly, Cov[Xk, Xl ] = −npk pl is k 6= l. It does make sense for the covariance to benegative, since the more there are in category 1, the fewer are available for category2.

3.5 Vectors and matrices

The mean of a vector or matrix of random variables is the corresponding vector ormatrix of means. That is, if X is a p× 1 vector, X = (X1, . . . , Xp)′, then

E[X] =

E[X1]...

E[Xp]

. (3.94)

Similarly, if X is an n× p matrix, then so is its mean:

E[X] = E

X11 X12 · · · X1p

X21 X22 · · · X2p

......

. . ....

Xn1 Xn2 · · · Xnp

=

E(X11) E(X12) · · · E(X1p)E(X21) E(X22) · · · E(X2p)

......

. . ....

E(Xn1) E(Xn2) · · · E(Xnp)

.

(3.95)The linearity in Lemma 1 holds for vectors and matrices as well. That is, for fixed

q× p matrix B and q× 1 vector a,

E[a + BX] = a + BE[X], (3.96)

3.5. Vectors and matrices 33

and for matrices A (m× q), B (m× n) and C (p× q),

A + E[BXC] = A + BE[X]C. (3.97)

These formulas can be proved by writing out the individual elements, and notingthat each is a linear combination of the random variables. (The functions a + BX andA + BXC are also called affine transformations, of X and X, respectively.)

A vector X yields p variances, the Var[Xi ]’s, but also (p2) covariances, the Cov[Xi, Xj]’s.

These are usually conveniently arranged in a p× p matrix, the covariance matrix:

Σ = Cov(X) =

Var(X1) Cov(X1, X2) · · · Cov(X1, Xp)Cov(X2, X1) Var(X2) · · · Cov(X2, Xp)

......

. . ....

Cov(Xp, X1) Cov(Xp, X2) · · · Var(Xp)

. (3.98)

This matrix is symmetric, i.e., Σ′ = Σ, where the prime means transpose. (Thecovariance matrix of a matrix X of random variables is typically defined by firstchanging the matrix X into a long vector, then defining the Cov[X] to be Cov[X].) Acompact way to define the covariance is

Cov[X] = E[(X− E[X])(X− E[X])′]. (3.99)

For example, if X ∼ Multinomial(n, p) as in (3.85), then (3.91) and (3.92) show that

E[X] = np and Cov[X] = n

p1 0 · · · 00 p2 · · · 0...

.... . .

...0 0 · · · pK

− np p′. (3.100)

A convenient, and important to remember, formula for the covariance of an affinetransformation follows.

Lemma 4. For fixed a and B,

Cov[a + BX] = B Cov[X]B′. (3.101)

Note that this lemma is a matrix version of (3.45).

Proof.

Cov[a + BX] = Cov[(a + BX − E[a + BX])(a + BX− E[a + BX])′] by (3.99)

= Cov[B(X− E[X])(B(X− E[X]))′]= Cov[B(X− E[X])(X− E[X])′B′]= B Cov[(X− E[X])(X− E[X])′]B′ by (3.97)

= B Cov[X]B′ again by (3.99). 2 (3.102)

This lemma leads to a simple formula for the variance of a + b1X1 + · · ·+ bpXp:

Var[a + b1X1 + · · ·+ bpXp] = b′Cov[X]b, (3.103)

because we can write a + b1X1 + · · ·+ bpXp = a + b′X (so a = a and B = b′ in (3.101).)Compare this formula to (3.47).

Chapter 4

Marginal Distributions and

Independence

4.1 Marginal distributions

Given the distribution of a vector of random variables, it is possible in principle tofind the distribution of any individual component of the vector, or any subset of com-ponents. To illustrate, consider the distribution of the scores (Assignment, Exams) fora statistics class, where each variable has values “Lo" and “Hi":

Assignments ↓ ExamsLo Hi Sum

Lo 0.3365 0.1682 0.5047Hi 0.1682 0.3271 0.4953

Sum 0.5047 0.4953 1

(4.1)

Thus about 33% of the students did low on both assigments and exams, and about33% did high on both. But notice it is also easy to figure out the percentages of peoplewho did low or high on the individual scores, e.g.,

P[Assigment = Lo] = 0.5047 and (hence) P[Assigment = Hi] = 0.4953. (4.2)

Coincidently, the exam scores have the same distribution. These numbers are in themargins of the table (4.1), hence the distribution of Assignments alone, and of Examsalone, are called marginal distributions. The distribution of (Assignments, Exams)together is called the joint distribution.

More generally, given the joint distribution of (the big vector) (X, Y), one canfind the marginal distribution of the vector X, and the marginal distribution of thevector Y. (We don’t have to take consecutive components of the vector, e.g., given(X1, X2, . . . , X5), we could be interested in the marginal distribution of (X1, X3, X4),say.)

Actually, the words joint and marginal can be dropped. The joint distribution of(X, Y) is just the distribution of (X, Y); the marginal distribution of X is just thedistribution of X, and the same for Y. These extra terms can be helpful, though,when dealing with different types of distributions in the same breath.

35

36 Chapter 4. Marginal Distributions and Independence

Before showing how to find the marginal distributions from the joint, we shoulddeal with the spaces. LetW be the joint space of (X, Y), and X and Y be the marginalspaces of X and Y, respectively. Then

X = x | (x, y) ∈ W for some y and Y = y | (x, y) ∈ W for some x. (4.3)

In Example 3.2.1, the joint spaceW is the upper left triangle of the unit square, hencethe marginal spaces X and Y are both (0, 1).

There are various approaches to finding the marginal distributions from the joint.First, suppose F(x, y) is the distribution function for (X, Y) jointly, and FX(x) is that

for X marginally. Then (assuming X is p× 1 and Y is q× 1),

FX(x) = P[X1 ≤ x1, . . . , Xp ≤ xp]

= P[X1 ≤ x1, . . . , Xp ≤ xp, Y1 ≤ ∞, . . . , Yq ≤ ∞]

= F(x1, . . . , xp , ∞, . . . , ∞). (4.4)

That is, you put ∞ in for the variables you are not interested in, because they arecertainly less than infinity.

The mgf is equally easy. Suppose M(t, s) is the mgf for (X, Y) jointly, so that

M(t, s) = E[et′X+s′Y ]. (4.5)

To eliminate the dependence on Y, we now set s to zero, that is, the mgf of X alone is

MX(t) = E[et′X ] = E[et′X+0′qY] = M(t, 0q). (4.6)

4.1.1 Example: Marginals of a multinomial

Given X ∼ Multinomial(n, p) as in (3.85), one may wish to find the marginal distri-

bution of a single component, e.g., X1. It should be Binomial, because now for eachtrial a success is that the observation is in the first category. To show this fact, we findthe mgf of X1 by setting t2 = · · · = tK = 0 in (3.90):

MX1(t) = MX((t, 0, . . . , 0)′) = (p1et + p1 + · · ·+ pK)n = (p1et + 1− p1)

n, (4.7)

which is indeed the mgf of a Binomial as in (3.81). Specifically,

X1 ∼ Binomial(n, p1). (4.8)

4.2 Marginal densities

More challenging, but also more useful, is to find the marginal density from thejoint density, assuming it exists. Suppose the joint distribution of the two randomvariables, (X, Y), has pmf f (x, y), and spaceW . Then X has a pmf, fX(x), as well. Tofind it in terms of f , write

fX(x) = P[X = x (and Y can be anything)]

= ∑y | (x,y)∈W

P[X = x, Y = y]

= ∑y | (x,y)∈W

f (x, y) (4.9)

4.2. Marginal densities 37

That is, you add up all the f (x, y) for that value of x, as in the table (4.1). The sameprocedure works if X and Y are vectors. The set of y’s we are summing over we will

call the conditional space of Y given X = x, and denote Yx:

Yx = y ∈ Y | (x, y) ∈ W. (4.10)

For Example 3.2.1, for any x ∈ (0, 1), y ranges from x to 1, hence

Yx = (x, 1). (4.11)

In the coin Example 2.4.3, for any probability of heads x, the range of Y is the same,so that Yx = 0, 1, . . . , n for any x ∈ (0, 1).

Thus in the general discrete case, we have

fX(x) = ∑y∈Yx

f (x, y), x ∈ X . (4.12)

4.2.1 Example: Ranks

The National Opinion Research Center Amalgam Survey of 1972 asked people torank three types of areas in which to live: City over 50,000, Suburb (within 30 miles

of a City), and Country (everywhere else). The table below shows the results1, withrespondents categorized by their current residence.

Ranking Residence(City, Suburb, Country) City Suburb Country Total

(1, 2, 3) 210 22 10 242(1, 3, 2) 23 4 1 28(2, 1, 3) 111 45 14 170(2, 3, 1) 8 4 0 12(3, 1, 2) 204 299 125 628(3, 2, 1) 81 126 152 359

Total 637 500 302 1439

(4.13)

That is, a ranking of (1, 2, 3) means that person ranks living in the city best, sub-urbs next, and country last. There were 242 people in the sample with that ranking,210 of whom live in the city (so they should be happy), 22 of whom live in the sub-urbs, and just 10 of whom live in the country.

The random vector here is (X, Y, Z), say, where X represents the rank of City, Ythat of Suburb, and Z that of Country. The space consists of the six permutations of1, 2 and 3:

W = (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1), (4.14)

as in the first column of the table. Suppose the total column is our population, sothat there are 1439 people all together, and we randomly choose a person from this

1 Duncan, O. D. and Brody, C. (1982) Analyzing n rankings of three items. Social Structure and Behavior,269 – 310. Hauser, R. M., Mechanic, D., Haller, A. O. and Hauser, T. S., eds. Academic Press: New York.


population. Then the (joint) distribution of the person’s ranking (X, Y, Z) is given by

f (x, y, z) = P[(X, Y, Z) = (x, y, z)] =

242/1439 i f (x, y, z) = (1, 2, 3)28/1439 i f (x, y, z) = (1, 3, 2)

170/1439 i f (x, y, z) = (2, 1, 3)12/1439 i f (x, y, z) = (2, 3, 1)

628/1439 i f (x, y, z) = (3, 1, 2)359/1439 i f (x, y, z) = (3, 2, 1)

. (4.15)

This distribution could use with some summarizing, e.g., what are the marginaldistributions of X, Y, and Z? For each ranking x = 1, 2, 3, we have to add over thepossible rankings of Y and Z, so that

fX(1) = f (1, 2, 3) + f (1, 3, 2) =28 + 242

1439= 0.1876;

fX(2) = f (2, 1, 3) + f (2, 3, 1) =170 + 12

1439= 0.1265;

fX(3) = f (3, 1, 2) + f (3, 2, 1) =628 + 359

1439= 0.6859. (4.16)

Thus City is ranked third over 2/3 of the time. The marginal rankings of Suburb andCountry can be obtained similarly. 2

Again for two variables, suppose now that the pdf is f (x, y). We know that thedistribution function is related to the pdf via

F(x, y) =∫

(−∞,x)∩X

∫

(−∞,y)∩Yu

f (u, v)dvdu. (4.17)

From (4.4), to obtain the distribution function, we set y = ∞, which means in theinside integral, we can remove the “(−∞, y)" part:

FX(x) =∫

(−∞,x)∩X

∫

Yu

f (u, v)dvdu. (4.18)

Then the pdf of X is found by taking the derivative with respect to x for x ∈ X , whichhere just means stripping away the outer integral and setting u = x (and v = y, if wewish):

fX(x) =∂

∂xFX(x) =

∫

Yx

f (x, y)dy. (4.19)

Thus instead of summing over the y as in (4.12), we integrate. This procedure is oftencalled “integrating out y".

Turning to Example 3.2.1 again, we have f (x, y) = 2 for (x, y) ∈ W . From (4.11),for x ∈ (0, 1),

fX(x) =∫

Yx

f (x, y)dy =∫ 1

x2dy = 2(1− x). (4.20)

With vectors, the process is the same, just add lines under the symbols:

fX(x) =∫

Yx

f (x, y)dy. (4.21)

4.3. Independence 39

4.3 Independence

Much of statistics is geared towards evaluation of relationships between variables:Does smoking cause cancer? Do cell phones? What factors explain the rise in asthma?The absence of a relationship, independence, is also important.

Definition 2. Suppose (X, Y) has joint distribution P, and marginal spaces X and Y , re-spectively. Then X and Y are independent if

P[X ∈ A and Y ∈ B] = P[X ∈ A]× P[Y ∈ B] for all A ⊂ X and B ⊂ Y . (4.22)

Also, if (X(1), . . . , X(K)) has distribution P, and the vector X(k) has space Xk, then X(1), . . . , X(K)

are (mutually) independent if

P[X(1) ∈ A1, . . . , X(K) ∈ AK] = P[X(1) ∈ A1]× · · · × P[X(K) ∈ AK]

for all A1 ⊂ X1, . . . , AK ⊂ XK. (4.23)

The basic idea in independence is that what happens with one variable does notaffect what happens with another. There are a number of useful equivalences for in-dependence of X and Y. (Those for mutual independence of K vectors hold similarly.)

• Distribution functions. X and Y are independent if and only if

F(x, y) = FX(x)× FY(y) for all x ∈ Rp, y ∈ R

q; (4.24)

• Products of functions. X and Y are independent if and only if

E[g(X)h(Y)] = E[g(X)]× E[h(Y)] (4.25)

for all functions g : X → R and h : Y → R whose means exist;

• MGF’s. Suppose the marginal mgf’s of X and Y are finite for t and s in neigh-borhoods of zero (respectively in Rp and Rq). Then X and Y are independent ifand only if

M(t, s) = MX(t)MY(s) (4.26)

for all (t, s) in a neighborhood of zero in Rp+q.

Remark. The second item can be used to show that independent random variablesare uncorrelated, because as in (16.123), Cov[X, Y] = E[XY] − E[X]E[Y], and (4.25)shows that E[XY] = E[X]E[Y] if X and Y are independent. Be aware that the im-plication does not go the other way, that is, X and Y can have correlation 0 andstill not be independent. For example, suppose W = (0, 1), (0,−1), (1, 0), (−1, 0),and P[(X, Y) = (x, y)] = 1

4 for each (x, y) ∈ W . Then it is not hard to show thatE[X] = E[Y] = 0, and that E[XY] = 0 (in fact, XY = 0 always), hence Cov[X, Y] = 0.But X and Y are not independent, e.g., take A = 0 and B = 0. Then

P[X ∈ A and Y ∈ B] = P[X = 0 and Y = 0] = 0 6= P[X = 0]P[Y = 0] =1

2× 1

2. 2

(4.27)


4.3.1 Example: Independent exponentials

Suppose U and V are independent Exponential(1)’s. The mgf of an Exponential(1)is 1/(1− t) for t < 1. See (3.74), which gives the mgf of Gamma(α, λ) as (λ/(λ− t))α

for t < λ. Thus the mgf of (U, V) is

M(U,V)(t1, t2) = MU(t1)MV(t2) =1

1− t1

1

1− t2, t1 < 1, t2 < 1. (4.28)

Now let

X = U + V and Y = X− Y. (4.29)

Are X and Y independent? What are their marginal distributions? We can start bylooking at the mgf:

M(X,Y)(s1, s2) = E[es1X+s2Y]

= E[es1(U+V)+s2(U−V)]

= E[e(s1+s2)U+(s1−s2)V ]

= M(U,V)(s1 + s2, s1 − s2)

=1

1− s1 − s2

1

1− s1 + s2. (4.30)

This mgf is finite if s1 + s2 < 1 and s1 − s2 < 1, which is a neighborhood of (0, 0). IfX and Y are independent, then this mgf must factor into the two individual mgf’s. Itdoes not appear to factor. More formally, the marginal mgf’s are

MX(s1) = M(X,Y)(s1, 0) =1

(1− s1)2, (4.31)

and

MY(s2) = M(X,Y)(0, s2) =1

(1− s2)(1 + s2)=

1

1− s22

. (4.32)

Note that the first one is that of a Gamma(2, 1), so that X ∼ Gamma(2, 1). The onefor Y may not be recognizable, but it turns out to be the mgf of a double exponential.But,

M(X,Y)(s1, s2) 6= MX(s1)MY(s2), (4.33)

hence X and Y are not independent. They are uncorrelated, however:

Cov[X, Y] = Cov[U + V, U−V] = Var[U]−Var[V]−Cov[U, V] + Cov[V, U] = 1− 1 = 0.(4.34)

4.3.2 Spaces and densities

Suppose (X, Y) is discrete, with pmf f (x, y) > 0 for (x, y) ∈ W , and fX and fY arethe marginal pmf’s of X and Y, respectively. Then applying (4.22) to the singletonsets x and y for x ∈ X and y ∈ Y shows that

P[X = x, Y = y] = P[X = x]× P[Y = y], (4.35)


which translates tof (x, y) = fX(x) fY(y). (4.36)

In particular, this equation shows that if x ∈ X and y ∈ Y , then f (x, y) > 0, hence(x, y) ∈ W . That is, if X and Y are independent, then

W = X ×Y , (4.37)

the “rectangle" created from the marginal spaces.Formally, given sets A ⊂ Rp and B ⊂ Rq, the rectangle A× B is defined to be

A× B = (x, y) | x ∈ A and y ∈ B ⊂ Rp+q. (4.38)

The set may not be a rectangle in the usual sense, although it will be if p = q = 1and A and B are both intervals. Figure 4.1 has some examples. Of course, Rp+q is arectangle itself, being Rp ×Rq. The result (4.37) holds in general.

Lemma 5. If X and Y are independent, then the spaces can be taken so that (4.37) holds.

This lemma implies that if the joint space is not a rectangle, then X and Y arenot independent. Consider example 3.2.1, where W = (x, y) | 0 < x < y < 1, atriangle. If we take a square below that triangle, such as (0.8, 1)× (0.8, 1), then

P[(X, Y) ∈ (0.8, 1)× (0.8, 1)] = 0 but P[X ∈ (0.8, 1)]P[Y ∈ (0.8, 1)] > 0, (4.39)

so that X and Y are not independent. The result extends to more variables. ConsiderExample 4.2.1 on ranks. Here, the marginal spaces are

X = Y = Z = 1, 2, 3, (4.40)

butW 6= X ×Y ×Z = (1, 1, 1), (1, 1, 2), (1, 1, 3), . . . , (3, 3, 3), (4.41)

in particular because W has only 6 elements, while the product space has 33 = 27.The factorization in (4.35) is necessary and sufficient for independence in the dis-

crete and continuous cases.

Lemma 6. Suppose X has marginal density fX(x), and Y has marginal density fY(y), and

they are both pdf’s or both pmf’s. Then X and Y are independent if and only if the distributionof (X, Y) is given by density

f (x, y) = fX(x) fY(y), (4.42)

and spaceW = X ×Y . (4.43)

We can simplify a little, that is, as long as the space and joint pdf factor, we haveindependence.

Lemma 7. Suppose (X, Y) has joint density f (x, y). Then X and Y are independent if andonly if the density can be written as

f (x, y) = g(x)h(y) (4.44)

for some functions g and h, andW = X ×Y . (4.45)

This lemma is not presuming the g and h are actual densities, although they cer-tainly could be.


0.0 1.0 2.0

0.0

1.0

2.0

(0,1)

(0,2

)

0.0 1.0 2.0 3.00.

01.

53.

0

(0,1)U(2,3)

(0,1

)U(2

,3)

0.0 1.0 2.0

1.0

2.5

4.0

0,1,2

1,2

,3,4

0.0 1.0 2.0

2.0

3.0

4.0

0,1,2

(2,4

)

Figure 4.1: Some rectangles


4.3.3 IID

A special case of independence has the vectors with the exact same distribution,

as well as being independent, that is, X(1), . . . , X(K) are independent, and all havethe same marginal distribution. We say the vectors are iid, meaning “independentand identically distributed." This type of distribution often is used to model randomsamples, where n individuals are chosen from a (virtually) infinite population, and p

variables are recorded on each. Then K = n and the X(i)’s are p× 1 vectors. If the

marginal density is fX for each X(i), with marginal space X0, then the joint densityof the entire sample is

f (x(1), . . . , x(n)) = fX(x(1)) · · · fX(x(n)) (4.46)

with spaceX = X0 × · · · × X0 = (X0)

n. (4.47)

Chapter 5

Transformations I

A major task of mathematical statistics is finding, or approximating, the distributionsof random variables that are functions of other random variables. For example, givenX1, . . . , Xn iid observations, find the distribution (exactly or approximately) of thesample mean,

X =1

n

n

∑i=1

Xi. (5.1)

This chapter will address finding the exact distribution. There are many ap-proaches, and which one to use may not always be Chapters 15 and 16 considerlarge-sample approximations. obvious. The following sections run through a num-ber of possibilities.

5.1 Distribution functions

Suppose the p × 1 vector X has distribution function FX, and Y = g(X) for somefunction

g : X −→ Y ⊂ R. (5.2)

Then the distribution function FY of Y is

FY(y) = P[Y ≤ y] = P[g(X) ≤ y], y ∈ R. (5.3)

The final probability is in principle obtainable from the distribution of X, whichsolves the problem. If Y has a pdf, we can then find that by differentiating.

5.1.1 Example: Sum of uniforms

Suppose X1 and X2 are iid Uni f orm(0, 1), so that f (x1, x2) = 1 and X = (0, 1)× (0, 1),the unit square. Let g(x1, x2) = x1 + x2, i.e., we want to find the distribution of

Y = X1 + X2. (5.4)

As in (5.2), its distribution function is

FY(y) = P[Y ≤ y] = P[X1 + X2 ≤ y]. (5.5)

45

46 Chapter 5. Transformations I

0.0 0.4 0.8

0.0

0.4

0.8

x1

x2

y=0.8

0.0 0.4 0.8

0.0

0.4

0.8

x1

x2

y=1.3

Figure 5.1: x | x1 + x2 ≤ y

Note that it is important in these calculations to distinguish between random vari-ables (the capitalized letters, X1, X2, Y), and particular values they could take on (thelower case letters, like y).

The space of Y is Y = (0, 2). Because the pdf of X is 1, the probability of any setA ⊂ X is just its area. Figure 5.1 shows the regions x | x1 + x2 ≤ y for y = 0.8 and

y = 1.3. When 0 < y < 1, this region is a triangle, whose are is easily seen to be y2/2.

When 1 ≤ y < 2, the area is the unit square minus a triangle, 1− (2− y)2/2. Thus

FY(y) =

0 i f y ≤ 0y2/2 i f 0 < y < 1

1− (2− y)2/2 i f 1 ≤ y < 21 i f 2 ≤ y

. (5.6)

The pdf is then the derivative, so that for 0 < y < 2,

fY(y) =

y i f 0 < y < 1

2− y i f 1 ≤ y < 2, (5.7)

which is a “tent" distribution, as in Figure 5.2.

5.1. Distribution functions 47

0.0 0.5 1.0 1.5 2.0

0.0

0.6

y

f(y)

Figure 5.2: Density of Y

5.1.2 Example: Chi-squared on 1 df

If X ∼ N(µ, 1), then Y = X2 is noncentral chisquared on 1 degree of freedom with

noncentrality parameter ∆ = µ2, written

Y ∼ χ21(∆). (5.8)

(In Definition 10, we present the more general chi-squared.) Because X = R, Y =(0, ∞). (Actually, it is [0, ∞), but leaving out one point is ok for continuous distri-butions.) Letting fX(x | µ) be the pdf of X, the distribution function of Y is, fory > 0,

FY(y | µ) = P[X2 ≤ y]

= P[−√y < X <√

y]

=∫ √y

−√yfX(x | µ)dx. (5.9)

The derivative will then yield the pdf of Y. We need the chain rule result fromcalculus:

∂

∂y

∫ b(y)

a(y)g(x)dx = g(b(y))b′(y)− g(a(y))a′(y). (5.10)

Applied to (5.9) gives

fY(y | µ) = F′Y(y) = fX(√

y | µ)∂√

y

∂y− fX(−√y | µ)

∂(−√y)

∂y

=1

2√

y( fX(√

y | µ) + fX(−√y | µ)). (5.11)


Note that this equation is not restricted to fX being normal. Now the N(µ, 1) densityis

fX(x | µ) =1√2π

e−12 (x−µ)2

=1√2π

e−12 x2

e−12 µ2

exµ, (5.12)

so that we can write

fY(y | µ) =1√2πy

e−12 ye−

12 µ2

(eµ√

y + e−µ√

y

2

)=

1√2πy

e−12 ye−

12 µ2

cosh(µ√

y).

(5.13)(The cosh is the hyperbolic cosine.) The notation in (5.8) implies that the distribution of

Y depends on µ through only ∆ = µ2, that is, the distribution is the same for µ and−µ. In fact, that can be seen to be true because fY(y | µ) = fY(y | − µ), so we can

replace µ with√

∆ in (5.13).When ∆ = 0 (i.e., µ = 0), Y is then central chi-squared, written

Y ∼ χ21. (5.14)

It’s pdf is (5.13) without the last two expressions:

fY(y | 0) =1√2πy

e−12 y. (5.15)

Notice that this pdf is that of Gamma( 12 , 1

2 ). (Also, we can see that Γ(1/2) =√

π, asmentioned in (3.70)).

5.1.3 Example: Probability transform

Generating random numbers on a computer is a common activity in statistics, e.g.,for inference, to randomize subjects to treatments, and to assess the performanceof various procedures. It it easy to generate independent Uni f orm(0, 1) randomvariables — Easy in the sense that many people have worked very hard for manyyears to develop good methods that are now available in most statistical software.Actually, the numbers are not truly random, but rather pseudo-random, because thereis a deterministic algorithm producing them. But you could also question whetherrandomness exists in the real world anyway. Even flipping a coin is deterministic, ifyou know all the physics in the flip.

One usually is not satisfied with uniforms, but rather has normals or gammas, orsomething more complex, in mind. There are many clever ways to create the desiredrandom variables from uniforms (see for example the Box-Mueller transformation inEquation 6.82), but the most basic uses the inverse distribution function. We supposeU ∼ Uni f orm(0, 1), and wish to generate a X that has given distribution function

F. Assume that F is continuous, and strictly increasing for x ∈ X , so that F−1(u) iswell-defined for each u ∈ (0, 1). Consider the random variable

W = F−1(U). (5.16)

Then its distribution function is

FW(w) = P[W ≤ w] = P[F−1(U) ≤ w] = P[U ≤ F(w)] = F(w), (5.17)

5.2. Moment generating functions 49

where the last step follows because U ∼ Uni f orm(0, 1) and 0 ≤ F(w) ≤ 1. But thatequation means that W has the desired distribution function, hence to generate an X,

we generate a U and take X = F−1(U).To illustrate, suppose we wish X ∼ Exponential(λ). The distribution function F of

X is zero for x ≤ 0, and

F(x) =∫ x

0λe−λwdw = 1− e−λx for x > 0. (5.18)

Thus for u ∈ (0, 1),

u = F(x) =⇒ x = F−1(u) = − log(1− u)/λ. (5.19)

One limitation of this method is that F−1 is not always computationally simple. Evenin the normal case there is no closed form expression. In addition, this method doesnot work directly for generating multivariate X, because then F is not invertible.

The approach works for non-continuous distributions as well, but care must be

taken because u may fall in a gap. For example, suppose X is Bernoulli( 12 ). Then for

u = 12 , we have x = 0, since F(0) = 1

2 , but no other value of u ∈ (0, 1) has an x with

F(x) = u. The fix is for 0 < u <12 , to set F−1(u) = 0, and for 1

2 < u < 1, to set

F−1(u) = 1, so that half the time F−1(U) is 0, and half the time it is 1. That is, if u

is in a gap, set F−1(u) to the value of x for that gap. Mathematically, we define for0 < u < 1,

F−1(u) = minx | F(x) ≥ u, (5.20)

which will work for continuous and noncontinuous F.The process can be reversed, which is useful in hypothesis testing, to obtain p-

values. That is, suppose X has continuous strictly increasing (on X ) distributionfunction F, and let

U = F(X). (5.21)

Then the distribution function of U is

FU(u) = P[F(X) ≤ u] = P[X ≤ F−1(u)] = F(F−1(u)) = u, (5.22)

which is the distribution function of a Uni f orm(0, 1), i.e.,

F(X) ∼ Uni f orm(0, 1). (5.23)

5.2 Moment generating functions

Instead of finding the distribution function of Y, one could try to find the mgf. Bythe uniqueness Theorem 1, if you recognize the mgf of Y as being for a particulardistribution, then you know Y has that distribution.


5.2.1 Example: Uniform→ Exponential

For example, suppose X ∼ Uni f orm(0, 1), and Y = − log(X). Then the mgf of Y is

MY(t) = E[etY]

= E[e−t log(X)]

=∫ 1

0e−t log(x)dx

=∫ 1

0x−tdx

=1

−t + 1x−t+1|1x=0

=1

1− tif t < 1 (and +∞ if not). (5.24)

It turns out that this mgf is that of the Gamma in (3.74), where α = λ = 1, whichmeans it is also Exponential(1), as in (3.72). Thus the pdf of Y is

fY(y) = e−y for y ∈ Y = (0, ∞). (5.25)

5.2.2 Example: Sums of independent Gamma’s

The mgf approach is especially useful for the sums of independent random variables.For example, suppose X1, . . . , XK are independent, with Xk ∼ Gamma(αk, λ). (Theyall have the same scale, but are allowed different shapes.) Let Y = X1 + · · · + XK.Then its mgf is

MY(t) = E[etY]

= E[et(X1+···+XK ]

= E[etX1 · · ·+ etXK ]

= E[etX1 ]× · · · × E[etXK ] by independence of the Xk’s

= M1(t)× · · · ×MK(t) where Mk(t) is the mgf of Xk

=

(λ

λ− t

)α1

· · ·(

λ

λ− t

)αK

for t < λ, by (3.74)

=

(λ

λ− t

)α1+···αK

. (5.26)

But then this mgf is that of Gamma(α1 + · · ·+ αK , λ). Thus

X1 + · · ·+ XK ∼ Gamma(α1 + · · ·+ αK , λ). (5.27)

What if the λ’s are not equal? Then it is still easy to find the mgf, but it would notbe the Gamma mgf, or anything else we have seem so far.

5.2.3 Linear combinations of independent Normals

In (3.65), we have that the mgf of a N(µ, σ2) is

M(t | µ, σ2) = etµ+ 12 σ2t2

, t ∈ R. (5.28)

5.2. Moment generating functions 51

Now suppose X1, . . . , XK are independent, Xk ∑ N(µk, σ2k ). Consider the affine com-

binationY = a + b1X1 + · · ·+ bKXK. (5.29)

It is straightforward, from Section 3.2.2, to see that

E[Y] = a + b1µ1 + · · ·+ bKµK and Var[Y] = b21σ2

1 + · · ·+ b2Kσ2

K , (5.30)

since independence implies that all the covariances are 0. But those equations do notgive the entire distribution of Y. We need the mgf:

MY(t) = E[etY]

= E[et(a+b1X1+···+bK XK)]

= eatE[e(tb1)X1 ]× · · · × E[e(tbK)XK ] by independence

= eatM(tb1 | µ1, σ21 )× · · · ×M(tbK | µK, σ2

K)

= eatetb1µ1+12 t2b2

1σ21 · · · etbKµK+ 1

2 t2b2Kσ2

K

= et(a+b1µ1+···+bKµK)+ 12 t2(b2

1σ21 +···+b2

Kσ2K)

= M(t | a + b1µ1 + · · ·+ bKµK, b21σ2

1 + · · ·+ b2Kσ2

K). (5.31)

Notice that in going from the third to fourth step, we have changed the variable forthe mgf’s from t to the bkt’s, which is legitimate. That mgf is indeed the mgf of aNormal, with the appropriate mean and variance, i.e.,

Y ∼ N(a + b1µ1 + · · ·+ bKµK, b21σ2

1 + · · ·+ b2Kσ2

K). (5.32)

If the Xk’s are iid N(µ, σ2), then (5.32) can be used to show that

∑ Xk ∼ N(Kµ, Kσ2) and X ∼ N(µ, σ2/K). (5.33)

5.2.4 Normalized means

The Central Limit Theorem is central to statistics as it justifies using the normaldistribution in certain non-normal situations. Here we assume we have X1, . . . , Xn

iid with any distribution, as long as the mgf M(t) of Xi exists for t in a neighborhoodof 0. The normalized mean is

Wn =√

nX− µ

σ, (5.34)

where µ = E[Xi] and σ2 = Var[Xi]. If the Xi are normal, then Wn ∼ N(0, 1). TheCentral Limit Theorem implies that even if the Xi’s are not normal, Wn is “approx-imately" normal if n is “large." We will look at the cumulant generating function ofWn, and compare it to the normal’s.

First, the mgf of Wn:

Mn(t) = E[etWn ]

= E[et√

n(X−µ)/σ]

= e−t√

nµ/σ E[e(t/(√

nσ))∑ xi ]

= e−t√

nµ/σ ∏ E[e(t/(√

nσ))xi]

= e−t√

nµ/σ M(t/(√

nσ))n. (5.35)


Letting c(t) be the cumulant generating function of Xi, and cn(t) be that of Wn, wehave that

cn(t) = − t√

nµ

σ+ nc

(t√nσ

). (5.36)

Now imagine taking the derivatives of cn(t) with respect to t. The first two:

c′n(t) = −√

nµ

σ+ nc′

(t√nσ

)1√nσ

⇒ E[Wn] = c′n(0) = −√

nµ

σ+ nµ

1√nσ

= 0;

c′′(t) = nc′′(

t√nσ

)1

(√

nσ)2⇒ Var[Wn] = c′′n(0) = nσ2 1

(√

nσ)2= 1.(5.37)

It is easy enough to find the mean and variance of Wn directly, but this approach ishelpful for the higher-order cumulants. Note that each derivative brings out another

1/(√

nσ) from c(t), hence the kth cumulant of Wn is

c(k)n (0) = nc(k)(0)

1

(√

nσ)k, (5.38)

and letting γk be the kth cumulant of Xi, and γn,k be that of Wn,

γn,k =1

nk/2−1

1

σkγk. (5.39)

For the normal, all cumulants are 0 after the first two. Thus the closer the γn,k’s areto 0, the closer the distribution of Wn is to N(0, 1) in some sense. There are then twofactors: The larger n, the closer these cumulants are to 0 for k > 2. Also, the smaller

γk/σk in absolute value, the closer to 0.For example, if the Xi’s are Exponential(1), or Gamma(1, 1), then from (3.77) the

cumulants are γk = (k− 1)!, and cn,k = (k− 1)!/nk/2−1. The Poisson(λ) has cumulant

generating function c(t) = λ(et − 1), so all its derivatives are λet, hence γk = λ, and

γn,k =1

nk/2−1

1

σk/2γk =

1

nk/2−1

1

λk/2λ =

1

(nλ)k/2−1. (5.40)

The DoubleExponential has pdf (1/2)e−|x| for x ∈ R. It’s cumulants are

γk =

0 i f k is odd

2(k− 1)! i f k is even.(5.41)

Thus the variance is 2, and

γn,k = γk =

0 i f k is odd

(k− 1)!/(2n)k/2−1 i f k is even.(5.42)

5.3. Discrete distributions 53

Here is a small table with some values of γk,n:

γn,k Normal Exp(1) Pois(1/10) Pois(1) Pois(10) DEk = 3 n = 1 0 2.000 3.162 1.000 0.316 0.000

n = 10 0 0.632 1.000 0.316 0.100 0.000n = 100 0 0.200 0.316 0.100 0.032 0.000

k = 4 n = 1 0 6.000 10.000 1.000 0.100 3.000n = 10 0 0.600 1.000 0.100 0.010 0.300n = 100 0 0.060 0.100 0.010 0.001 0.030

k = 5 n = 1 0 24.000 31.623 1.000 0.032 0.000n = 10 0 0.759 1.000 0.032 0.001 0.000n = 100 0 0.024 0.032 0.001 0.000 0.000

(5.43)For each distribution, as n increases, the cumulants do decrease, Also, the expo-

nential is closer to normal that the Poisson(1/10), but the Poisson(1) is closer thanthe exponential, and the Poisson(10) is even closer. The Double Exponential is sym-metric, so its odd cumulants are 0, automatically making it relatively close to normal.Its kurtosis is a bit worse than the Poisson(1), however.

5.3 Discrete distributions

If the X is discrete, then any function Y = g(X) will also be discrete, hence its pmfcan be found via

fY(y) = P[Y = y] = P[g(X) = y], y ∈ Y . (5.44)

Of course, that final probability may or not be easy to find.One situation in which it is easy is when g is a one-to-one and onto function from

X to Y , so that there exists an inverse function,

g−1 : Y → X ; g(g−1(y)) = y and g−1(g(x)) = x. (5.45)

Then, with fX being the pmf of X,

fY(y) = P[g(X) = y] = P[X = g−1(y)] = fX(g−1(y)). (5.46)

For example, if X ∼ Poisson(λ), and Y = X2, then g(x) = x2, hence g−1(y) =√

y fory ∈ Y = 0, 1, 4, 9, . . .. The pmf of Y is then

fY(y) = fX(√

y) = e−λ λ√

y

√y!

, y ∈ Y . (5.47)

Notice that it is important to have the spaces correct. E.g., this g is not one-to-one ifthe space is R, and the “

√y!" makes sense only if y is the square of a nonnegative

integer.We consider some more examples.

5.3.1 Example: Sum of discrete uniforms

Suppose X = (X1, X2), where X1 and X2 are independent, and

X1 ∼ Uni f orm(0, 1) and X2 ∼ Uni f orm(0, 1, 2). (5.48)


We are after the distribution of Y = X1 + X2. The space of Y can be seen to beY = 0, 1, 2, 3. This function is not one-to-one, e.g., there are two x’s that sum to1: (0, 1) and (1, 0). This is a small enough example that we can just write out all thepossibilities:

fY(0) = P[X1 + X2 = 0] = P[X = (0, 0)] =1

2× 1

3=

1

6

fY(1) = P[X1 + X2 = 1] = P[X = (0, 1) or X = (1, 0)] =1

2× 1

3+

1

2× 1

3=

2

6

fY(2) = P[X1 + X2 = 2] = P[X = (0, 2) or X = (1, 1)] =1

2× 1

3+

1

2× 1

3=

2

6

fY(3) = P[X1 + X2 = 3] = P[X = (1, 2)] =1

2× 1

3=

1

6(5.49)

5.3.2 Example: Convolutions

Here we generalize the previous example a bit by assuming X1 has pmf fx and spaceX1 = 0, 1, . . . , a, and X2 has pmf f2 and space X2 = 0, 1, . . . , b. Both a and b arepositive integers, or either could be +∞. Then Y has space

Y = 0, 1, . . . , a + b. (5.50)

To find fY(y), we need to sum up the probabilities of all (x1, x2)’s for which x1 +x2 = y. These pairs can be written (x1, y − x1), and require x1 ∈ X1 as well asy− x1 ∈ X2. That is, for fixed y ∈ Y ,

0 ≤ x1 ≤ a and 0 ≤ y− x1 ≤ b =⇒ max0, y− b ≤ x1 ≤ mina, y. (5.51)

For example, with a = 3 and b = 3, the following table shows which x1’s correspondto each y:

x1 ↓; x2 → 0 1 2 30 0 1 2 31 1 2 3 42 2 3 4 53 3 4 5 6

(5.52)

Each value of y appears along a diagonal, so that

y = 0 ⇒ x1 = 0

y = 1 ⇒ x1 = 0, 1

y = 2 ⇒ x1 = 0, 1, 2

y = 3 ⇒ x1 = 0, 1, 2, 3

y = 4 ⇒ x1 = 1, 2, 3

y = 5 ⇒ x1 = 2, 3

y = 6 ⇒ x1 = 3 (5.53)


Thus in general, for y ∈ Y ,

fY(y) = P[X1 + X2 = y]

=mina,y

∑x1=max0,y−b

P[X1 = x1, X2 = y− x1]

=mina,y

∑x1=max0,y−b

f1(x1) f2(y− x1). (5.54)

This formula is often called the convolution of f1 and f2. In general, the convolutionof two random variables is the distribution of the sum.

To illustrate, suppose X1 has the pmf of the Y in (5.49), and X2 is Uni f orm0, 1, 2, 3,so that a = b = 3 and

f1(0) = f1(3) =1

6, f1(1) = f1(2) =

1

3and f2(x2) =

1

4, x2 = 0, 1, 2, 3. (5.55)

Then Y = 0, 1, . . . , 6, and

fY(0) =0

∑x1=0

f1(x1) f2(0− x1) =1

24

fY(1) =1

∑x1=0

f1(x1) f2(1− x1) =1

24+

1

12=

3

24

fY(2) =2

∑x1=0

f1(x1) f2(1− x1) =1

24+

1

12+

1

12=

5

24

fY(3) =3

∑x1=0

f1(x1) f2(1− x1) =1

24+

1

12+

1

12+

1

24=

6

24

fY(4) =3

∑x1=1

f1(x1) f2(1− x1) =1

12+

1

12+

1

24=

5

24

fY(5) =3

∑x1=2

f1(x1) f2(1− x1) =1

12+

1

24=

3

24

fY(6) =3

∑x1=3

f1(x1) f2(1− x1) =1

24. (5.56)

Check that the fY(y)’s do sum to 1.

5.3.3 Example: Sum of two Poissons

An example for which a = b = ∞ has X1 and X2 independent Poisson’s, with param-eters λ1 and λ2, respectively. Then Y = X1 + X2 has space Y = 0, 1, · · · , the same


as the spaces of the Xi’s. In this case, for fixed y, x1 = 0, . . . , y, hence

fY(y) =y

∑x1=0

f1(x1) f2(y− x1)

=y

∑x1=0

e−λ1λx1

1

x1!e−λ2

λy−x1

2

(y− x1)!

= e−λ1−λ2

y

∑x1=0

1

x1!(y− x1)!λx1

1 λy−x1

2

= e−λ1−λ21

y!

y

∑x1=0

y!

x1!(y− x1)!λx1

1 λy−x1

2

= e−λ1−λ21

y!(λ1 + λ2)

y, (5.57)

the last step using the Binomial Theorem in (3.80). But that last expression is thePoisson pmf, i.e.,

Y ∼ Poisson(λ1 + λ2). (5.58)

This fact can be proven also using mgf’s.

5.3.4 Binomials and Bernoullis

The distribution of a random variable that takes on just the values 0 and 1 is com-pletely specified by giving the probability it is 1. Such a variable is called Bernoulli.

Definition 3. If Z has space 0, 1, then it is Bernoulli with parameter p = P[Z = 1],written

Z ∼ Bernoulli(p). (5.59)

The pmf can then be written

f (z) = pz(1− p)1−z. (5.60)

Note that Bernoulli(p) = Binomial(1, p). A better alternative definition1 for the bino-mial is based on Bernoullis.

Definition 4. If Z1, . . . , Zn are iid Bernoulli(p), then X = Z1 + · · ·+ Zn is binomial withparameters n and p, written

X ∼ Binomial(n, p). (5.61)

The Bernoullis represent the individual trials, and a “1" indicates success. The

moments, and mgf, of a Bernoulli are easy to find. In fact, because Zk = Z fork = 1, 2, . . .,

E[Z] = E[Z2] = 0× (1− p) + 1× p = p⇒ Var[Z] = p− p2 = p(1− p), (5.62)

andMZ(t) = E[etZ] = e0(1− p) + et p = pet + 1− p. (5.63)

1Alternative to the definition via the pmf.


Thus for X ∼ Binomial(n, p),

E[X] = nE[Zi] = np, Var[X] = nVar[Zi] = np(1− p), (5.64)

and

MX(t) = E[et(Z1+···+Zn)] = E[etZ1 ] · · · E[etZn ] = MZ(t)n = (pet + 1− p)n. (5.65)

This mgf is the same as we found in (3.81), meaning we really are defining the samebinomial.

Chapter 6

Transformations II: Continuous

distributions

The mgf approach is very convenient, especially when dealing with sums of indepen-dent variables, but does have the drawback that one needs to recognize the resultingmgf (although there are ways to find the density from the mgf). Densities are typ-ically most useful, but to find the pdf of a transformation, one cannot simply plug

g−1(y) into the pmf of X, as in (5.47). The problem is that the for a pdf, fX(x) is notP[X = x], which is 0. Rather, for a small area A ⊂ X that contains x,

P[X ∈ A] ≈ fX(x)× Area(A). (6.1)

Now suppose g : X → Y is one-to-one and onto. Then for y ∈ B ⊂ Y ,

P[Y ∈ B] ≈ fY(y)× Area(B). (6.2)

Now

P[Y ∈ B] = P[X ∈ g−1(B)] ≈ fX(g−1(y))× Area(g−1(B)), (6.3)

where

g−1(B) = x ∈ X | g(x) ∈ B, (6.4)

hence

fY(y) ≈ fX(g−1(y))× Area(g−1(B))

Area(B). (6.5)

Compare this equation to (5.46). For continuous distributions, we need to take careof the transformation of areas as well as the transformation of y itself. The actual pdf

of Y is found by shrinking the B in (6.5) down to the point y.

6.1 One dimension

When X and Y are random variables, the ratio of areas in (6.5) is easy to find. Because

g and g−1 are one-to-one, they must be either strictly increasing or strictly decreasing.

59

60 Chapter 6. Transformations II: Continuous distributions

For y ∈ Y , take ǫ small enough that Bǫ ≡ [y, y + ǫ) ∈ Y . Then, since the area of aninterval is just its length,

Area(g−1(Bǫ))

Area(Bǫ)=

|g−1(y + ǫ)− g−1(y)|ǫ

→∣∣∣∣

∂

∂yg−1(y)

∣∣∣∣ as ǫ→ 0. (6.6)

(The absolute value is there in case g is decreasing.) That derivative is called the

Jacobian of the transformation g−1. That is,

fY(y) = fX(g−1(y)) |Jg−1(y)|, where Jg−1(y) =∂

∂yg−1(y). (6.7)

This approach needs a couple of assumptions: y is in the interior of Y , and the

derivative of g−1(y) exists.Reprising Example 5.2.1, where X ∼ Uni f orm(0, 1) and Y = − log(X), we have

Y = (0, ∞), and g−1(y) = e−y. Then

fX(e−y) = 1 and Jg−1(y) =∂

∂ye−y = −e−y =⇒ fY(y) = 1× | − e−y| = e−y, (6.8)

which is indeed the answer from (5.25).

6.2 General case

For the general case (for vectors), we need to figure out the ratio of the volumes,which is again given by the Jacobian, but for vectors.

Definition 5. Suppose g : X → Y is one-to-one and onto, where both X and Y are open

subsets of Rp, and all the first partial derivative of g−1(y) exist. Then the Jacobian of the

transformation g−1 is defined to be

Jg−1 : Y → R, (6.9)

Jg−1(y) =

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

∂∂y1

g−11 (y) ∂

∂y2g−1

1 (y) · · · ∂∂yp

g−11 (y)

∂∂y1

g−12 (y) ∂

∂y2g−1

2 (y) · · · ∂∂yp

g−12 (y)

......

. . ....

∂∂y1

g−1p (y) ∂

∂y2g−1

p (y) · · · ∂∂yp

g−1p (y)

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

, (6.10)

where g−1(y) = (g−11 (y), . . . , g−1

p (y)), and here the “| · |" represents the determinant.

The next theorem is from advanced calculus.

6.2. General case 61

Theorem 2. Suppose the conditions in Definition 5, and Jg−1(y) is continuous and non-zero

for y ∈ Y . Then

fY(y) = fX(x)× |Jg−1(y)|. (6.11)

Here the “| · |" represents the absolute value.

If you think of g−1 as x, you can remember the formula as

fY(y) = fX(x)×∣∣∣∣∣dx

dy

∣∣∣∣∣ . (6.12)

6.2.1 Example: Gamma – Beta

Suppose X1 and X2 are independent, with

X1 ∼ Gamma(α, λ) and X2 ∼ Gamma(β, λ), (6.13)

so that they have the same scale but possibly different shapes. We are interested in

Y1 =X1

X1 + X2. (6.14)

This variable arises, e.g., in linear regression, where R2 has that distribution undercertain conditions. The function taking (x1, x2) to y1 is not one-to-one. To fix that up,we introduce another variable, Y2, so that the function to the pair (y1, y2) is one-to-one. Then to find the pdf of Y1, we integrate out Y2. We will take

Y2 = X1 + X2. (6.15)

ThenX = (0, ∞)× (0, ∞) and Y = (0, 1)× (0, ∞). (6.16)

To find g−1, solve the equation y = g(x) for x:

y1 =x1

x1 + x2, y2 = x1 + x2 =⇒ x1 = y1y2, x2 = y2 − x1

=⇒ x1 = y1y2, x2 = y2(1− y1), (6.17)

henceg−1(y1, y2) = (y1y2, y2(1− y1)). (6.18)

Using this inverse, we can see that indeed the function is onto the Y in (6.16), sinceany y1 ∈ (0, 1) and y2 ∈ (0, ∞) will yield, via (6.18), x1 and x2 in (0, ∞). For theJacobian:

Jg−1(y) =

∣∣∣∣∣∣∣∣∣

∂∂y1

y1y2∂

∂y2y1y2

∂∂y1

y2(1− y1)∂

∂y2y2(1− y1)

∣∣∣∣∣∣∣∣∣

=

∣∣∣∣y2 y1−y2 1− y1

∣∣∣∣= y2(1− y1) + y1y2

= y2. (6.19)


Now because the Xi’s are independent Gamma’s, their pdf is

fX(x1, x2) =λα

Γ(α)xα−1

1 e−λx1λβ

Γ(β)x

β−12 e−λx2 . (6.20)

Then the pdf of Y is

fY(y) = fX(g−1(y))× |Jg−1(y)|

= fX(y1y2, y2(1− y1)))× |y2|

=λα

Γ(α)(y1y2)

α−1e−λy1y2λβ

Γ(β)(y2(1− y1))

β−1e−λy2(1−y1) × |y2|

=λα+β

Γ(α)Γ(β)yα−1

1 (1− y1)β−1y

α+β−12 e−λy2 . (6.21)

To find the pdf of Y1, we can integrate out y2:

fY1(y1) =

∫ ∞

0fY(y). (6.22)

That certainly is a fine approach. But in this case, note that the joint pdf in (6.21) canbe factored into a function of just y1, and a function of just y2. That fact, coupled withthe fact that the space is a rectangle, means that we automatically have that Y1 andY2 are independent, and also have the forms of their marginal pdf’s. The followinglemma is more explicit.

Lemma 8. Suppose Y = (Y(1), Y(2)) has pdf fY(y) = c h1(y(1))h2(y(2)) for some constant

c and functions (not necessarily pdf’s) h1 and h2, and has joint space Y = Y1 × Y2. Then

Y(1) and Y(2) are independent, and have marginal pdf’s

f1(y(1)) = c1 h1(y(1)) and f2(y(2)) = c2 h2(y(2)), (6.23)

respectively, where c = c1c2, and c1 and c2 are constants that ensure the pdf’s integrate to 1.

Apply this lemma to (6.21), where

c =λα+β

Γ(α)Γ(β), h1(y1) = yα−1

1 (1− y1)β−1 and h2(y2) = y

α+β−12 e−λy2 . (6.24)

The lemma shows that Y1 and Y2 are independent, and the pdf of Y2 is c2h2(y2) forsome c2. But notice that is the Gamma(α + β, λ) pdf, hence we already know that

c2 =λα+β

Γ(α + β). (6.25)

Then the pdf of Y1 is c1h1(y1), where it must be that c1 = c/c2, hence

f1(y1) =

λα+β

Γ(α)Γ(β)

λα+β

Γ(α+β)

yα−11 (1− y1)

β−1 =Γ(α + β)

Γ(α)Γ(β)yα−1

1 (1− y1)β−1. (6.26)


That is the Beta(α, β) pdf. Note that we have surreptitiously also proven that

∫ 1

0yα−1

1 (1− y1)β−1dy1 =

Γ(α)Γ(β)

Γ(α + β)≡ β(α, β), (6.27)

which is the beta function.To summarize, we have shown that X1/(X1 + X2) and X1 + X2 are independent

(even though they do not really look independent), and that

X1

X1 + X2∼ Beta(α, β) and X1 + X2 ∼ Gamma(α + β, λ). (6.28)

That last fact we already knew, from Example 5.2.2. Also, notice that the Beta variabledoes not depend on the scale λ.

Turning to the moments of the Beta, it is easy enough to find them directly byintegrating. The mgf is not particularly helpful, as it is a confluent hypergeometricfunction. But since we know the first two moments of the Gamma, we can use theindependence of Y1 and Y2 to derive them. We will take λ = 1 for simplicity, whichis fine because the distribution of Y1 is independent of λ. The mean and variance ofa Gamma(α, 1) are both α from (3.75) and (3.76), hence

E[X1] = α, E[X21 ] = Var[X1] + E[X1]

2 = α + α2,

E[Y2] = α + β, E[Y22 ] = α + β + (α + β)2. (6.29)

Then since X1 = Y1Y2 and Y1 and Y2 are independent,

E[X1] = E[Y1]E[Y2] and E[X21 ] = E[Y2

1 ]E[Y22 ]

⇒ α = E[Y1](α + β) and α + α2 = E[Y21 ](α + β + (α + β)2)

⇒ E[Y1] =α

α + βand E[Y2

1 ] =α(α + 1)

(α + β)(α + β + 1), (6.30)

hence

Var[Y1] =α(α + 1)

(α + β)(α + β + 1)−(

α

α + β

)2

=αβ

(α + β)2(α + β + 1). (6.31)

6.2.2 Example: The Dirichlet distribution

The Dirichlet distribution is a multivariate version of the beta. We start with theindependent random variables X1, . . . , XK , where

Xk ∼ Gamma(αk, 1). (6.32)

Then the K− 1 vector Y defined via

Yk =Xk

X1 + · · ·+ XK, k = 1, . . . , K− 1, (6.33)

has a Dirichlet distribution, written

Y ∼ Dirichlet(a1, . . . , αK). (6.34)


There is also a YK, which may come in handy, but because Y1 + · · ·+ YK = 1, it is re-dundant. Also, as in the Beta, the definition is the same if the Xk’s are Gamma(αk, λ).

The representation (6.33) makes it easy to find the marginals, i.e.,

Yk ∼ Beta(αk, 1− αk), (6.35)

hence the marginal means and variances. Sums of the Yk’s are also beta, e.g., if K = 4,then

Y1 + Y3 =X1 + X3

X1 + X2 + X3 + X4∼ Beta(α1 + α3, α2 + α4). (6.36)

The space of Y is

Y = y ∈ RK−1 | 0 < yk < 1, k = 1, . . . , K− 1, and y1 + · · ·+ yK−1 < 1. (6.37)

To find the pdf of Y, we need a one-to-one transformation from X, so we need toappend another variable to Y. The easiest choice is

W = X1 + . . . + XK , (6.38)

so that g(x) = (y, w). Then g−1 is given by

x1 = wy1

x2 = wy2

...

xK−1 = wyK−1

xK = w(1− y1 − . . .− yK−1). (6.39)

Left to the reader: The Jacobian is wK−1, and the joint pdf of (Y, W) is

f(Y,W)(y, w) = [K

∏k=1

1

Γ(αK)] [

K−1

∏k=1

yαk−1k ] (1− y1 − · · · − yK−1)

αK wα1+···+αK−1 e−w.

(6.40)The joint space of (Y, W) can be shown to be Y × (0, ∞), which, together with thefactorization of the density in (6.40), means that Y and W are independent. In addi-tion, W ∼ Γ(α1 + · · ·+ αK , 1), which can be seen either by looking at the w-part in(6.40), or by noting that it is the sum of independent gamma’s with the same scaleparameter. Either way, we then have that the constant that goes with the w-part is1/Γ(α1 + · · ·+ αK), hence the pdf of Y must be

fY(y) =Γ(α1 + · · ·+ αK)

Γ(α1) · · · Γ(αK)yα1−1

1 · · · yαK−1

K−1(1− y1 − · · · − yK−1)αK−1. (6.41)

Comparing this to the beta pdf in (6.26), we see that the Dirichlet is indeed an exten-sion, and

Beta(α, β) = Dirichlet(α, β). (6.42)


6.2.3 Example: Affine transformations

Suppose X is p× 1, and for given p× 1 vector a and p× p matrix B, let

Y = g(X) = a + BX. (6.43)

In order for this transformation to be one-to-one, we need that B is invertible, whichwe will assume, so that

X = g−1(Y) = B−1(Y− a). (6.44)

For a matrix C and vector z, with w = Cz, it is not hard to see that

∂wi

∂zj= Cij, (6.45)

the ijth element of C. Thus from (6.44), since the a part is a constant, the Jacobian is

Jg−1(y) = |B−1|. (6.46)

The invertibility of B ensures that this Jacobian is between 0 and ∞.

6.2.4 Example: Bivariate Normal

We apply this result to the normal, where X1 and X2 are iid N(0, 1), a is a 2 × 1vector, and B is a 2× 2 invertible matrix. Then Y is given as in (6.44). The mean andcovariance matrix for X are

E[X] = 02 and Cov[X] = I2, (6.47)

so that from (3.96) and (3.102),

µ ≡ E[Y] = a and Σ ≡ Cov[Y] = B Cov[X]B′ = BB′. (6.48)

The space of X, hence of Y, is R2. To find the pdf of Y, we start with that of X:

fX(x) =2

∏i=1

1√2π

e−12 x2

i =1

2πe−

12 x′x. (6.49)

Then

fY(y) = fX(B−1(y− a))abs(|B−1|)

=1

2πe−

12 (B−1(y−a))′B−1(y−a)abs(|B−1|). (6.50)

Using (6.48), we can write

(B−1(y− a))′B−1(y− a) = (y− µ)(BB′)−1(y− µ) = (y− µ)′Σ−1(y− µ), (6.51)

and using properties of determinants,

abs(|B−1|) =√|B−1||B−1| =

√(|B||B′|)−1 =

√|Σ|, (6.52)


hence the pdf of Y can be given as a function of the mean and covariance matrix:

fY(y) =1

2π

1√|Σ|

e−12 (y−µ)′Σ−1(y−µ). (6.53)

This Y is bivariate normal, with mean µ and covariance matrix Σ, written very much

in the same way as the regular Normal,

Y ∼ N(µ, Σ). (6.54)

In particular, the X here isX ∼ N(02, I2). (6.55)

The mgf of a bivariate normal is not hard to find given that of the X. Because X1and X2 are iid N(0, 1), their mgf is

MX(t) = e12 t2

1 e12 t2

2 = e12 ‖t‖2

. (6.56)

Then with Y = a + BX,

MY(t) = E[et′Y)

= E[et′(a+BX)]

= et′aE[e(B′t)′X]

= et′aMX(B′t)

= et′ae12 ‖B′t‖2

= et′µ+ 12 t′Σt by (6.48), (6.57)

because ‖B′t‖2 = (B′t)′B′t = t′BB′t = t′Σt. Compare this mgf to that of the regularnormal, in (3.65).

Later, we will deal with the general p× 1 multivariate normal, proceeding exactlythe same way as above, except the matrices and vectors have more elements.

6.2.5 Example: Orthogonal transformations

A p× p orthogonal matrix is a matrix Γ such that

Γ′Γ = ΓΓ′ = Ip. (6.58)

An orthogonal matrix has orthonormal columns (and orthonormal rows), that is, withΓ = (γ

1, . . . , γ

p),

‖γi‖ = 1 and γ′

iγ

j= 0 if i 6= j. (6.59)

An orthogonal transformation of X is then

Y = ΓX (6.60)

for some orthogonal matrix Γ. This transformation rotates Y about zero, that is, thelength of Y is the same as that of X, but its orientation is different, because

‖Γz‖2 = z′Γ′Γz = z′z = ‖z‖2. (6.61)

The Jacobian is ±1:|Γ′Γ| = |Ip| = 1 =⇒ |Γ|2 = 1. (6.62)

Thus in the bivariate Normal example above if X ∼ N(02, I2), then so is Y = ΓX.


6.2.6 Example: Polar coordinates

Suppose X = (X1, X2) has pdf fX(x) for x ∈ X , and consider the transformation topolar coordinates

g(x1, x2) = (r, θ) = (‖x‖, Angle(x1, x2)), where ‖x‖ =√

x21 + x2

2. (6.63)

The Angle(x1, x2) is taken to be in [0, 2π), and is basically the arctan(x2/x1), exceptthat that is not uniquely defined, e.g., if x1 = x2 = 1, arctan(1/1) = π/4 or 5π/4.What we really mean is the unique value θ in [0, 2π) for which

x1 = r cos(θ) and x2 = r sin(θ). (6.64)

Another glitch is that θ is not uniquely defined when (x1, x2) = (0, 0), since thenr = 0 and any θ would work. So we should assume that X does not contain (0, 0).Just to be technical, we also need the space of (r, θ) to be open (from Theorem 2),so we should just go ahead and assume x2 6= 0 for (x1, x2) ∈ X . This requirementwould not hurt anything because we are dealing with continuous random variables,but for simplicity we will ignore these niceties.

We can find the Jacobian without knowing the pdf or space of X, and the g−1 isalready given in (6.64), hence

Jg−1(r, θ) =

∣∣∣∣∣∣∣∣

∂∂r r cos(θ) ∂

∂θ r cos(θ)

∂∂r r sin(θ) ∂

∂θ r sin(θ)

∣∣∣∣∣∣∣∣

=

∣∣∣∣cos(θ) −r sin(θ)sin(θ) r cos(θ)

∣∣∣∣

= −r sin(θ)2 − r cos(θ)2

= −r. (6.65)

You may recall this result from calculus, i.e.,

dx1dx2 = rdrdθ. (6.66)

The pdf of (r, θ) in any case is

fY(r, θ) = fX(r cos(θ), r sin(θ)) r. (6.67)

We next look at some examples.

Spherical symmetric pdf’s

A random variable X is spherically symmetric if for any orthogonal Γ,

X =D ΓX, (6.68)

meaning X and ΓX have the same distribution. For example, X ∼ N(02, I2) is spher-ically symmetric.


Suppose the 2× 1 vector X is spherically symmetric and has a pdf fX(x). Let Rbe the space of R ≡ ‖X‖. Then the space of X must include all circles with radii inR, that is,

X = ∪r∈Rx | x1 + x2 = r. (6.69)

In addition, the pdf of X depends on x through only r, that is,

fX(x) = h(‖x‖) (6.70)

for some function h : [0, ∞) → [0.∞). This h is not itself a pdf. (We can see that ifY = ΓY, then

fY(y) = fX(Γ′y) = h(‖Γ′y‖) = h(‖y‖) = fX(y), (6.71)

i.e., Y has the same pdf as X.)Now with y = (r, Angle(x1, x2)), (6.69) shows that the

Y = R× [0, 2π), (6.72)

and (6.67) shows that

fY(r, θ) = fX(r cos(θ), r sin(θ))r = h(r)r, (6.73)

because (r cos(θ))2 + (r sin(θ))2 = r2.Now we can appeal to Lemma 8 about independence. This pdf can be written as a

function of just r, i.e., h(r)r, and a function of just θ, i.e., the function “1", and c = 1.The space is a rectangle. Thus we know that θ has pdf c2 for θ ∈ [0, 2π). That is,

θ ∼ Uni f orm[0, 2π), (6.74)

and c2 = 1/(2π). Then c1 = c/c2 = 2π, i.e., the marginal pdf of R is

fR(r) = 2πh(r)r. (6.75)

Applying this approach to X ∼ N(02, I2), we know from (6.49) that

h(r) =1

2πe−

12 r2

, (6.76)

hence R has pdf

fR(r) = r e−12 r2

. (6.77)

The Box-Mueller transformation is an approach to generating two random Nor-mals from two random uniforms that reverses the above procedure. That is, supposeU1 and U2 are independent Uni f orm(0, 1). Then we can generate θ by setting

θ = 2πU1, (6.78)

and R by setting

R = F−1R (U2), (6.79)

where FR is the distribution function for the pdf in (6.77):

FR(r) =∫ r

0w e−

12 u2

du = −e12 u2 |ru=0 = 1− e−

12 r2

. (6.80)

6.3. Order statistics 69

Inverting u2 = FR(r) yields

r = F−1(u2) =√−2 log(1− u2). (6.81)

Thus, as in (6.64), we set

X1 =√−2 log(1−U2) cos(2πU1) and X2 =

√−2 log(1−U2) sin(2πU1), (6.82)

which are then independent N(0, 1)’s. (Usually one sees U2 in place of the 1−U2 inthe logs, but either way is fine because both are Uni f orm(0, 1).)

6.3 Order statistics

The order statistics for a sample x1, . . . , xn are the observations placed in orderfrom smallest to largest. They are usually designated with indices (i), so that theorder statistics are

x(1), x(2), . . . , x(n), (6.83)

where

x(1) = smallest of x1, . . . , xn = minx1, . . . , xnx(2) = second smallest ofx1, . . . , xn

...

x(n) = largest of x1, . . . , xn = maxx1, . . . , xn. (6.84)

For example, if the sample is 3.4, 2.5, 1.7, 5.2 then the order statistics are 1.7, 2.5, 3.4,5.2. If two observations have the same value, then that value appears twice, i.e., theorder statistics for 3.4, 1.7, 1.7, 5.2 are 1.7, 1.7, 3.4, 5.2, i.e., x(1) = x(2).

These statistics are useful as descriptive statistics, and in nonparametric infer-ence. For example, estimates of the median of a distribution are often based on orderstatistics, such as the median, or the trimean, which is a linear combination of thetwo quartiles and the median.

We will deal with X1, . . . , Xn being iid, each with distribution function F, pdf f ,and space X0, so that the space of X = (X1, . . . , Xn) is X = X0 × · · · × X0. Then let

Y = (X(1), . . . , X(n)). (6.85)

We will assume that the Xi’s are distinct, that is, no two have the same value. Thisassumption is fine in the continuous case, because the probability that two observa-tions are equal is zero. In the discrete case, there may indeed be ties, and the analysisbecomes more difficult. Thus the space of Y is

Y = (y1, . . . , yn) | yi ∈ X0, y1 < y2 < · · · < yn. (6.86)

To find the pdf, start with y ∈ Y , and let δ > 0 be small enough so that the

intervals (y1, y1 + δ), (y2, y2 + δ), . . . , (yn, yn + δ) are disjoint. (So take δ less than theall the gaps yi+1 − yi.) Then

P[y1 < Y1 < y1 + δ, . . . , yn < Yn < yn + δ]

= P[y1 < X(1) < y1 + δ, . . . , yn < X(n) < yn + δ]. (6.87)


Now the event in the latter probability occurs when any permutation of the xi’s hasone component in the first interval, one in the second, etc. E.g., if n = 3,

P[y1 < X(1) < y1 + δ, y2 < X(2) < y2 + δ, y3 < X(3) < y3 + δ]

= P[y1 < X1 < y1 + δ, y2 < X2 < y2 + δ, y3 < X3 < y3 + δ]

+ P[y1 < X1 < y1 + δ, y2 < X3 < y2 + δ, y3 < X2 < y3 + δ]

+ P[y1 < X2 < y1 + δ, y2 < X1 < y2 + δ, y3 < X3 < y3 + δ]

+ P[y1 < X2 < y1 + δ, y2 < X3 < y2 + δ, y3 < X1 < y3 + δ]

+ P[y1 < X3 < y1 + δ, y2 < X1 < y2 + δ, y3 < X2 < y3 + δ]

+ P[y1 < X3 < y1 + δ, y2 < X2 < y2 + δ, y3 < X1 < y3 + δ]

= 6 P[y1 < X1 < y1 + δ, y2 < X2 < y2 + δ, y3 < X3 < y3 + δ]. (6.88)

The last equation follows because the xi’s are iid, hence the six individual probabili-ties are equal. In general, the number of permutations is n!. Thus, we can write

P[y1 < Y1 < y1 + δ, . . . , yn < Yn < yn + δ] = n!P[y1 < X1 < y1 + δ,

. . . , yn < Xn < yn + δ]

= n!n

∏i=1

[F(yi + δ)− F(yi)]. (6.89)

Dividing by δn then letting δ→ 0 yields the joint density, which is

fY(y) = n!n

∏i=1

f (yi), y ∈ Y . (6.90)

Marginal distributions of individual order statistics, or sets of them, can be ob-tained by integrating out the ones that are not desired. The process can be a bit tricky,and one must be careful with the spaces. Instead, we will present a representationthat leads to the marginals as well as other quantities.

We start with the U1, . . . , Un’s being iid Uni f orm(0, 1), so that the pdf (6.90) of theorder statistics (U(1), . . . , U(n)) is simply n!. Consider the first order statistic plus the

gaps between consecutive order statistics:

G1 = U(1), G2 = U(2) −U(1), . . . , Gn = U(n) −U(n−1). (6.91)

These Gi’s are all positive, and they sum to U(n), which has range (0, 1). Thus the

space of G = (G1, . . . , Gn) is

G = g ∈ Rn | 0 < gi < 1, i = 1, . . . , n & g1 + · · ·+ gn < 1. (6.92)

The inverse function to (6.91) is

u(1) = g1, u(2) = g1 + g2, . . . , u(n) = g1 + · · ·+ gn, (6.93)

so that the Jacobian is ∣∣∣∣∣∣∣∣∣∣∣

1 0 0 · · · 01 1 0 · · · 01 1 1 · · · 0...

......

. . ....

1 1 1 · · · 1

∣∣∣∣∣∣∣∣∣∣∣

= 1. (6.94)

6.3. Order statistics 71

Thus the pdf of G is also n!:fG(g) = n!, g ∈ G . (6.95)

This pdf is quite simple on its own, but note that it is a special case of the Dirichletin (6.41) with K = n + 1 and all αk = 1. Thus any order statistic is a beta, because itis the sum of the first few gaps, i.e.,

U(k) = g1 + · · ·+ gk ∼ Beta(k, n− k + 1) (6.96)

analogous to (6.36). In particular, if n is odd, then the median of the observations isU(n+1)/2, hence

MedianU1, . . . , Un ∼ Beta(n + 1

2,

n + 1

2), (6.97)

and by (6.30) and (6.31),

E[MedianU1, . . . , Un] =1

2(6.98)

and

Var[MedianU1, . . . , Un] =((n + 1)/2)(n + 1)/2

(n + 1)2(n + 2)=

1

4(n + 2). (6.99)

To obtain the pdf of the order statistics of non-uniforms, we can use the probabilitytransform approach as in Example 5.1.3. That is, suppose the Xi’s are iid with (strictlyincreasing) distribution function F and pdf f . We then have that X has the same

distribution as (F−1(U1), . . . , F−1(Un)), where the Ui’s are iid Uni f orm(0, 1). Because

F, hence F−1, is increasing, the order statistics for the Xi’s match those of the Ui’s,that is,

(X(1), . . . , X(n)) =D (F−1(U(1)), . . . , F−1(U(n))). (6.100)

Thus for any particular k, X(k) =D F−1(U(k)). We know that U(k) ∼ Beta(k, n− k + 1),

hence can find the distribution of X(k) using the transformation with h(u) = F−1.

Thus h−1 = F, i.e.,

U(k) = h−1(X(k)) = F(X(k))⇒ Jh−1(x) = |F′(x)| = f (x). (6.101)

The pdf of X(k) is then

fX(k)(x) = fU(k)

(F(x)) f (x) =Γ(n + 1)

Γ(k)Γ(n− k + 1)F(x)k−1(1− F(x))n−k f (x). (6.102)

6.3.1 Approximating the mean and variance

For most F and k, the pdf in (6.102) is not particularly easy to deal with analytically.If the variance of the U(k) is small, one could use the “∆-method" to approximate

the mean and variance of X(k). This method a applied to a function h(U) where h is

continuously differentiable ant U = µU , its mean. We start with the one-term TaylorSeries expansion about U = µU :

h(U) = h(µU) + (U − µU)h′(µ∗), µ∗ is between U and µU . (6.103)


Now if the variance of U is small, then it is likely that U and µU are fairly close, sothat µ∗ is close to µU , and if h′ is not very variable, h′(µ∗) ≈ h′(µU), so that

h(U) ≈ h(µU) + (U− µU)h′(µU). (6.104)

Because h(µU) and h′(µU) are constants, we have that

E[h(U)] ≈ E[h(µU) + (U− µU)h′(µU)] = h(µU), (6.105)

and

Var[h(U)] ≈ Var[h(µU) + (U − µU)h′(µU)] = (h′(µU))2 Var[U]. (6.106)

Later, in Section 16.5, we’ll give a more formal justification for this procedure.

To apply the ∆-method to an order statistic, take X(k) = F−1(U(k)), i.e., h(u) =

F−1(u). Then one can show that

h′(u) =1

f (F−1(u)). (6.107)

From (6.96), µU = k/(n + 1), so that

E[X(k)] ≈ F−1(k

n + 1), (6.108)

and

Var[X(k)] ≈1

f (F−1( kn+1 ))2

k(n− k + 1)

(n + 1)2(n + 2). (6.109)

With n odd, the median then has

E[X(k)] ≈ η and Var[X(k)] ≈1

f (η)2

1

4(n + 2), (6.110)

where η = F−1( 12 ) is the population median of X.

When the distribution of X is symmetric, and the mean exists, then the meanand median are the same. Thus both the sample mean and the sample median areestimates of the mean. In terms of estimation, the better estimate would have thesmaller variance. Here is a table comparing the variance of the sample mean (σ2/n)to the approximate variance of the sample median for some symmetric distributions:

Distribution f (µ) Var[X] ≈ Var[Median] ≈ Var[X]/Var[Median]

Normal(µ, 1) 1/√

2π 1/n π/(2(n + 2)) 2π ≈ 0.6366

Cauchy 1/π ∞ π2/(4(n + 2)) ∞

Double Exponential 1/2 2/n 1/(n + 2) 2

Uniform(µ− 1, µ + 1) 1/2 1/(3n) 1/(n + 2) 13

Logistic 1/4 π2/(3n) 4/(n + 2) π2

12 ≈ 0.8225(6.111)

The last column takes n→ ∞. So the mean is better for the Normal and Uniform, butthe median is better for the others, especially the Cauchy. Note that these approxi-mations are really for large n. Remember that for n = 1, the mean and median arethe same.

Chapter 7

Conditional distributions

7.1 Introduction

A two-stage process is described in Example 2.4.3, where one first randomly choosesa coin from a population of coins, then flips it independently n = 10 times. There aretwo random variables in this experiment: X, the probability of heads for the chosencoin, and Y, the total number of heads among the n flips. It is given that X is equallylikely to be any number between 0 and 1, i.e.,

X ∼ Uni f orm(0, 1). (7.1)

Also, once the coin is chosen, Y is Binomial. If the chosen coin has X = x, then wesay that the conditional distribution of Y given X = x is Binomial(n, x), written

Y | X = x ∼ Binomial(n, x). (7.2)

Together the equations (7.1) and (7.2) describe the distribution of (X, Y).A couple of other distributions may be of interest. First, what is the marginal,

sometimes referred to in this context as unconditional, distribution of Y? It is notBinomial. It is the distribution arising from the entire two-stage procedure, not thatarising given a particular coin. The space is Y = 0, 1, . . . , n, as for the binomial, butthe pmf is different:

fY(y) = P[Y = y] 6= P[Y = y | X = x]. (7.3)

(That last expression is pronounced the probability that Y = y given X = x.)Also, one might wish to interchange the roles of X and Y, and ask for the con-

ditional distribution of X given Y = y for some y. This distribution is of particularinterest in Bayesian inference, as follows. One chooses a coin as before, and thenwishes to know its X. It is flipped ten times, and the number of heads observed,y, is used to guess what the X is. More precisely, one then finds the conditionaldistribution of X given Y = y:

X | Y = y ∼ ??? (7.4)

In Bayesian parlance, the marginal distribution of X in (7.1) is the prior distributionof X, because it is your best guess before seeing the data, and the conditional dis-tribution in (7.4) is the posterior distribution, determined after you have seen thedata.

73

74 Chapter 7. Conditional distributions

These ideas extend to random vectors (X, Y). There are five distributions we con-sider, three of which we have seen before:

• Joint. The joint distribution of (X, Y) is the distribution of X and Y takentogether.

• Marginal. The two marginal distributions: that of X alone and that of Y alone.

• Conditional. The two conditional distributions: that of Y given X = x, and thatof X given Y = y.

The next section shows how to find the joint distribution from a conditional andmarginal. Further sections look at finding the marginals and reverse conditional, thelatter using Bayes Theorem. We end with independence, Y begin independent of X ifthe conditional distribution of Y given X = x does not depend on x.

7.2 Examples of conditional distributions

When considering the conditional distribution of Y given X = x, it may or may notbe that the randomness of X is of interest, depending on the situation. In addition,there is no need for Y and X to be of the same type, e.g., in the coin example, X iscontinuous and Y is discrete. Next we look at some additional examples.

7.2.1 Simple linear model

The relationship of one variable to another is central to many statistical investigations.The simplest is a linear relationship,

Y = α + βX + E, (7.5)

Here, α and β are fixed, Y is the “dependent" variable, and X is the “explanatory"or “independent" variable. The E is error, needed because one does not expect thevariables to be exactly linearly related. Examples include X = Height and Y =Weight, or X = Dosage of a drug and Y some measure of health (cholesterol level,e.g.). The X could be a continuous variable, or an indicator function, e.g., be 0 or 1according to the sex of the subject.

The Normal linear model specifies that

Y | X = x ∼ N(α + βx, σ2e ). (7.6)

In particular,

E[Y | X = x] = α + βx and Var[Y | X = x] = σ2e . (7.7)

(Other models take (7.7) but do not assume normality, or allow Var[Y | X = x] todepend on x.) It may be that X is fixed by the experimenter, for example, the dosagex might be preset; or it may be that the X is truly random, e.g., the height of arandomly chosen person would be random. Often, this randomness of X is ignored,and analysis proceeds conditional on X = x. Other times, the randomness of X isalso incorporated into the analysis, e.g., one might have the marginal

X ∼ N(µX, σ2X). (7.8)

7.2. Examples of conditional distributions 75

7.2.2 Finite mixture model

The population may consist of a finite number of distinct subpopulations, e.g., inassessing consumer ratings of cookies, there may be a subpopulation of people wholike sweetness, and one with those who do not. With K subpopulations, X takes onthe values 1, . . . , K, These values are indices, not necessarily meaning to conveyany ordering. For a Normal mixture, the model is

Y | X = k ∼ N(µk, σ2k ). (7.9)

Generally there are no restrictions on the µk’s, but the σ2k ’s may be assumed equal.

Also, K may or may not be known. The marginal distribution for X is usually unre-stricted, i.e.,

fX(k) = pk, k = 1, . . . , K, (7.10)

the only restrictions being the necessary ones, i.e., the pk’s are positive and sum to 1.

7.2.3 Hierarchical models

Many experiments involve first randomly choosing a number of subjects from a pop-ulation, the measuring a number of random variables on the chosen subjects. Forexample, one might randomly choose n third-grade classes from a city, then withineach class administer a test to randomly choose m students. Let Xi be the overallability of class i, and Yi the average performance on the test of the students chosenfrom class i. Then a possible hierarchical model is

X1, . . . , Xn are iid ∼ N(µ, σ2), and

Y1, . . . , Yn | X1 = x1, . . . , Xn = xn are independent ∼ (N(x1, τ2), . . . , N(xn, τ2))

(7.11)

Here, µ and σ2 are the mean and variance for the entire population of classes, whilexi is the mean for class i. Interest may center on the overall mean, so the city canobtain funding for the state, as well as for the individual classes chosen, so theseclasses can get special treats from the local school board.

7.2.4 Bayesian models

A statistical model typically depends on an unknown parameter vector θ, and theobjective is to estimate the parameters, or some function of them, or test hypothesesabout them. The Bayesian approach treats the data X and the parameter θ as bothbeing random, hence having a joint distribution. (See The frequentist approach con-siders the parameters to be fixed but unknown. Both approaches use a model forX, which is a set of distributions indexed by the parameter, e. g., the Xi’s are iid

N(µ, σ2), where θ = (µ, σ2). The Bayesian approach considers that model conditionalon θ = θ, and would write

X1, . . . , Xn | θ = θ (= (µ, σ2)) ∼ iid N(µ, σ2). (7.12)

Here, the bold θ is the random vector, and the regular θ is the particular value. Thento fully specify the model, the distribution of θ must be given,

θ ∼ π, (7.13)

for some prior distribution π. Once the data is x obtained, inference is based on theposterior distribution of θ | X = x.


7.3 Conditional + Marginal → Joint

We start with the conditional distribution of Y given X = x, and the marginal distri-bution of X, and find the joint distribution of (X, Y). Let X be the (marginal) spaceof X, and for each x ∈ X , let Yx be the conditional space of Y given X = x, that is,the space for the distribution of Y | X = x. Then the (joint) space of (X, Y) is

W = (x, y) | x ∈ X & y ∈ Yx. (7.14)

(In the coin example, and Yx = 0, 1, . . . , n, so that in this case the conditional spaceof Y does not depend on x.)

Now for a function g(x, y), the conditional expectation of g(X, Y) given X = x is

denotedeg(x) = E[g(x, Y) | X = x], (7.15)

and is defined to be the expected value of the function g(x, Y) where x is fixed and Yhas the conditional distribution Y | X = x. If this conditional distribution has a pdf,say fY|X(y | x), then

eg(x) =∫

Yx

g(x, y) fY|X(y | x)dy. (7.16)

If fX|Y is a pmf, then we have the summation instead of the integral. It is impotant to

realize that this conditional expectation is a function of x. In the coin example (7.2),with g(x, y) = y,

eg(x) = E[Y | θ = x] = E[Binomial(n, x)] = nx. (7.17)

The key to describing the joint distribution is to define the unconditional expectedvalue of g.

Definition 6. Given a function g : W → R,

E[g(X, Y)] = E[eg(X)], (7.18)

where eg(x) = E[g(x, Y) | X = x], if all the expected values exist.

So, to continue with the coin example, with g(x, y) = y,

E[Y] = E[eg(X)] = E[nθ] = n E[Uni f orm(0, 1)] =n

2. (7.19)

Notice that this unconditional expected value does not depend on x, while the con-ditional expected value does, which is as it should be.

This definition yields the joint distribution P on W by looking at indicator func-tions g, as in Section 3.1.1. That is, take A ⊂ W , so that with P being the jointprobability distribution for (X, Y),

P[A] = E[IA(X, Y)]. (7.20)

Then the definition says that

P[A] = E[eIA(X)] where eIA

(x) = E[IA(x, Y) | X = x]. (7.21)

We should check that this P is in fact a legitimate probability distribution, and in turnyields the correct expected values. The latter result is proven in measure theory. That

7.3. Conditional + Marginal→ Joint 77

it is a probability measure (see Section 2.1) is not hard to show using that IW ≡ 1,and if Ai’s are disjoint,

I∪Ai(x, y) = ∑ IAi

(x, y). (7.22)

The eIA(x) is the conditional probability of A given X = x, written

P[A | X = x] = P[(X, Y) ∈ A | X = x] = E[IA(x, Y) | X = x]. (7.23)

7.3.1 Joint densities

More useful is the pdf or pmf of the joint distribution, if it exists. Suppose that X isdiscrete, and the conditional distribution of Y given X = x is discrete for each x. Thepmf’s are

fX(x) = P[X = x] and fY|X(y | x) = P[Y = y | X = x]. (7.24)

Let f be the joint pdf of (X, Y), so that for each (x, y) ∈ W ,

f (x, y) = P[(X, Y) = (x, y)] = E[I(x,y)(X, Y)]. (7.25)

By (7.21),E[I(x,y)(X, Y)] = E[eI(x,y)(X)] = eI(x,y)(x) fX(x), (7.26)

where

eI(x,y)(x) = E[I(x,y)(x, Y) | X = x] = 1× P[Y = y | X = x] = fY|X(y | x). (7.27)

Thus,f (x, y) = fY|X(y | x) fX(x). (7.28)

In fact, this equation should not be especially surprising. It is a special case of thegeneral definition of conditional probability for sets A and B:

P[A | B] =P[A ∩ B]

P[B], if P[B] 6= 0. (7.29)

If X has pdf fX(x) and Y | X = x has pdf fY|X(y | x) for each x, then the joint

distribution has a pdf f (x, y), again given by the formula in (7.28). To verify this

formula, for A ⊂ W,

P[A] = E[eIA(X)] =

∫

XeIA

(x) fx(x)dx. (7.30)

Now

eIA(x) =

∫

Yx

IA(x, y) fY|X(y | x)dy

=∫

Ax

fY|X(y | x)dy

where Ax = y ∈ Yx | (x, y) ∈ A. (7.31)

Using (7.30) in (7.31) yields

P[A] =∫

X

∫

Ax

fY|X(y | x) fX(x)dydx, (7.32)


which means that the joint pdf is fY|X(y | x) fX(x).

What if X is discrete and Y is continuous, or vice versa? The joint formula (7.28)still works, but we have to interpret what the f (x, y) is. It is not a pmf or a pdf, but

has elements of both. As long as we remember to sum over the discrete parts andintegrate over the continuous parts, there should be no problems.

7.4 The marginal distribution of Y

There is no special trick in obtaining the marginal of Y given the conditional Y | X = xand the marginal of X; just find the joint and integrate out x. Thus

fY(y) =∫

Xy

fY|X(y | x) fX(x)dx. (7.33)

7.4.1 Example: Coins and the Beta-Binomial

In the coin example,

fY(y) =∫ 1

0

(n

y

)xy(1− x)n−ydx

=

(n

y

)Γ(y + 1)Γ(n− y + 1)

Γ(n + 2)

=n!

y!(n− y)!

y!(n− y)!

(n + 1)!

=1

n + 1, y = 0, 1, . . . , n, (7.34)

which is the Discrete Uni f orm0, 1 . . . , n. Note that this distribution is not a bino-mial, and does not depend on x (it better not!). In fact, this Y is a special case of thefollowing.

Definition 7. Suppose

Y | X = x ∼ Binomial(n, x) and X ∼ Beta(α, β). (7.35)

Then the marginal distribution of Y is beta-binomial with parameters α, β and n, written

Y ∼ Beta-Binomial(α, β, n). (7.36)

When α = β = 1, the X is uniform. Otherwise, as above, we can find the marginalpdf to be

fY(y) =

(n

y

)Γ(y + α)Γ(n− y + β)

Γ(n + α + β), y = 0, 1, . . . , n. (7.37)

7.4.2 Example: Simple normal linear model

For another example, take the linear model in (7.6) and (7.8),

Y | X = x ∼ N(α + βx, σ2e ) and X ∼ N(µX, σ2

X). (7.38)

7.4. The marginal distribution of Y 79

We could write out the joint pdf and integrate, but instead we will find the mgf,which we can do in steps because it is an expected value. That is, with g(y) = ety,

MY(t) = E[etY] = E[eg(X)], where eg(x) = E[etY | X = x]. (7.39)

We know the mgf of a normal, which Y | X = x is, hence

eg(x) = MN(α+βx,σ2)(t) = e(α+βx)t+σ2e

t2

2 . (7.40)

The expected value of eg(X) can also be written as a normal mgf:

MY(t) = E[eg(X)] = E[e(α+βX)t+σ2e

t2

2 ]

= eαt+σ2e

t2

2 E[e(βt)X]

= eαt+σ2e

t2

2 MN(µX,σ2X)(βt)

= eαt+σ2e

t2

2 eβtµX+σ2X

(βt)2

2

= e(α+βµX)t+(σ2e +σ2

Xβ2) t2

2

= mgf of N(α + βµX , σ2e + σ2

X β2). (7.41)

That is, marginally,

Y ∼ N(α + βµX, σ2e + σ2

X β2). (7.42)

7.4.3 Marginal mean and variance

The marginal expected value of Y is just the expected value of the conditional ex-pected value:

E[Y] = E[eY(X)], where eY(x) = E[Y | X = x]. (7.43)

The marginal variance is not quite that simple. First, we need to define the conditionalvariance:

vY(x) ≡ Var[Y | X = x] = E[(Y− E[Y|X = x])2 | X = x]. (7.44)

It is the variance of the conditional distribution; just be sure to subtract the conditional

mean before squaring. To find the marginal variance, we use Var[Y] = e[Y2]− E[Y]2.Then

E[Y2] = E[eY2(X)], where eY2(x) = E[Y2 | X = x]. (7.45)

The conditional variance works like any variance, i.e.,

Var[Y | X = x] = E[Y2 | X = x]− E[Y | X = x]2, (7.46)

henceeY2(x) = E[Y2 | X = x] = vY(x) + eY(x)2. (7.47)

Thus

Var[Y] = E[Y2]− E[Y]2

= E[eY2(X)]− (E[eY(X)])2

= E[vY(X)] + E[eY(X)2]− (E[eY(X)])2

= E[vY(X)] + Var[eY(X)]. (7.48)

To summarize:


The unconditional expected value is the expected value of the conditional expectedvalue.

The unconditional variance is the expected value of the conditional variance plusthe variance of the conditional expected value.

(The second sentence is very analogous to what happens in regression, where thetotal sum-of-squares equals the regression sum-of-squares plus the residual sum ofsquares.)

These are handy results. For example, in the beta-binomial (Definition 7), find-ing the mean and vaiance using the pdf (7.37) can be challenging. But using theconidtional approach is much easier. Because conditionally Y is Binomial(n, x),

eY(x) = nx and vY(x) = nx(1− x). (7.49)

Then because X ∼ Beta(α, β),

E[Y] = nE[X] = nα

α + β, (7.50)

and

Var[Y] = E[vY(X)] + Var[eY(X)]

= E[nX(1− X)] + Var[nX]

= nαβ

(α + β)(α + β + 1)+ n2 αβ

(α + β)2(α + β + 1)

=nαβ(α + β + n)

(α + β)2(α + β + 1). (7.51)

These expressions give some insight into the beta-binomial. Like the binomial, thebeta-binomial counts the number of successes in n trials, and has expected value npfor p = α/(α + β). Consider the variances of a binomial and beta-binomial:

Var[Binomial] = np(1− p) and Var[Beta-Binomial] = np(1− p)α + β + n

α + β + 1. (7.52)

Thus the beta-binomial has a larger variance, so it can be used to model situationsin which the data are more disperse than the binomial, e.g., if the n trials are noffspring in the same litter, and success is survival. The larger α + β, the closer thebeta-binomial is to the binomial.

You might wish to check the mean and variance in the normal example 7.4.2.The same procedure works for vectors:

E[Y] = E[eY(X)] where eY(x) = E[Y | X = x], (7.53)

and

Cov[Y] = E[vY(X)] + Cov[eY(X)], where vY(x) = Cov[Y | X = x]. (7.54)

In particular, considering a single covariance,

Cov[Y1, Y2] = E[cY1,Y2(X)]+ Cov[eY1

(X), eY2(X)], where cY1,Y2

(x) = Cov[Y1, Y2 | X = x].(7.55)

7.4. The marginal distribution of Y 81

7.4.4 Example: Fruit flies, continued

Recall the fruit fly Example 1.1.1 from Chapter 1. There are n = 10 fathers, andeach father has two offspring. Each father’s genotype is either (TL, TL), (TL, CU), or(CU, CU). We can represent the genotype by counting the number of CU’s, letting

Xi be the number (either 0, 1, or 2) of CU’s for the ith father. The mothers are all(TL, TL), so each offspring receives a TL from the mother, and randomly one of the

alleles from the father. Let Yij be the indicator of whether the jth offspring of father ireceives a CU from the father. That is,

Yij =

0 i f Offspring j from father i is (TL, TL)1 i f Offspring j from father i is (TL, CU).

(7.56)

Then each “family" has three random variables, (Xi, Yi1, Yi2). We will assume thatthese triples are independent, in fact,

(X1, Y11, Y12), . . . , (Xn, Yn1, Yn2) are iid. (7.57)

To specify the distribution of each triple, we use the assumptions mentioned inthe original Example 1.1.1. The proportion of CU’s in the population is θ, and theassumptions imply that the number of CU’s for father i is

Xi ∼ Binomial(2, θ), (7.58)

because each father in effect randomly chooses two alleles from the population.Next, we specify the conditional distirbution of the offspring given the father, i.e.,(Yi1, Yi2) | Xi = xi). If xi = 0, then the father is (TL, TL), so the offspring will allreceive a TL from the father:

P[(Yi1, Yi2) = (0, 0) | Xi = 0] = 1. (7.59)

Similarly, if xi = 2, the father is (CU, CU), so the offspring will all receive a TL fromthe father:

P[(Yi1, Yi2) = (1, 1) | Xi = 2] = 1. (7.60)

Finally, if xi = 1, the father is (TL, CU), which means each offspring has a 50-50chance of receiving a CU from the father:

P[(Yi1, Yi2) = (y1, y2) | Xi = 1] =1

4for (y1, y2) ∈ (0, 0), (0, 1), (1, 0), (1, 1). (7.61)

This conditional distribution can be written more compactly by noting that xi/2 isthe chance that an offspring receives a CU, so that

(Yi1, Yi2) | Xi = xi ∼ iid Bernoulli(xi/2), (7.62)

which using (5.60) yields the conditional pmf

fY|X(yi1, yi2 | xi) =( xi

2

)yi1+yi2(

1− xi

2

)2−yi1−yi2

. (7.63)

The goal of the experiment is to estimate θ, but without knowing the Xi’s. Thusthe estimation has to be based on just the Yij’s. The marginal means are easy to find:

E[Yij] = E[eYij(Xi)], where eYij

(xi) = E[Yij | Xi = xi ] =xi

2, (7.64)


because conditionally Yij is Bernoulli. Then E[Xi] = 2θ, hence

E[Yij] =2θ

2= θ. (7.65)

Nice! Then an obvious estimator of θ is the sample mean of all the Yij’s, of whichthere are 2n = 20:

θ =∑

ni=1 ∑

2j=1 Yij

2n. (7.66)

The data are in Table 1.1. To find the estimate, we just count the number of CU’s,which is 8, hence the estimate of θ is 0.4.

What is the variance of the estimate? Are Yi1 and Yi2 unconditionally independent?What is Var[Yij]? What is the marginal pmf of (Yi1, Yi2)?

7.5 The conditional from the joint

Often one has a joint distribution, but is primarily interested in the conditional, e.g.,many experiments involve collecting health data from a population, and interest cen-ters on the conditional distribution of certain outcomes, such as longevity, condi-tional on other variables, such as sex, age, cholesterol level, activity level, etc. InBayesian inference, one can find the joint distribution of the data and the parameter,and from that needs to find the conditional distribution of the parameter given thedata. Measure theory guarantees that for any joint distribution of (X, Y), there ex-ists a conditional distribution of Y | X = x for each x ∈ X . It may not be unique,but any conditional distribution that combines with the marginal of X to yield theoriginal joint distribution is a valid conditional distribution. If densities exists, thenfY|X(y | x) is a valid conditional density if (7.28) holds:

f (x, y) = fY|X(y | x) fX(x) (7.67)

for all (x, y) ∈ W . Thus, given a joint density f , one can integrate out y to obtain the

marginal fX, then define the conditional by

fY|X(y | x) =f (x, y)

fX(x)=

Joint

Marginalif fX(x) > 0. (7.68)

It does not matter how the conditional density is defined when fX(x) = 0, as long asit is a density on Yx, because in reconstructing the joint f in (7.67), it is multiplied by0. Also, the conditional of X given Y = y is the ratio of the joint to the marginal of Y.

This formula works for pdf’s, pmf’s, and the mixed kind.

7.5.1 Example: Coin

In the coin Example 2.4.3, the joint density of (X, Y) is

f (x, y) =

(n

y

)xy(1− x)n−y, (7.69)

and the marginal distribution of Y, the number of heads, is, as in (7.34),

fY(y) =1

n + 1, y = 0, . . . , n. (7.70)

7.5. The conditional from the joint 83

Thus the conditional posterior distribution of X, the chance of heads, given Y, thenumber of heads, is

fX|Y(x | y) =f (x, y)

fY(y)= (n + 1)

(n

y

)xy(1− x)n−y

=(n + 1)!

y!(n− y)!xy(1− x)n−y

=Γ(n + 2)

Γ(y + 1)Γ(n− y + 1)xy(1− x)n−y

= Beta(y + 1, n− y + 1). (7.71)

For example, if the experiment yields Y = 3 heads, then one’s guess of what the prob-ability of heads is for this particular coin is described by the Beta(4, 8) distribution.So, for example, the best guess could be the posterior mean,

E[X | Y = 3] = E[Beta(4, 8)] =4

4 + 8=

1

3. (7.72)

Note that this is not the sample proportion of heads, 0.3, although it is close. Theposterior mode and posterior median are also reasonable point estimates. A moreinformative quantity might be a probability interval:

P[0.1093 < X < 0.6097 | Y = 3] = 95%. (7.73)

So there is a 95% chance that the chance of heads is somewhere between 0.11 and0.61. (These numbers were found using the qbeta function in R.) It is not a very tightinterval, but there is not much information in just ten flips.

7.5.2 Bivariate normal

If one can recognize the form of a particular density for Y within a joint density, thenit becomes unnecessary to explicitly find the marginal of X, then divide the joint by

the marginal. For example, the N(µ, σ2) density can be written

φ(z; µ, σ2) =1√2πσ

e− 1

2(z−µ)2

σ2

= c(µ, σ2)ez µ

σ2 −z2 1

2σ2 . (7.74)

That is, we factor the pdf into a constant we do not care about at the moment, thatdepends on the fixed parameters, and the important component containing all thez-action.

Now consider the bivariate normal,

(XY

)∼ N(µ, Σ), where µ =

(µXµY

)and Σ =

(σ2

X σXY

σXY σ2Y

). (7.75)

Assuming Σ is invertible, the joint pdf is (as in (6.53))

f (x, y) =1

2π

1√|Σ|

e−12 ((x,y)′−µ)′Σ−1((x,y)′−µ). (7.76)


The conditional pdf of Y | X = x can be written

fY|X(y | x) =f (x, y)

fX(x)= c(x, µ, Σ)× ?? , (7.77)

where ?? is a function of y that, we hope, is the core of a recognizable density. Thec contains the fX(x), plus other factors, so we do not need to explicitly find themarginal. To try to isolate the parts depending on y, take part of the exponent:

((x, y)′ − µ)′Σ−1((x, y)′ − µ) = (x, y)Σ−1(x, y)′ − 2(x, y)Σ−1µ + µ′Σ−1µ. (7.78)

Recall that

Σ−1 =1

|Σ|

(σ2

Y −σXY

−σXY σ2X

). (7.79)

Now the expression in (7.78), considered as a function of y, is quadratic, i.e., a + by +cy2, where a, b, and c depend on µ, Σ and x. We do not care about the a, but the

linear and quadratic terms are important. Some algebra yields

(x, y)Σ−1(x, y)′ = −2xσXY

|Σ| y +σ2

X

|Σ| y2 + · · · , and

−2(x, y)Σ−1µ = −2−µXσXY + µYσ2

X

|Σ| y + · · · . (7.80)

ThenfY|X(y | x) = c(x, µ, σ2)eQ, (7.81)

where

Q = − 1

2(−2

xσXY

|Σ| y +σ2

X

|Σ| y2 − 2−µXσXY + µYσ2

X

|Σ| y)

=σXY(x− µX) + µYσ2

Y

|Σ| y− 1

2

σ2X

|Σ| y2. (7.82)

Now compare the Q in (7.82) to the exponent in (7.74). We see that fY|X(y | x) is of

the same form, with

1

σ2=

σ2X

|Σ| andµ

σ2=

(x− µX)σXY + µYσ2X

|Σ| . (7.83)

Thus we just need to solve for µ and σ2, and will then have that Y | X = x ∼ N(µ, σ2).To proceed,

σ2 =|Σ|σ2

X

=σ2

Xσ2Y − σ2

XY

σ2X

= σ2Y −

σ2XY

σ2X

, (7.84)

and

µ =(x− µX)σXY + µYσ2

X

σ2X

= µY +σXY

σ2X

(x− µX). (7.85)

That is,

Y | X = x ∼ N

(µY +

σXY

σ2X

(x− µX), σ2Y −

σ2XY

σ2X

). (7.86)

7.6. Bayes Theorem: Reversing the conditionals 85

Because we know the normal pdf, we could work backwards to find the c in (7.77),but there is no need to do that.

This is a normal linear model, as in (7.6), where

Y |X = x ∼ N(α + βx, σ2e ). (7.87)

We can make the identifications:

β =σXY

σ2X

, α = µY − βµX, and σ2e = σ2

Y −σ2

XY

σ2X

. (7.88)

These equations should be familiar from linear regression.

7.6 Bayes Theorem: Reversing the conditionals

We have already essentially derived Bayes Theorem, but it is important enough todeserve its own section. The theorem takes the conditional density of Y given X andthe marginal distribution of X, and produces the conditional density of X given Y.It uses the formula conditional = joint/marginal, where the marginal is found byintegrating out the y from the joint.

Theorem 3. Bayes Theorem Suppose Y | X = x has density fY|X(y | x) and X has

marginal density fX(x). Then,

fX|Y(x | y) =fY|X(y | x) fX(x)∫

XyfY|X(y | x) fX(x)dx

. (7.89)

The integral will be a summation if X is discrete.

Proof With f (x, y) being the joint density,

fX|Y(x | y) =f (x, y)

fY(y)

=fY|X(y | x) fX(x)∫Xy

f (x, y)dx

=fY|X(y | x) fX(x)∫

XyfY|X(y | x) fX(x)dx

. 2 (7.90)

Bayes Theorem is often used with sets. Let A ⊂ X and B1, . . . , BK be a partitionalof X , i.e.,

Bi ∩ Bj = ∅ for i 6= j, and ∪Kk=1 Bk = X . (7.91)

Then

P[Bk | A] =P[A | Bk]P[Bk]

∑Kl=1 P[A | Bl ]P[Bl ]

. (7.92)


7.6.1 Example: AIDS virus

A common illustration of Bayes Theorem involves testing for some medical condition,e.g., a blood test for the AIDS virus. Suppose the test is 99% accurate. If a randomperson’s test is positive, does that mean the person is 99% sure of having the virus?

Let A+ = “test is positive", A− = “test is negative", Bv =“person has the virus"and Bn = “person does not have the virus." Then we know the conditionals

P[A+ | Bv] = 0.99 and P[A− | Bn] = 0.99, (7.93)

but they are not of interest. We want to know the reverse conditional, P[Bv | A+],the chance of having the virus given the test is positive. There is no way to figurethis probability out without the marginal of B, that is, the marginal chance a randomperson has the virus. Let us say that P[Bv] = 1/10, 000. Now we can use BayesTheorem (7.92)

P[Bv | A+] =P[A+ | Bv]P[Bv]

P[A+ | Bv]P[Bv] + P[A+ | Bn]P[Bn]

=0.99× 1

10000

0.99× 110000 + 0.01× 9999

10000

≈ 0.0098. (7.94)

Thus the chance of having the virus, given the test is positive, is only about 1/100.That is lower than one might expect, but it is substantially higher than the overallchance of 1/10000. (This example is a bit simplistic in that random people do nottake the test, but more likely people who think they may be at risk.)

7.6.2 Example: Beta posterior for the binomial

The coin Example 7.5.1 can be generalized by using a beta in place of the uniform. In aBayesian framework, we suppose the probability of success, θ, has a prior distributionBeta(α, β), and Y | θ = θ is Binomial(n, θ). (So now x has become θ.) The prior issupposed to represent knowledge or belief about what the θ is before seeing the dataY. To find the posterior, or what we are to think after seeing the data Y = y, we needthe conditional distribution of θ given Y = y.

The joint density is

f (θ, y) = fY|θ(y | θ) fθ(θ) =

(n

y

)θy(1− θ)n−yβ(α, β)θα−1(1− θ)β−1

= c(y, α, β)θy+α−1(1− θ)n−yβ−1. (7.95)

Because we are interested in the pdf of θ, we put everything not depending on θ inthe constant. But the part that does depend on θ is the meat of a Beta(α + y, β + n− y)density, hence that is the posterior:

θ | Y = y ∼ Beta(α + y, β + n− y). (7.96)

7.7 Conditionals and independence

If X and Y are independent, then it makes sense that the distribution of one does notdepend on the value of the other, which is true.

7.7. Conditionals and independence 87

Lemma 9. The random vectors X and Y are independent if and only if the conditional

distribution 1 of Y given X = x does not depend on x.

When there are densities, this result follows directly:

Independence ⇒ f (x, y) = fX(x) fY(y) and W = X ×Y

⇒ fY|X(y | x) =f (x, y)

fX(x)= fY(y) and Yx = Y , (7.97)

which does not dependent on x. The other way, if the conditional distribution of Ygiven X = x does not depend on x, then fY|X(y | x) = fY(y) and Yx = Y , hence

f (x, y) = fY|X(y | x) fX(x) = fY(y) fX(x) and W = X ×Y . (7.98)

7.7.1 Example: Independence of residuals and X

Suppose (X, Y)′ is bivariate normal as in (7.75),

(XY

)∼ N

((µX

µY

),

(σ2

X σXY

σXY σ2Y

)). (7.99)

We then have thatY | X = x ∼ N(α + βx, σ2

e ), (7.100)

where the parameters are given in (7.88). The residual is Y − α − βX. What is itsconditional distribution? First, for fixed x,

Y− α− βX | X = x =D Y− α− βx | X = x. (7.101)

This equation means that when we are conditioning on X = x, the conditional dis-tribution stays the same if we fix X = x, which follows from the original definitionof conditional expected value in (7.15). We know that subtracting the mean from anormal leaves a normal with mean 0 and the same variance, hence

Y− α− βx | X = x ∼ N(0, σ2e ). (7.102)

But the right-hand side has no x, hence Y − α − βX is independent of X, and has

marginal distribution N(0, σ2e ).

1Measure-theoretically, we technically mean that “a version of the conditional distribution ..."

Chapter 8

The Multivariate Normal Distribution

Almost all data are multivariate, that is, entail more than one variable. There are twogeneral-purpose multivariate models: the multivariate normal for continuous data,and the multinomial for categorical data. There are many specialized multivariatedistributions, but these two are the only ones that are used in all areas of statistics. Wehave seen the bivariate normal in Example 6.2.4, and the multinomial is introducedin Example 3.4.3. This chapter focusses on the multivariate normal.

8.1 Definition

There are not very many commonly used multivariate distributions to model data.The multivariate normal is by far the most common, at least for continuous data.Which is not to say that all data are distributed normally, nor that all techniquesassume such. Rather, one either assumes normality, or makes few assumptions atall and relies on asymptotic results. Some of the nice properties of the multivariatenormal:

• It is completely determined by the means, variances, and covariances.

• Elements are independent if and only if they are uncorrelated.

• Marginals of multivariate normals are multivariate normal.

• An affine transformation of a multivariate normal is multivariate normal.

• Conditionals of a multivariate normal are multivariate normal.

• The sample mean is independent of the sample covariance matrix.

The multivariate normal arises from iid standard normals, that is, iid N(0, 1)’s.Suppose Z = (Z1, . . . , ZM)′ is an M × 1 vector of iid N(0, 1)’s. Because they areindependent, all the covariances are zero, so that

E[Z] = 0M and Cov[Z] = IM. (8.1)

A general multivariate normal distribution can have any (legitimate) mean andcovariance, achieved through the use of affine transformations. Here is the definition.

89

90 Chapter 8. The Multivariate Normal Distribution

Definition 8. Let the p× 1 vector Y be defined by

Y = µ + BZ, (8.2)

where B is p×M, µ is p×M, and Z is an M× 1 vector of iid standard normals. Then Y is

multivariate normal with mean µ and covariance matrix Σ ≡ BB′, written

Y ∼ N(µ, Σ). (8.3)

From (8.1) and (8.2), the usual affine tranformation results from Section 3.5 showthat E[Y] = µ and Cov[Y] = BB′ = Σ. The definition goes further, implying that

the distribution depends on B only through the BB′. For example, consider the twomatrices

B1 =

(1 1 00 1 2

)and B2 =

3√5

1√5

0√

5

, (8.4)

so that

B1B′1 = B2B′2 =

(2 11 5

)≡ Σ. (8.5)

Thus the definition says that both

Y1 =

(1 1 00 1 2

)

Z1uZ2uZ3

and Y2 =

3√5

1√5

0√

5

(

Z1uZ2

)(8.6)

are N(02, Σ). They clearly have the same mean and covariance matrix, but it is notobvious they have the exact same distribution, especially as they depend on differentnumbers of Zi’s. To see the distributions are the same, we have to look at the mgf.We have already found the mgf for the bivariate (p = 2) normal in (6.57), and theproof here is the same. The answer is

MY(t) = exp(µ′t +1

2t′Σt), t ∈ R

p. (8.7)

Thus the distribution of Y does depend on just µ and Σ, so that because Y1 and Y2

have the same BiB′i (and same mean), they have the same distribution.

Can µ and Σ be anything, or are there restrictions? Any µ is possible, since a = µ

and a can be anything in Rp. The covariance matrix Σ can be BB′ for any p × Mmatrix B. Note that M is arbitrary, too. Clearly, Σ must be symmetric, but we alreadyknew that. It must also be nonnegative definite, which we define now.

Definition 9. A symmetric p× p matrix B is nonnegative definite if

c′Bc ≥ 0 for all p× 1 vectors c. (8.8)

Also, B is positive definite if

c′Bc > 0 for all p× 1 vectors c 6= 0. (8.9)

8.1. Definition 91

Note that c′BB′c = ‖c′B‖2 ≥ 0, which means that Σ must be nonnegative definite.But from (3.103),

c′Σc = Cov(c′Y) = Var(c′Y) ≥ 0, (8.10)

because all variances are nonnegative. That is, any covariance matrix has to be non-negative definite, not just multivariate normal ones.

So we know that Σ must be symmetric and nonnegative definite. Are there anyother restrictions, or for any symmetric nonnegative definite matrix is there a corre-sponding B? In fact, there are potentially many such square roots B. A nice one is thesymmetric square root available because Σ is symmetric. See the next subsection.

We conclude that the multivariate normal distribution is defined for and (µ, Σ),

where µ ∈ R and Σ is symmetric and positive definite, i.e., it can be any valid covari-

ance matrix.

8.1.1 The spectral decomposition

To derive a symmetric square root, as well as preform other useful tasks, we need thefollowing decomposition.

Theorem 4. The Spectral Decomposition Theorem for Symmetric Matrices. If Ω is asymmetric p × p matrix, then there exists a p × p orthogonal (6.58) matrix Γ and a p× pdiagonal matrix Λ with diagonals λ1 ≥ λ2 ≥ · · · ≥ λp such that

Ω = ΓΛΓ′. (8.11)

Write Γ = (γ1, . . . , γ

p), so that the γ

i’s are the columns of Γ. Then

Ω =p

∑i=1

λiγ1γ′i . (8.12)

Because the γi’s are orthogonal, and each have length 1,

Ωγi= λiγi

, (8.13)

which means γi

is an eigenvector of Ω, with corresponding eigenvalue λi. The eigen-

values are unique, but the eigenvectors will not all be unless the eigenvectors aredistinct. For example, the identity matrix Ip has all eigenvalues 1, and any vector γwith length 1 is an eigenvector.

Here are some handy facts about symmetric Ω and its spectral decomposition.

• Ω is positive definite if and only if all λi’s are positive, and nonnegative definiteif and only if all λi’s are nonnegative.

• The trace and determinant are, respectively,

trace(Ω) = ∑ λi and |Ω| = ∏ λi. (8.14)

(The trace of a square matrix is the sum of its diagonals.)

• Ω is invertible if and only if its eigenvalues are nonzero, in which case itsinverse is

Ω−1 = ΓΛ−1Γ′. (8.15)

Thus the inverse has the same eignevectors, and eigenvalues 1/λi .


• If Ω is nonnegative definite, then with Λ1/2 being the diagonal matrix withdiagonal elements

√λi,

Ω1/2 = ΓΛ1/2Γ′ (8.16)

is a symmetric square root of Ω, that is, it is symmetric and Ω1/2Ω1/2 = Ω.

The last item was used in the previous section to guarantee that any covariancematrix has a square root.

8.2 Some properties of the multivariate normal

We will prove the next few properties using the representation (8.2). They can alsobe easily shown using the mgf.

8.2.1 Affine transformations

Affine transformations of multivariate normals are also multivariate normal, becauseany affine transformation of a multivariate normal vector is an affine transformationof an affine transformation of a standard normal vector, and an affine transformationof an affine transformation is also an affine transformation. That is, suppose Y ∼Np(µ, Σ), and W = c + DY for q× p matrix D and q× 1 vector c. Then we know that

for some B with BB′ = Σ, Y = µ + BZ, where Z is a vector of iid standard normals.

Hence

W = c + DY = c + D(µ + BZ) = c + Dµ + DBZ. (8.17)

Then by Definition 8,

W ∼ N(c + Dµ, DBB′D′) = N(c + Dµ, DΣD′). (8.18)

Of course, the mean and covariance result we already knew.

8.2.2 Marginals

Because marginals are special cases of affine transformations, marginals of multivari-ate normals are also multivariate normal. One needs just to pick off the appropriatemeans and covariances. So If Y = (Y1, . . . , Y5)

′ is N5(µ, Σ), and W = (Y2, Y5)′, then

W ∼ N2

((µ2µ5

),

(σ22 σ25σ52 σ55

)). (8.19)

Here, σij is the ijth element of Σ, so that σii = σ2i .

8.2.3 Independence

In Section 4.3, we showed that independence of two random variables means thattheir covariance is 0, but that a covariance of 0 does not imply independence. But,with multivariate normals, it does. That is, if X is a multivariate normal vector, andCov(Xj, Xk) = 0, then Xj and Xk are independent. The next theorem generalizes thisindependence to sets of variables.

8.3. The pdf 93

Theorem 5. Suppose

X =

(YW

)(8.20)

is multivariate normal, where Y = (Y1, . . . , YK)′ and W = (W1, . . . , WL)′. If Cov(Yk, Wl) =0 for all k, l, then Y and W are independent.

Proof. For simplicitly, we will assume the mean of X is 0K+L. Because covariancesbetween the Yk’s and Wl’s are zero,

Cov(X) =

(Cov(Y) 0

0 Cov(W)

). (8.21)

(The 0’s denote matrices of the appropriate size with all elements zero.) Each of thoseindividual covariance matrices has a square root, hence there are matrices B, K × K,and C, L× L, such that

Cov[Y] = BB′ and Cov[W] = CC′. (8.22)

Thus

Cov[X] = AA′ where A =

(B 00 C

). (8.23)

Then by definition, we know that with Z being a (K + L)× 1 vector of iid standardnormals, (

YW

)= X = AZ =

(B 00 C

)(Z1Z2

)=

(BZ1CZ2

), (8.24)

where Z1 and Z2 are K× 1 and L× 1, respectively. But that means that

Y = BZ1 and W = CZ2, (8.25)

and Z1 is independent of Z2, hence Y is independent of W. 2

Note that this result can be extended to partitions of X into more than two groups.Especially, if all the covariances between elements of X are 0, then the elements aremutually independent.

8.3 The pdf

The multivariate normal has a pdf only if the covariance is invertible (i.e., positivedefinite). In that case, its pdf is easy to find using the same procedure used to findthe pdf of the bivariate normal in Section 6.2.4. Suppose Y ∼ N(µ, Σ) where Σ is a

p× p positive definite matrix. Let B be the symmetric square root of Σ, which will

also be positive definite. (Why?1) Then

Y = µ + BZ, Z ∼ N(0p, Ip). (8.26)

We know the pdf of Z, because it consists of iid N(0, 1)’s:

fZ(z) = ∏1√2π

e−12 z2

i =1

(2π)p/2e−

12 ‖z‖2

. (8.27)

1Because the eigenvalues of Σ are positive, and those of B are the square roots, so are also positive.


The inverse of the transoformation (8.26) is

Z = B−1(Y− µ), (8.28)

hence the Jacobian is |B−1| = 1/|B|. Then

fY(y) =1

(2π)p/2

1

|B| e−12 ‖B−1(y−µ)‖2

. (8.29)

Because Σ = BB′ = BB, |Σ| = |B|2. Also

‖B−1(y− µ)‖2 = (y− µ)′(BB)−1(y− µ). (8.30)

Thus,

fY(y) =1

(2π)p/2

1√|Σ|

e−12 (y−µ)′Σ−1(y−µ). (8.31)

8.4 The sample mean and variance

Often one desires a confidence interval for the population mean. Specifically, suppose

X1, . . . , Xn are iid N(µ, σ2). Then

X− µ

σ/√

n∼ N(0, 1), (8.32)

because X ∼ N(µ, σ2/n). Because P[−1.96 < N(0, 1) < 1.96] = 0.95,

P[−1.96 <X− µ

σ/√

n< 1.96] = 0.95, (8.33)

or, untangling the equations to get µ in the middle,

P[X− 1.96σ√n

< µ < X + 1.96σ√n

] = 0.95. (8.34)

Thus,

(X− 1.96σ√n

, X + 1.96σ√n

) is a 95% confidence interval for µ, (8.35)

at least if σ is known. But what if σ is not known? Then you estimate it, which willchange the distribution, that is,

X− µ

σ/√

n∼ ?? (8.36)

The sample variance for a sample x1, . . . , xn is

s2 =∑(xi − x)2

n, or is it s2 =

∑(xi − x)2

n− 1? (8.37)

Rather than worry about that question now, we will find the joint distribution of

(X, U), where U = ∑(Xi − X)2. (8.38)

8.4. The sample mean and variance 95

To start, note that the deviations xi − x are linear functions of the xi’s, as of courseis x, and we know how to deal with linear combinations of normals. That is, lettingX = (X1, . . . , Xn)′, because the elements are iid,

X ∼ N(µ1n, σ2In), (8.39)

where 1n is the n× 1 vector of all 1’s. The mean and deviations can be written

XX1 − X

...Xn − X

=

(0X

)−(−11n

)X. (8.40)

Now X = InX and X = (1/n)1′nX, hence

XX1 − X

...Xn − X

=

(

0In

)−

− 1

n 1′n

− 1n 1n1′n

X

=

1n 1′n

Hn

X, (8.41)

where

Hn = In −1

n1n1′n (8.42)

is the n× n centering matrix. It is called the centering matrix because for any n× 1vector a, Hna subtracts the mean of the elements from each element, centering thevalues at 0. Note that if all the elements are the same, centering will set everythingto 0, i.e.,

Hn1n = 0n. (8.43)

Also, if the mean of the elements already is 0, centering does nothing, which inparticular means that Hn(Hna) = Hna, or

HnHn = Hn. (8.44)

Such a matrix is called idempotent. It is not difficult to verify (8.44) directly usingthe definition (8.42) and multiplying things out. In fact, In and (1/n)1n1′n are alsoidempotent.

Back to the task. Equation (8.41) gives explicitly that the mean and deviationsare a linear transformation of a multivariate normal, hence the vector is multivariatenormal. The mean and covariance are

E

XX1 − X

...Xn − X

=

1n 1′n

Hn

µ1n =

1n 1′n1n

Hn1n

µ =

(µ0n

)(8.45)


and

Cov

XX1 − X

...Xn − X

=

1n 1′n

Hn

σ2In

1n 1′n

Hn

′

= σ2

1n2 1′n1n

1n 1′nHn

1n Hn1n HnHn

= σ2

1n 0′n

0n Hn

. (8.46)

Look at the 0n’s: The covariances between X and the deviations HnX are zero, hencewith the multivariate normality means they are independent. Further, we can readoff the distributions:

X ∼ N(µ,1

nσ2) and HnX ∼ N(0n, σ2Hn). (8.47)

The first we already knew. But because X and HnX are independent, and U =‖HnX‖2 is a function of just HnX,

X and U = ∑(Xi − X)2 are independent. (8.48)

The next section will find the distribution of U.

8.5 Chi-squares

In Section 5.1.2, we defined the chi-squared distribution on one degree of freedom.Here we look at the more general chi-square.

Definition 10. Suppose Z ∼ N(µ, Iν). Then

W = ‖Z‖2 (8.49)

has the noncentral chi-squared distribution on ν degrees of freedom with noncentrality

∆ = ‖µ‖2, written

W ∼ χ2ν(∆). (8.50)

If ∆ = 0, then W has the (central) chi-squared distribution,

W ∼ χ2ν. (8.51)

This definition implies that if two µ’s have the same length, then the corresponding

W’s have the same distribution. That is, with ν = 2,

µ = (5, 0), µ = (3, 4), and µ = (5√2

,5√2) (8.52)

8.5. Chi-squares 97

all lead to the same ∆ = 25, hence the distributions of their ‖Z‖2’s will all be thesame, χ2

2(25). Is that a plausible fact? The key is that if Γ is an orthogonal matrix, then‖Γz‖ = ‖z‖. Thus Z and ΓZ would lead to the same chi-squared. Take Z ∼ N(µ, Iν),

and let Γ be the orthogonal matrix such that

Γµ =

‖µ‖0...0

=

√∆

0...0

. (8.53)

Any orthogonal matrix whose first row is µ/‖µ‖ will work. Then

‖Z‖2 =D ‖ΓZ‖2, (8.54)

and the latter clearly depends on just ∆, which shows that the definition is fine.The definition can be used to show that the sum of independent chi-squared is

also chisquared.

Lemma 10. If W1 and W2 are independent, with

W1 ∼ χ2ν1

(∆1) and W2 ∼ χ2ν2

(∆2), then W1 + W2 ∼ χ2ν1+ν2

(∆1 + ∆2). (8.55)

Proof. Because W1 and W2 are independent, they can be represented by W1 = ‖Z1‖2

and ‖Z2‖2, where Z1 and Z2 are independent, and

Z1 ∼ N(µ1, Iν1 ) and Z2 ∼ N(µ

2, Iν2), ∆1 = ‖µ

1‖2 and ∆2 = ‖µ

2‖2. (8.56)

Letting

Z =

(Z1Z2

)∼ N

((µ

1µ

2

), Iν1+ν2

), (8.57)

we have that

W1 + W2 = ‖Z‖2 ∼ χ2ν1+ν2

(‖µ1‖2 + µ

2‖2) = χ2

ν1+ν2(∆1 + ∆2). 2 (8.58)

If both Wi’s are central chi-squares, then their sum is also central chi-squares.

In fact, from (5.15), we have that a χ21 is Gamma( 1

2 , 12 ), and a χ2

ν is the sum of ν

independent χ21’s, hence the sum of independent Gamma( 1

2 , 12 )’s from Example 5.2.2.

Thus

χ2ν = Gamma(

ν

2,

1

2), (8.59)

hence its pdf is

fν(y) =1

2ν/2Γ(ν/2)yν/2−1 e−y/2, y > 0. (8.60)


8.5.1 Moments of a chi-squared

The moments of a central chi-squared are easily obtained from those of a gamma.From (3.75) and (3.76), we have that the mean and variance of a Gamma(α, λ) are α/λand α/λ2, respectively, hence

E[χ2ν] = ν and Var[χ2

ν] = 2ν. (8.61)

Turn to the noncentral chisquared. Because a χ2ν(∆) = χ2

1(∆) + χ2ν−1 where the

latter two are independent,

E[χ2ν(∆)] = E[χ2

1(∆)] + E[χ2ν−1] and Var[χ2

ν(∆)] = Var[χ21(∆)] + Var[χ2

ν−1]. (8.62)

Thus we need to find the moments of a χ1(∆). Now (√

∆ + X)2 ∼ χ21(∆), where

X ∼ N(0, 1), hence

E[χ21(∆)] = E[(

√∆ + X)2] = ∆ + 2

√∆E[X] + E[X2] = ∆ + 1. (8.63)

Also,

E[(χ21(∆))2] = E[(

√∆ + X)4] = E[∆4 + 4∆3/2X + 6∆X2 + 4

√∆X3 + X4. (8.64)

Now E[X] = E[X3] = 0, E[X2] = 1, and E[X4] = 3 (can be found a number of ways),hence

E[(χ21(∆))2] == ∆4 + 6∆ + 3, (8.65)

andVar[(χ2

1(∆))2] = ∆4 + 6∆ + 3− (∆ + 1)2 = 4∆ + 2. (8.66)

Putting (8.62) together with (8.64) and (8.65), we obtain

E[χ2ν(∆)] = ν + ∆ and Var[χ2

ν(∆)] = 2ν + 4∆. (8.67)

8.5.2 Other covariance matrices

What if Z ∼ N(µ, Σ) for general Σ? We can still find a chi-squared distribution, but

we first have to transform the Z so that the covariance matrix is Iν. We will look atthree possibilities, which are actually all special cases of using generalized inverses.

The simplest is if Z ∼ N(µ, σ2Iν). Then

1

σZ ∼ N(

1

σµ, Iν), (8.68)

which means‖Z‖2

σ2∼ χ2

ν(‖µ‖2

σ2), or ‖Z‖2 ∼ σ2 χ2

ν(‖µ‖2

σ2), (8.69)

a “scaled" chi-square. Suppose more generally that Σ is ν× ν and invertible. Then it

has a symmetric square root, say Σ1/2, which is also invertible. Multiply the vectorby that inverse:

Σ−1/2Z ∼ N(Σ−1/2µ, Σ−1/2ΣΣ−1/2) = N(Σ−1/2µ, Iν). (8.70)

8.5. Chi-squares 99

Applying the definition yields

‖Σ−1/2Z‖2 ∼ χ2ν(‖Σ−1/2µ‖2), or Z′Σ−1Z ∼ χ2

ν(µ′Σ−1µ). (8.71)

Also,

(Z− µ)′Σ−1(Z− µ) ∼ χ2ν. (8.72)

Finally, consider the Hn in (8.42). It is not positive definite, hence not invertible,because 1′nHn1n = 0. Thus we cannot immediately use (8.70). There is a nice resultfor symmetric idempotent n× n covariance matrices H. We apply it to HZ, whereZ ∼ N(µ, In), so that

HZ ∼ N(Hµ, H). (8.73)

We first look at the spectral decomposition, H = ΓΛΓ′, and note that

H = ΓΛΓ′ = HH = ΓΛΓ′ΓΛΓ′ = ΓΛ2Γ′, (8.74)

so that Λ2 = Λ. Because Λ is diagonal, we have

λ2i = λi, for all i, (8.75)

which means that each λi must be either 0 or 1. Because the trace of H is the sum ofthe eigenvalues, there must be trace(H) 1’s. These come first in the Λ, hence

Λ =

(Iν 00 0

), ν = trace(H). (8.76)

In H, the last n− ν columns of Γ are multiplied by λi = 0, hence with γi

being the ith

column of Γ,

H = Γ1Γ′1, Γ1 = (γ1, . . . , γ

ν), n× ν. (8.77)

Because the columns are orthonormal,

Γ′1Γ1 = Iν. (8.78)

Now look at

Γ′1HZ = Γ′1Z ∼ N(Γ′1Hµ, Γ′1HΓ1) = N(Γ′1µ, Iν) (8.79)

because

Γ′1H = Γ′1Γ1Γ′1 = Γ′1. (8.80)

The definition of chi-squares can be applied to (8.79):

Z′Γ1Γ′1Z ∼ χ2ν(µ′Γ1Γ′1µ), (8.81)

i.e.,

Z′HZ ∼ χ2ν(µ′Hµ), ν = trace(H). (8.82)

Note that happily we do not need to explicitly find the eigenvectors γi.

This result is used in liner models, where H is a projection matrix on a particularsubspace, and Z′HZ are the sum-of-squares for that subspace.


Sum of squared deviations

Now back to Section 8.4, with

X ∼ N(µ1n, σ2In), U = ∑(Xi − X)2 = X′HnX. (8.83)

Here, Hnµ1n = 0n, and1

σHnX ∼ N(0, Hn). (8.84)

Thus the U is central chi-squared:

1

σ2X′HnX ∼ χ2

ν, ν = trace(Hn). (8.85)

The trace is not difficult:

trace(Hn) = trace(In −1

n1′n1n) = trace(In)− 1

nn = n− 1. (8.86)

We can now finish Section 8.4.

Lemma 11. If X1, . . . , Xn are iid N(µ, σ2), then X and ∑(Xi − X)2 are independent,

X ∼ N(µ,1

nσ2) and ∑(Xi − X)2 ∼ σ2 χ2

n−1. (8.87)

8.6 Student’s t distribution

Here we answer the question of how to find a confidence interval for µ as in (8.35)when σ is unknown. The “pivotal quantity" we use here is

T =X− µ

S/√

n, S2 ∼ ∑(Xi − X)2

n− 1. (8.88)

Its distribution is next.

Definition 11. Suppose Z ∼ N(0, 1) and U ∼ χ2ν, where Z and U are independent. Then

T =Z√

Uν

(8.89)

is Student’s t on ν degrees of freedom, written

T ∼ tν. (8.90)

From Lemma 11, we have the conditions of Definition 11 satisfied by setting

Z =X− µ

σ/√

nand U =

∑(Xi − X)2

σ2. (8.91)

Then

T =(X− µ)/(σ/

√n)√

(∑(Xi−X)2/σ2

n−1

=X− µ

S/√

n∼ tn−1. (8.92)

8.7. Linear models and the conditional distribution 101

A 95% confidence interval for µ is then

X± tn−1,0.025s√n

, (8.93)

where tν,α/2 is the cutoff point that satisfies

P[−tν,α/2 < tν < tν,α/2] = 1− α. (8.94)

8.7 Linear models and the conditional distribution

Expanding on the simple linear model in (7.38), we consider the conditional model

Y | X = x ∼ N(α + βx, Σe) and X ∼ N(µX

, ΣXX), (8.95)

where Y is q× 1, X is p× 1, and β is a q× p matrix. As in Example 7.7.1,

E = Y− α− βX and X (8.96)

are independent and multivariate normal, and their joint distribution is also multi-variate normal: (

XE

)∼ N

((µ

X0q

),

(ΣXX 0

0 Σe

)). (8.97)

To find the joint distribution of X and Y, we note that (X′, Y′)′ is a affine transfor-mation of (X′, E′)′, hence is multivariate normal. Specifically,

(XY

)=

(0p

α

)+

(Ip 0β Iq

)(XE

)

∼ N

((µ

Xµ

Y

),

(ΣXX ΣXYΣYX ΣYY

)), (8.98)

where (µ

Xµ

Y

)=

(0p

α

)+

(Ip 0β Iq

)(µ

X0q

)=

(µ

Xα + βµ

X

)(8.99)

and

(ΣXX ΣXYΣYX ΣYY

)=

(Ip 0β Iq

)(ΣXX 0

0 Σe

)(Ip β′

0 Iq

)

=

(ΣXX ΣXXβ′

βΣXX Σe + βΣXXβ′

). (8.100)

As we did for the bivariate normal, we invert the above process to find the condi-tional distribution of Y given X from the joint. First, we solve for α and β in (8.100):

β = ΣYXΣ−1XX, α = µ

Y− βµ

X, and Σe = ΣYY − ΣYXΣ−1

XXΣXY. (8.101)


Lemma 12. Suppose

(XY

)∼ N

((µ

Xµ

Y

),

(ΣXX ΣXYΣYX ΣYY

)), (8.102)

where ΣXX is invertible. Then

Y | X = x ∼ N(α + βx, Σe), (8.103)

where α, β and Σe are given in (8.101).

Part II

Statistical Inference

103

Chapter 9

Statistical Models and Inference

Most, although not all, of the material so far has been straight probability calculations,that is, we are given a probability distribution, and try to figure out the implications(what X is likely to be, marginals, conditionals, moments, etc.). Statistics generallyconcerns itself with the reverse problem, that is, observing the data X = x, and havingthen to guess aspects of the probability distribution that generated x. This “guessing"goes under the general rubric of inference. Four main aspects of inference are

• Estimation: What is the best guess of a particular parameter (vector), or func-tion of the parameter? The estimate may be a point estimate, or a point estimateand measure of accuracy, or an interval or region, e.g., “The mean is in the in-terval (10.44,19.77)."

• Hypothesis testing: The question is whether a specific hypothesis, the nullhypothesis, about the distribution is true, so that the inference is basically either“yes" or “no", along with an idea of how reliable the conclusion is.

• Prediction: One is interested in predicting a new observation, possibly depend-ing on a covariate. For example, the data may consist of a number of (Xi, Yi)pairs, and a new observation comes along, where we know the x but not they, and wish to guess that y. We may be predicting a numerical variable, e.g.,return on an investment, or a categorical variable, e.g., the species of a plant.

The boundaries between these notions are not firm. One can consider predictionto be estimation, or classification as an extension of hypothesis testing with more thantwo hypotheses. Whatever the goal, the first task is to specify the statistical model.

9.1 Statistical models

A probability model consists of a random X in space X and a probability distributionP. A statistical model also has X and space X , but an entire family P of probabilitydistributions on X . By family we mean a set of distributions; the only restrictionbeing that they are all distributions for the same X . Such families can be quitegeneral, e.g.,

X = Rn,P = P | X1, . . . , Xn are iid with finite mean and variance. (9.1)

105

106 Chapter 9. Statistical Models and Inference

This family includes all kinds of distributions (iid normal, gamma, beta, binomial),but not ones with the Xi’s correlated, or distributed Cauchy (which has no meanor variance). Another possibility is the family with the Xi’s iid with a continuousdistribution.

Often, the families are parametrized by a finite-dimensional parameter θ, i.e.,

P = Pθ | θ ∈ Θ, where Θ ⊂ RK. (9.2)

The Θ is called the parameter space. We are quite familiar with parameters, but forstatistical models we must be careful to specify the parameter space as well. For

example, suppose X and Y are independent, X ∼ N(µX, σ2X) and Y ∼ N(µY, σ2

Y).Then the following parameter spaces lead to distinctly different models:

Θ1 = (µX, σ2X , µY, σ2

Y) ∈ R× (0, ∞)×R× (0, ∞)Θ2 = (µX, σ2

X , µY, σ2Y) | µX ∈ R, µY ∈ R, σ2

X = σ2Y ∈ (0, ∞)

Θ3 = (µX, σ2X , µY, σ2

Y) | µX ∈ R, µY ∈ R, σ2X = σ2

Y = 1Θ4 = (µX, σ2

X , µY, σ2Y) | µX ∈ R, µY ∈ R, µX > µY, σ2

X = σ2Y ∈ (0, ∞). (9.3)

The first model places no restrictions on the parameters, other than the variancesare positive. The second one demands the two variances be equal. The third sets thevariances to 1, which is equivalent to saying that the variances are known to be 1.The last one equates the variances, as well as specifying that the mean of X is largerthan that of Y.

A Bayesian model includes a (prior) distribution on P , which in the case of aparametrized model means a distribution on Θ. In fact, the model could include afamily of prior distributions, although we will not deal with that case explicitly.

Before we introduce inference, we take a brief look at how probability is inter-preted.

9.2 Interpreting probability

In Section 2.1, we defined probability models mathematically, starting with someaxioms. Everything else flowed from those axioms. But as in all mathematical objects,they do not in themselves have physical reality. In order to make practical use of theresults, we must somehow connect the mathematical objects to the physical world.That is, how is one to interpret P[A]? In the case of games of chance, people generallyfeel confident that they know what “the chance of heads" or “the chance of a fullhouse" mean. But other probabilities may be less obvious, e.g., “the chance that itrains next Tuesday" or “the chance ____ and ____ get married" (fill in the blanks withany two people). Two popular interpretations are frequency and subjective. Bothhave many versions, and there are also many other interpretations, but much of thismaterial is beyond the scope of the author. Here are sketches of the two.

Frequency. An experiment is presumed to be repeatable, so that one could conceiv-ably repeat the experiment under the exact same conditions over and over again (i.e.,infinitely often). Then the probability of a particular event A, P[A], is the long-runproportion of times it occurs, as the experiment is repeated forever. That is, it is thefrequency A occurs. This interpretation implies that probability is objective in thesense that it is inherent in the experiment, not a product of one’s beliefs. This inter-pretation works well for games of chance. One can imagine rolling a die or spinning

9.2. Interpreting probability 107

a roulette wheel an “infinite" number of times. Population sampling also fits in well,as one could imagine repeatedly taking a random sample of 100 subjects from a givenpopulation. The frequentist interpretation can not be applied to situations that arenot in principle repeatable, such as whether two people will get married, or whethera particular candidate will win an election. One could imagine redoing the worldover and over (like in the movie Groundhogs Day), but that is fiction.

Much of “classical" statistics is based on frequentist concepts. For example, a 95%confidence interval for the mean µ is a random interval with the property that, if theexperiment were reapeated over and over, about 95% of the intervals would containµ. That is, for X ∼ N(µ, 1),

P[X− 1.96 < µ < X + 1.96] = 0.95 hence

(X− 1.96, X + 1.96) is a 95% confidence interval for µ. (9.4)

Thus if we observe X = 6.7, say,

(4.74, 8.66) is a 95% confidence interval for µ

BUT P[4.74 < µ < 8.66] 6= 0.95. (9.5)

The reason for the last statement is that there is nothing random in the event “4.74 <

µ < 8.66". The mean µ is the fixed but unknown population mean, and 4.74 and 8.66are just numbers. Thus probability does not make sense there. Either µ is between4.74 and 8.66, or it is not, we just do not know which. The randomness of µ does notcome under the frequentist purview (unless of course a further repeatable experimentproduces µ, e.g., one first randomly chooses a population to sample from).

Subjective. The subjective approach allows each person to have a different proba-bility, so that for a given person, P[A] is that person’s opinion of the probability ofA. The only assumption is that each person’s probabilities cohere, that is, satsify theprobability axioms. Subjective probability can be applied to any situation. One re-peatable experiment, people’s subjective probabilities would tend to agree, whereasin other cases, such as the probability a certain team will win a particular game, theirprobabilites could differ widely. Subjectively, P[4.74 < µ < 8.66] does make sense;one drawback is that each person can have a different value.

Some subjectivists make the assumption that any given person’s subjective prob-abilities can be elicited using a betting paradigm. For example, suppose the event inquestion is “Pat and Leslie will get married," the choices being “Yes" and “No", andwe wish to elicit your probability of the event “Yes". We give you $10, and ask youfor a number w, which will be used in two possible bets:

Bet 1 → Win $w if “Yes", Lose $10 if “No",

Bet 2 → Lose $10 if “Yes", Win $(100/w) if “No". (9.6)

Some dastardly being will decide which of the bets you will take, so the w should bean amount for which you are willing to take either of those two bets. For example, ifyou choose $w = $5, then you are willing to accept a bet that pays only $5 if they doget married, and loses $10 if they don’t; AND you are willing to take a bet that wins$20 if they do not get married, and loses $10 if they do. These numbers suggest youexpect they will get married. Suppose p is your subjective probability of “Yes". Thenyour willingness to take Bet 1 means you expect to not lose money:

Bet 1→ E[Winnings] = p ($w)− (1− p)($10) ≥ 0. (9.7)

108 Chapter 9. Statistical Models and Inference

Same with Bet 2:

Bet 2→ E[Winnings] = −p ($10) + (1− p)$(100/w) ≥ 0. (9.8)

A little algebra translates those two inequaties into

p ≥ 10

10 + wand p ≤ 100/w

10 + 100/w=

10

10 + w, (9.9)

which of course means that

p =10

10 + w. (9.10)

With $w = $5, your p = 2/3 that they will get married.The betting approach is then an alternative to the frequency approach. Whether it

is practical to elicit an entire probability distribution (i.e., P[A] for all A ⊂ X ), andwhether the result will satisfy the axioms, is questionable, but the main point is thatthere is in principle a grounding to a subjective probability.

9.3 Approaches to inference

Paralleling the interpretations of probability are the two main approaches to statisticalinference: frequentist and Bayesian. Both aim to make inferences about θ based onobserving the data x = x, but take different tacks.

Frequentist The frequentist approach assumes that the parameter θ is fixed but un-known (that is, we only know that θ ∈ Θ). An inference is an action, which is afunction

δ : X −→ A, (9.11)

for some action space A. The action space depends on the type of inference desired.For example, if one wishes to estimate θ, then δ(x) would be the estimate, and A = Θ.Or δ may be a vector containing the estimate as well as an estimate of its standarderror, or it may be a two-dimensional vector representing a confidence interval. Inhypothesis testing, we often take A = 0, 1, where 0 means accept the null hypothe-sis, and 1 means reject it. The properties of a procedure δ, which would describe howgood it is, are based on the behavior if the experiment were repeated over and over,with θ fixed. Thus an estimator δ of θ is unbiased if

Eθ [δ(X)] = θ for all θ ∈ Θ. (9.12)

Or a confidence interval procedure δ(x) = (l(x), u(x)) has 95% coverage if

Pθ [l(X) < θ < u(X)] ≥ 0.95 for all θ ∈ Θ. (9.13)

Understand that the 95% does not refer to your particular interval, but rather to theinfinite number of intervals that you imagine arising from repeating the experimentover and over.

Bayesian. The frequentist does not tell you what to think of θ. It just produces anumber or numbers, then reassures you by telling you what would happen if yourepeated the experiment an infinite number of times. The Bayesian approach, bycontrast, tells you what to think. More precisely, given your prior distribution on

9.3. Approaches to inference 109

Θ, which may be your subjective distribution, the Bayes approach tells you how toupdate your opinion upon observing X = x. The update is of course the posterior,which we know how to find using Bayes Theorem 3. The posterior fθ|X(θ | x) is the

inference, or at least all inferences are derived from it. For example, an estimate couldbe the posterior mean, median, or mode. A 95% probability interval is any interval(l, u) such that

P[l < θ < u | X = x] = 0.95. (9.14)

A hypothesis test would calculate

P[Null hypothesis is true | X = x], (9.15)

or, if an accept/reject decision is desired, reject the null hypothesis if the posteriorprobability of the null is less than 0.01, say.

A drawback to the frequentist approach is that we cannot say what we want, suchas the probability a null hypothesis is true, or the probability µ is between two num-bers. Bayesians can make such statements, but as the cost of having to come upwith a (subjective) prior. The subjectivity means that different people can come todifferent conclusions from the same data. (Imagine a tobacco company and a con-sumer advocate analyzing the same smoking data.) Fortunately, there are more orless well-accepted “objective" priors, and especially when the data is strong, differentreasonable priors will lead to practically the same posteriors. From an implementa-tion point of view, sometimes frequentist procedures are computationally easier, andsometimes Bayesian procedures are. It may not be philosophically pleasing, but isnot a bad idea to take an opportunistic view and use whichever approach best movesyour understanding along.

There are other approaches to inference, such as the likelihood approach, the struc-tural approach, the fiducial approach and the fuzzy approach. These are all interest-ing and valuable, but seem a bit iffy to me.

The rest of the course goes more deeply into inference.

Chapter 10

Introduction to Estimation

10.1 Definition of estimator

We assume a model with parameter space Θ, and suppose we wish to estimate somefunction g of θ,

g : Θ −→ R. (10.1)

This function could be θ itself, or one component of θ. For example, if X1, . . . , Xn are

iid N(µ, σ2), where θ = (µ, σ2) ∈ R× (0, ∞), some possible g’s are

g(µ, σ2) = µ;

g(µ, σ2) = σ;

g(µ, σ2) = σ/µ = coefficient of variation;

g(µ, σ2) = P[Xi ≤ 10] = Φ((10− µ)/σ), (10.2)

where Φ is the distribution function for N(0, 1).An estimator is a function δ(X),

δ : X −→ A, (10.3)

where A is some space, presumably the space of g(θ), but not always. The hope isthat δ(X) is close to g(θ). The estimator can be any function of x, but cannot depend

on an unknown parameter. Thus with g(µ, σ2) = σ/µ in the above example,

δ(x1, . . . , xn) =s

x, [s2 = ∑(xi − x)2/n] is an estimator,

δ(x1, . . . , xn) =σ

xis not an estimator. (10.4)

Any function can be an estimator, but that does not mean it will be a particularlygood estimator. There are basically two questions we must address: How does onefind reasonable estimators? How do we decide which estimators are good? As maybe expected, the answers differ depending on whether one is taking a frequentistapproach or a Bayesian approach.

This chapter will introduce several methods of estimation. Section 12.3 in the nextchapter delves into a certain optimality for estimators.

111

112 Chapter 10. Introduction to Estimation

10.2 Plug-in methods

Suppose the data consist of X1, . . . , Xn iid vectors, each with space X0 and distri-bution function F. Many parameters, such as the mean or variance, are defined forwide classes of distributions F. The simplest notion of a plug-in estimator is to use

the sample version of the population parameter, so that X is an estimator for µ, and

S2, the sample variance, is an estimator for σ2, and an estimator of P[Xi ≤ 10] is

δ(x) =#xi ≤ 10

n. (10.5)

Such estimators can be formalized by considering a large family F of distributionfunctions, which includes at least all the distributions with finite X0, and functionalson F ,

θ : F −→ R. (10.6)

That is, the argument for the function is F. The mean, median, variance, P[Xi ≤ 10] =F(10), etc., are all such functionals for appropriate F ’s. Notice that the α and β of abeta distribution are not such functionals, because they are not defined for non-betadistributions. The θ(F) is then the population parameter θ. The sample version is

θ = θ(Fn), (10.7)

where Fn is the empirical distribution function for a given sample x1, . . . , xn, definedby

Fn(x) =#xi | xi ≤ x

n(10.8)

in the univariate case, or

Fn(x) =#xi | xij ≤ xj for all j = 1, . . . , p

n(10.9)

if the data are p− variate:

xi =

xi1...

xip

and x =

x1...

xp

∈ R

p. (10.10)

This Fn is the distribution function for the random vector X∗ which has space

X ∗0 = the distinct values among x1, . . . , xn (10.11)

and probabilities

P∗[X∗ = x∗] =#xi | xi = x∗

n, x∗ ∈ X ∗0 . (10.12)

Thus X∗ is generated by randomly choosing one of the observations xi. For example,if the sample is 3, 5, 2, 6, 2, 2, then

X ∗0 = 2, 3, 5, 6, and P[X∗ = 2] =1

2, P[X∗ = 3] = P[X∗ = 5] = P[X∗ = 6] =

1

6.

(10.13)

10.3. Least squares 113

The Fn is an estimate of F, so the estimate (10.7) for θ is found by plugging in Fn for

F. If θ is the mean, then θ(Fn) = x, and if θ is the variance, θ(Fn) = s2 = ∑(xi− x)2/n.As in (10.2), the coefficient of variation can be estimated by the sample version, s/x.

Other plug-in estimators arise in parametric models, where one has estimates ofthe parameters and wishes to estimate some function of them, or one has estimates ofsome functions of the parameters and wishes to estimate the parameters themselves.

For example, if the Xi’s are iid N(µ, σ2), P[Xi ≤ 10] = Φ((10− µ)/σ), where Φ is thedistribution function of the standard normal. Then instead of (10.5), we could use theestimate

P[Xi ≤ 10] = Φ

(10− x

s

). (10.14)

Or suppose the Xi’s are iid Beta(α, β), with (α, β) ∈ (0, ∞)× (0, ∞). Then from (6.30)and (6.31), the population mean and variance are

µ =α

α + βand σ2 =

αβ

(α + β)2(α + β + 1). (10.15)

Then x and s2 are estimates of those functions of α and β, hence the estimates α and

β of α and β would be the solutions to

x =α

α + βand s2 =

αβ

(α + β)2(α + β + 1), (10.16)

or after some algebra,

α = x

(x(1− x)

s2− 1

)and β = (1− x)

(x(1− x)

s2− 1

). (10.17)

The estimators in (10.17) are special plug-in estimators, called method of mo-ments estimators, because the estimates of the parameters are chosen to match thepopulation moments with their sample versions. The method of moments are notnecessarily strictly defined. For example, in the Poisson(λ), the mean and variance

are λ, so that λ could be x or s2. Also, one has to choose moments that work. Forexample, if the data are iid N(0, σ2), and we wish to estimate σ, the mean is uselessbecause one cannot do anything to match 0 = x.

10.3 Least squares

Another general approach to estimation chooses the estimate of θ so that the observeddata are close to their expected values. For example, if one is trying to estimate themean µ of iid Xi, the estimate µ is chosen so that it is “close" to the Xi’s. It cannot beexactly equal to all the observations (unless they are equal), so some overall measureof closeness is needed. The most popular is least squares, that is, µ is the value a thatminimizes

L(a; x1, . . . , xn) = ∑(xi − a)2 over a. (10.18)

This L is easily minimized by differentiating with respect to a then setting to 0, toobtain a = x, that is, the sample mean is the least squares estimate of µ. Measuresother than squared error are useful as well. For example, least absolute deviations


minimizes ∑ |xi− a|, the minimizer being the sample median; a measure from supportvector machine methods allows a buffer zone:

L(a; x1, . . . , xn) = ∑(xi − a)2 I|xi−a|<ǫ, (10.19)

that is, it counts only those observations less than ǫ away from a.Least square is especially popular in linear models, where (X1, Y1), . . . , (Xn, Yn)

are independent,

E[Yi | Xi = xi ] = α + βxi, Var[Yi] = σ2e . (10.20)

Then the least squares estimates of α and β are the a and b, respectively, that minimize

L(a. b; (x1, y1), . . . , (xn, yn)) = ∑(yi − a− bxi)2. (10.21)

Differentiating yields the normal equations,

(∑ yi

∑ xiyi

)=

(n ∑ xi

∑ xi ∑ x2i

)(ab

), (10.22)

which have a unique solution if and only if ∑(xi − x)2> 0. (Which makes sense,

because if all the xi’s are equal, it is impossible to estimate the slope of the linethrough the points (xi , yi).) Inverting the matrix, we have

b =∑(xi − x)(yi − y)

∑(xi − x)2and a = y− b x. (10.23)

Note that these estimates are sample analogs of the α and β in (8.101).

10.4 The likelihood function

If we know θ, then density tells us what X is likely to be. In statistics, we do notknow θ, but we do observe X = x, and wish to know what values θ is likely to be.The analog to the density for the statistical problem is the likelihood function, whichis the same as the density, but considered as a function of θ for fixed x. The functionis not itself a density (usually), because there is no particular reason to believe thatthe integral over θ for fixed x is 1. Rather, the likelihood function gives the relativelikelihood of various values of θ.

For example, suppose X ∼ Binomial(n, θ), θ ∈ (0, 1), n = 5. The pmf is

f (x | θ) =

(n

x

)θx(1− θ)n−x, x = 0, . . . , n. (10.24)

The likelihood isL(θ ; x) = cx θx(1− θ)n−x, 0 < θ < 1. (10.25)

Here, cx is a constant that may depend on x but not on θ. Suppose the observedvalue of X is x = 2. Then one can find the relative likelihood of different values of θfrom L, that is,

“relative likelihood of θ = .2 to θ = .4" =L(.4; 2)

L(.2; 2)=

(.4)2(.6)3

(.2)2(.8)3=

27

16= 1.6875.

(10.26)

10.4. The likelihood function 115

0 1 2 3 4 5

0.00

0.15

0.30

x

f(x|

thet

a), t

heta

=.3

Binomial density

0.0 0.4 0.8

0.00

00.

020

thetaL(

thet

a;x)

, x=

2

Binomial likelihood

Figure 10.1: Binomial pmf and likelihood

That is, based on seeing x = 2 heads in n = 5 flips of a coin, it is about 1.7 times aslikely that the true probability of heads is θ = .4 versus θ = .2. Notice that the “cx" inthe likelihood (10.25) is irrelevant, because we look at ratios.

Likelihood is not probability, in particular because θ is not necessarily random.Even if θ is random, the likelihood is not the pdf, but rather the part of the pdf thatdepends on x. That is, suppose π is the prior pdf of θ. Then the posterior pdf is

f (θ | x) =f (x | θ)π(θ)∫

Θf (x | θ∗)π(θ∗)dθ∗

= cxL(θ; x)π(θ). (10.27)

(Note: These densities should have subscripts, e.g., f (x | θ) should be fX|θ(x | θ), but

I hope leaving them off is not too confusing.) Here, the cx takes into account the cx in(10.25) as well as the integral in the denominator of (10.27) (which does not dependon θ — it is integrated away). Thus though the likelihood is not a density, it does tellus how to update the prior to obtain the posterior.

10.4.1 Maximum likelihood estimation

If L(θ; x) reveals how likely θ is in light of the data x, it seems reasonable that the mostlikely θ would be a decent estimate of θ. In fact, it is reasonable, and the resultingestimator is quite popular.

Definition 12. Given the model with likelihood L(θ; x) for θ ∈ Θ, the maximum likelihoodestimate (MLE) at observation x is the unique value of θ that maximizes L(θ; x) over θ ∈ Θ,if such unique value exists. Otherwise, the MLE does not exist at x.

There are times when the likelihood does not technically have a maximum, butthere is an obvious limit of the θ’s that approach the supremum. For example, sup-


pose X ∼ Uni f orm(0, θ). Then the likelihood at x > 0 is

L(θ; x) =1

θIx<θ(θ) =

1θ i f θ > x

0 i f θ ≤ x. (10.28)

For example, here is a plot of the likelihood when x = 4.

0 2 4 6 8 10

0.00

0.15

0.30

theta

L(th

eta;

x), x

=4

Uniform likelihood

The highest point occurs at θ = 4, almost. To be precise, there is no maximum,

because the graph is not continuous at θ = 4. But we still take the MLE to be θ = 4in this case, because we could switch the filled-in dot from (4, 0) to (4, 1/4) by takingX ∼ Uni f orm(0, θ], so that the pdf at x = θ is 1/θ, not 0.

Often, the MLE is found by differentiating the likelihood, or the log of the like-lihood (called the loglikelihood). Because the log function is strictly increasing, thesame value of θ maximizes the likelihood and the loglikelihood. For example, in thebinomial example, from (10.25) the loglikelihood is (dropping the cx)

l(θ; x) = log(L(θ; x) = x log θ + (n− x) log(1− θ). (10.29)

Then

l ′(θ; x) =x

θ− n− x

1− θ. (10.30)

Set that expression to 0 and solve for θ to obtain

θ =x

n. (10.31)

From Figure 10.1, one can see that the maximum is at x/n = 0.4.If θ is p× 1, then one must do a p-dimensional maximization. For example, sup-

pose X1, . . . , Xn are iid N(µ, σ2), (µ, σ2) ∈ R × (0, ∞). Then the loglikelihood is

10.5. The posterior distribution 117

(dropping the√

2π’s),

log(L(µ, σ2; x1, . . . , xn)) = − 1

2σ2 ∑(xi − µ)2 − n

2log(σ2). (10.32)

We could go ahead an differentiate with respect to µ and σ2, obtaining two equations.But notice that µ appears only in the sum of squares part, and we have already, in(10.18), found that µ = x. Then for the variance, we need to maximize

log(L(x, σ2; x1, . . . , xn)) = − 1

2σ2 ∑(xi − x)2 − n

2log(σ2). (10.33)

Differentiating with respect to σ2,

∂

∂(σ2)log(L(x, σ2; x1, . . . , xn)) =

1

2σ4 ∑(xi − x)2 − n

2

1

σ2, (10.34)

and setting to 0 leads to

σ2 =∑(xi − x)2

n= s2. (10.35)

So the MLE of (µ, σ2) is (x, s2).

10.5 The posterior distribution

In the Bayesian framework, the inference is the posterior, because the posterior hasall the information about the parameter given the data. Estimation of θ then be-comes a matter of summarizing the distribution. In Example 7.6.2, if X | θ = θ ∼Binomial(n, θ), and the prior is θ ∼ Beta(α, β), then

θ | X = x ∼ Beta(α + x, β + n− x). (10.36)

One estimate is the posterior mean,

E[θ | X = x] =α + x

α + β + n. (10.37)

Another is the posterior mode, the value with the highest density, which turns out tobe

Mode[θ | X = x] =α + x− 1

α + β + n− 2. (10.38)

Note that both of these estimates are modifications of the MLE, x/n, and are close tothe MLE if the parameters α and β are small. One can think of α as the prior numberof successes and α + β as the prior sample size, the larger α + β, the more weight theprior has in the posterior.

A 95% probability interval for the parameter could be (l, u) where

P[θ < l | X = x] = P[θ > u | X = x] = 0.025, (10.39)

or commonly the l and u are chosen so that the interval is shortest, but still with 95%probability.


10.6 Bias, variance, and MSE

Rather than telling one what to think about the parameter, frequentist inferencemakes a statement, e.g., an estimate, confidence interval, or acceptance/rejection ofa null hypothesis. The properties of the statement do not refer to the particular pa-rameter at hand, but rather how the statement would behave if the experiment wererepeated over and over. Thus the “95%" in the 95% confidence interval means that ifthe same procedure were repeated over and over again in similar experiments, about95% of the resulting intervals would contain the parameter. This 95% coverage rateis an operating characteristic.

In estimation, any function can be an estimator, but what is a good one? Thereare many operating characteristics, but here we will present just three: bias, variance,and mean square error.

Definition 13. The bias of an estimator δ of g(θ) is the function

Bias(θ; δ) = Eθ [δ(X)]− g(θ), θ ∈ Θ. (10.40)

The estimator is unbiased if

Eθ [δ(X)] = g(θ), for all θ ∈ Θ. (10.41)

An unbiased estimator is then one that, on average, will equal the quantity beingestimated (the estimand). Generally, bias is not good, but whether we have to worryabout absolutely 0 bias, or just a little, is questionable. A good estimator will beclose the estimand, and being unbiased does not necessarily mean being close, e.g.,half the time the estimator could be extremely low, and half extremely high, but thelows and highs balance out. Thus variance is also a key operating characteristic: Alow variance means the estimator is likely to be near its expected value, which isthe estimand if the estimator is unbiased. Alternatively, one could look at closenessdirectly, e.g., in a squared-error sense.

Definition 14. The mean squared error (MSE) of an estimator δ of g(θ) is

MSE(θ; δ) = Eθ [(δ(X)− g(θ))2]. (10.42)

The MSE combines bias and variance in a handy way. Letting eδ(θ) = Eθ [δ(X)],

MSE(θ; δ) = Eθ[(δ(X)− g(θ))2]

= Eθ[((δ(X)− eδ(θ))− (g(θ)− eδ(θ)))2]

= E[(δ(X)− eδ(θ))2] + 2E[(δ(X)− eδ(θ))(g(θ)− eδ(θ))]

+ E[(g(θ)− eδ(θ))2]

= Varθ [δ(X)] + Bias(θ; δ)2, (10.43)

because g(θ) and eδ(θ) are constants as far as X is concerned. Thus MSE balancesbias and variances.

For an example, consider again X ∼ Binomial(n, θ), where we wish to estimateθ ∈ (0, 1). We have the MLE, δ(x) = x/n, and the Bayes posterior means, (10.37), forvarious α and β. To assess their operating characteristics, we find for the MLE,

Eθ

[X

n

]= θ and Varθ

[X

n

]=

θ(1− θ)

n, (10.44)

10.6. Bias, variance, and MSE 119

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.10

0.20

0.30

theta

MS

E

MLEa=b=1a=b=10a=10,b=3

Figure 10.2: Some MSE’s for binomial estimators

and for prior θ ∼ Beta(α, β),

Eθ

[α + X

α + β + n

]=

α + nθ

α + β + nand Varθ

[α + X

α + β + n

]=

nθ(1− θ)

(n + α + β)2. (10.45)

(Note that formally, the MLE is the posterior mean if α = β = 0. But that prior isnot legitimate, that is, the parameters have to be positive in the beta.) We can see

that the MLE is unbiased1, but the posterior means are biased, no matter what theparameters. On the other hand, the variance of the posterior means are always lessthan that of the MLE. Figure 10.2 graphs the MSE’s for the MLE, and the posteriormeans for (α, β) = (1, 1), (10, 10), and (10, 3). Note that they are quite different, andthat none of them is always best, i.e., lowest. The Bayes estimates tend to be goodnear the prior means, 1/2, 1/2, and 10/13, respectively. The Bayes estimate with(α, β) = (1, 1) is closest to having a flat, and fairly low, MSE. It is never very bad,though rarely the best.

1It is by no means a general rule that MLE’s are unbiased.


Which estimator to use depends on where you wish the best MSE. (Unless youinsist on unbiasedness, in which case you are stuck with x/n.) If you use a beta priorto specify where you think the θ is, then the corresponding posterior mean is the best,as we shall see in the next section.

Chapter 12 goes into unbiasedness more deeply. Chapter 14 presents the formaldecision-theoretic approach.

10.7 Functions of estimators

If θ is an estimator of θ, then g(θ) is an estimator of g(θ). Do the properties of θ

transfer to g(θ)? Maybe, maybe not. Although there are exceptions (such as when gis linear for the first two statements), generally

If θU is an unbiasedestimator of θ

then g(θU) is notan unbiasedestimator of g(θ);

If θB is the Bayes pos-terior mean of θ

then g(θB) is notthe Bayes posteriormean of g(θ);

If θmle is the MLE of θ then g(θmle) is the MLE of g(θ).

The basic reason for the first two statements is the following:

Theorem 6. If Y is a random variable with finite mean, and g is a function of y, then

E[g(Y)] 6= g(E[Y]) (10.46)

unless

• g is linear: g(y) = a + b y for constants a and b;

• Y is essentially constant: P[Y = µ] = 1 for some µ;

• You are lucky.

A simple example is g(x) = x2. If E[X2] = (E[X])2, then X has variance 0. As

long as X is not a constant, E[X2] > (E[X])2.

10.7.1 Example: Poisson

Suppose X ∼ Poisson(θ), θ ∈ (0, ∞). Then for estimating θ, θU = X is unbi-ased, and happens to be the MLE as well. For a Bayes estimate, take the prior

θ ∼ Exponential(1). The likelihood is L(θ; x) = e−θθx , hence the posterior is

π(θ | x) = c.L(θ; x)π(θ) = c.e−θθxe−θ = c.θxe−2θ , (10.47)

which is Gamma(α = x + 1, λ = 2). Thus from (3.75), the posterior mean with respectto this prior is

θB = E[θ | X = x] =x + 1

2. (10.48)

10.7. Functions of estimators 121

Now consider estimating g(θ) = e−θ , which is the Pθ [X = 0]. (So if θ is the average

number of telephone calls coming in an hour, e−θ is the chance there are 0 calls in the

next hour.) Is g(θU) unbiased? No:

E[g(θU)] = E[e−X] = e−θ∞

∑x=0

e−x θx

x!

= e−θ∞

∑x=0

(e−1θ)x

x!

= e−θee−1θ

= eθ(e−1−1) 6= e−θ . (10.49)

There is an unbiased estimator for g, namely I[X = 0].Turn to the posterior mean of g(θ), which is

g(θ)B = E[g(θ) | X = x] =∫ ∞

0g(θ)π(θ | x)dθ

=2x+1

Γ(x + 1)

∫ ∞

0e−θθx e−2θdθ

=2x+1

Γ(x + 1)

∫ ∞

0θx e−3θdθ

=2x+1

Γ(x + 1)

Γ(x + 1)

3x+1

=

(2

3

)x+1

6= e−θB = e−x+1

2 . (10.50)

For the MLE, we have to reparametrize. That is, with τ = g(θ) = e−θ , the likeli-hood and loglikelihood are

L∗(τ; x) = τ(− log(τ))x and log(L∗(τ; x)) = log(τ) + x log(− log(τ)). (10.51)

Then

∂

∂τlog(L∗(τ; x)) =

1

τ+

x

τ log(τ)= 0 ⇒ τmle = e−x = e−θmle . (10.52)

The MLE result follows in general for one-to-one functions because the likelihoodsare related via

L(θ; x) = L∗(g(θ); x), (10.53)

henceL(θ; x) > L(θ; x) ⇒ L∗(g(θ); x) > L∗(g(θ); x), (10.54)

so if θ maximizes the L, g(θ) maximizes the L∗.

Chapter 11

Likelihood and sufficiency

11.1 The likelihood principle

The likelihood function is introduced in Section 10.4. It is fundamental to Bayesianinference, and extremely useful in frequentist inference. It encapsulates the relation-ship between θ and the data. In this section we look at likelihood more formally.

Definition 15. Suppose X has density f (x | θ) for θ ∈ Θ. Then a likelihood function forobservation X = x is

L(θ; x) = cx f (x | θ), θ ∈ Θ, (11.1)

where cx is any positive constant.

Two likelihood functions are the same if they are proportional.

Definition 16. Suppose the two models, X with density f (x | θ) and Y with density g(y | θ),

depend on the same parameter θ with space Θ. Then x and y have the same likelihood if for

some positive constants cx and cy in (11.1),

L(θ; x) = L∗(θ; y) for all θ ∈ Θ, (11.2)

where L and L∗ are their respective likelihoods.

The two models in Definition 16 could very well be the same, in which case x and y

are two possible elements of X . As an example, suppose the model has X = (X1, X2),where the elements are iid N(µ, 1), and µ ∈ Θ = R. Then the pdf is

f (x | µ) =1

2πe−

12 ∑(xi−µ)2

=1

2πe−

12 (x2

1+x22)+(x1+x2)µ−µ2

. (11.3)

Consider two possible observed vectors, x = (1, 2) and y = (−3, 6). These observa-

tions have quite different pdf values. Their ratio is

f (1, 2 | µ)

f (−3, 6 | µ)= 485165195. (11.4)

123

124 Chapter 11. Likelihood and sufficiency

This ratio is interesting in two ways: It shows the probability of being near x is almosthalf a billion times larger than the probability of being near y, and it shows the ratio

does not depend on µ. That is, their likelihoods are the same:

L(µ; x) = c(1,3) e3µ−µ2and L(µ; y) = c(−3,6) e3µ−µ2

. (11.5)

It does not matter what the constants are. We could just take c(1,3) = c(−3,6) = 1, but

the important aspect is that the sums of the two observations are both 3, and the sumis the only part of the data that hits µ.

11.1.1 Binomial and negative binomial

The models can be different, but they do need to share the same parameter. Forexample, suppose we have a coin with probability of heads being θ ∈ (0, 1), and weintend to flip it a number of times independently. Here are two possible experiments:

• Binomial. Flip the coin n = 10 times, and count X, the number of heads, sothat X ∼ Binomial(10, θ).

• Negative binomial. Flip the coin until there are 4 heads, and count Y, thenumber of tails obtained. This Y is called Negative Binomial(4, θ).

The space of the Negative Binomial(k, θ) is Y = 0, 1, 2, . . ., and the pmf can befound by noting that to obtain Y = y, there must be exactly k − 1 heads among the

first k− 1 + y flips, then the kth head on the (k + y)th flip:

g(y | θ) =

[(k− 1 + y

k− 1

)θk−1(1− θ)y

]× θ =

(k− 1 + y

k− 1

)θk(1− θ)y. (11.6)

Next, suppose we perform the binomial experiment and obtain X = 4 heads outof 10 flips. The likelihood is the usual binomial one:

L(θ ; 4) =

(10

4

)θ4(1− θ)6. (11.7)

Also, suppose we perform the negative binomial experiment and happen to see Y = 6

tails before the k = 4th head. The likelihood here is

L∗(θ ; 6) =

(9

3

)θ4(1− θ)6. (11.8)

The likelihoods are the same. I left the constants there to illustrate that the pmf’s

are definitely different, but erasing the constants leaves the same θ4(1− θ)6. Thesetwo likelihoods are based on different random variables. The binomial has a fixednumber of flips but could have any number of heads (between 0 and 10), while thenegative binomial has a fixed number of heads but could have any number of flips(over 4). In particular, either experiment with the given outcome would yield thesame posterior for θ, and the same MLE for θ. 2

The likelihood principle says that if two outcomes (whether they are from thesame experiment or not) have the same likelihood, then any inference made about θbased on the outcomes must be the same. Any inference that is not the same under

11.2. Sufficiency 125

the two scenarios are said to violate the likelihood principle. Bayesian inferencedoes not violate the likelihood principle, nor does maximum likelihood estimation,as long as your inference is just “Here is the estimate ..."

Unbiased estimation does violate the likelihood principle. Keeping with the aboveexample, we know that X/n is an unbiased estimator of θ for the binomial. ForY ∼ Negative Binomial(k, θ), the unbiased estimator is found by ignoring the lastflip, because that we know is always heads, so would bias the estimate if used. Thatis,

E[θ∗U ] = θ, θ∗U =k− 1

Y + k− 1. (11.9)

(That was not a proof. One needs to show directly that it is unbiased.)Now we test out the inference: “The unbiased estimate of θ is ...":

• Binomial(10, θ), with outcome x = 4. “The unbiased estimate of θ is θU =4/10 = 2/5"

• Negative Binomial(4, θ), with outcome y = 6. “The unbiased estimate of θ is

θ∗U = 3/9 = 1/3"

Those two situations have the same likelihood, but different estimates, thus violating

the likelihood principle!1 On the other hand, the MLE in both cases would be 2/5.The problem with unbiasedness is that it depends on the entire density, i.e., on

outcomes not observed, so different densities would give different expected values.For that reason, any inference that involves the operating characteristics of the proce-dure violates the likelihood principle.

Whether one decides to fully accept the likelihood principle or not, it provides animportant guide for any kind of inference, as we shall see.

11.2 Sufficiency

Consider again (11.3), or more generally, X1, . . . , Xn iid N(µ, 1). The likelihood is

L(µ : x1, . . . , xn) = eµ ∑ xi− n2 µ2

, (11.10)

where the exp(−(1/2) ∑ x2i ) part can be dropped as it does not depend on µ. Note

that this function depends on the xi’s only through their sum, that is, as in (11.5),if x and x∗ have the same sum, they have the same likelihood. Thus as far as thelikelihood goes, all we need to know is the sum of the xi’s to make an inference aboutµ. This sum is a sufficient statistic.

Definition 17. Consider the model with space X and parameter space Θ. A function

s : X −→ S (11.11)

is a sufficient statistic if for some function b,

b : S ×Θ −→ [0, ∞), (11.12)

the likelihood can be writtenL(θ ; x) = b(s(x), θ). (11.13)

1Don’t worry too much. You won’t be arrested.


Thus S = s(X) is a sufficient statistic (it may be a vector) if by knowing S, youknow the likelihood, i.e., it is sufficient for performing any inference. That is handy,because you can reduce your data set, which may be large, to possibly just a fewstatistics without losing any information. More importantly, it turns out that the bestinferences depend on just the sufficient statistics.

We next look at some examples.

11.2.1 Examples

First,we note that for any model, the data x is itself sufficient, because the likelihooddepends on x through x.

IID

If X1, . . . , Xn are iid with density f (xi|θ), then no matter what the model, the orderstatistics (see Section 6.3) are sufficient. To see this fact, write

L(θ; x) = f (x1|θ) · · · f (xn|θ) = f (x(1)|θ) · · · f (x(n)|θ) = b((x(1), . . . , x(n)), θ), (11.14)

because the order statistics are just the xi’s in a particular order.

Normal

If X1, . . . , Xn are iid N(µ, 1), Θ = R, then we have several candidates for sufficientstatistic:

The data itself : s1(x) = xThe order statistics : s2(x) = (x(1), . . . , x(n))

The sum : s3(x) = ∑ xiThe mean : s4(x) = x

Partial sums : s5(x) = (x1 + x2, x3 + x4 + x5, x6) (if n = 6).

(11.15)

An important fact is that any one-to-one function of a sufficient statistic is alsosufficient, because knowing one means you know the other, hence you know the like-lihood. For example, the mean and sum are one-to-one. Also, note that the dimensionof the sufficient statistics in (11.15) are different (n, n, 1, 1, and 3, respectively). Gener-ally, one prefers the most compact one, in this case either the mean or sum. Each ofthose are functions of the others. In fact, they are minimal sufficient.

Definition 18. A statistic s(x) is minimal sufficient if it is sufficient, and given any othersufficient statistic t(x), there is a function h such that s(x) = h(t(x)).

This concept is important, but we will focus on a more restrictive notion of com-pleteness.

Which statistics are sufficient depends crucially on the parameter space. In the

above, we assumed the variance known. But suppose X1, . . . , XN are iid N(µ, σ2),

with (µ, σ2) ∈ Θ ≡ R× (0, ∞). Then the likelihood is

L(µ, σ2 ; x) =1

σne−

12σ2 ∑(xi−µ)2

=1

σne−

12σ2 ∑ x2

i +1

σ2 µ ∑ xi− 12σ2 µ2

. (11.16)

Now we cannot eliminate the ∑ x2i part, because it involves σ2. Here the sufficient

statistic is two-dimensional:

s(x) = (s1(x), s2(x)) = (∑ xi, ∑ x2i ), (11.17)

11.2. Sufficiency 127

and the b function is

b(s1, s2) =1

σn e− 1

2σ2 s2+1

σ2 µ s1− 12σ2 µ2

. (11.18)

Uniform

Suppose X1, . . . , Xn are iid Uni f orm(0, θ), θ ∈ (0, ∞). The likelihood is

L(θ ; x) = ∏1

θI0<xi<θ(θ) =

1θn i f 0 < xi < θ for all xi0 i f not

(11.19)

All the xi’s are less than θ if and only if the largest one is, so that we can write

L(θ ; x) =

1θ i f 0 < x(n) < θ

0 i f not(11.20)

Thus the likelihood depends on x only through the maximum, hence x(n) is sufficient.

Double exponential

Now suppose X1, . . . , Xn are iid Double Exponential(θ), θ ∈ R, which has pdf f (xi | θ) =(1/2) exp(−|xi − θ|), so that the likelihood is

L(θ ; x) = e−∑ |xi−θ|. (11.21)

Because the data are iid, the order statistics are sufficient, but unfortunately, thereis not another sufficient statistic with smaller dimension. The absolute value barscannot be removed. Similar models based on the Cauchy, logistic, and others havethe same problem.

Exponential families

In some cases, the sufficient statistic does reduce the dimensionality of the data sig-nificantly, such as in the iid normal case, where no matter how large n is, the suf-ficient statistic is two-dimensional. In other cases, such as the double exponentialabove, there is no dimensionality reduction, so one must still carry around n values.Exponential families are special families in which there is substantial reduction foriid variables or vectors. Because the likelihood of an iid sample is the product ofindividual likelihoods, statistics “add up" when they are in an exponent.

The vector X has an exponential family distribution if its density (pdf or pmf)depends on a p× 1 vector θ that can be written

f (x | θ) = a(x)eθ1t1(x)+···+θptp(x)−ψ(θ) (11.22)

for some functions t1(x), . . . , tp(x), a(x), and ψ(θ). The θ is called the natural pa-

rameter and the t(x) = (t1(x), . . . , tp(x))′ is the vector of natural sufficient statistics.Now if X1, . . . , Xn are iid vectors with density (11.22), then the joint density is

f (x1, . . . , xn | θ) = ∏ f (xi | θ) = [∏ a(xi)]eθ1Σit1(xi)+···+Σitp(xi)−nψ(θ), (11.23)


hence has sufficient statistic

s(x1, . . . , xn) =

∑i t1(xi)...

∑i tp(xi)

= ∑

i

t(xi), (11.24)

which has dimension p no matter how large n.The natural parameters and sufficient statistics are not necessarily the most “‘nat-

ural" to us. For example, in the normal case of (11.16), the natural parameters can betaken to be

θ1 =µ

σ2and θ2 = − 1

2σ2. (11.25)

The corresponding statistics are

t(xi) =

(t1(xi)t2(xi)

)=

(xi

x2i

). (11.26)

There are other choices, e.g., we could switch the negative sign from θ2 to t2.Other examples are the Poisson(λ), which has θ = log(λ), and the Binomial(n, p),

which has θ = log(p/(1− p)); also the gamma, beta, multivariate normal, multino-mial, and others.

11.3 Conditioning on a sufficient statistic

The intuitive meaning of a sufficient statistic is that once you know the statistic,nothing else about the data helps in inference about the parameter. For example, inthe iid case, once you know the values of the xi’s, it does not matter in what orderthey are listed. This notion can be formalized by finding the conditional distributionof the data given the sufficient statistic, and showing that it does not depend on theparameter.

First, a lemma that makes it easy to find the conditional density, if there is one.

Lemma 13. Suppose X has space X and density fX, and S = s(X) is a function of X withspace S and density fS. Then the conditional density of X given S, if it exists, is

fX|S(x | s) =fX(x)

fS(s)for x ∈ Xs ≡ x ∈ X | s(x) = s. (11.27)

The caveat “if it exists" in the lemma is unnecessary in the discrete case, becausethe conditional pmf always will exist. But in continuous or mixed cases, the resultingconditional distribution may not have a density with respect to Lebesgue measure. Itwill have a density with respect to some measure, though, which one would see in ameasure theory course.

Proof, at least in the discrete case. Suppose X is discrete. Then by Bayes Theorem,

fX|S(x | s) =fS|X(s | x) fX(x)

fS(s), x ∈ Xs. (11.28)

11.3. Conditioning on a sufficient statistic 129

Because S is a function of X,

fS|X(s | x) = P[s(X) = s | X = x] =

1 i f s(x) = s0 i f s(x) 6= s

=

1 i f x ∈ Xs

0 i f x 6∈ Xs

(11.29)Thus the term “ fS|X(s | x)" equals 1 in (11.28), because x ∈ Xs, so we can erase it,

yielding (11.27). 2

Here is the main result.

Lemma 14. Suppose s(x) is a sufficient statistic for a model with data X and parameterspace Θ. Then the conditional distribution X | s(X) = s does not depend on θ.

Before giving a proof, consider the example with X1, . . . , Xn iid Poisson(θ), θ ∈(0, ∞). The likelihood is

L(θ ; x) = ∏ e−θθxi = e−nθθΣxi . (11.30)

We see that s(x) = ∑ xi is sufficient. We know that S = ∑ Xi is Poisson(nθ), hence

fS(s) = e−nθ (nθ)s

s!. (11.31)

Then by Lemma 13, the conditional pmf of X given S is

fX|S(x | s) =fX(x)

fS(s)

=e−nθθΣxi / ∏ xi!

e−nθ(nθ)s/s!

=s!

∏ xi!

1

ns, x ∈ Xs = x | ∑ xi = s. (11.32)

That is a multinomial distribution:

X | ∑ Xi = s ∼ Multinomialn(s, (1

n, . . . ,

1

n)). (11.33)

But the main point of Lemma 14 is that this distribution is independent of θ. Thusknowing the sum means the exact values of the xi’s do not reveal anything extraabout θ.

The key to the result lies in (11.30) and (11.31), where we see that the likelihoodsof X and s(X) are the same, if ∑ xi = s. This fact is a general one.

Lemma 15. Suppose s(X) is a sufficient statistic for the model with data X, and consider

the model for S where S =D s(X). Then the likelihoods for X = x and S = s are the same ifs = s(x).

This result should make sense, because it is saying that the sufficient statistic con-tains the same information about θ that the full data do. One consequence is that in-stead of having to work with the full model, one can work with the sufficient statistic’s

model. For example, instead of working with X1, . . . , Xn iid N(µ, σ), you can work


with just the two independent random variables X ∼ N(µ, σ2/n) and S2 ∼ σ2χ2n−1/n

without losing any information.We prove the lemma in the discrete case.

Proof. Let fX(x | θ) be the pmf of X, and fS(s | θ) be that of S. Because s(X) issufficient, the likelihood for X can be written

L(θ ; x) = cx fX(x | θ) = b(s(x), θ) (11.34)

by Definition 17, for some cx and b(s, θ). The pmf of S is

fS(s | θ) = Pθ [s(X) = s] = ∑x∈Xs

fX(x | θ) (where Xs = x ∈ X | s(x) = s)

= ∑x∈Xs

b(s(x), θ)/cx

= b(s, θ) ∑x∈Xs

1/cx (because in the summation, s(x) = s)

= b(s, θ)ds, (11.35)

where ds is that sum of 1/cx’s. Thus the likelihood of S can be written

L∗(θ ; s) = b(s, θ), (11.36)

the same as L in (11.34). 2

A formal proof for the continuous case proceeds by introducing appropriate extravariables Y so that X and (S, Y) are one-to-one, then using Jacobians, then integratingout the Y. We will not do that, but special cases can be done easily, e.g., if X1, . . . , Xn

are iid N(µ, 1), one can directly show that X and s(X) = ∑ Xi ∼ N(nµ, n) have thesame likelihood.

Now for the proof of Lemma 14, again in the discrete case. Basically, we just repeatthe calculations for the Poisson.

Proof. As in the proof of Lemma 15, the pmf’s of X and S can be written, respectively,

fX(x | θ) = L(θ ; x)/cx and fS(s | θ) = L∗(θ ; s)ds, (11.37)

where the likelihoods are equal in the sense that

L(θ ; x) = L∗(θ ; s) if x ∈ Xs. (11.38)

Then by Lemma 13,

fX|S(x | s, θ) =fX(x | θ)

fS(s | θ)for x ∈ Xs

=L(θ ; x)/cx

L∗(θ ; s)dsfor x ∈ Xs

=1

cx dsfor x ∈ Xs (11.39)

by (11.38), which does not depend on θ. 2

We end this section with a couple of additional examples that derive the condi-tional distribution of X given s(X).

11.3. Conditioning on a sufficient statistic 131

11.3.1 Example: IID

Suppose X1, . . . , Xn are iid continuous random variables with pdf f (xi). No matterwhat the model, we know that the order statistics are sufficient, so we will suppressthe θ. We can proceed as in the proof of Lemma 14. Letting s(x) = (x(1), . . . , x(n)),

we know that the joint pdf’s of X and S ≡ s(X) are, respectively,

fx(x) = ∏ f (xi) and fS(s) = n! ∏ f (x(i)). (11.40)

The products of the pdf’s are the same, just written in different orders. Thus

P[X = x | s(X) = s] =1

n!for x ∈ Xs. (11.41)

The s is a particular set of ordered values, and Xs is the set of all x that have the samevalues as s, but in any order. To illustrate, suppose that n = 3 and s(X) = s = (1, 2, 7).Then X has a conditional chance of 1/6 of being any x with order statistics (1, 2, 7):

X(1,2,7) = (1, 2, 7), (1, 7, 2), (2, 1, 7),

(2, 7, 1), (7, 1, 2), (7, 2, 1). (11.42)

The discrete case works as well, although the counting is a little more complicatedwhen there are ties. For example, if s(x) = (1, 3, 3, 4), there are 4!/2! = 12 differentorderings.

11.3.2 Example: Normal mean

Suppose X1, . . . , Xn are iid N(µ, 1), so that X is sufficient (as is ∑ Xi). We wish to

find the conditional distribution of X given X = x. Because we have normality andlinear functions, everything is straightforward. First we need the joint distribution of

W = (X, X′)′, which is a linear transformation of X ∼ N(µ1n, In), hence multivariatenormal. We could figure out the matrix for the transformation, but all we really needare the mean and covariance matrix of W. The mean and covariance of the X part we

know, and the mean and variance of X are µ and 1/n, respectively. All that is left is

the covariance of the Xi’s with X, which are all the same, and can be found directly:

Cov[Xi, X] =1

n

n

∑j=1

Cov[Xi, Xj] =1

n, (11.43)

because Xi is independent of all the Xj’s except for Xi itself, with which its covarianceis 1. Thus

W =

(XX

)∼ N

(µ1n+1,

(1n

1n 1′n

1n 1n In

)). (11.44)

For the conditional distribution, we use (8.101) through (8.103). The X there is Xhere, and the Y there is the X here. Thus

ΣXX = 1/n, ΣYX = (1/n)1n, ΣYY = In, µX = µ, and µY = µ1n, (11.45)

and

β = (1/n)1n(1/n)−1 = 1n and α = µ1n − βµ = µ1n − 1nµ = 0n. (11.46)


Thus the conditional mean is

E[X | X = x] = α + β x = x 1n. (11.47)

That result should not be surprising. It says that if you know the sample mean is x,you expect on average the observations to be x. The conditional covariance is

ΣYY − ΣYXΣ−1XXΣXY = In − (1/n)1n(1/n)−1(1/n)1′n = In −

1

n1n1′n = Hn, (11.48)

the centering matrix from (8.42). Putting it all together:

X | X = x ∼ N(x 1n, Hn). (11.49)

We can pause and reflect that this distribution is free of µ, as it had better be. Notealso that if we subtract the conditional mean, we have

X− X 1n | X = x ∼ N(0n, Hn). (11.50)

That conditional distribution is free of x, meaning the vector is independent of X.

But this is the vector of deviations, and we already knew it is independent of X from(8.46).

11.4 Rao-Blackwell: Improving an estimator

We see that sufficient statistics are nice in that we do not lose anything by restricting tothem. In fact, they are more than just convenient ... if you base an estimate on moreof the data than the sufficient statistic, then you do lose something. For example,suppose X1, . . . , Xn are iid N(µ, 1), µ ∈ R, and we wish to estimate

g(µ) = Pµ[Xi ≤ 10] = Φ(10− µ), (11.51)

Φ being the distribution function of the standard normal. An unbiased estimator is

δ(x) =#xi | xi ≤ 10

n=

1

n ∑ Ixi≤10(xi). (11.52)

Note that this estimator is not a function of just x, the sufficient statistic. The claimis that there is then another unbiased estimator which is a function of just x that isbetter, i.e., has lower variance.

Fortunately, there is a way to find such an estimator besides guessing. We useconditioning as in the previous section, and the results on conditional means andvariances in Section 7.4.3. Start with an estimator δ(x) of g(θ), and a sufficient statisticS = s(X). Then consider the conditional expected value of δ:

δ∗(s) = E[δ(X) | s(X) = s]. (11.53)

First, we need to make sure δ∗ is an estimator, which means that it does not dependon the unknown θ. But by Lemma 14, we know that the conditional distribution of Xgiven S does not depend on θ, and δ does not depend on θ because it is an estimator,hence δ∗ is an estimator. If we condition on something not sufficient, then we maynot end up with an estimator.

11.4. Rao-Blackwell: Improving an estimator 133

Is δ∗ a good estimator? From (7.43), we know it has the same expected value as δ:

E[δ∗(S) | θ] = E[δ(X) | θ], (11.54)

so thatBias(θ; δ∗) = Bias(θ; δ). (11.55)

Thus we haven’t done worse in terms of bias, and in particular if δ is unbiased, so isδ∗.

Turning to variance, we have the equation (7.48), which translated to our situationhere is

Var[δ(X) | θ] = E[v(S) | θ] + Var[δ∗(S) | θ], (11.56)

wherev(s) = Var[δ(X) | s(X) = s, θ]. (11.57)

Whatever v is, it is not negative, hence

Var[δ∗(S) | θ] ≤ Var[δ(X) | θ]. (11.58)

Thus variance-wise, the δ∗ is no worse than δ. In fact, δ∗ is strictly better unless v(S)is zero with probability one, But in that case, then δ and δ∗ are the same, so that δitself is a function of just S already.

Finally, if the bias is the same and the variance of δ∗ is lower, then the meansquared error of δ∗ is better. To summarize:

Theorem 7. Rao-Blackwell2. If δ is an estimator of g(θ), and s(X) is sufficient, thenδ∗(s) given in (11.53) has the same bias as δ, and smaller variance and MSE, unless δ is afunction of just s(X), in which case δ and δ∗ are the same.

11.4.1 Example: Normal probability

Consider the estimator δ(x) in (11.52) for g(µ) = Φ(10 − µ) in (11.51) in the nor-

mal case. With X being the sufficient statistic, we can find the conditional expectedvalue of δ. We start by finding the conditional expected value of just one of theIxi≤10(xi)’s. It turns out that the conditional expectation is the same for each i,

hence the conditional expected value of δ is the same as the condition expected valueof just one. So we are interested in finding

δ∗(x) = E[IXi≤10(Xi) | X = x] = P[Xi ≤ 10 | X = x]. (11.59)

From (11.49) we have that

Xi | X = x ∼ N(x, 1− 1

n), (11.60)

hence

δ∗(x) = P[Xi ≤ 10 | X = x]

= P[N(x, 1− 1/n) ≤ 10]

= Φ

(10− x√1− 1/n

)

= Φ

(√n− 1

n(10− x)

). (11.61)

2David Blackwell received his Ph. D. from the Math Department at UIUC.


This estimator is then guaranteed to be unbiased, and have a lower variance thatδ. It would have been difficult to come up with this estimator directly, or even showthat it is unbiased, but the original δ is quite straightforward, as is the conditionalcalculation.

11.4.2 Example: IID

In Example 11.3.1, we saw that when the observations are iid, the conditional dis-tribution of X given the order statistics is uniform over all the permutations of theobservations. One consequence is that any estimator must be invariant under per-mutations. For a simple example, consider estimating µ, the mean, with the simpleestimator δ(X) = X1. Then with s(x) being the order statistics,

δ∗(s) = E[X1 | s(X) = s] =1

n ∑ si = s (11.62)

because X1 is conditionally equally likely to be any of the order statistics. Of course,

the mean of the order statistics is the same as the x, so we have that X is a betterestimate than X1, which we knew. The procedure applied to any weighted averagewill also end up with the mean, e.g.,

E

[1

2X1 +

1

3X2 +

1

6X4 | s(X) = s

]=

1

2E[X1 | s(X) = s] +

1

3E[X2 | s(X) = s]

+1

6E[X4 | s(X) = s]

=1

2s +

1

3s +

1

6s

= s. (11.63)

Turning to the variance σ2, because X1 − X2 has mean 0 and variance 2σ2, δ(x) =(x1 − x2)

2/2 is an unbiased estimator of σ2. Conditioning on the order statistics, we

obtain the mean of all the (xi − xj)2/2’s with i 6= j:

δ∗(s) = E

[(X1 − X2)

2

2

]=

1

n(n− 1) ∑ ∑i 6=j

(xi − xj)2

2. (11.64)

After some algebra, we can write

δ∗(s(x)) =∑(xi − x)2

n− 1, (11.65)

the usual unbiased estimate of σ2.The above estimators are special cases of U-statistics, for which there are many

nice asymptotic results. They are studied in STAT575. A U-statistic is based on akernel h(x1 , . . . , xd), a function of a subset of the observations. The correspondingU-statistic is the symmetrized version of the kernel, i.e., the conditional expectedvalue,

u(x) = E[h(X1, . . . , Xd) | s(X) = s(x)]

=1

n(n− 1) · · · (n− d + 1) ∑ · · ·∑i1,...,id, distincth(xi1

, . . . , xid). (11.66)

Chapter 12

UMVUE’s

12.1 Unbiased estimation

Section 10.6 introduces the notion of unbiased estimators. Several questions arisewhen estimating a particular parameter in a particular model:

• Are there any unbiased estimators?

• Is there a unique unbiased estimator?

• If there is no unique one, which (if any) is best?

• How does one find unbiased estimators?

The answer to all of the questions is, “It depends." In particular, there is no generalautomatic method for finding unbiased estimators, in contrast to maximum likeli-hood estimators, or Bayes estimators. The latter two types may be difficult to calcu-late, but it is a mathematics or computational problem, not a statistical one.

One method that occasionally works for finding unbiased estimators is to find apower series from the definition. For example, suppose X ∼ Binomial(n, θ), θ ∈(0, 1). We know X/n is an unbiased estimator for θ, but what about estimatingg(θ) = θ(1− θ), the numerator of the variance? An unbiased δ will satisfy

Eθ[δ(X)] = θ(1− θ) for all θ ∈ (0, 1), (12.1)

orn

∑x=0

δ(x)

(n

x

)θx(1− θ)n−x = θ(1− θ). (12.2)

Both sides of equation (12.2) are polynomials in θ, hence for equality to hold for every

θ ∈ (0, 1), the coefficients of each power θk must match.First, suppose n = 1, so that we need

δ(0)(1− θ) + δ(1)θ = θ(1− θ) = θ − θ2. (12.3)

There is no δ that will work, because the coefficient of θ2 on the right-hand side is -1,and on the left-hand side is 0. That is, with n = 1, there is no unbiased estimator ofθ(1− θ).

135

136 Chapter 12. UMVUE’s

With n = 2, we have

δ(0)(1− θ)2 + 2δ(1)θ(1− θ) + δ(2)θ2 = θ(1− θ)‖ ‖

δ(0) + (−2δ(0) + 2δ(1))θ + (δ(0)− 2δ(1) + δ(2))θ2 θ− θ2. (12.4)

Matching coefficents of θk:

k Left-hand side Right-hand side0 δ(0) 01 −2δ(0) + 2δ(1) 12 δ(0)− 2δ(1) + δ(2) −1

(12.5)

we see that the only solution is

δ(0) = 0

δ(1) =1

2δ(2) = 0, (12.6)

which actually is easy to see directly from the first line of (12.4). Thus the δ in (12.6)must be the best unbiased estimator, being the only one. In fact, because for any

function δ(x), Eθ[δ(X)] is an nth-degree polynomial in θ, the only functions g(θ) thathave unbiased estimators are those that are themselves polynomials of degree n or

less. For example, 1/θ and eθ do not have unbiased estimators.For n > 2, estimating θ(1 − θ) is more complicated, but in principle will yield

a best unbiased estimator. Or we can use the approach conditioning on a sufficientstatistic in Section 11.4, particularly Example 11.4.2. We know that X = Z1 + . . . + Zn,where the Zi’s are iid Bernoulli(θ). The estimator δ in (12.6) for n = 2 Bernoulli’s canbe given as (z1 − z2)

2/2. (Check it.) Then conditioning on the order statistics of theZi’s we obtaining the same result as in (11.65), i.e., a better unbiased estimator ofθ(1− θ) is

∑(Zi − Z)2

n− 1=

∑ Z2i − (∑ Zi)

2/n

n− 1

=∑ Zi − (∑ Zi)

2/n

n− 1(because Z2

i = Zi)

=X− (X)2/n

n− 1(because X = ∑ Zi)

=X(1− 1

n X)

n− 1. (12.7)

Is this estimator is the only unbiased one (of θ(1− θ)) for the binomial? Yes, whichcan be shown using the results in Section 12.2. (Or if we had continued matchingcoefficients, we would have found it to be unique.)

Another approach is to take a biased estimator, and see if it can be modifed tobecome unbiased. For example, if X1, . . . , Xn are iid Poisson(θ), the MLE of g(θ) =

e−θ isg(θ) = e−X. (12.8)

12.2. Completeness: Unique unbiased estimators 137

Because S = ∑ Xi ∼ Poisson(θ),

E[g(θ) | θ] = E[e−S/n | θ]

= ∑ e−nθ nθs

s!e−s/n

= e−nθ ∑(nθe−1/n)s

s!

= e−nθenθe−1/n

6= e−θ . (12.9)

So the MLE is biased. Now e−1/n ≈ (1− 1/n). and if we make that substitution in

the exponent of E[g(θ)], it would be unbiased. Thus if we modified the S/n a bit wemay find an unbiased estimator. Consider

g(θ)c = e−cX (12.10)

for same constant c. Working through (12.9) again but with c/n in place of 1/n, weobtain

E[g(θ)c | θ] = e−nθenθe−c/n. (12.11)

We want−n + ne−c/n = −1⇒ c = −n log(1− 1/n) ≡ c∗, (12.12)

hence an unbiased estimator is

g(θ)c∗ =

(1− 1

n

)nx

. (12.13)

We have seen that there may or may not be an unbiased estimator. Often, there are

many. For example, suppose X1, . . . , Xn are iid Poisson(θ). Then X and ∑(Xi − X)2/(n− 1)are both unbiased, being unbiased estimators of the mean and variance, respectively.Weighted averages of the Xi’s are also unbiased, e.g.,

X1,X1 + X2

2, and

1

3X1 +

1

2X2 +

1

6X3 (12.14)

are all unbiased. Among all the possibilities, is there one that is best, i.e., has thelowest variance? There is not a good general answer, but the rest of this chapter usessufficiency and the concept of completeness to find unique best unbiased estimatorsin many situations.

12.2 Completeness: Unique unbiased estimators

We have already answered the question of finding the best unbiased estimator incertain situations without realizing it. We know that any estimator that is not afunction of just the sufficient statistic can be improved. Also, if there is only oneunbiased estimator that is a function of the sufficient statistic, then it must be thebest one that depends on only the sufficient statistic. Furthermore, it must be thebest overall, because it is better than any estimator that is not a function of just thesufficient statistic.


The concept of “there being only one unbiased estimator" is called completeness.It is a property attached to a model. Consider the model with random vector T,space T , and parameter space Θ. Suppose there are at least two unbiased estimatorsof some function g(θ) in this model, say δg(t) and δ∗g(t). That is,

Pθ[δg(T) 6= δ∗g(T)] > 0 for some θ ∈ Θ, (12.15)

andEθ [δg(T)] = g(θ) = Eθ [δ

∗g(T)] for all θ ∈ Θ. (12.16)

ThenEθ [δg(T)− δ∗g(T)] = 0 for all θ ∈ Θ. (12.17)

This δg(t)− δ∗g(t) is an unbiased estimator of 0. Now suppose δh(t) is an unbiased

estimator of the function h(θ). Then so is

δ∗h (t) = δh(t) + δg(t)− δ∗g(t), (12.18)

becauseEθ [δ

∗h (T)] = Eθ[δh(T)] + Eθ [δg(T)− δ∗g(T)] = h(θ) + 0. (12.19)

That is, if there is more than one unbiased estimator of one function, then thereis more than one unbiased estimator of any other function (that has at least oneunbiased estimator).

Logically, it follows that if there is only one unbiased estimator of some function,then there is only one (or zero) unbiased estimator of any function. That one functionmay as well be the zero function.

Definition 19. Suppose for the model given by (T, T , Θ), the only unbiased estimator of 0is 0 itself. That is, suppose

Eθ [δ(T)] = 0 for all θ ∈ Θ (12.20)

implies thatPθ [δ(T) = 0] = 1 for all θ ∈ Θ. (12.21)

Then the model is complete.

The important implication follows.

Lemma 16. Suppose the model given by (T, T , Θ) is complete. Then there exists at most oneunbiased estimator of any function g(θ).

Illustrating with the Binomial again, suppose X ∼ Binomial(n, θ) with θ ∈ (0, 1).Suppose δ(x) is an unbiased estimator of 0:

Eθ [δ(X)] = 0 for all θ ∈ (0, 1). (12.22)

We know the left-hand side is a polynomial in θ, as is the right-hand side. All the

coefficients of θi are zero on the right, hence on the left. Write

Eθ [δ(X)] = δ(0)

(n

0

)(1− θ)n + δ(1)

(n

1

)θ(1− θ)n−1 (12.23)

+ · · ·+(

n

n− 1

)θn−1(1− θ) +

(n

n

)θn. (12.24)

12.2. Completeness: Unique unbiased estimators 139

The coefficient of θ0, i.e, the constant, arises from just the first term, so is δ(0)(n0). For

that to be 0, we have δ(0) = 0. Erasing that fisrt term, we see that the coefficient ofθ is δ(1)(n

1), hence δ(1) = 0. Continuing, we see that δ(2) = · · · = δ(n) = 0, whichmeans that δ(x) = 0, which means that the only unbiased estimator of 0 is 0 itself.Hence this model is complete.


Suppose X1, . . . , Xn are iid Poisson(θ), θ ∈ (0, ∞), with n > 1. Is this model complete?No. Consider δ(x) = x1 − x2.

Eθ [δ(X)] = Eθ [X1]− Eθ[X2] = θ − θ = 0, (12.25)

butPθ [δ(X) = 0] = Pθ[X1 = X2] > 0. (12.26)

So 0 is not the only unbiased estimator of 0; X1 − X2 is another. You can come upwith an infinite number, in fact.

Now let S = X1 + · · ·+ Xn, which is a sufficient statistic. The model for S then hasspace S = 0, 1, 2, . . . and distribution S ∼ Poisson(nθ) for θ ∈ (0, ∞). Is the modelfor S complete? Suppose δ∗(s) is an unbiased estimator of 0. Then

Eθ [δ∗(S)] = 0 for all θ ∈ (0, ∞) ⇒

∞

∑s=0

δ∗(s)e−nθ (nθ)s

s!= 0 for all θ ∈ (0, ∞)

⇒∞

∑s=0

δ∗(s)ns

s!θs = 0 for all θ ∈ (0, ∞)

⇒ δ∗(s)ns

s!= 0 for all s ∈ S

⇒ δ∗(s) = 0 for all s ∈ S . (12.27)

Thus the only unbiased estimator of 0 that is a function of S is 0, meaning the modelfor S is complete.

12.2.2 Example: Uniform

Suppose X1, . . . , Xn are iid Uni f orm(0, θ), θ ∈ (0, ∞), with n > 1. This model againis not complete. No iid model is complete, in fact, if n > 1. Consider the sufficientstatistic S = maxX1, . . . , Xn. The model for S has space (0, ∞) and pdf

fθ(s) =

nsn−1/θn i f 0 < s < θ

0 i f not. (12.28)

To see if the model for S is complete, suppose that δ∗ is an unbiased estimator of 0.Then

Eθ [δ∗(S)] = 0 for all θ ∈ (0, ∞) ⇒

∫ θ

0δ∗(s)sn−1ds/θn = 0 for all θ ∈ (0, ∞)

⇒∫ θ

0δ∗(s)sn−1ds = 0 for all θ ∈ (0, ∞)

(taking d/dθ) ⇒ δ∗(θ)θn−1 = 0 for (almost) all θ ∈ (0, ∞)

⇒ δ∗(θ) = 0 for (almost) all θ ∈ (0, ∞).(12.29)


That is, δ∗ must be 0, so that the model for S is complete. [The “(almost)" means thatone can deviate from zero for a few values (with total Lebesgue measure 0) withoutchanging the fact that Pθ[δ

∗(S) = 0] = 1 for all θ.]

12.3 Uniformly minimum variance estimators

In this section culminates the work of this chapter. First, a definition.

Definition 20. In the model with X, X , and parameter space Θ, δ is a uniformly minimumvariance unbiased estimator (UMVUE) of the function g(θ) if it is unbiased,

Eθ [δ(X)] = g(θ) for all θ ∈ Θ, (12.30)

has finite variance, and for any other unbiased estimator δ′,

Varθ [δ(X)] ≤ Varθ [δ′(X)] for all θ ∈ Θ. (12.31)

And the theorem.

Theorem 8. Suppose S = s(X) is a sufficient statistic for the model (X,X , Θ), and themodel for S is complete. Then if δ∗(s) is an unbiased estimator (depending on s) of the

function g(θ), then δUMVUE(x) = δ∗(s(x)) is the UMVUE of g(θ).

Proof. Let δ be any unbiased estimator of g(θ), and consider

eδ(s) = E[δ(X) | S = s]. (12.32)

Because S is sufficient, eδ does not depend on θ, so it is an estimator. Furthermore,since

Eθ [eδ(S)] = Eθ [δ(X)] = g(θ) for all θ ∈ Θ, (12.33)

it is unbiased, and because it is a conditional expectration, as in (11.58),

Varθ [eδ(S)] ≤ Varθ [δ(X)] for all θ ∈ Θ. (12.34)

But by completeness of the model for S, there is only one unbiased estimator that isa function of just S, δ∗, hence δ∗(s) = eδ(s) (almost everywhere), and

Varθ [δ∗(S)] ≤ Varθ [δ(X)] for all θ ∈ Θ. (12.35)

This equation holds for any unbiased δ, so that δ∗ is best, or, as a function of s,

δUMVUE(x) = δ∗(s(x)) is best. 2

This proof is actually constructive in a sense. If you do not know δ∗, but haveany unbiased δ, then you can find the UMVUE by using Rao-Blackwell’s (Theorem 7)conditioning on the sufficient statistic. Or, if you can by any means find an unbiasedestimator that is a function of S, then it is UMVUE.

12.4. Completeness for exponential families 141


Consider again Example 12.2.1. We have that S = X1 + · · ·+ Xn is sufficient, and themodel for S is complete. Because

Eθ [S/n] = θ, (12.36)

we have that S/n is the UMVUE of θ.Now let g(θ) = e−θ = Pθ[X1 = 0]. (If Xi is the number of phone calls in one

minute, then g(θ) is the chance there are 0 calls in the next minute.) We have from(12.13) an unbiased estimator of g(θ) that is a function of S, hence it is UMVUE. Butfinding the estimator took a bit of work, and luck. Instead, start with a very simpleestimator,

δ(x) = Ix1=0(x1), (12.37)

which indicates whether there were 0 calls in the first minute. That estimator isunbiased, but obviously not using all the data. From (11.33) we have that X givenS = s is multinomial, hence X1 is binomial:

X1 | S = s ∼ Binomial(s,1

n). (12.38)

Then

E[δ(X) | s(X) = s] = P[X1 = 0 | S = s] = P[Binomial(s,1

n) = 0] = (1− 1

n)s. (12.39)

That is the UMVUE, and (as must be) it is the same as the estimator in (12.13).

12.4 Completeness for exponential families

Showing completeness of a model in general is not an easy task. Fortunately, expo-nential families as in (11.22) usually do provide completeness for the natural sufficientstatistic. To present the result, we start assuming the model for the p× 1 vector T hasexponential family density with θ ∈ Θ ⊂ Rp as the natural parameter, and T itself asthe natural sufficient statistic:

fθ(t) = a(t) eθ1t1+···+θptp−ψ(θ). (12.40)

Lemma 17. In the above model for T, suppose that the parameter space contains a nonemptyopen p-dimensional rectangle. Then the model is complete.

If T is a lattice, e.g., the set of vectors containing all nonnegative integers, then

the lemma can be proven by looking at the power series in eθi ’s. In general, the proofwould use uniqueness of Laplace transforms, as would the proof of the uniquenessof moment generating functions.

The requirement on the parameter space guards against exact constraints among

the parameters. For example, suppose X1, . . . , Xn are iid N(µ, σ2), where µ is knownto be positive and the coefficient of variation, σ/µ, is known to be 10%. The exponentin the pdf, with µ = 10σ, is

µ

σ2 ∑ xi −1

2σ2 ∑ x2i =

10

σ ∑ xi −1

2σ2 ∑ x2i . (12.41)


The exponential family terms are then

θ = (10

σ,− 1

2σ2) and T = (∑ Xi, ∑ X2

i ). (12.42)

The parameter space, with p = 2, is

Θ = (10

s,− 1

2s2) | s > 0, (12.43)

which does not contain a two-dimensional open rectangle, because θ2 = −θ21/100.

Thus we cannot use the lemma to show completeness.

NOTE. It is important to read the lemma carefully. It does not say that if the require-ment is violated, then the model is not complete. (Although realistic counterexamplesare hard to come by.) To prove a model is not complete, you must produce a nontriv-ial unbiased estimator of 0. For example, in (12.42),

Eθ [(T1/n)2] = Eθ[(X)2] = µ2 +σ2

n= (10nσ)2 +

σ2

n= (100n2 +

1

n)σ2 (12.44)

and

Eθ [T2/n] = Eθ[X2i ] = µ2 + σ2 = (10nσ)2 + σ2 = (100n2 + 1)σ2. (12.45)

Then

δ(t1, t2) = t1/(100n2 +1

n)− t2/(100n2 + 1) (12.46)

has expected value 0, but is not zero itself. Thus the model is not complete.

12.4.1 Examples

Suppose p = 1, so that the natural sufficient statistic and parameter are both scalars,T and θ, respectively. Then all that is needed is that T is not constant, and Θ containsan interval (a, b), a < b. The table below has some examples, where X1, . . . , Xn are iidwith the given distribution (the parameter space is assumed to be the most general):

Distribution Sufficient statistic T Natural parameter θ Θ

N(µ, 1) ∑ Xi µ R

N(0, σ2) ∑ X2i

12σ2 (0, ∞)

Poisson(λ) ∑ Xi log(λ) R

Exponential(λ) ∑ Xi −λ (−∞, 0)

(12.47)

The parameter spaces all contain open intervals; in fact they are open intervals.Thus the models for the T’s are all complete.

For X1, . . . , Xn iid N(µ, σ2), with (µ, σ2) ∈ R× (0, ∞), the exponential family termsare

θ = (µ

σ,− 1

2σ2) and T = (∑ Xi, ∑ X2

i ). (12.48)

Here, without any extra constraints on µ and σ2, we have

Θ = R× (−∞, 0), (12.49)

12.5. Fisher’s Information and the Cramér-Rao Lower Bound 143

because for and (a, b) with a ∈ R, b < 0, µ = a/√

(− 2b) ∈ R and σ2 = −1/(2b) ∈(0, ∞) are valid parameter values. Thus the model is complete for (T1, T2). From theUMVUE Theorem 8, we then have that any function of (T1, T2) is the UMVUE for its

expected value. For example, X is the UMVUE for µ, S2 = ∑(Xi − X)2/(n− 1) is the

UMVUE for σ2. Also, since E[(X)2] = µ2 + σ2/n, the UMVUE for µ2 is (X)2 − S2/n.

12.5 Fisher’s Information and the Cramér-Rao Lower Bound

If you find a UMVUE, then you know you have the best unbiased estimator. Often-times, there is no UMVUE, or there is one but it is too difficult to calculate. Thenwhat? At least it would be informative to have an idea of whether a given estimatoris very poor, or just not quite optimal. One approach is to find a lower bound for thevariance. The closer an estimator is to the lower bound, the better.

The model for this section has X, X , densities fθ(x), and parameter space Θ =(a, b) ⊂ R. We need various assumptions, which we will postpone detailing untilSection 17.1. Mainly, we need that the pdf is always positive (so that Uni f orm(0, θ)’sare not allowed), and has several derivatives with respect to θ, and that certain ex-pected values are finite. Suppose δ is an unbiased estimator of g(θ), so that

Eθ [δ(X)] =∫

Xδ(x) fθ(x)dx = g(θ) for all θ ∈ Θ. (12.50)

(Use a summation if the density is a pmf.) Now take the derivative with respect to θof both sides. Assuming the following steps are valid, we have

g′(θ) =∂

∂θ

∫

Xδ(x) fθ(x)dx

=∫

Xδ(x)

∂

∂θfθ(x)dx

=∫

Xδ(x)

∂∂θ fθ(x)

fθ(x)fθ(x)dx

=∫

Xδ(x)

∂

∂θlog( fθ(x)) fθ(x)dx

= Eθ [δ(X)∂

∂θlog( fθ(X))]. (12.51)

(Don’t ask how anyone decided to take these steps. That’s why these people arefamous.)

The final term looks almost like a covariance. It is, because if we put δ(x) = 1 inthe steps (12.51), since

∫X δ(x) fθ(x)dx = 1, and the derivative of 1 is 0,

0 = Eθ [∂

∂θlog( fθ(X))]. (12.52)

Then (12.51) and (12.52) show that

g′(θ) = Covθ [δ(X),∂

∂θlog( fθ(X))]. (12.53)

Now we can use the correlation inequality, Cov[U, V]2 ≤ Var[U]Var[V]:

g′(θ)2 ≤ Varθ [δ(X)] Varθ [∂

∂θlog( fθ(X))], (12.54)


or

Varθ [δ(X)] ≥ g′(θ)2

Varθ [∂

∂θ log( fθ(X))]. (12.55)

This equation is the key. The Cramér-Rao lower bound (CRLB) for unbiasedestimators of g(θ) is

CRLBg(θ) =g′(θ)2

I(θ), (12.56)

where I is Fisher’s Information in the model,

I(θ) = Varθ [∂

∂θlog( fθ(X))]. (12.57)

Under appropriate conditions, among them that I(θ) is positive and finite, for anyunbiased estimator δ of g(θ),

Varθ [δ(X)] ≥ g′(θ)2

I(θ). (12.58)

Chapter 17 studies Fisher’s information in more depth. Notice that I(θ) is a function

just of the model, not of any particular estimator or estimand1. The larger it is,the better, because that means the variance of the estimator is potentially lower. Soinformation is good.

If an unbiased estimator achieves the CRLB, that is,

Varθ [δ(X)] =g′(θ)2

I(θ)for all θ ∈ Θ, (12.59)

then it is the UMVUE. The converse is not necessarily true. The UMVUE may notachieve the CRLB, which of course means that no unbiased estimator does. Thereare other more accurate bounds, called Bhattacharya bounds, that the UMVUE mayachieve in such cases.

12.5.1 Example: Double exponential

Suppose X1, . . . , Xn are iid Double Exponential(θ), θ ∈ R, so that

fθ(x) =1

2ne−Σ|xi−θ|. (12.60)

This density is not an exponential family one, and if n > 1, the sufficient statistic is thevector of order statistics, which does not have a complete model. Thus the UMVUE

approach does not bear fruit. Because Eθ [Xi] = θ, X is an unbiased estimator of θ,with

Varθ [X] =Varθ [Xi]

n=

2

n. (12.61)

Is this variance reasonable? We will compare it to the CRLB:

∂

∂θlog( fθ(x)) =

∂

∂θ−∑ |xi − θ| = ∑ Sign(xi − θ), (12.62)

1That which is being estimated.

12.5. Fisher’s Information and the Cramér-Rao Lower Bound 145

because the derivative of |x| is +1 if x > 0 and −1 if x < 0. (We are ignoring thepossibility that xi = θ, where the derivative does not exist.) By symmetry of the

density around θ, each Sign(Xi− θ) has a probabilty of 12 to be either −1 or +1, so it

has mean 0 and variance 1. Thus

I(θ) = Varθ [∑ Sign(Xi − θ)] = n. (12.63)

Here, g(θ) = θ, hence

CRLBg(θ) =g′(θ)2

I(θ)=

1

n. (12.64)

Compare this bound to the variance in (12.61). The variance of X is twice the CRLB,which is not very good. It appears that there should be a better estimator. In fact,there is, which we will see next in Chapter 13, although even that estimator does notachieve the CRLB.

12.5.2 Example: Normal µ2

Suppose X1, . . . , Xn are iid N(µ, 1), µ ∈ R. We know that X is sufficient, and themodel for it is complete, hence any unbiased estimator based on the sample mean is

UMVUE. In this case, g(µ) = µ2, and Eµ[X2] = µ2 + 1/n, hence

δ(x) = x2 − 1

n(12.65)

is the UMVUE. To find the variance of the estimator, start by noting that√

nX ∼N(√

nµ, 1), so that its square is noncentral χ2 (Definition 10),

nX2 ∼ χ2

1(nµ2). (12.66)

Thus from (8.67),

Varµ[nX2] = 2 + 4nµ2(= 2ν + 4∆). (12.67)

Finally,

Varµ [δ(X)] = Varµ [X2] =

2

n2+

4µ2

n. (12.68)

For the CRLB, we first need Fisher’s Information. Start with

∂

∂µlog( fµ(x)) =

∂

∂µ− 1

2 ∑(xi − µ)2 = ∑(xi − µ). (12.69)

The Xi’s are independent with variance 1, so that I(µ) = n. Thus with g(µ) = µ2,

CRLBg(µ) =(2µ)2

n=

4µ2

n. (12.70)

Comparing (12.68) to (12.70), we see that the UMVUE does not achieve the CRLB,which implies that no unbiased estimator will. But note that the variance of the

UMVUE is only off by 2/n2, so that for large n, it is close to achieving the CRLB. Theratio

CRLBg(µ)

Varµ [δ(X)]=

1

1 + 12nµ2

(12.71)

measures how close to ideal the estimator is, the closer to 1, the better. Letting n→ ∞,we obtain a measure of asymptotic efficiency, which in this case is 1.

Chapter 13

Location Families and Shift Equivariance

13.1 Introduction

Unbiasedness is one criterion to use for choosing estimators. It is generally a desirableproperty, and restricting to unbiased estimators eases the search for good estimators,On the other hand, exact unbiasedness is sometimes too restrictive, e.g., there may beno unbiased estimators of a particular quantity, or slightly biased estimators (such asMLE’s) are more pleasing in other ways.

Other useful criteria go under the names invariance and equivariance. These arerelated properties but not identical; even so, they are often conflated. As an exampleof an equivariant estimator, suppose X1, . . . , Xn are the observed heights in inches

of a sample of adults, and X is the sample mean, which turns out to be 67 inches.Now change the scale of these heights to centimeters, so that the data are now Y1 =2.54X1, . . . , Yn = 2.54Xn . What is the sample mean of the Yi’s?

Y =Y1 + · · ·+ Yn

n=

2.54X1 + · · ·+ 2.54Xn

n= 2.54

X1 + · · ·+ Xn

n= 2.54X = 170.18.

(13.1)The idea is that if one multiplies all the data by the same nonzero constant c, thesample mean is also multiplied by c. The sample mean is said to be equivariant underscale transformations. The sample median is also scale equivariant.

An affine transformation is one which changes both location and scale, e.g., thechange from degrees centigrade to degrees Fahrenheit. If X1, . . . , Xn are measure-ments in degrees centigrade, then

Yi = 32 +9

5Xi (13.2)

is in degrees Fahrenheit. The sample mean (and median) is again equivariant under

this transformation, that is, Y = 32 + (9/5)X. A shrinkage estimator here would not

be equivariant, e.g., 0.95× X is not equivariant: 0.95Y = 0.95× 32 + 0.95(9/5)X 6=32 + (9/5)(0.95X). An argument for equivariance is that the estimator should notdepend on whether you are doing the calculations in the US or in Europe (or anyother non-US place).

The technicalities needed in using invariance considerations start with a group oftransformations that act on the data. This action induces an action on the distribu-

147

148 Chapter 13. Location Families and Shift Equivariance

tion of the data. For example, if the data are iid N(µ, σ2), and the observations are

multiplied by c, then the distribution becomes iid N(cµ, c2σ2). For invariance con-siderations to make sense, the model has to be invariant under the group. Then wecan talk about invariant and equivariant functions. All these notions are explained inSection 13.5. The present chapter treats the special case of shift invariance as appliedto what are termed “location families."

13.2 Location families

A location family of distributions is one for which the only parameter is the center;the shape of the density stays the same. We will deal with the one-dimensional iidcase, where the data are X1, . . . , Xn iid, the space for X = (X1, . . . , Xn) is X = Rn,and the density of Xi is f (xi − θ) for some given density f . The parameter space isR. Thus the density of X is

fθ(x) =n

∏i=1

f (xi − θ). (13.3)

This θ may or may not be the mean or median.Examples include the Xi’s being N(θ, 1), or Uniform(θ, θ + 1), or Double Exponen-

tial (θ) as in (12.60). By contrast, Uniform(0, θ) and Exponential (θ) are not locationfamiles since the spread as well as the center is affected by the θ.

A location family model is shift-invariant, because if we add the same constant ato all the Xi’s, the resulting random vector has the same model as X. That is, supposea ∈ R is a fixed, and we look at the transformation

X∗ = (X∗1 , . . . , X∗n) = (X1 + a, . . . , Xn + a). (13.4)

The inverse transformation uses Xi = X∗i − a, and the |Jacobian| = 1, so the densityof Y is

f ∗θ (x∗) =n

∏i=1

f (x∗i − a− θ) =n

∏i=1

f (x∗i − θ∗), where θ∗ = θ + a. (13.5)

Thus adding a to everything just shifts everything by a. Note that the space of X∗ isthe same as the space of X, Rn, and the space for the parameter θ∗ is the same as thatfor θ, R:

Model Model∗

Data X X∗

Sample space Rn Rn

Parameter space R R

Density ∏ni=1 f (xi − θ) ∏

ni=1 f (x∗i − θ∗)

(13.6)

Thus the two models are the same. Not that X and X∗ are equal, but that the sets ofdistributions considered for them are the same. This is the sense in which the modelis shift-invariant.

Note. If the model includes a prior on θ, then the two models are not the same,because the prior distribution on θ∗ would be different than that on θ.

13.3. Shift-equivariant estimators 149

13.3 Shift-equivariant estimators

Suppose δ(x) is an estimator of θ in a location family model. Because the two modelsin (13.6) are exactly the same, δ(x∗) would be an estimator of θ∗. Adding a to the xi’sadds a to the θ, hence a “reasonable" estimator would behave similarly. That is,

If δ(x) is an estimator of θ,and

δ(x∗) = δ((x1 + a, . . . , xn + a)) is an estimator of θ∗ = θ + a,(13.7)

then we should have

δ((x1 + a, . . . , xn + a)) = δ((x1, . . . , xn)) + a. (13.8)

An estimator satisfying (13.8) is called shift-equivariant. Shifting the data by a shiftsthe estimator by a. Examples of shift-equivariant estimators include x, the median,and (Min + Max)/2.

A typical estimator may have its variance, bias, or MSE depending on the parame-ter. E.g., in the binomial case, the variance of the MLE is p(1− p)/n. It turns out thatin the location family case, a shift-equivariant estimator’s variance, bias, and (hence)MSE are all independent of θ. The key is that we can normalize X to have parameter0:

X1, . . . , XN iid fθ =⇒ Z1, . . . , Zn iid f0, (13.9)

whereZ1 = X1 − θ, . . . , Zn = Xn − θ. (13.10)

Letting a = −θ in (13.8),

δ(X1, . . . , Xn) = δ(X1 − θ, Xn − θ) + θ =D δ(Z1, . . . , Zn) + θ. (13.11)

ThenEθ [δ(X1, . . . , Xn)] = E0[δ(Z1, . . . , Zn)] + θ, (13.12)

henceBias(θ; δ) = Eθ [δ(X1, . . . , Xn)]− θ = E0[δ(Z1, . . . , Zn)], (13.13)

Varθ [δ(X)] = Eθ [(δ(X)− Eθ [δ(X)])2] = E0[δ(Z)− E0[δ(Z])2] = Var0[δ(Z)], (13.14)

and

MSE(θ; δ) = Varθ [δ(X)] + Bias2(θ; δ) = Var0[δ(Z)] + E0[δ(Z)]2 = E0[δ(Z)2]. (13.15)

Thus when comparing shift-equivariant estimators, it is enough to look at the MSEwhen θ = 0. The next subsection finds the best such estimator.

13.3.1 The Pitman estimator

By the best shift-equivariant estimator, we mean the equivariant estimator with thelowest mean square error for all θ. It is analogous to the UMVUE, the unbiasedestimator with the lowest variance ≡ MSE.

For any biased equivariant estimator, it is fairly easy to find one that is unbiasedbut has the same variance by just shifting it a bit. For example, suppose that Xi’s areiid Uni f orm(θ, θ + 1). Let δ(x) = x. Then

Eµ[δ(X)] = Eθ [Xi] = θ +1

2and Varθ [δ(X)] =

Varθ [Xi]

n=

1

12n, (13.16)


hence the bias is E0[X] = 12 , and

MSEθ[δ(X)] =1

12n+

1

4. (13.17)

But we can de-bias the estimator easily enough by subtracting the bias 12 , giving

δ∗(x) = x− 12 . Then the bias is 0, but the variance is the same as for δ, hence,

MSEθ[δ∗] =

1

12n<

1

12n+

1

4= MSEθ[δ]. (13.18)

That works for any shift-equivariant estimator δ, i.e., suppose δ is biased, and b =E0[δ(Z)] is the bias. Then δ∗ = δ− b is also shift-equivariant, with

Var0[δ∗(Z)] = Var0[δ(Z)] and E0[δ

∗(Z)] = 0, (13.19)

hence

MSE(θ; δ∗) = Var0[δ(Z)] < Var0[δ(Z)] + E0[δ(Z)]2 = MSE(θ; δ). (13.20)

Thus if a shift-equivariant estimator is biased, it can be improved, ergo ...

Lemma 18. If δ is the best shift-equivariant estimator, then it is unbiased.

Notice that this lemma implies that in the normal case, X is the best shift-equivariantestimator, because it is the best unbiased estimator, and it is shift-equivariant. But, of

course, in the normal case, X is always the best at everything.Now consider the general location family case. To find the best estimator, we first

characterize the shift-equivariant estimators. We can use a similar trick as we didwhen finding the UMVUE when we took a simple unbiased estimator, then foundits expected value given the sufficient statistic. Sufficiency is not very interesting,because for most f , there is no complete sufficient statistic. (The normal and uniformare a couple of exceptions.) But we start with an arbitrary shift-equivariant estimatorδ, and let “Xn" be another shift-equivariant estimator, and look at the difference:

δ(x1, . . . , xn)− xn = δ(x1 − xn, x2 − xn , . . . , xn−1 − xn , 0). (13.21)

Define the function v on the differences yi = xi − xn ,

v : Rn−1 −→ R

y −→ v(y) = δ(y1, . . . , yn−1). (13.22)

Then what we have is that for any equivariant δ, there is a v such that

δ(x) = xn + v(x1 − xn, . . . , xn−1 − xn). (13.23)

Now instead of trying to find the best δ, we will look for the best v, then use (13.23)to get the δ. From (13.15), the best v is found by minimizing

E0[δ(z)2] = E0[(Zn + v(Z1 − Zn, . . . , Zn−1 − Zn))2]. (13.24)

The trick is to condition on the differences Z1 − Zn, . . . , Zn−1 − Zn:

E0[(Zn + v(Z1 − Zn, . . . , Zn−1 − Zn))2] = E0[e(Z1 − Zn, . . . , Zn−1 − Zn)], (13.25)

13.3. Shift-equivariant estimators 151

where

e(y1, . . . , yn−1) = E0[(Zn + v(Z1 − Zn, . . . , Zn−1 − Zn))2 | (13.26)

Z1 − Zn = y1, . . . , Zn−1 − Zn = yn−1]

= E0[(Zn + v(y1, . . . , yn−1))2 | (13.27)

Z1 − Zn = y1, . . . , Zn−1 − Zn = yn−1]. (13.28)

It is now possible to minimize e for each fixed set of yi’s, e.g., by differentiatingwith respect to v(y1, . . . , yn−1). But we know the minimum is minus the (conditional)mean of Zn, that is, the best v is

v(y1, . . . , yn−1) = −E[Zn | Z1 − Zn = y1, . . . , Zn−1 − Zn = yn−1]. (13.29)

That is a straightforward calculation. We just need the joint pdf divided by themarginal to get the conditional pdf, then find the expected value. For the joint, let

Y1 = Z1 − Zn, . . . , Yn−1 = Zn−1 − Zn, (13.30)

and find the pdf of (Y1, . . . , Yn−1, Zn). We use the Jacobian approach, so need theinverse function:

z1 = y1 + zn, . . . , zn−1 = yn−1 + zn, zn = zn. (13.31)

Taking the ∂zi/∂yi’s, etc., we have

J =

1 0 0 · · · 0 10 1 0 · · · 0 10 0 1 · · · 0 1...

......

. . ....

...0 0 0 · · · 1 10 0 0 · · · 0 1

(13.32)

The determinant of J is just 1, then, so the pdf of (Y1, . . . , Yn−1, Zn) is

f ∗(y1, . . . , yn−1, zn) = f (zn)n−1

∏i=1

f (yi + zn), (13.33)

and the marginal of the Yi’s is

f ∗Y(y1, . . . , yn−1) =∫ ∞

−∞f (zn)

n−1

∏i=1

f (yi + zn)dzn. (13.34)

The conditional pdf:

f ∗(zn|y1, . . . , yn−1) =f (zn) ∏

n−1i=1 f (yi + zn)∫ ∞

−∞f (u) ∏

n−1i=1 f (yi + u)du

. (13.35)

The conditional mean is then

E[Zn | Y1 = y1, . . . , Yn−1 = yn−1] =

∫ ∞

−∞zn f (zn) ∏

n−1i=1 f (yi + zn)dzn∫ ∞

−∞f (zn) ∏

n−1i=1 f (yi + zn)dzn

= −v(y1, . . . , yn−1). (13.36)


The best δ uses that v, so that from (13.23), with yi = xi − xn,

δ(x) = xn −∫ ∞

−∞zn f (zn) ∏

n−1i=1 f (xi − xn + zn)dzn∫ ∞

−∞f (zn) ∏

n−1i=1 f (xi − xn + zn)dzn

=

∫ ∞

−∞(xn − zn) f (zn) ∏

n−1i=1 f (xi − xn + zn)dzn∫ ∞

−∞f (zn) ∏

n−1i=1 f (xi − xn + zn)dzn

. (13.37)

Just to make it look better, make the change-of-variables θ = xn − zn, so thatzn = xn − θ, and

δ(x) =

∫ ∞

−∞θ f (xn − θ) ∏

n−1i=1 f (xi − θ)dθ

∫ ∞

−∞f (xn − θ) ∏

n−1i=1 f (xi − θ)dθ

=

∫ ∞

−∞θ ∏

ni=1 f (xi − θ)dθ∫ ∞

−∞ ∏ni=1 f (xi − θ)dθ

. (13.38)

That’s the best, called the Pitman Estimator:

Theorem 9. In the location-family model with individual density f , if there is an equivariantestimator with finite MSE, then it is given by the Pitman Estimator (13.38).

The final expression in (13.38) is very close to a Bayes posterior mean. With priorπ, the posterior mean is

E[θ | X = x] =

∫ ∞

−∞θ fθ(x)π(θ)dθ∫ ∞

−∞fθ(x)π(θ)dθ

. (13.39)

Thus the Pitman Estimator can be thought of as the posterior mean for prior π(θ) = 1.Unfirtunately, that π is not a pdf, because it integrates to +∞. Such a prior is calledan improper prior.

13.4 Examples

One thing special about this theorem is that it is constructive, that is, there is a for-mula telling exactly how to find the best. By contrast, when finding the UMVUE,you have to do some guessing. If you find an unbiased estimator that is a functionof a complete sufficient statistic, then you have it, but it may not be easy to find.You might be able to use the power-series approach, or find an easy one then useRao-Blackwell, but maybe not. A drawback is that the integrals may not be easy toperform analytically, although in practice it would be straightforward to use numeri-cal integration. The homework has a couple of example, the Normal and uniform, inwhich the integrals are doable. Here are additional examples:

13.4.1 Exponential

Here, f (x) = e−x I[x > 0], i.e., the Exponential(1) pdf. The location family is notExponential(θ), but a shifted Exponential(1) that starts at θ rather than at 0:

f (x− θ) = e−(x−θ)I[x− θ > 0] =

e−(x−θ) i f x > θ

0 otherwise. (13.40)

13.4. Examples 153

Then the pdf of X is

f (x | θ) =n

∏i=1

e−(xi−θ)I[xi > θ] = enθe−Σxi I[minxi > θ], (13.41)

because all xi’s are greater than θ if and only if the minimum is. Then the Pitmanestimator is, from (13.38),

δ(x) =

∫ ∞

−∞θenθe−Σxi I[minxi > θ]dθ∫ ∞

−∞enθe−Σxi I[minxi > θ]dθ

=

∫ m−∞

θenθdθ∫ m−∞

enθdθ, (13.42)

where m = minxi. Do the numerator using integration by parts, so that

∫ m

−∞θenθdθ =

θ

nenθ |m−∞ −

1

n

∫ m

−∞enθdθ =

m

nemn − 1

n2emn, (13.43)

and the denominator is (1/n)emn, hence

δ(x) =mn emn − 1

n2 emn

1n emn

= minxi −1

n. (13.44)

Is that estimator unbiased? Yes, because it is the best equivariant estimator. Actually,it is the UMVUE, because minxi is a complete sufficient statistic.

13.4.2 Double Exponential

Now

f (x | θ) =n

∏i=1

1

2e−|xi−θ| =

1

2ne−Σ|xi−θ|, (13.45)

and the Pitman Estimator is

δ(x) =

∫ ∞

−∞θe−Σ|xi−θ|dθ∫ ∞

−∞e−Σ|xi−θ|dθ

. (13.46)

These integrals need to be broken up, depending on which (xi − θ)’s are positive andwhich negative. We’ll do the n = 2 case in detail. First note that we get the sameresult if we use the order statistics, that is,

δ(x) =

∫ ∞

−∞θe−Σ|x(i)−θ|dθ

∫ ∞

−∞e−Σ|x(i)−θ|dθ

. (13.47)

So with n = 2, we have three regions of integration:

θ < x(1) ⇒ ∑ |x(i) − θ| = x(1) + x(2) − 2θ;

x(1) < θ < x(2) ⇒ ∑ |x(i) − θ| = −x(1) + x(2);

x(2) < θ ⇒ ∑ |x(i) − θ| = −x(1) − x(2) + 2θ.(13.48)


The numerator in (13.47) is

e−x(1)−x(2)

∫ x(1)

−∞θe2θdθ + ex(1)−x(2)

∫ x(2)

x(1)

θdθ + ex(1)+x(2)

∫ ∞

x(2)

θe−2θdθ

= e−x(1)−x(2) (x(1)

2− 1

4)e2x(1) + ex(1)−x(2)

x2(2)− x2

(1)

2+ ex(1)+x(2) (

x(2)

2+

1

4)e−2x(2)

=1

2ex(1)−x(2) (x(1) + x2

(2) − x2(1) + x(2)). (13.49)

The denominator is

e−x(1)−x(2)

∫ x(1)

−∞e2θdθ + ex(1)−x(2)

∫ x(2)

x(1)

dθ + ex(1)+x(2)

∫ ∞

x(2)

e−2θdθ

= e−x(1)−x(2)1

2e2x(1) + ex(1)−x(2) (x(2) − x(1)) + ex(1)+x(2)

1

2e−2x(2)

= ex(1)−x(2) (1 + x(2) − x(1)). (13.50)

Finally,

δ(x) =1

2

x(1) + x2(2)− x2

(1)+ x(2)

1 + x(2) − x(1)=

x(1) + x(2)

2, (13.51)

which is a rather long way to calculate the mean! The answer for n = 3 is not themean. Is it the median?

13.5 General invariance and equivariance

This section sketches the general notions of invariance and equivariance, expandingon the shift-invariance in the previous sections.

13.5.1 Groups and their actions

See Section 13.5.5 for a brief review of groups. We are interest in how members of agroup act on the data X = (X1, . . . , Xn)′ with space X . That is, given a group G , itsaction on X is expressed by associating for each g ∈ G a one-to-one and onto functionhg ,

hg : X −→ X , (13.52)

so that the action of g on x is the transformation hg(x). Some examples:

Scale The group is G = c ∈ R | c 6= 0 under multiplication. It acts on the data bymultiplying the observations by an element of the group, that is, for c ∈ G , thetransformation is given by

hc : Rn −→ Rn

x −→ hc(x) = (c x1, . . . , c xn)′ = c x. (13.53)

Shift The group is G = R under addition. It acts on the data by adding a constant toeach of the observations:

ha : Rn −→ R

n

x −→ ha(x) = (a + x1, . . . , a + xn)′ = a1n + x, (13.54)

where 1n = (1, . . . , 1)′, the column vector of all 1’s.

13.5. General invariance and equivariance 155

Affine The group is G = (a, b) ∈ R2 | b 6= 0. It acts on the data by multiplying theobservations by b and adding a:

h(a,b) : Rn −→ Rn

x −→ h(a,b)(x) = (a + b x1, . . . , a + b xn)′ = a1n + b x.(13.55)

The temperature transformation in (13.2) is an example of an affine transforma-tion.

The operation for the group is, for (a, b) and (c, d) ∈ G ,

(a, b) (c, d) = (a + bc, bd), (13.56)

so that

h(a,b)(c,d)(x) = h(a,b)(h(c,d)(x))

= h(a,b)(c1n + dx)

= a1n + b(c1n + dx)

= (a + bc)1n + (bd)x

= h(a+bc,bd)(x) (13.57)

Reflection The group is G = −1, +1 under multiplication. It acts on the data by multi-plication:

hǫ : Rn −→ R

n

x −→ hǫ(x) = ǫx. (13.58)

Thus, it either leaves the data alone, or it switches all the signs.

Permutation The group is G = P | P is an n× n permutation matrix under multiplication.It acts on the data by multiplication:

hP : Rn −→ Rn

x −→ hP(x) = Px. (13.59)

A permutation matrix has exactly one 1 in each row and one 1 in each column,and the rest of the elements are 0. Then Px just permutes the elements of x. Forexample,

P =

0 1 0 00 0 0 10 0 1 01 0 0 0

(13.60)

is a 4× 4 permutation matrix, and

P

x1x2x3x4

=

x2

x4x3x1

. (13.61)


The group actions need to cohere in a proper way. That is, in order for the hg’s tobe an action, they must satisfy the following:

• hι(x) = x, where ι is the identity.

• hg1g2 (x) = hg1 (hg2 (x)) for any g1, g2 ∈ G .

In the above examples, it is fairly easy to see that the transformations do satisfythese requirements. For example, in the shift case, the identity is 0, and adding 0 tothe x leaves it alone. Also, adding a to the xi’s, then adding b to the xa + a’s, is thesame as adding a + b to the xi’s. The only example that is a little tricky in the affineone, and (13.57) shows that the second requirement works.

Whether a given group can be used for a given model depends on whether thetransformed data remain in the model. This question is addressed next.

13.5.2 Invariant models

Start with the model consisting of random vector X, space X , and set of distributionsP .

Definition 21. Suppose G acts on X , and that for any g ∈ G , if the distribution of X is inP , then the distribution of hg(X) is in P . Then the model (X ,P) is said to be invariantunder the group G .

The definition is not saying that X and hg(X) have the same distribution, but thatthey have the same type of distribution. Now for some examples.

1. Suppose X ∼ N(µ, 1) where µ ∈ R, so the mean can be anything, but thevariance has to be 1. First take G1 = R under addition, so the action is the shift(13.54), with n = 1. Now X∗ = ha(X) = a + X ∼ N(a + µ, 1), which is N(µ∗, 1)with µ∗ = a + µ ∈ R. So the model is invariant under G1.

Next, consider the scale group, G2 = c ∈ R | c 6= 0 under multiplication.

Then X∗ = gc(X) = cX ∼ N(cµ, c2). This X∗ does not have a distribution in themodel for all c, since that model requires the variance to be 1,

But with the reflection group G3 = −1, +1, ǫX ∼ N(ǫµ, ǫ2) = N(ǫµ, 1) forǫ = ±1, which is in the model. So the model is invariant under G3.

2. Now let X ∼ N(µ, σ2) where (µ, σ2) ∈ R× (0, ∞), so the mean and variancecan be anything. This model is invariant under all the groups in the previousexample. In fact, it is invariant under the affine group, since if (a, b) ∈ G =(a, b) ∈ R2 | b 6= 0, then X∗ = a + b X ∼ N(a + bµ, b2σ2) = N(µ∗, σ∗2), and

(µ∗, σ∗2) ∈ R× (0, ∞) as required.

3. Suppose X1, . . . , Xn are iid with any location-family distribution, that is, for agiven pdf f , the pdf of X is

n

∏i=1

f (xi − µ), (13.62)

where µ ∈ R. This model is shift-invariant, since for a ∈ R, X∗ = (a +X1, . . . , a + Xn)′ has pdf

n

∏i=1

f (x∗i − µ∗), (13.63)

13.5. General invariance and equivariance 157

where µ∗ = a + µ ∈ R.

4. Suppose X1, . . . , Xn are iid with any distribution individually. Then the modelis invariant under the permutation group, because any permutation of the Xi’salso has iid elements.

5, On the other hand, consider a little ANOVA model, where X1, X2 are N(µ, σ2),

and X3, X4 are N(ν, σ2), and the Xi’s are independent. The parameter space is

(µ, ν) ∈ R2. So

X ∼ N(

µµνν

, σ2I4). (13.64)

This model is not invariant under the permutation group. E.g., take P as in(13.60), so that from (13.61),

PX =

X2X4X3X1

∼ N(

µννµ

, σ2I4), (13.65)

This mean does not have the right pattern (first two equal, last two equal), hencethe distribution of PX is not in the model. (There are permutations which dokeep the distribution in the model, but to be invariant, it has to work for everypermutation.)

13.5.3 Induced actions

Most of the models we’ll consider are parametric, i.e., P = Pθ | θ ∈ Θ. If aparametric model is invariant under a group G , then for g ∈ G ,

X ∼ Pθ =⇒ X∗ = hg(X) ∼ Pθ∗ (13.66)

for some other parameter θ∗ ∈ Θ. In most of the examples above, that is what

happened. For example, #2. where X ∼ N(µ, σ2), (µ, σ2) ∈ Θ = R× (0, ∞). The

group is the affine group. Then for (a, b) ∈ G , X∗ = a + bX, and (µ∗, σ∗2) = (a +bµ, b2σ2). That is, the group also acts on Θ. It is a different action than that on X.Call the transformations hg , so that

h(a,b) : Θ −→ Θ

(µ, σ2) −→ h(a,b)(µ, σ2) = (a + bµ, b2σ2). (13.67)

This action is called the induced action on Θ, because it automatically arises once theaction on X is defined.

It may be that the induced action is trivial, that is, the parameter does not change.For example, take X ∼ N(µ1n, In), µ ∈ R. This model is invariant under permu-tations. But for a permutation matrix P, PX ∼ N(P(µ1n), PInP′) = N(µ1n, In), soµ∗ = µ, i.e., hP(µ) = µ.


13.5.4 Equivariant estimators

The first application of invariance will be to estimating the location parameter in loca-tion families. The data are as in Example 3 in Section 13.5.2, so that X = (X1, . . . , Xn)′,where the Xi’s are iid f (xi − µ), yielding the joint pdf (13.62). The parameter spaceis R. We wish to find the best shift-equivariant estimator of µ, but first we need todefine what that means. Basically, it means the induced action on the estimator is thesame as that on the parameter. The following definition is for the general case.

Definition 22. Suppose that the model with space X and parameterized set of distributionsPθ | θ ∈ Θ is invariant under the group G . Then an estimator δ of θ with

δ : X −→ Θ (13.68)

is equivariant under G ifδ(hg(x)) = hg(δ(x)) (13.69)

for each g ∈ G .

13.5.5 Groups

Definition 23. A group G is a set of elements equipped with a binary operation such that

Closure If g1, g2 ∈ G then g1 g2 ∈ G .

Associativity g1 (g2 g3) = (g1 g2) g3 for any g1, g2, g3 ∈ G .

Identity There exists an element ι ∈ G , called the identity, such that for every g ∈ G , g ι =ι g = g.

Inverse For each g ∈ G there exists an element g−1 ∈ G , called the inverse of g, such that

g g−1 = g−1 g = ι.

Some examples:

Name G Element Identity InverseShift R a + 0 −aScale R− 0 b × 1 1/b

Affine R× [R− 0] (a, b)(a, b) (c, d)=(a + bc, bd)

* *

Reflection −1, +1 ǫ × 1 ǫPermutation n× n permutation P × In P′

matrices

(* = Exercise for the reader.)

Chapter 14

The Decision-theoretic Approach

14.1 The setup

Assume we have a statistical model, with X, X , and Θ. The decision-theoretic ap-proach to inference supposes an action space A that specifies the possible “actions"we might take. Some typical examples:

Inference Action spaceEstimating θ A = Θ

Hypothesis testing A = Accept H0, Reject H0Selecting among several models A = Model I, Model I I, . . .

Predicting Xn+1 A = Space of Xn

(14.1)

A decision procedure specifies which action to take for each possible value of thedata. In estimation, the decision procedure is the estimator; in hypothesis testing, itis the hypothesis test. Formally, a decision procedure is a function δ(x),

δ : X −→ A. (14.2)

[The above is a nonrandomized decision procedure. A randomized procedure woulddepend on not just the data x, but also some outside randomization element. We willignore randomized procedures until Chapter 18 on hypothesis testing.]

A good procedure is one that takes good actions. To measure how good, we need aloss function that specifies a penalty for taking a particular action when a particulardistribution obtains. Formally, the loss function L is a function

L : A×Θ −→ [0, ∞). (14.3)

When estimating a function g(θ), common loss functions are squared-error loss,

L(a, θ) = (a− g(θ))2, (14.4)

and absolute-error loss,L(a, θ) = |a− g(θ)|. (14.5)

In hypothesis testing, a “0 − 1" loss is common, where you lose 0 by making thecorrect decision, and lose 1 if you make a mistake.

159

160 Chapter 14. The Decision-theoretic Approach

14.2 The risk function

A frequentist evaluates procedures by their behavior over experiments. In this decision-theoretic framework, the risk function for a particular decision procedure δ is key.The risk is the expected loss, where δ takes place of the a. It is a function of θ:

R(θ; δ) = Eθ [L(δ(X), g(θ))] = E[L(δ(X)− g(θ)) | θ = θ]. (14.6)

The two expectations are the same, but the first is written for frequentists, and thesecond for Bayesians.

In this chapter, we will concentrate on estimation, where L is square-error loss(14.4), which means the risk is the mean square error from Definition 14:

R(θ; δ) = Eθ [(δ(X)− g(θ))2] = MSE(θ; δ). (14.7)

The idea is to choose a δ with small risk. The challenge is that usually, there is notprocedure that is best for every θ. For example, in Section 10.6, we exhibit severalestimators of the binomial parameter. Notice that the risk functions cross, so there isno way to say one procedure is “best." One way to choose is to restrict considerationto a subset of estimators, e.g., unbiased ones as in Chapters 12, or shift-equivariantones as in Chapter 13. Often, one either does not wish to use such restriction, orcannot.

In the absence of a uniquely defined best procedure, frequentists have severalpossible strategies, among them determining admissible procedures, minimax proce-dures, or Bayes procedures.

14.3 Admissibility

Recall the example in Section 10.6, where X ∼ Binomial(n, θ), n = 5. Figure 10.2,repeated here in Figure 14.1, shows MSE(θ; δ) for four estimators.

The important message in the graph is that none of the four estimators is obviously“best" in terms of MSE. In addition, none of them are discardable in the sense thatanother estimator is better. That is, for any pair of estimators, sometimes one is better,sometimes the other. A fairly weak criterion is this lack of discardability. Formally,the estimator δ∗ is said to dominate the estimator δ if

R(θ; δ∗) ≤ R(θ; δ) for all θ ∈ Θ, and (14.8)

R(θ; δ∗) < R(θ; δ) for at least one θ ∈ Θ. (14.9)

For example, suppose X and Y are independent, with

X ∼ N(µ, 1) and Y ∼ N(µ, 2). (14.10)

Then two estimators are

δ(x, y) =x + y

2and δ∗(x, y) =

2x + y

3. (14.11)

Both are unbiased, and hence their risks (using squared-error loss) are their variances:

R(µ; δ) =1 + 2

4=

3

4and R(µ; δ∗) =

4 + 2

9=

2

3. (14.12)

14.3. Admissibility 161

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.10

0.20

0.30

theta

MS

E

MLEa=b=1a=b=10a=10,b=3

Figure 14.1: Some MSE’s for binomial estimators (redux)

Thus δ∗ dominates δ. (The second inequality in (14.9) holds for all µ, although itdoes not have to for domination.) The concept of admissibility is based on lack ofdomination.

Definition 24. Let D be a set of decision procedures. A δ ∈ D is inadmissible amongprocedures in D if there is another δ∗ ∈ D that dominates δ. If there is no such δ∗, then δ isadmissible among procedures in D.

If D is the set of unbiased estimators, then the UMVUE is admissible in D, and anyother unbiased estimator is inadmissible. Simlarly, if D is the set of shift-equivariantprocedures (assuming that restriction makes sense for the model), then the best shift-invariant estimator is the only admissible estimator in D. In the Binomial exampleabove, if D consists of the four given estimators, then all four are admissible in D.

The presumption is that one would not want to use an inadmissible procedure,at least if risk is the only consideration. Other considerations, such as intuitivityor computational ease, may lead one to use an inadmissible procedure, provided itcannot be dominated by much. Conversely, any admissible procedure is presumedto be at least plausible, although there are some strange ones.


14.3.1 Stein’s surprising result

Admissiblity is general considered a fairly weak criterion. An admissible proceduredoes not have to be very good everywhere, but just have something going for it. Thusthe statistical community was rocked when in 1956, Charles Stein showed that in the

multivariate normal case, the usual estimator of the mean could be inadmissible1.The model has random vector

X ∼ Np(µ, Ip), (14.13)

with µ ∈ Rp. The objective is to estimate µ with squared-error loss, which in this case

is multivariate:

L(a, µ) = ‖a− µ‖2 =p

∑i=1

(ai − µi)2. (14.14)

The obvious estimator, which is the UMVUE, MLE, best shift-equivariant estimator,etc., is

δ(x) = x. (14.15)

Its risk is

R(µ; δ) = Eµ[p

∑i=1

(Xi − µi)2] = p, (14.16)

because the Xi’s all have variance 1. When p = 1 or 2, then δ is admissible. Thesurprise is that when p ≥ 3, δ is inadmissible. The most famous estimator that

dominates it is the 1961 James-Stein estimator2,

δJS(x) =

(1− p− 2

‖x‖2

)x. (14.17)

It is a shinkage estimator, because it takes the usual estimator, and shrinks it (towards

0 in this case), at least when (p − 2)/‖x‖2< 1. Throughout the sixties and sev-

enties, there was a frenzy of work on various shrinkage estimators. They are stillquite popular. The domination result is not restricted to normality. It is quite broad.The general notion of shrinkage is very important in machine learning, where betterpredictions are found by restraining estimators from becoming too large (also called“regularization").

To find the risk function for the James-Stein estimatorwhen p ≥ 3, start by writing

R(µ; δJS) = Eµ[‖δJS(X)− µ‖2]

= Eµ[‖X− µ− p− 2

‖X‖2X‖2]

= Eµ[‖X− µ‖2] + Eµ[(p− 2)2

‖X‖2]− 2(p− 2)Eµ[

X′(X− µ)

‖X‖2]. (14.18)

The first term we recognize as from (14.16) as p. Consider the third term, where

Eµ[X′(X− µ)

‖X‖2] =

p

∑i=1

Eµ[Xi(Xi − µi)

‖X‖2]. (14.19)

1Charles Stein (1956) Inadmissiblity of the usual estimator for the mean of a multivariate distribution,Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, pages 187–195.

2James, W. and Stein, C. (1961) Estimation with quadratic loss, Proceedings of the Fourth Berkeley Symposiumon Mathematical Statistics and Probability, Volume 1, pages 311–319.

14.3. Admissibility 163

We take each term in the summation separately. The first one can be written

Eµ[Xi(X1 − µ1)

‖X‖2] =

∫ ∞

−∞· · ·

∫ ∞

−∞

x1(x1 − µ1)

‖x‖2φµ1(x1)dx1φµ2(x2)dx2 · · · φµp (xp)dxp ,

(14.20)where φ is the N(0, 1) pdf, so that

φµ1(x1) =1√2π

e−12 (x1−µ1)2

. (14.21)

Looking at the innermost integral, note that by (14.21),

(x1 − µ1)φµ1 (x1) = − ∂

∂x1φµ1(x1), (14.22)

so that

∫ ∞

−∞

x1(x1 − µ1)

‖x‖2φµ1 (x1)dx1 = −

∫ ∞

−∞[

x1

‖x‖2][

∂

∂x1φµ1 (x1)]dx1. (14.23)

Now, with fixed x2, . . . , xp , use integration by parts:

u = x1

‖x‖2 ; v = φµ1(x1)

dudx1

= 1‖x‖2 − 2x2

1

‖x‖4 ; dvdx1

= ∂∂x1

φµ1(x1)

(14.24)

It is not hard to show that

uv|∞−∞ =x1

‖x‖2

1√2π

e−12 (x1−µ1)2 |∞−∞ = 0. (14.25)

Then,

∫ ∞

−∞

x1(x1 − µ1)

‖x‖2φµ1(x1)dx1 = −(uv|∞−∞ −

∫ ∞

−∞vdu)

=∫ ∞

−∞(

1

‖x‖2− 2x2

1

‖x‖4)φµ1 (x1)dx1. (14.26)

Replacing the innermost integral in (14.20) with (14.26) yields

Eµ[X1(X1 − µ1)

‖X‖2] = Eµ[

1

‖X‖2− 2X2

1

‖X‖4]. (14.27)

The same calculation works for i = 2, . . . , p, so that from (14.19),

Eµ[X′(X − µ)

‖X‖2] =

p

∑i=1

Eµ[1

‖X‖2− 2X2

i

‖X‖4] = pEµ[

1

‖X‖2]− 2Eµ[

∑ X2i

‖X‖2] = Eµ[

p− 2

‖X‖2].

(14.28)


Going back to (14.18),

R(µ; δJS) = Eµ[‖X− µ‖2] + Eµ[(p− 2)2

‖X‖2]− 2(p− 2)Eµ[

X′(X− µ)

‖X‖2]

= p + Eµ[(p− 2)2

‖X‖2]− 2(p− 2)Eµ[

p− 2

‖X‖2]

= p− Eµ[(p− 2)2

‖X‖2]. (14.29)

That’s it! The expected value at the end is positive, so that the risk is less than p. Thatis

R(µ; δJS) = p− Eµ[(p− 2)2

‖X‖2] < p = R(µ; δ) for all µ ∈ R

p, (14.30)

meaning δ(x) = x is inadmissible.How much does the James-Stein estimator dominate δ? It shrinks towards sero, so

if the true mean is zero, one would expect the James-Stein estimator to be quite good.

We can find the risk exactly, because when µ = 0, ‖X‖2 ∼ χ2p, and it is not hard to

show (using gamma’s) that E[1/χ2p] = 1/(p− 2), hence

R(0; δJS) = p− (p− 2)2

p− 2= 2. (14.31)

Especially when p is large, this risk is much less than that of δ, which is always p.Even for p = 3, the James-Stein risk is 2/3 of δ’s. The farther from 0 the µ is, the less

advantage the James-Stein estimator has. As ‖µ‖ → ∞, with ‖X‖ ∼ χ2p(‖µ‖2), the

Eµ[1/‖X‖2]→ 0, so

lim‖µ‖→∞

R(µ; δJS) −→ p = R(µ; δ). (14.32)

If rather than having good risk at zero, one has a “prior" idea that the mean is nearsome fixed µ

0, one can instead shrink towards that vector:

δ∗JS(x) = (1− p− 2

‖x− µ0‖2

)(x− µ0) + µ

0. (14.33)

This estimator has the same risk as the regular James-Stein estimator, but with shiftedparameter:

R(µ; δ∗JS) = p− Eµ[(p− 2)2

‖X− µ0‖2

] = p− Eµ−µ0[(p− 2)2

‖X‖2], (14.34)

and has risk of 2 when µ = µ0.

The James-Stein estimator itself is not admissible. There are many other similarestimators in the literture, some that dominate δJS but are not admissible (such asthe “positive part" estimator that does not allows the shrinking factor to be negative),and many admissible estimators that dominate δ.

14.4. Bayes procedures 165

14.4 Bayes procedures

One method for selecting among various δ’s is to find one that minimizes the averageof the risk, where the average is taken over θ. This averaging needs a probabilitymeasure on Θ. From a Bayesian perspective, this distribution is the prior. From a fre-quentist perspective, it may or may not reflect prior belief, but it should be “reason-able." The estimator that minimizes this average is the Bayes estimator correspondingto the distribution.

Definition 25. For given risk function R(θ; δ) and distribution π on Θ, a Bayes procedurecorresponding to π is a procedure δπ that minimizes the Bayes risk Eπ [R(θ; δ)] over δ, i.e.,

Eπ [R(θ; δπ)] ≤ Eπ [R(θ; δ)] for any δ. (14.35)

It might look daunting to minimize over an entire function, but we can reduce theproblem to minimizing over a single value by using an iterative expected value. Withboth X and θ random, the Bayes risk is the expected value of the loss over the jointdistribution of (X, θ), hence can be written as the expected value of the conditionalexpected value of L given X:

Eπ [R(θ; δ)] = E[L(δ(X), θ)]

= E[eL(X)], (14.36)

whereeL(x) = E[L(δ(x), θ) | X = x]. (14.37)

In that final expectation, θ is random, having the posterior distribution given X = x.In (14.37), because x is fixed, δ(x) is just a constant, hence it may not be too difficultto minimize eL(x) over δ(x). Note that is we find the δ(x) to minimize eL(x) for eachx, then we have also minimized the overall Bayes risk (14.36). Thus a Bayes procedureis δπ such that

δπ(x) minimizes E[L(δ(x), θ) | X = x] over δ(x) for each x ∈ X . (14.38)

In estimation with squared-error loss, (14.37) becomes

eL(x) = E[(δ(x)− g(θ))2 | X = x]. (14.39)

That expression is minimized with δ(x) being the mean, in this case the conditional(posterior) mean of g:

δπ(x) = E[g(θ) | X = x]. (14.40)

To summarize:

Lemma 19. In estimating g(θ) with squared error loss, the Bayes procedure correspondingto the prior distribution π is the posterior mean of g(θ).

A Bayesian does not care about x’s not observed, hence would immediately goto the conditional equation (14.38), and use the resulting δπ(x). It is interestingthat the decision-theoretic approach appears to bring Bayesians and frequentists to-gether. They do end up with the same procedure, but from different perspectives.The Bayesian is trying to limit expected losses given the data, while the frequentist istrying to limit average expected losses, taking expected values as the experiment isrepeated.


14.4.1 Bayes procedures and admissiblity

A Bayes procedure with respect to a prior π has good behavior averaging over theθ. Any procedure that dominated that procedure would also have to have at least asgood Bayes risk, hence would also be Bayes. In there is only one Bayes procedure fora π, then nothing could dominate it. Hence the following.

Lemma 20. Suppose δπ is the unique Bayes procedure relative to the prior π. Then it isadmissible.

Proof. Suppose δ∗ dominates δπ . Then

R(θ; δ∗) ≤ R(θ; δπ) for all θ ∈ Θ, (14.41)

which implies that

Eπ [R(θ; δ∗)] ≤ Eπ [R(θ; δπ)]. (14.42)

But by definition, from (14.35),

Eπ [R(θ; δπ)] ≤ Eπ [R(θ; δ∗)]. (14.43)

Thus δ and δπ are Bayes relative to π. But because the Bayes procedure is unique, δ∗

and δπ must be the same. Thus δ∗ cannot dominate δπ , and δπ is admissible. 2

Not all admissible procedures are Bayes, but they are at least limits of Bayes pro-cedures in some sense. Exactly which limits are admissible is a bit delicate, though.In any case, at least approximately, one can think of a procedure being admissibleis there is some Bayesian somewhere, or a sequence of Bayesians, who would woulduse it. The book Mathematical Statistics - A Decision Theoretic Approach by ThomasFerguson is an excellent introduction to these concepts.

14.5 Minimax procedures

Using a Bayes procedure still involves choosing a prior π, even if you are a frequentistand the prior does not really represent your beliefs. One attempt to objectifyingthe choice of a procedure is for each procedure, see what its worst risk is. Thenyou choose the procedure that has the best worst. For example, in Figure 14.1, themaximum risks are

MLE : 0.050a = 1, b = 1 : 0.026

a = 10, b = 10 : 0.160a = 10, b = 3 : 0.346

(14.44)

Of these, the posterior mean for prior Beta(1, 1) has the lowest maximum. It is calledthe minimax procedure.

Definition 26. Let D be a set of decision procedures. A δ ∈ D is minimax among proceduresin D if for any other δ∗ ∈ D,

supθ∈Θ

R(θ; δ) ≤ supθ∈Θ

R(θ; δ∗). (14.45)

14.5. Minimax procedures 167

For estimating the normal mean as in Section 14.3.1, the usual estimator δ(x) = xis minimax, as is any estimator (such as the James-Stein estimator) that dominates it.In particular, we see that minimax procedures do not have to be unique.

Again looking at Figure 14.1, note that the minimax procedure is the flattest. Infact, it is also “maximin" in that it has the worst best risk. It looks as if when tryingto limit bad risk everywhere, you give up very good risk somewhere. This idea leadsto one method for finding a minimax procedure: A Bayes procedure with flat risk isminimax.

Lemma 21. Suppose δπ is Bayes with respect to the prior π, and the risk is constant,

R(θ; δπ) = c for all θ ∈ Θ. (14.46)

Then δπ is minimax.

Proof. Suppose δπ is not minimax, so that there is a δ∗ such that

supθ∈Θ

R(θ; δ∗) < supθ∈Θ

R(θ; δπ) = c. (14.47)

But thenEπ [R(θ; δ∗)] < c = Eπ [R(θ; δπ)], (14.48)

meaning that δπ is not Bayes. Hence we have a contradiction, so that δπ is minimax.2

14.5.1 Example: Binomial

Suppose X ∼ Binomial(n, θ), θ ∈ (0, 1), then from (10.45), the MSE for the Bayesestimator using the Beta(α, β) prior can be calculated as

R(θ; δα,β) =nθ(1− θ)

(n + α + β)2+

((α + β)θ − α

n + α + β

)2

. (14.49)

If we can find an (α, β) so that this risk is constant, then the corresponding estimatoris minimax.

Chapter 15

Asymptotics: Convergence in probability

Often, inference about a single function of the parameter g(θ) consists of an estimate

g(θ) plus an assessment of its variability, which may be in the form of a standarderror of the estimate, a confidence or probability interval for g(θ), or a significancelevel or probability of a particular hypothesized value of g(θ). In any case, we needmore than just the estimate.

The standard error of an estimate g(θ) is its standard deviation or, in a slight ambi-guity of notation, an estimate of such:

SE[g(θ)] =

√Var[g(θ)] or

√Var[g(θ)]. (15.1)

Then an approximate 95% confidence interval for g(θ) would be

g(θ)± 2 SE[g(θ)]. (15.2)

If the estimator is Normal with mean g(θ) and variance SE2, then this interval isexact if “2" is changed to “1.96." For example, when estimating the mean µ of a

normal when σ2 is known, X ± 1.96σ/√

n is an exact 95% confidence interval. Inother situations, (15.2) may or not be a good approximation. Unfortunately, once onemoves away from linear estimators from Normal data, it is typically very difficult tofigure out exactly the distribution of an estimator, or even what its mean and varianceare. We have been in this situation many times, with MLE’s, Pitman Estimators, eventhe median.

One way to address such questions is to look at what happens when n is large, oractually, as n→ ∞. In many case, nice asymptotic results are available, and they givesurprisingly good approximations even when n is nowhere near ∞.

15.1 The set-up

We assume that we have a sequence of random variables, or random vectors. That is,for each n, we have a random p× 1 vector Wn with spaceWn(⊂ Rp) and probabilitydistribution Pn. There need not be any particular relationship between the Wn’sfor different n’s, but in the most common situation we will deal with, Wn is somefunction of iid X1, . . . , Xn, so as n→ ∞, the function is based on more observations.

169

170 Chapter 15. Asymptotics: Convergence in probability

For example, Wn may be the sample mean of X1, . . . , Xn, iid N(µ, σ2)’s, so that

Wn = Xn =∑

ni=1 Xi

n∼ N(µ,

σ2

n). (15.3)

Or Wn could be a vector,

Wn =

(Xn

S2n

), (15.4)

whose joint distribution we know. In other situations, the parameters may dependon n, e.g.,

Wn ∼ Binomial(n,λ

n) (15.5)

for some λ > 0. (Technically, we would have to set the p to be min1, λ/n, but asn→ ∞, eventually that would be λ/n.)

The two types of convergence we will consider are convergence in probability toa constant and convergence in distribution to a random vector, which will be definedin the next two sections.

15.2 Convergence in probability to a constant

A sequence of constants an approaching the constant c means that as n → ∞, an getsarbitrarily close to c; technically, for any ǫ > 0, eventually |an− c| < ǫ. That definition

does not immediately transfer to random variables. For example, suppose Xn is the

mean of n iid N(µ, σ2)’s as in (15.3). The law of large numbers says that as n → ∞,

X → µ. But that cannot be always true, since no matter how large n is, the space of

Xn is R. On the other hand, the probability is high that Xn is close to µ. That is, forany ǫ > 0,

Pn[|Xn − µ| < ǫ] = P[|N(0, 1)| <√

nǫ/σ] = Φ(√

nǫ/σ)−Φ(−√

nǫ/σ), (15.6)

where Φ is the distribution function of Z ∼ N(0, 1). Now let n → ∞. Because Φ is adistribution function, the first Φ on the right in (15.6) goes to 1, and the second goesto 0, so that

Pn[|Xn − µ| < ǫ] −→ 1. (15.7)

Thus Xn isn’t for sure close to µ, but is with probability 0.9999999999 (assuming n islarge enough). Now for the definition.

Definition 27. The sequence of random variables Wn converges in probability to the con-stant c, written

Wn −→P c, (15.8)

if for every ǫ > 0,

Pn[|Wn − c| < ǫ] −→ 1. (15.9)

If Wn is a sequence of random p× 1 vectors, and c is a p× 1 constant vector, then Wn →P cif for every ǫ > 0,

Pn[‖Wn − c‖ < ǫ] −→ 1. (15.10)

15.2. Convergence in probability to a constant 171

It turns out that Wn →P c if and only if each component Wni →P ci, whereWn = (Wn1, . . . , Wnp)′.

As an example, suppose X1, . . . , Xn are iid Uni f orm(0, 1), and let Wn = minX1, . . . , Xn.You would expect that as the number of observations between 0 and 1 increase, the

minimum would get pushed down to 0. So the question is whether Wn →P 0. To

prove it, take any 1 > ǫ > 01, and look at

Pn[|Wn− 0| < ǫ] = Pn[|Wn| < ǫ] = Pn[Wn < ǫ], (15.11)

because Wn is positive. Then, recalling the pdf of the minimum of uniforms in (6.96,with k = 1),

Pn[Wn < ǫ] =∫ ǫ

0n(1− z)n−1dz

= −(1− z)n |ǫ0= 1− (1− ǫ)n

→ 1 as n→ ∞. (15.12)

ThusminX1, . . . , Xn −→P 0. (15.13)

15.2.1 Chebychev’s Inequality

The examples in (15.6) and (15.12) are unusual in that we can calculate the probabil-ities exactly. It is more common that some inequalities are used. Chebychev’s, for

instance, which can be used to show that Wn →P c if E[(Wn− c)2]→ 0, comes next.

Lemma 22. Chebychev’s Inequality. For random variable W and ǫ > 0,

P[|W| ≥ ǫ] ≤ E[W2]

ǫ2. (15.14)

Proof.

E[W2] = E[W2 I|W |<ǫ] + E[W2I|W |≥ǫ]

≥ E[W2 I|W |≥ǫ]

≥ ǫ2 E[I|W |≥ǫ]

= ǫ2 P[|W| ≥ ǫ]. (15.15)

Then (15.14) follows. QED

Aside 1. A similar proof can be applied to any nondecreasing function φ(w) : [0, ∞) → R

to show that

P[|W| ≥ ǫ] ≤ E[φ(|W|)]φ(ǫ)

. (15.16)

Chebychev’s Inequality uses φ(w) = w2. The general form is called Markov’s Inequality.

1Why is it ok for us to ignore ǫ ≥ 1?


Now apply Chebychev to the sequence Wn, substituting W = Wn − c, so that forany ǫ > 0,

Pn[|Wn − c| ≥ ǫ] ≤ E[(Wn − c)2]

ǫ2. (15.17)

Then

E[(Wn − c)2]→ 0 =⇒ Pn[|Wn − c| < ǫ] > 1− E[(Wn − c)2]

ǫ2→ 1. (15.18)

A couple of useful consequences:

1. If δn(x1, . . . , xn) is an estimate of g(θ) for every n, and if the mean square errorgoes to 0 as n→ ∞, then

δn(X1, . . . , Xn) −→P g(θ). (15.19)

(That is, Wn = δn(X1, . . . , Xn) and c = g(θ) in (15.18).) A sequence of estimatorsfor which (15.19) holds is said to be consistent.

Definition 28. Suppose δ1, δ2, . . . is a sequence of estimators of g(θ). (That is, foreach n, there is a set of probability measures Pnθ | θ ∈ Θ.) Then if

δn −→P g(θ) (15.20)

for each θ ∈ Θ, the sequence of estimators is consistent.

Notice that this definition assumes the same parameter space for each n. Oneoften leaves out the words “sequence of", and says that “δn is a consistent esti-mator of g(θ)."

2. If X1, . . . , Xn are iid with mean µ and variance σ2< ∞, then

Xn −→P µ. (15.21)

This fact follows from part 1, since the mean square error of Xn as an estimator

of µ is Var[Xn] = σ2/n → 0. Equation (15.21) is called the weak law of largenumbers.

Revisiting the uniform example, suppose that X1, . . . , Xn are iid Uni f orm(µ, µ + 1),and consider δn(x1, . . . , xn) = minx1, . . . , xn to be an estimator of µ. Because δn isshift-equivariant, we know that the mean square error is the same for all µ. Thus by(6.96), δn ∼ Beta(1, n) if µ = 0, and

MSE(δn) = Var0[δn] + E0[δn]2 =n

(n + 1)2(n + 2)+

1

(n + 1)2→ 0. (15.22)

Hence minX1. . . . , Xn →P µ, i.e., the minimum is a consistent estimator of µ.The next section contains some useful techniques for showing convergence in

probability.

15.3. The law of large numbers; Mapping 173

15.3 The law of large numbers; Mapping

Showing an estimator is consistent usually begins with the law of large numbers.Here is the formal statement.

Lemma 23. Weak law of large numbers (WLLN). If X1, . . . , Xn are iid with (finite) meanµ, then

Xn −→P µ. (15.23)

The difference between this lemma and what we stated in (15.21) is that before,we assumed that the variance was finite. We had to do that to use Chebychev’sinequality. It turns out that the result still holds as long as the mean is finite. We willuse that result, but not prove it. It can be applied to means of functions of the Xi’s.

For example, if E[X2i ] < ∞, then

1

n

n

∑i=1

X2i −→P E[Xi]

2 = µ2 + σ2, (15.24)

because the X21 , . . . , X2

n are iid with mean µ2 + σ2.Not only means of functions, but functions of the mean are of interest. For exam-

ple, if the Xi’s are iid Exponential(λ), then the mean is 1/λ, so that

Xn −→P1

λ. (15.25)

But we really want to estimate λ. The method of moments and maximum likelihood

estimators are both 1/Xn, though not the UMVUE. Does

1

Xn−→P λ ? (15.26)

We could find the mean and variance of 1/Xn, but more simply we note that if Xn

is close to 1/λ, 1/Xn must be close to λ, because the function 1/w is continuous.Formally, we have the following mapping result.

Lemma 24. If Wn →P c, and g(w) is a function continuous at w = c, then

g(Wn) −→P g(c). (15.27)

Proof. By definition of continuity, for every ǫ > 0, there exists a δ > 0 such that

|w− c| < δ =⇒ |g(w)− g(c)| < ǫ. (15.28)

Thus the event on the right happens at least as often as that on the left, i.e.,

Pn[|Wn− c| < δ] ≤ Pn[|g(Wn)− g(c)| < ǫ]. (15.29)

The definition of→P means that Pn[|Wn − c| < δ]→ 1 for any δ > 0, butPn[|g(Wn)− g(c)| < ǫ] is larger, hence

Pn[|g(Wn)− g(c)| < ǫ] −→ 1, (15.30)

proving (15.27).

Thus the answer to (15.26) is “Yes." This lemma also works for vector Wn, that is,if g(w) is continuous at c, then

Wn −→P c =⇒ g(Wn) −→P g(c). (15.31)


Examples.

1. Suppose X1, . . . , Xn are iid with exponential family density, where θ is the naturalparameter and x is the natural statistic:

f (x; θ) = a(x)eθx−ψ(θ). (15.32)

Then the WLLN shows that

Xn −→P E[Xi] = µ(θ) = ψ′(θ). (15.33)

The MLE is the solution to the equation xn = µ(θ) (why?), hence is

θn = µ−1(xn). (15.34)

The function µ−1(w) is continuous, hence by Lemma 24,

θn = µ−1(Xn) −→P µ−1(µ(θ)) = θ. (15.35)

Thus the MLE is a consistent estimator of θ.

2. Consider regression through the origin, that is, (X1, Y2), . . . , (Xn, Yn) are iid,

E[Yi | Xi = xi ] = βxi, Var[Yi | Xi = xi ] = σ2e , E[Xi] = µX, Var[Xi] = σ2

X > 0.(15.36)

The least squares estimate of β is

βn =∑

ni=1 xiyi

∑ni=1 x2

i

. (15.37)

Is this a consistent estimator? We know from the WLLN by (15.24) that

1

n

n

∑i=1

X2i = µ2

X + σ2X . (15.38)

Also, X1Y1, . . . , XnYn are iid, and

E[XiYi | Xi = xi] = xi E[Yi | Xi = xi ] = βx2i , (15.39)

hence

E[XiYi] = E[βX2i ] = β(µ2

X + σ2X). (15.40)

Thus the WLLN shows that

1

n

n

∑i=1

XiYi −→P β(µ2X + σ2

X). (15.41)

Now consider

Wn =

(1n ∑

ni=1 XiYi

1n ∑

ni=1 X2

i

)and c =

(β(µ2

X + σ2X)

µ2X + σ2

X

), (15.42)

15.3. The law of large numbers; Mapping 175

so that Wn →P c. The function g(w1, w2) = w1/w2. is continuous at w = c, hence(15.31) shows that

g(Wn) =1n ∑

ni=1 XiYi

1n ∑

ni=1 X2

i

−→P g(c) =β(µ2

X + σ2X)

µ2X + σ2

X

, (15.43)

that is,

βn =∑

ni=1 XiYi

∑ni=1 X2

i

−→P β. (15.44)

So, yes, the least squares estimator is consistent.

Chapter 16

Asymptotics: Convergence in

distribution

Convergence to a constant is helpful, but generally more information is needed, asfor confidence intervals. E.g, is we can say that

θ − θ

SE(θ)≈ N(0, 1), (16.1)

then an approximate 95% confidence interval for θ would be

θ ± 2× SE(θ), (16.2)

where the “2" is approximately 1.96. Thus we need to find the approximate distri-bution of a random variable. In the asymptotic setup, we need the notion of Wn

converging to a random variable. It is formalized by looking at the respective distri-bution function for each possible value, almost. Here is the definition.

Definition 29. Suppose Wn is a sequence of random variables, and W is a random variable.Let Fn be the distribution function of Wn, and F be the distribution function of W. Then Wn

converges in distribution to W if

Fn(w) −→ F(w) (16.3)

for every w ∈ R at which F is continuous. This convergence is written

Wn −→D W. (16.4)

For example, go back to X1, . . . , Xn iid Uni f orm(0, 1), but now let

Wn = n minX1, . . . , Xn. (16.5)

(The minimum itself goes to 0, but by multiplying by n it may not.) The distributionfunction of Wn is Fn(w) = 0 if w ≤ 0, and if w > 0 (using calculations as in equation15.12),

Fn(w) = P[Wn ≤ w] = P[minX1, . . . , Xn ≤ w/n] = 1−(

1− w

n

)n. (16.6)

177

178 Chapter 16. Asymptotics: Convergence in distribution

Now let n→ ∞. Recalling that (1 + z/n)n → ez,

Fn(w) −→

1− e−w i f w > 00 i f w ≤ 0.

(16.7)

Is the right hand side a distribution function of some random variable? Yes, indeed, itis the F of W ∼ Exponential(1). (This F is continuous, so one can switch the equalityfrom the ≤ to the >.) That is,

n minX1, . . . , Xn −→D Exponential(1). (16.8)

Aside 2. In the definition, the convergence (16.3) does not need to hold at w’s for which Fis not continuous. This relaxation exists because sometimes the limit of the Fn’s will havepoints at which the function is continuous from the left but not the right, whereas F’s needto be continuous from the right. For example, take Wn = Bernoulli(1/2) + 1/n. It seems

reasonable that Wn −→D Bernoulli(1/2). Let Fn be the distribution function for Wn:

Fn(w) =

0 i f w < 1/n1/2 i f 1/n ≤ w < 1 + 1/n

1 i f 1 + 1/n ≤ w.(16.9)

Now let n→ ∞, so that

Fn(w) −→

0 i f w ≤ 01/2 i f 0 < w ≤ 1

1 i f 1 < w.(16.10)

That limit is not a distribution function, though it would be if the ≤’s and <’s were switched,in which case it would be the F for a Bernoulli(1/2). Luckily, the definition allows the limitto be wrong at points of discontinuity, which are 0 and 1 in this example, so we can say that

Wn −→D Bernoulli(1/2).

For another example, suppose Xn ∼ Binomial(n, λ/n) for some fixed λ > 0. Thedistribution function of Xn is

Fn(x) =

0 i f x < 0

∑⌊x⌋i=0 f (i; n, λ/n) i f 0 ≤ x ≤ n

1 i f n < x.

(16.11)

where ⌊x⌋ is the largest integer less than or equal to x, and f (i; n, λ/n) is the Binomial(n, λ/n) pmf. Taking the limit as n → ∞ of Fn requires taking the limit of the f ’s, sowe will do that first. For a positive integer i ≤ n,

f (i; n,λ

n) =

n!

i!(n− i)!

(λ

n

)i (1− λ

n

)n−i

=λi

i!

n(n− 1) · · · (n− i + 1)

ni

(1− λ

n

)n (1− λ

n

)−i

. (16.12)

Now i is fixed and n→ ∞. Consider the various factors on the right. The first has no

n’s. The second has i terms on the top, and ni on the bottom, so can be written

n(n− 1) · · · (n− i + 1)

ni=

n

n

n− 1

n· · · n− i + 1

n= 1 (1− 1

n) · · · (1− i− 1

n)→ 1.

(16.13)

16.1. Converging to a constant random variable 179

The third term goes to e−λ, and the fourth goes to 1. Thus for any positive integer i,

f (i; n,λ

n) −→ e−λ λi

i!, (16.14)

which is the pmf of the Poisson(λ). Going back to the Fn in (16.11), note that nomatter how large x is, as n→ ∞, eventually x < n, so that the third line never comesinto play. Thus

Fn(x) −→

0 i f x < 0

∑⌊x⌋i=0 e−λ λi

i! i f 0 ≤ x.(16.15)

But that is the distribution function of the Poisson(λ), i.e.,

Binomial(n,λ

n) −→D Poisson(λ). (16.16)

16.1 Converging to a constant random variable

It could be that Wn −→D W, where P[W = c] = 1, that is, W is a constant randomvariable. (Sounds like an oxymoron.) But that looks like Wn is converging to aconstant. It is. In fact,

Wn −→D W where P[W = c] if and only if Wn −→P c. (16.17)

Let Fn(w) be the distribution function of Wn, and F the distribution function of W,so that

F(w) =

0 i f w < c1 i f c ≤ w.

(16.18)

For any ǫ > 0,

P[|Wn− c| ≤ ǫ] = P[Wn ≤ c + ǫ]− P[Wn < c− ǫ]. (16.19)

Also,P[Wn ≤ c− 3ǫ/2] ≤ P[Wn < c− ǫ] ≤ P[Wn ≤ c− ǫ/2], (16.20)

hence, because Fn(w) = P[Wn ≤ w],

Fn(c + ǫ)− Fn(c− ǫ/2) ≤ P[|Wn− c| ≤ ǫ] ≤ Fn(c + ǫ)− Fn(c− 3ǫ/2). (16.21)

We use this equation to show (16.17).

1. First suppose that Wn →D W, so that Fn(w) → F(w) if w 6= c. Then applyingthe convergence to w = c + ǫ and c− ǫ/2,

Fn(c + ǫ) → F(c + ǫ) = 1 and

Fn(c− ǫ/2) → F(c− ǫ/2) = 0. (16.22)

Then (16.21) shows that P[|Wn− c| ≤ ǫ]→ 1, proving that Wn →P c.

2. Next, suppose Wn −→P c. Then P[|Wn − c| ≤ ǫ]→ 1, hence from (16.21),

Fn(c + ǫ)− Fn(c− 3ǫ/2) −→ 1. (16.23)

The only way that can happen is for

Fn(c + ǫ)→ 1 and Fn(c− 3ǫ/2)→ 0. (16.24)

But since that holds for any ǫ > 0, for any w < c, Fn(w)→ 0, and for any w > c,

Fn(w)→ 1. That is, Fn(w)→ F(w) for w 6= c, proving that Wn →D W.


16.2 Moment generating functions

It may be difficult to find distribution functions and their limits, but often momentgenerating functions are easier to work with. Recall that if two random variable havethe same moment generating function that is finite in a neighborhood of 0, then theyhave the same distribution. It is true also of limits, that is, if the mgf’s of a sequenceof random variables converge to a mgf, then the random variable converge.

Lemma 25. Suppose W1, W2, . . . is a sequence of random variables, where Mn(t) is the mgfof Wn; and suppose W is a random variable with mgf M(t). If for some ǫ > 0, Mn(t) < ∞

for all n and M(t) < ∞ for all |T| < ǫ,

Wn −→D W if and only if Mn(t) −→ M(t) for all |t| < ǫ. (16.25)

Looking again at the binomial example, if Xn ∼ Binomial(n, λ/n), then its mgf is

Mn(t) = ((1− λ

n) +

λ

net)n = (1 +

−λ + λet

n)n. (16.26)

Letting n→ ∞,

Mn(t) −→ e−λ+λet, (16.27)

which is the mgf of a Poisson(λ), hence (16.16).

16.3 The Central Limit Theorem

We know that sample means tend to the population mean, if the latter exists. Butwe can obtain more information with a distributional limit. In the normal case, weknow that the sample mean is normal, and with appropriate normalization, it isNormal(0,1). A Central Limit Theorem is one that says a properly normalized samplemean approaches normality even if the original variables do not.

Start with X1, . . . , Xn iid with mean 0 and variance 1, and mgf MX(t), which is

finite for |t| < ǫ for some ǫ > 0. The variance of Xn is 1/n, so to normalize it wemultiply by

√n:

Wn =√

n Xn. (16.28)

To find the asymptotic distribution of Wn, we first find its mgf, Mn(t):

Mn(t) = E[etWn ] = E[et√

n Xn ]

= E[e(t/√

n) ∑ Xi ]

= E[e(t/√

n) Xi ]n

= MX(t/√

n)n. (16.29)

Now Mn(t) is finite if MX(t/√

n) is, and MX(t/√

n) < ∞ if |t/√n| < ǫ, which iscertainly true if |t| < ǫ. That is, Mn(t) < ∞ if |t| < ǫ.

To find the limit of Mn, first, take logs:

log(Mn(t)) = n log(MX(t/√

n)) = nRX(t/√

n), (16.30)

where RX(t) = log(MX(t)). Expand RX in a Taylor Series about t = 0:

RX(t) = RX(0) + t R′X(0) +t2

2R′′(t∗), t∗ between 0 and t. (16.31)

16.3. The Central Limit Theorem 181

But RX(0) = 0, and R′X(0) = E[X] = 0, by assumption. Thus substituting t/√

n for tin (16.31) yields

RX(t/√

n) =t2

2nR′′(t∗n), t∗n between 0 and t/

√n, (16.32)

hence by (16.30),

log(Mn(t)) =t2

2R′′(t∗n). (16.33)

The mgf MX(t) has all its derivatives as long as |t| < ǫ, which means so does RX . Inparticular, R′′(t) is continuous at t = 0. As n → ∞, t∗n gets squeezed between 0 andt/√

n, hence t∗n → 0, and

log(Mn(t)) =t2

2R′′(t∗n)→ t2

2R′′(0) =

t2

2Var[Xi] =

t2

2, (16.34)

because we have assumed that Var[Xi] = 1. Finally,

Mn(t) −→ et2/2, (16.35)

which is the mgf of a N(0,1), i.e.,

√n Xn −→D N(0, 1). (16.36)

There are many central limit theorems, depending on various assumptions, butthe most basic is the following.

Theorem 10. Central limit theorem. Suppose X1, X2, . . . are iid with mean 0 and variance1. Then (16.36) holds.

What we proved using (16.35) required the mgf be finite in a neighborhood of 0.This theorem does not need mgf’s, only that the variance is finite. A slight general-

ization of the theorem has X1, X2, . . . iid with mean µ and variance σ2, 0 < σ2< ∞,

and concludes that √n (Xn − µ) −→D N(0, σ2). (16.37)

16.3.1 Supersizing

Convergence in distribution immediately translates to multivariate random variables.That is, suppose Wn is a p× 1 random vector with distribution function Fn. Then

Wn −→D W (16.38)

for some p× 1 random vector W with distribution function F if

Fn(w) −→ F(w) (16.39)

for all points w ∈ R at which F(w) is continuous.If Mn(t) is the mgf of Wn, and M(t) is the mgf of W, and these mgf’s are all finite

for ‖t‖ < ǫ for some ǫ > 0, then

Wn −→D W iff Mn(t) −→ M(t) for all ‖t‖ < ǫ. (16.40)


Now for the central limit theorem. Suppose X1, X2, . . . are iid random vectors withmean µ, covariance matrix Σ, and mgf MX(t) < ∞ for ‖t‖ < ǫ. Let

Wn =√

n(Xn − µ). (16.41)

Let a be any p× 1 vector, and write

a′Wn =√

n(a′Xn − a′µ) =√

n (1

n

n

∑i=1

a′Xi − a′µ). (16.42)

Thus a′Wn is the normalized sample mean of the a′Xi’s, and the regular central limit

theorem 10 (actually, equation 16.37) can be applied, where σ2 = Var[a′Xi] = a′Σa:

a′Wn −→D N(0, a′Σa). (16.43)

But then that means the mgf’s converge in (16.43):

E[et(a′Wn)] −→ et2

2 a′Σa, (16.44)

for t‖a‖ < ǫ. Now switch notation so that a = t and t = 1, and we have that

Mn(t) = E[et′Wn ] −→ e12 t′Σt. (16.45)

which holds for any ‖t‖ < ǫ. The right hand side is the mgf of a N(0, Σ), so

Wn =√

n(Xn − µ) −→D N(0, Σ). (16.46)

In fact, (16.46) holds whenever Σ is finite.

Example. Suppose X1, X2, . . . are iid with mean µ and variance σ2. One might beinterested in the joint distribution of the sample mean and variance, after some nor-malization. When the data are normal, we know the answer exactly, of course, butwhat about otherwise? We won’t answer that question quite yet, but take a step by

looking at the joint distribution of the sample means of the Xi’s and the X2i ’s. We will

assume that Var[X2i ] < ∞. Then we take

Wn =√

n

(1

n

n

∑i=1

(Xi

X2i

)−(

µµ2 + σ2

)). (16.47)

Then the central limit theorem says that

Wn −→D N(0n, Cov

(Xi

X2i

)). (16.48)

Look at that covariance. We know Var[Xi ] = σ2. Also,

Var[X2i ] = E[X4

i ]− E[X2i ]

2 = κ4 − (µ2 + σ2)2, (16.49)

andCov[Xi, X2

i ] = E[X3i ]− E[Xi]E[X2

i ] = κ3 − µ(µ2 + σ2), (16.50)

16.4. Mapping 183

where for convenience we’ve defined

κk = E[Xki ]. (16.51)

It’s not pretty, but the final answer is

√n

(1

n

n

∑i=1

(Xi

X2i

)−(

µµ2 + σ2

))(16.52)

−→D N(0n,

(σ2 κ3 − µ(µ2 + σ2)

κ3 − µ(µ2 + σ2) κ4 − (µ2 + σ2)2

)). (16.53)

16.4 Mapping

Lemma 24 and equation (15.31) show that if Wn →P c then g(Wn) →P g(c) ifg is continuous at c. Similar results hold for convergence in distribution, and forcombinations of convergences.

Lemma 26. Some mapping lemmas

1. If Wn →D W, and g : R → R is continuous at all the points in W , the space of W,then

g(Wn) −→D g(W). (16.54)

2. Similarly, for multivariate Wn (p× 1) and g (q× 1): If Wn →D W, and g : Rp → Rq

is continuous at all the points inW , then

g(Wn) −→D g(W). (16.55)

3. The next results constitute what is usually called Slutsky’s Theorem, or sometimesCramér’s Theorem.

Suppose that

Zn −→P c and Wn −→D W. (16.56)

Then

Zn + Wn −→D c + W, ZnWn −→D cW, and, if c 6= 0,Wn

Zn−→D W

c. (16.57)

4. Generalizing Slutsky, if Zn −→P c and Wn −→D W, and g : R × R → R iscontinuous at c ×W , then

g(Zn, Wn) −→D g(c, W). (16.58)

5. Finally, the multivariate version of #4. If Zn −→P c (p1 × 1) and Wn −→D W(p2 × 1), and g : Rp1 ×Rp2 → Rq is continuous at c ×W , then

g(Zn, Wn) −→D g(c, W). (16.59)


All the other four follow from #2, because convergence in probability is the sameas convergence in distribution to a constant random variable. Also, #2 is just themultivariate version of #1, so basically they are all the same. They appear variousplaces in various forms, though. The idea is that as long as the function is continuous,the limit of the function is the function of the limit.

Together with the law of large numbers and central limit theorem, these mappingresults can prove a huge number of useful approximations. The t-statistic is one

example. Let X1, . . . , Xn be iid with mean µ and variance σ2 ∈ (0, ∞). Student’s tstatistic is defined as

Tn =√

nXn − µ

Sn, where S2

n =∑

ni=1(Xi − Xn)2

n− 1. (16.60)

We know that if the data are normal, this Tn ∼ tn−1 exactly, but if the data are notnormal, who knows what the distribution is. We can find the limit, though, usingSlutsky. Take

Zn = Sn and Wn =√

n (Xn − µ). (16.61)

Then

Zn −→P σ and Wn −→D N(0, σ2) (16.62)

from the homework and the central limit theorem, respectively. Because σ2> 0, the

final component of (16.57) shows that

Tn =Wn

Zn−→D N(0, σ2)

σ= N(0, 1). (16.63)

Thus for large n, Tn is approximately N(0, 1) even if the data are not normal. Thus

Xn ± 2Sn√

n(16.64)

is an approximate 95% confidence interval for µ. Notice that this result doesn’t sayanything about small n, especially it doesn’t say that the t is better than the z whenthe data are not normal. Other studies have shown that the t is fairly robust, so it canbe used at least when the data are approximately normal. Actually, heavy tails forthe Xi’s means light tails for the Tn, so the z might be better that t in that case.

Regression through the origin.

Recall the example in (15.36) and (15.38). We know βn is a consistent estimator of β,but what about its asymptotic distribution? That is,

√n (βn − β) −→D ?? (16.65)

16.5. The ∆-method 185

We need to do some manipulation to get it into a form where we can use the centrallimit theorem, etc. To that end,

√n (βn − β) =

√n (

∑ni=1 XiYi

∑ni=1 X2

i

− β)

=√

n∑

ni=1 XiYi − β ∑

ni=1 X2

i

∑ni=1 X2

i

=√

n∑

ni=1(XiYi − βX2

i )

∑ni=1 X2

i

=

√n ∑

ni=1(XiYi − βX2

i )/n

∑ni=1 X2

i /n. (16.66)

The numerator in the last expression contains the sample mean of the (XiYi −βX2

i )’s. Conditionally, from (15.38)

E[XiYi− βX2i | Xi = xi] = xi βxi− βx2

i = 0, Var[XiYi− βX2i | Xi = xi] = x2

i σ2e , (16.67)

so that unconditionally,

E[XiYi − βX2i ] = 0, Var[XiYi − βX2

i ] = E[X2i σ2] + Var[0] = σ2

e (σ2X + µ2

X). (16.68)

Thus the central limit theorem shows that

√n

n

∑i=1

(XiYi − βX2i )/n −→D N(0, σ2

e (σ2X + µ2

X). (16.69)

We already know from (15.41) that ∑ X2i /n→P σ2

X + µ2X, hence by Slutsky (16.57),

√n (βn − β) −→D N(0, σ2

e (σ2X + µ2

X))

σ2X + µ2

X

= N(0,σ2

e

σ2X + µ2

X

). (16.70)

16.5 The ∆-method

Recall the use of the ∆-method to obtain the approximate variance of a function of arandom variable, as in (6.106). Here we give a formal version of it.

Lemma 27. ∆-method. Suppose

√n (Xn − µ) −→D W, (16.71)

and the function g : R→ R has a continuous derivative at µ. Then

√n (g(Xn)− g(µ)) −→D g′(µ) W. (16.72)

Proof. As in (6.103), we have that

g(Xn) = g(µ) + (Xn − µ)g′(µ∗n), µ∗n is between Xn and µ, (16.73)


hence √n (g(Xn)− g(µ)) =

√n (Xn − µ) g′(µ∗n). (16.74)

We wish to show that g′(µ∗n)→P g′(µ), but first need to show that Xn →P µ. Now

Xn − µ = [√

n (Xn − µ)]× 1√n

−→D W × 0 (because1√n−→P 0)

= 0. (16.75)

That is, Xn − µ →P 0, hence Xn →P µ. Because µ∗n is trapped between Xn and µ,

µ∗n →P µ, which by continuity of g′ means that

g′(µ∗n) −→P g′(µ). (16.76)

Applying Slutsky (16.57) to (16.74), by (16.71) and (16.76),

√n (Xn − µ) g′(µ∗n) −→D W g′(µ), (16.77)

which is (16.72). 2

Usually, the limiting W is normal, so that under the conditions on g, we have that

√n (Xn − µ) −→D N(0, σ2) =⇒

√n (g(Xn)− g(µ)) −→D N(0, g′(µ)2σ2). (16.78)

16.5.1 The median

In Section 6.3.1, we used the ∆-method to estimate the variance of order statistics.Here we apply Lemma 27 to the sample median. We have X1, . . . , Xn iid with distri-

bution function F. Let η be the median, so that F(η) = 12 , and assume that the density

f is positive and continuous at η. For simplicity we take n odd, n = 2k − 1 for k a

positive integer, so that X(k), the kth order statistic, is the median. From (6.100), X(k)

has the same distribution as F−1(U(k)), where U(k) is the median of a sample of n iid

Uniform(0,1)’s. Also, from (6.97),

U(k) ∼ Beta(n + 1

2,

n + 1

2) = Beta(k, k). (16.79)

The objective is to find the asymptotic distribution of X(k), or actually

√n (X(k) − η) as n→ ∞ (i.e., k→ ∞). (16.80)

We start with the asymptotic distribution of U(k). It can be shown that

√k (U(k) −

1

2) −→ N(0,

1

8). (16.81)

Thus for a function g, with continuous derivative at 12 , the ∆-method (16.78) shows

that √k (g(U(k))− g(

1

2)) −→ N(0,

1

8g′(

1

2)2). (16.82)

16.5. The ∆-method 187

Let g(u) = F−1(u). Then

g(U(k)) = F−1(U(k)) =D X(k) and g(1

2) = F−1(

1

2) = η. (16.83)

We also use the fact that

g′(u) =1

F′(F−1(u))=⇒ g′(

1

2) =

1

f (η). (16.84)

Thus making the substitutions in (16.82), and recalling that n = 2k− 1, we obtain for(16.80) that

√n (X(k) − η) =

√2k− 1

k

√k (X(k) − η)

→√

2 N(0,1

8 f (η)2), (16.85)

hence √n (X(k) − η) −→ N(0,

1

4 f (η)2). (16.86)

(Compare the asymptotic variance here to the approximation in (6.99).)

16.5.2 Variance stabilizing transformations

Often, the variance of an estimator depends on the value of the parameter beginestimated. For example, if Xn ∼ Binomial(n, p), then with pn = Xn/n,

Var[ pn] =p(1− p)

n. (16.87)

In regression situations, for instance, one usually desires the dependent Yi’s to havethe same variance for each i, so that if these Yi’s are binomial, or Poisson, the variancewill not be constant. However, taking a function of the variables may achieve approx-imately equal variances. Such a function is called a variance stabilizing transformation.

Formally, if θn is an estimator of θ, then we wish to find a g such that

√n (g(θn)− g(θ)) −→D N(0, 1). (16.88)

The “1" for the variance is arbitrary. The important thing is that it does not dependon θ.

In the binomial example, the variance stabilizing g would satisfy

√n (g( pn)− g(p)) −→D N(0, 1). (16.89)

We know that √n ( pn − p) −→D N(0, p(1− p)), (16.90)

and by the ∆-method,

√n (g( pn)− g(p)) −→D N(0, g′(p)2 p(1− p)). (16.91)


What should g be so that that variance is 1? We need to solve

g′(p) =1√

p(1− p), (16.92)

so that

g(p) =∫ p

0

1√y(1− y)

dy. (16.93)

First, let u =√

y, so that y = u2 and dy = 2udu, and

g(p) =∫ √p

0

1

u√

1− u22udu = 2

∫ √p

0

1√1− u2

du. (16.94)

The integral is arcsin(u), which means the variance stabilizing transformation is

g(p) = 2 arcsin(√

p). (16.95)

Note that adding a constant to g won’t change the derivative. The approximationsuggested by (16.89) is then

2 arcsin(√

pn) ≈ N(2 arcsin(√

p),1

n). (16.96)

An approximate 95% confidence interval for 2 arcsin(√

p) is

2 arcsin(√

pn)± 2√n

. (16.97)

That interval can be inverted to obtain the interval for p, that is, apply g−1(u) =sin(u/2)2 to both ends:

p ∈(

sin(arcsin(√

pn)− 1√n

)2, sin(arcsin(√

pn) +1√n

)2

). (16.98)

This interval may be slightly better than the usual approximate interval,

pn ± 2

√pn(1− pn)

n. (16.99)

16.6 The MLE in Exponential Families

Consider a regular one-dimensional exponential family model, where θ is the naturalparameter and x is the natural statistic, i.e., X1, . . . , Xn are iid with pdf

f (x | θ) = a(x)eθx−ψ(θ). (16.100)

We know that the maximum likelihood estimator of θ is

θn = µ−1(Xn), where µ(θ) = Eθ [Xi] = ψ′(θ). (16.101)

Also, by the central limit theorem,

√n (Xn − µ(θ)) −→D N(0, I1(θ)), (16.102)

16.6. The MLE in Exponential Families 189

where I1 is the Fisher information in one observation:

I1(θ) = µ′(θ) = ψ′′(θ) = Varθ [Xi]. (16.103)

The asymptotic distribution of θn can then be found using the ∆-method with g(w) =µ−1(w). As in (6.107),

g′(w) =1

µ′(µ−1(w)), (16.104)

(16.102) implies

√n (µ−1(Xn)− µ−1(µ(θ))) −→D N(0,

1

µ′(µ−1(µ(θ)))2I1(θ)). (16.105)

But,θn = µ−1(Xn), µ−1(µ(θ)) = θ, (16.106)

and

1

µ′(µ−1(µ(θ)))2I1(θ) =

1

µ′(θ)2I1(θ) =

1

I1(θ)2I1(θ) =

1

I1(θ). (16.107)

That is,√

n (θn − θ) −→D N(0,1

I1(θ)), (16.108)

so that for large n,

θn ≈ N(θ,1

n I1(θ)). (16.109)

Notice that, at least asymptotically, the MLE achieves the Cramér-Rao lower bound.Not that the MLE is unbiased, but it at least is consistent.

One limitation of the result (16.109) is that the asymptotic variance depends on theunknown parameter. But we do can consistently estimate the parameter, and I1(θ) isa continuous function, hence

I1(θn) −→P I1(θ), (16.110)

so that Slutsky can be used to show that from (16.108),

√n I1(θn) (θn − θ) −→D N(0, 1), (16.111)

and an approximate 95% confidence interval for θ is

θn ±2√

n I1(θn). (16.112)

What about functions g(θ)? We know the MLE of g(θ) is g(θn), and if g′ is contin-uous,

√n (g(θn)− g(θ)) −→D N(0,

g′(θ)2

I1(θ)). (16.113)

Notice that that variance is also the CRLB for unbiased estimates of g(θ), at least ifyou divide by n.


16.7 The multivariate ∆-method

It might be that we have a function of several random variables to deal with, or moregenerally we have several functions of several variables. For example, we might beinterested in the mean and variance simultaneously, as in (15.4). So we start witha sequence of p × 1 random vector, whose asymptotic distribution is multivariatenormal: √

n (Xn − µ) −→D N(0n, Σ), (16.114)

and a function g : Rp → Rq. We cannot just take the derivative of g, since there are

pq of them. What we need is the entire matrix of derivatives, just as for finding theJacobian. Letting

g(w) =

g1(w)g2(w)

...gq(w)

, w =

w1w2...

wp

, (16.115)

define the p× q matrix

D(w) =

∂∂w1

g1(w) ∂∂w1

g2(w) · · · ∂∂w1

gq(w)∂

∂w2g1(w) ∂

∂w2g2(w) · · · ∂

∂w2gq(w)

......

. . ....

∂∂wp

g1(w) ∂∂wp

g2(w) · · · ∂∂wp

gq(w)

. (16.116)

Lemma 28. Multivariate ∆-method. Suppose (16.114) holds, and D(w) in (16.116) iscontinuous at w = µ. Then

√n (g(Xn)− g(µ)) −→D N(0n, D(µ)′ΣD(µ)). (16.117)

The Σ is p× p and D is p× q, so that the covariance in (16.117) is q× q, as it shouldbe. Some examples follow.

Mean and variance

Go back to the example that ended with (16.53): X1, . . . , Xn are iid with mean µ, vari-

ance σ2, E[X3i ] = κ3 and E[X4

i ] = κ4. Ultimately, we wish the asymptotic distribution

of Xn and S2n, so we start with that of (∑ Xi, ∑ X2

i ), so that

Xn =1

n

n

∑i=1

(Xi

X2i

), µ =

(µ

µ2 + σ2

), (16.118)

and

Σ =

(σ2 κ3 − µ(µ2 + σ2)

κ3 − µ(µ2 + σ2) κ4 − (µ2 + σ2)2

). (16.119)

With S2n = ∑ X2

i /n− Xn2,

(Xn

S2n

)= g(Xn, ∑ X2

i /n) =

(g1(Xn, ∑ X2

i /n)g2(Xn, ∑ X2

i /n)

), (16.120)

16.7. The multivariate ∆-method 191

whereg1(w1, w2) = w1 and g2(w1, w2) = w2 − w2

1. (16.121)

Then

D(w) =

∂w1∂w1

∂(w2−w21)

∂w1

∂w1∂w2

∂(w2−w21)

∂w2

=

(1 −2w10 1

). (16.122)

Also,

g(µ) =

(µ

σ2 + µ2 − µ2

)=

(µσ2

)and D(µ) =

(1 −2µ0 1

), (16.123)

and

D(µ)′ΣD(µ) =

(1 0−2µ 1

)(σ2 κ3 − µ(µ2 + σ2)

κ3 − µ(µ2 + σ2) κ4 − (µ2 + σ2)2

)(1 −2µ0 1

)

=

(σ2 κ3 − µ3 − 3µσ2

κ3 − µ3 − 3µσ2 κ4 − σ4 − 4µκ3 + 3µ4 + 6µ2σ2

).

Yikes! Notice in particular that the sample mean and variance are not asymptoticallyindependent necessarily.

Before we go on, let’s assume the data are normal. In that case,

κ3 = µ3 + 3µσ2 and κ4 = 3σ4 + 6σ2 + µ2, (16.124)

(left to the reader), and, magically,

D(µ)′ΣD(µ) =

(σ2 00 2σ4

), (16.125)

hence√

n

((Xn

S2n

)−(

µσ2

))−→D N(02,

(σ2 00 2σ4

)). (16.126)

Actually, that is not surprising, since we know the variance of Xn is σ2/n, and that of

S2n is the variance of a χ2

n−1 times (n− 1)σ2/n, which is 2(n− 1)3σ4/n2. Multiplying

those by n and letting n → ∞ yields the diagonals σ2 and 2σ4. Also the mean andvariance are independent, so their covariance is 0.

From these, we can find the coefficient of variance, or the noise-to-signal ratio,

cv =σ

µand sample version cv =

Sn

Xn. (16.127)

Then cv = h(Xn, S2n) where h(w1, w2) =

√w2/w1. The derivatives here are

Dh(w) =

−

√w2

w21

− 12√

w2 w1

=⇒ Dh(µ, σ2) =

(− σ

µ2

− 12σµ

). (16.128)

Then, assuming that µ 6= 0,

D′hΣDh = (− σ

µ2,− 1

2σµ)

(σ2 00 2σ4

)( − σµ2

− 12σµ

)

=σ4

µ4+

1

2

σ2

µ2. (16.129)


The asymptotic distribution is then

√n (cv− cv) −→D N(0,

σ4

µ4+

1

2

σ2

µ2) = N(0, cv2(cv2 +

1

2)). (16.130)

For example, data on n = 102 female students’ heights had a mean of 65.56 and astandard deviation of 2.75, so the cv = 2.75/65.56 = 0.0419. We can find a confidenceinterval by estimating the variance in (16.130) in the obvious way:

(cv± 2|cv|

√0.5 + cv2

√n

) = (0.0419± 2× 0.0029) = (0.0361, 0.0477), (16.131)

For the men, the mean is 71.25 and the sd is 2.94, so their cv is 0.0413. That’s prac-tically the same as for the women. The men’s standard error of cv is 0.0037 (then = 64), so a confidence interval for the difference between the women and men is

(0.0419− 0.0413± 2√

0.00292 + 0.00372) = (0.0006± 0.0094). (16.132)

Clearly 0 is in that interval, so there does not appear to be any difference between thecoefficients of variation.

Correlation coefficient

Consider the bivariate normal sample,

(X1Y1

), . . . ,

(Xn

Yn

)are iid ∼ N(02,

(1 ρρ 1

)). (16.133)

The sample correlation coefficient in this case (we don’t need to subtract the means)is

rn =∑

ni=1 XiYi√

∑ni=1 X2

i ∑ni=1 Y2

i

. (16.134)

From the homework, we know that rn →P ρ. What about the asymptotic distribu-tion? Notice that rn can be written as a function of three sample means,

rn = g(n

∑i=1

XiYi/n,n

∑i=1

X2i /n,

n

∑i=1

Y2i /n) where g(w1, w2, w3) =

w1√w1w2

. (16.135)

First, apply the central limit theorem to the three means:

√n

∑ni=1 XiYi/n

∑ni=1 X2

i /n

∑ni=1 Y2

i /n

− µ

−→D N(03, Σ). (16.136)

Now E[XiYi] = ρ and E[X2i ] = E[Y2

i ] = 1, hence

µ =

ρ11

. (16.137)

16.7. The multivariate ∆-method 193

The covariance is a little more involved. The homework shows Var[XiYi] = 1 + ρ2.

Because Xi and Yi are N(0,1), their squares are χ21, whose variance is 2. Next,

Cov[XiYi, X2i ] = E[X3

i Yi]− E[XiYi]E[X2i ] = E[E[X3

i Yi | Xi]]− ρ

= E[X3i ρXi ]− ρ

= 3ρ− ρ = 2ρ, (16.138)

because E[X4i ] = Var[X4

i ] + E[X2i ]

2 = 2 + 1. By symmetry, Cov[XiYi, Y2i ] = 2ρ, too.

Finally,

Cov[Y2i , X2

i ] = E[X2i Y2

i ]− E[Y2i ]E[X2

i ] = E[E[X2i Y2

i | Xi]]− 1

= E[X2i (ρ2X2

i + 1− ρ2)]− 1

= 2ρ2. (16.139)

Putting those together we have that

Σ = Cov

XiYi

X2i

Y2i

=

1 + ρ2 2ρ 2ρ2ρ 2 2ρ2

2ρ 2ρ2 2

. (16.140)

Now for the derivatives of g(w) = w2/√

w2w3:

D(w) =

1√w2w3

− 12

w1

w3/22

√w3

− 12

w1

w3/23

√23

=⇒ D(µ) = D(ρ, 1, 1) =

1

− 12 ρ

− 12 ρ

. (16.141)

The ∆-method applied to (16.136) shows that

√n (rn − ρ) −→D N(0, (1− ρ2)2), (16.142)

where that asymptotic variance is found by just writing out

D(µ)′ΣD(µ) =(

1 − 12 ρ − 1

2 ρ)

1 + ρ2 2ρ 2ρ2ρ 2 2ρ2

2ρ 2ρ2 2

1

− 12 ρ

− 12 ρ

= (1− ρ2)2.

(16.143)

Linear functions

If A is a q× p matrix, then it is easy to get the asymptotic distribution of AXn:

√n (Xn − µ) −→D N(0n, Σ) =⇒

√n (AXn − Aµ) −→D N(0n, AΣA′), (16.144)

because for the function g(w) = Aw, D(w) = A′.

Chapter 17

Likelihood Estimation

The challenge when trying to estimate a parameter is to first find an estimator, andsecond to find a good one, and third to find its standard error, and fourth to find itsdistribution. We have found many nice estimators, but faced with a new situation,what does one do? Fortunately, there is a solution to all four challenges, at leastwhen n is large (and certain conditions hold): the maximum likelihood estimator. Wehave already seen this result in the regular one-dimensional exponential family case(Section 16.6), where

√n (θn − θ) −→D N(0,

1

I1(θ)). (17.1)

Plus we have thatI1(θn) −→P I1(θ), (17.2)

hence √nI1(θn) (θn − θ) −→D N(0, 1). (17.3)

Thus the MLE is asymptotically normal, consistent, asymptotically efficient, and wecan estimate the standard error effectively. What happens outside of exponentialfamilies?

17.1 The setup: Cramér’s conditions

We need a number of technical assumptions to hold. They easily hold in exponentialfamilies, but for other pdf’s they may or may not be easy to verify. We start withX1, . . . , Xn iid, each with space X0 and pdf f (x | θ), θ ∈ Θ = (a, b). First, we needthat the space of Xi is the same for each θ, which is satisfied if

f (x | θ) > 0 for all x ∈ X0, θ ∈ Θ. (17.4)

This requirement eliminates the Uni f orm(0, θ), for example. Which is not to say thatthe MLE is bad in this case, but that the asymptotic normality, etc., does not hold. Wealso need that

∂ f (x | θ)

∂θ,

∂2 f (x | θ)

∂θ2,

∂3 f (x | θ)

∂θ3exist for all x ∈ X0, θ ∈ Θ. (17.5)

195

196 Chapter 17. Likelihood Estimation

In order for the score and information functions to exists and behave correctly, assumethat for any θ ∈ Θ,

∫

X∂ f (x | θ)

∂θdx =

∂

∂θ

∫

Xf (x | θ)dx (= 0) (17.6)

and ∫

X∂2 f (x | θ)

∂θ2dx =

∂2

∂θ2

∫

Xf (x | θ)dx (= 0). (17.7)

(Replace the integrals with sums for the discrete case.) Denote the loglikelihood forone observation by l:

l(θ; x) = log( f (x | θ)). (17.8)

Define the score function byS(θ; x) = l ′(θ; x), (17.9)

and the Fisher information in one observation by

I1(θ) = Eθ [S(θ; X)2]. (17.10)

Assume that0 < I1(θ) < ∞ for all θ ∈ Θ. (17.11)

We have the following, which was used in deriving the Cramér-Rao lower bound.

Lemma 29. If (17.4), (17.6), (17.7), and (17.10) hold, then

Eθ [S(θ; X)] = 0, hence Varθ [S(θ, X)] = I1(θ), (17.12)

andI1(θ) = −Eθ[l

′′(θ; X)] = Eθ [S′(θ; X)]. (17.13)

Proof. First, by definition of the score function (17.9),

Eθ [S(θ; X)] = Eθ

[∂

∂θlog( f (x | θ))

]

=∫

X∂ f (x | θ)/∂θ

f (x | θ)f (x | θ)dx

=∫

X∂ f (x | θ)

∂θdx

= 0 by (17.6) (17.14)

Next, write

Eθ[l′′(θ; X)] = Eθ

[∂2

∂θ2log( f (X|θ))

]

= Eθ

[∂2 f (X|θ)/∂θ2

f (X|θ) −(

∂ f (X|θ)/∂θ

f (X|θ)

)2]

=∫

X0

∂2 f (x|θ)/∂θ2

f (x|θ) f (x|θ)dx− Eθ [S(θ : X)]2

=∂2

∂θ2

∫

X0

f (x|θ)dx− Eθ [S(θ; X)]2

= 0− I1(θ), (17.15)

17.2. Consistency 197

which proves (17.13). 2

One more technical assumption we need is that for each θ ∈ Θ (which will takethe role of the “true" value of the parameter), there exists an ǫ > 0 and a functionM(x) such that

|l ′′′(t; x)| ≤ M(x) for θ − ǫ < t < θ + ǫ, and Eθ [M(X)] < ∞. (17.16)

17.2 Consistency

First we address the question of whether the MLE is a consistent estimator of θ. Theshort answer is “Yes," although things can get sticky if there are multiple maxima.But before we get to the results, there are some mathematical prerequisites to dealwith.

17.2.1 Convexity and Jensen’s inequality

Definition 30. Convexity. A function g : X0 → R, X0 ⊂ R, is convex if for each x0 ∈ X ,there exist α0 and β0 such that

g(x0) = α0 + β0x0, (17.17)

andg(x) ≥ α0 + β0x for all x ∈ X0. (17.18)

The function is strictly convex if (17.17) holds, and

g(x) > α0 + β0x for all x ∈ X0, x 6= x0. (17.19)

The definition basically means that the tangent to g at any point lies below thefunction. If g′′(x) exists for all x, then g is convex if and only if g′′(x) ≥ 0 for all x,and it is strictly convex if and only if g′′(x) > 0 for all x. Notice that the line neednot be unique. For example, g(x) = |x| is convex, but when x0 = 0, any line through(0, 0) with slope between ±1 will lie below g.

By the same token, any line segment connecting two points on the curve lies abovethe curve, as in the next lemma.

Lemma 30. If g is convex, x, y ∈ X0, and 0 < ǫ < 1, then

ǫg(x) + (1− ǫ)g(y) ≥ g(ǫx + (1− ǫ)y). (17.20)

If g is strictly convex, then

ǫg(x) + (1− ǫ)g(y) > g(ǫx + (1− ǫ)y) for x 6= y. (17.21)

Rather than prove this lemma, we will prove the more general result for randomvariables.

Lemma 31. Jensen’s inequality. Suppose that X is a random variable with space X0, andthat E[X] exists. If the function g is convex, then

E[g(X)] ≥ g(E[X]), (17.22)

where E[g(X)] may be +∞. Furthermore, if g is strictly convex and X is not constant,

E[g(X)] > g(E[X]). (17.23)


Proof. We’ll prove it just in the strictly convex case, when X is not constant. The othercase is easier. Apply Definition 30 with x0 = E[X], so that

g(E[X]) = α0 + β0E[X], and g(x) > α0 + β0x for all x 6= E[X]. (17.24)

But thenE[g(X)] > E[α0 + β0X] = α0 + β0E[X] = g(E[X]), (17.25)

because there is a positive probability X 6= E[X]. 2

Why does Lemma 31 imply Lemma 30?1

A mnemonic device for which way the inequality goes is to think of the convex

function g(x) = x2. Jensen implies that

E[X2] ≥ E[X]2, (17.26)

but that is the same as saying that Var[X] ≥ 0. Also, Var[X] > 0 unless X is aconstant.

Convexity is also defined for x being a p× 1 vector, so that X0 ⊂ Rp, in which casethe line α0 + β0x in Definition 30 becomes a hyperplane α0 + β′

0x. Jensen’s inequality

follows as well, where we just turn X into a vector.

17.2.2 A consistent sequence of roots

The loglikelihood function for the data is

ln(θ; x1, . . . , xn) =n

∑i=1

log( f (xi | θ)) =n

∑i=1

l(θ; xi). (17.27)

For now, we assume that the space does not depend on θ (17.4), and the first derivativein (17.5) is continuous. Also, for each n and x1, . . . , xn, there exists a unique solutionto l ′n(θ; x1, . . . , xn) = 0:

l ′n(θn; x1, . . . , xn) = 0, θn ∈ Θ. (17.28)

Note that this θn is a function of x1, . . . , xn. It is also generally the maximum like-lihood estimate, although it is possible it is a local minimum or an inflection pointrather than the maximum.

Now suppose θ is the true parameter, and take ǫ > 0. Look at the difference,divided by n, of the likelihoods at θ and θ + ǫ:

1

n(ln(θ0; x1, . . . , xn)− ln(θ0 + ǫ; x1, . . . , xn)) =

1

n

n

∑i=1

log

(f (xi | θ)

f (xi | θ + ǫ)

)

=1

n

n

∑i=1

− log

(f (xi | θ + ǫ)

f (xi | θ)

)(17.29)

The final expression is the mean of iid random variables, hence by the WLLN itconverges in probability to the expected value of the summand:

1

n(ln(θ; x1, . . . , xn)− ln(θ + ǫ; x1, . . . , xn)) −→P Eθ [− log

(f (X | θ + ǫ)

f (X | θ)

)]. (17.30)

1Take X to be the random variable with P[X = x] = ǫ and P[X = y] = 1− ǫ.

17.2. Consistency 199

Now apply Jensen’s inequality, Lemma 31, to that expected value, with g(x) =− log(x), and the random variable being f (X | θ)/ f (X | θ + ǫ). This g is strictlyconvex, and the random variable is not constant because the parameters are different,hence

Eθ [− log

(f (X | θ + ǫ)

f (X | θ)

)] > − log

(Eθ

[f (X | θ + ǫ)

f (X | θ)

])

= − log

(∫

X0

f (x | θ + ǫ)

f (x | θ)f (x | θ)dx

)

= − log

(∫

X0

f (x | θ + ǫ)dx

)

= − log(1)

= 0 (17.31)

The same result holds for θ − ǫ, hence

1

n(ln(θ; x1, . . . , xn)− ln(θ + ǫ; x1, . . . , xn)) −→P a > 0 and

1

n(ln(θ; x1, . . . , xn)− ln(θ − ǫ; x1, . . . , xn)) −→P b > 0. (17.32)

These equations mean that eventually, the likelihood at θ is higher than that at θ ± ǫ;precisely, dropping the Xi’s for convenience:

Pθ [ln(θ) > ln(θ + ǫ) and ln(θ) > ln(θ − ǫ)] −→ 1. (17.33)

Note that if ln(θ) > ln(θ + ǫ) and ln(θ) > ln(θ − ǫ), then between θ − ǫ and θ + ǫ,the likelihood goes up then comes down again. Because the derivative is continuous,somewhere between θ ± ǫ the derivative must be 0. By assumption, that point is the

unique root θn. It is also the maximum. Which means that

ln(θ) > ln(θ + ǫ) and ln(θ) > ln(θ − ǫ) =⇒ θ − ǫ < θn < θ + ǫ. (17.34)

By (17.33), the left hand side of (17.34) has probability going to 1, hence

P[|θn − θ| < ǫ]→ 1 =⇒ θn −→P θ, (17.35)

and the MLE is consistent.The requirement that there is a unique root (17.28) for all n and set of xi’s is too

strong. The main problem is that sometimes the maximum of the likelihood does notexist over Θ = (a, b), but at a or b. For example, in the Binomial case, if the numberof successes is 0, then the MLE of p would be 0, which is not in (0,1). Thus in the nexttheorem, we need only that probably there is a unique root.

Theorem 11. Suppose that

Pθ [l′n(t; X1, . . . , Xn) has a unique root θn ∈ Θ] −→ 1. (17.36)

Thenθn −→P θ. (17.37)


Technically, if there is not a unique root, you can choose θn to be whatever youwant, but typically it would be either one of a number of roots, or one of the limitingvalues a and b. Equation (17.36) does not always hold. For example, in the Cauchylocation-family case, the number of roots goes in distribution to 1 + Poisson(1/π)(ref2), so there is always a good chance of two or more roots. But it will be true thatif you pick the right root, it will be consistent.

17.3 Asymptotic normality

To find the asymptotic distribution of the MLE, we first expand the derivative of the

likelihood around θ = θn:

l ′n(θn) = l ′n(θ) + (θn − θ) l ′′n (θ) +(θn − θ)2

2l ′′′n (θ∗n), θ∗n between θ and θn. (17.38)

If θn is a root of l ′n as in (17.28), then

0 = l ′n(θ) + (θn − θ) l ′′n (θ) +(θn − θ)2

2l ′′′n (θ∗n)

=⇒ (θn − θ)(l ′′n (θ) +(θn − θ)

2l ′′′n (θ∗n)) = −l ′n(θ)

=⇒√

n (θn − θ) = −√

n 1n l ′n(θ)

1n l ′′n (θ) + (θn − θ) 1

n l ′′′n (θ∗n)/2. (17.39)

The task is then to find the limits of the three terms on the right: the numerator andthe two summands in the denominator.

Theorem 12. Asymptotic normality of the MLE (Cramér). Suppose that the assumptionsin Section 17.1 hold, i.e., (17.4), (17.5), (17.6), (17.7), (17.11), and (17.16). Also, suppose

that θn is a consistent sequence of roots of (17.28), that is, l ′n(θn) = 0 and θn →P θ, where θis the true parameter. Then

√n (θn − θ) −→D N(0,

1

I1(θ)). (17.40)

Proof. First,

1

nl ′(θ) =

1

n

n

∑i=1

l ′(θ; xi) =1

n

n

∑i=1

S(θ; xi). (17.41)

The scores S(θ; Xi) are iid with mean 0 and variance I1(θ) as in (17.12), so that thecentral limit theorem implies that

√n

1

nl ′(θ) −→D N(0, I1(θ)). (17.42)

Second,1

nl ′′n (θ) =

1

n

n

∑i=1

l ′′(θ; xi). (17.43)

2Reeds (1985), Annals of Statistics, pages 775–784.

17.3. Asymptotic normality 201

The l ′′(θ; Xi)’s are iid with mean −I1(θ) by (17.13), hence the WLLN shows that

1

nl ′′n (θ) −→P −I1(θ). (17.44)

Third, consider the M(xi) from assumption (17.16). By the WLLN,

1

n

n

∑i=1

M(Xi) −→P Eθ [M(X)] < ∞, (17.45)

and we have assumed that θn →P θ, hence

(θn − θ)1

n

n

∑i=1

M(Xi) −→P 0. (17.46)

Thus for any δ > 0,

P[|θn − θ| < δ and |(θn − θ)1

n

n

∑i=1

M(Xi)| < δ] −→ 1. (17.47)

Now take the δ < ǫ, where ǫ is from the assumption (17.16). Then

|θn − θ| < δ =⇒ |θ∗n − θ| < δ

=⇒ |l ′′′(θ∗n; xi)| ≤ M(xi) by (17.16)

=⇒ | 1n

l ′′′n (θ∗n)| ≤ 1

n

n

∑i=1

M(xi). (17.48)

Thus

|θn − θ| < δ and |(θn − θ)1

n

n

∑i=1

M(Xi)| < δ =⇒ |(θn − θ)1

nl ′′′n (θ∗n)| < δ, (17.49)

and (17.47) shows that

P[|(θn − θ)1

nl ′′′n (θ∗n)| < δ, ] −→ 1. (17.50)

That is,

(θn − θ)1

nl ′′′n (θ∗n) −→P 0. (17.51)

Putting together (17.42), (17.44), and (17.51),

√n (θn − θ) = −

√n 1

n l ′n(θ)1n l ′′n (θ) + (θn − θ) 1

n l ′′′n (θ∗n)/2

−→D −N(0, I1(θ))

−I1(θ) + 0

= N(0,1

I1(θ)), (17.52)

which proves the theorem (17.40). 2


Note. The assumption that we have a consistent sequence of roots can be relaxed to

the condition (17.36), that is, θn has to be a root of l ′n only with high probability:

P[l ′(θn) = 0 and θn ∈ Θ] −→ 1. (17.53)

As in the exponential family case, if I1(θ) is continuous, In(θn) →P I1(θ), so that(17.3) holds here, too. It may be that I1(θ) is annoying to calculate. One can insteaduse the observed Fisher Information, which is defined to be

I1(θ; x1, . . . , xn) = − 1

nl ′′n (θ). (17.54)

The advantage is that the second derivative itself is used, and the expected value of

it does not need to be calculated. Using θn yields a consistent estimate of I1(θ):

I1(θn; x1, . . . , xn) = − 1

nl ′′n (θ)− (θn − θ)

1

nl ′′′n (θ∗n)

−→P I1(θ) + 0, (17.55)

by (17.44) and (17.51). It is thus legitimate to use, for large n, either of the followingas approximate 95% confidence intervals:

θn ± 21√

n I1(θn)(17.56)

or

(θn ± 21√

n I1(θn; x1, . . . , xn)) = (θn ± 2

1√− l ′′n (θn)

). (17.57)

Example: Fruit flies. Go back to the fruit fly example in HW#5 of STAT 410. We usedNewton-Raphson to calculate the MLE, iterating the equation

θ(i) = θ(i−1) − l ′n(θ(i−1))

l ′′n (θ(i−1)). (17.58)

Starting with the guess θ(1) = 0.5, the iterations proceeded as follows:

i θ(i) l ′n(θ(i)) l ′′n (θ(i))1 0.5 −5.333333 −51.555562 0.396552 0.038564 −54.501613 0.397259 0.000024 −54.433774 0.397260 0 −54.43372

(17.59)

So θMLE = 0.3973, but notice that we automatically have the observed Fisher Infor-mation at the MLE as well,

I1(θn; y1, . . . , y10) =1

1054.4338. (17.60)

An approximate 95% confidence interval for θ is then

(0.3973± 2√54.4338

) = (0.3973± 2× 0.1355) = (0.1263, 0.6683). (17.61)

17.4. Asymptotic efficiency 203

17.4 Asymptotic efficiency

We do not expected the MLE to be unbiased or to have a variance that achieves theCramér-Rao Lower Bound. In fact, it may be that the mean or variance of the MLE

does not exist. For example, the MLE for 1/λ in the Poisson case is 1/Xn, which

does not have a finite mean because there is a positive probability that Xn = 0. Butunder the given conditions, if n is large, the MLE is close in distribution to a randomvariable that is unbiased and that does achieve the CRLB. Recall that we definedefficiency of an unbiased estimator by the ratio of the CRLB to the variance of theestimator. A parallel asymptotic notion, the asymptotic efficiency, looks at the ratio of1/I1(θ) to the variance in the asymptotic distribution of appropriate estimators.

A sequence δn is a consistent and asymptotically normal sequence of estimators ofg(θ) if

δn −→P g(θ) and√

n (δn − g(θ)) −→D N(0, σ2g(θ)) (17.62)

for some σ2g(θ). That is, it is consistent and asymptotically normal. The asymptotic

normality implies the consistency, because g(θ) is subtracted from the estimator inthe second convergence.

Theorem 13. Suppose the conditions in Section 17.1 hold. Then if the sequence δn is aconsistent and asymptotically normal estimator of g(θ), and g′ is continuous, then

σ2g(θ) ≥ g′(θ)2

I1(θ)(17.63)

for all θ ∈ Θ except perhaps for a set of Lebesgue measure 0.

That coda about “Lebesgue measure 0" is there because it is possible to trick up

the estimator so that it is “superefficient" at a few points. If σ2g (θ) is continuous in

θ, then you can ignore that bit. Also, the conditions need not be quite as strict as inSection 17.1 in that the part about the third derivative in (17.5) can be dropped, and(17.16) can be changed to be about the second derivative.

Definition 31. If the conditions above hold, then the asymptotic efficiency of the sequenceδn is

AsympE f fθ(δn) =g′(θ)2

I1(θ)σ2g (θ)

. (17.64)

If the asymptotic efficiency is 1, then the sequence is said to be asymptotically efficient.

A couple of immediate consequences follow, presuming the conditions hold.

1. The maximum likelihood estimator is asymptotically efficient, because σ2(θ) =1/I1(θ).

2. If θn is an asymptotically efficient estimator of θ, then g(θn) is an asymptoticallyefficient estimator of g(θ) by the ∆-method.

Example. In the location family case, Mallard problem 3 of HW#3 found the asymp-totic efficiencies (which do not depend on θ) of the mean and median in some cases:


f AsympEff(Mean) AsympEff(Median)Normal 1 0.6366Logistic 0.9119 0.75Cauchy 0 0.8106

Double Exponential 0.5 1

(17.65)

The assumptions actually do not hold for the Double Exponential density, butthe asymptotic efficiency calculations are still valid. For these cases, the MLE isasymptotically efficient; in the Normal case the MLE is the mean, and in the DoubleExponential case the MLE is the median. Also, although we do not prove it, thePitman Estimator is asymptotically efficient, which should make sense because itsvariance is the lowest possible. It is a matter of showing that it is asymptoticallynormal.

Lehmann (1991) 3 in Table 4.4 has more calculations for the asymptotic efficienciesof some trimmed means. Here are some of the values, where α is the trimmingproportion (the mean is α = 0 and the median is α = 1/2):

f ↓ ; α → 0 1/8 1/4 3/8 1/2Normal 1.00 0.94 0.84 0.74 0.64Logistic 0.91 0.99 0.95 0.86 0.75

t5 0.80 0.99 0.96 0.88 0.77t3 0.50 0.96 0.98 0.92 0.81

Cauchy 0.00 0.50 0.79 0.88 0.81Double Exponential 0.50 0.70 0.82 0.91 1.00

(17.66)

This table can help you choose what trimming amount you would want to use, de-pending on what you think your f might be. You can see that between the mean anda small amount of trimming (1/8), the efficiencies of most distributions go up sub-stantially, while the Normal’s goes down only a small amount. With 25% trimming,all have at least a 79% efficiency.

17.5 Multivariate parameters

The work so far assumed that θ was one-dimensional (although the data couldbe multidimensional). Everything follows for multidimensional parameters θ, with

some extended definitions. Now assume that Θ ⊂ RK , and that Θ is open. Thederivative of the loglikelihood is now K-dimensional,

∇ln(θ) =n

∑i=1

∇l(θ; xi), (17.67)

where

∇l(θ; x) =

∂ log( f (x | θ)∂θ1

...∂ log( f (x | θ)

∂θK

. (17.68)

3Theory of Point Estimation, Wadsworth. This is the first edition of the Lehmann-Casella book.

17.5. Multivariate parameters 205

The MLE then satisfies the equations

∇ln(θn) = 0. (17.69)

The score function is also a K× 1 vector, in fact,

S(θ; x) = ∇l(θ; x), (17.70)

and the Fisher information in one observation is a K× K matrix,

I1(θ) = Covθ[S(θ)] (17.71)

= −E

∂2 log( f (x | θ))∂θ2

1

∂2 log( f (x | θ))∂θ1∂θ2

· · · ∂2 log( f (x | θ))∂θ1∂θK

∂2 log( f (x | θ))∂θ2∂θ1

∂2 log( f (x | θ))∂θ2

2· · · ∂2 log( f (x | θ))

∂θ2∂θK

......

. . ....

∂2 log( f (x | θ))∂θK∂θ1

∂2 log( f (x | θ))∂θK∂θ2

· · · ∂2 log( f (x | θ))∂θ2

K

. (17.72)

I won’t detail all the assumptions, but they are basically the same as before, exceptthat they apply to all the partial and mixed partial derivatives. The equation (17.11),

0 < I1(θ) < ∞ for all θ ∈ Θ, (17.73)

means that I1(θ) is positive definite, and all its elements are finite. The two mainresults are next.

1. If θn is a consistent sequence of roots of the derivative of the loglikelihood, then

√n (θn − θ) −→D N(0, I−1(θ)). (17.74)

2. If δn is a consistent and asymptotically normal sequence of estimators of g(θ),where the partial derivatives of g are continuous, and

√n (δn − g(θ)) −→D N(0, σ2

g (θ)), (17.75)

thenσ2

g (δ) ≥ Dg(θ)′ I−11 (θ)Dg(θ) (17.76)

for all θ ∈ Θ (except possibly for a few), where Dg is the K× 1 vector of partialderivatives ∂g(θ)/∂θi as in the multivariate ∆-method in (16.116).

The lower bound in (17.76) is the variance in the asymptotic distribution of√

n(g(θn)−g(θ)), where θn is the MLE of θ. Which is to say that the MLE of g(θ) is asymptoticallyefficient.

Example: Normal

Suppose X1, . . . , Xn are iid N(µ, σ2). The MLE of θ = (µ, σ2)′ is θ = (Xn, S2n)′, where

the S2n is divided by n. To find the information matrix, start with

l(µ, σ2; x) = − 1

2

(x− µ)2

σ2− 1

2log(σ2). (17.77)


Then

S(µ, σ2; x) =

(x−µσ2

12

(x−µ)2

σ4 − 12σ2

). (17.78)

(For fun you can check that E[S] = 02.) To illustrate, we’ll use two methods forfinding the information matrix. First,

I1(θ) = Covθ[S(µ, σ2; X)]

=

(E[(

X−µσ2 )2] E[ 1

2(X−µ)3

σ6 ]

E[ 12

(X−µ)3

σ6 ] Var[ 12

(X−µ)2

σ4 ]

)

=

(E[ Z2

σ2 ] E[ 12

Z3

σ3 ]

E[ 12

Z3

σ3 ] Var[ 12

Z2

σ2 ]

)(Z =

X− µ

σ∼ N(0, 1))

=

(1σ2 0

0 12

1σ4

).

Using the second derivatives,

I1(θ) = −E

[(− 1

σ2 − X−µσ4

− X−µσ4 − (x−µ)2

σ6 + 12σ4

)]

= −E

[(− 1

σ2 0

0 − σ2

σ6 + 12σ4

)]

=

(1σ2 0

0 12

1σ4

).

Good, they both gave the same answer! Fortunately, the information is easy to invert,hence

√n

((Xn

S2n

)−(

µσ2

))−→D N(02,

(σ2 00 2σ4

)). (17.79)

which is what we found in (16.126) using the ∆-method.

Chapter 18

Hypothesis Testing

Estimation addresses the question, “What is θ?" Hypothesis testing addresses ques-tions like, “Is θ = 0?" Confidence intervals do both. It will give a range of plausiblevalues, and if you wonder whether θ = 0 is plausible. you just check whether 0is in the interval. But hypothesis testing also address broader questions in whichconfidence intervals may be clumsy. Some types of questions for hypothesis testing:

• Is a particular drug more effective than a placebo?

• Are cancer and smoking related?

• Is the relationship between amount of fertilizer and yield linear?

• Is the distribution of income the same among men and women?

• In a regression setting, are the errors independent? Normal? Homoskedastic?

• Does God exist? 1

The main feature of hypothesis testing problems is that there are two competingmodels under consideration, the null hypothesis model and the alternative hypothe-sis model. The random variable (vector) X and space X are the same in both models,but the sets of distributions are different, being denoted P0 and PA for the null andalternative, respectively, where P0 ∩ PA = ∅. If P is the probability distribution forX, then the hypotheses are written

H0 : P ∈ P0 versus HA : P ∈ PA. (18.1)

Often both models will be parametric:

P0 = Pθ | θ ∈ Θ0 and PA = Pθ | θ ∈ ΘA, with Θ0, ΘA ⊂ Θ, Θ0 ∩ΘA = ∅,(18.2)

for some overall parameter space Θ. It is not unusual, but also not required, thatΘA = Θ−Θ0. In a parametric setting, the hypotheses are written

H0 : θ ∈ Θ0 versus HA : θ ∈ ΘA. (18.3)

1The chance is 67%, according to http://education.guardian.co.uk/higher/sciences/

story/0,12243,1164894,00.html

207

208 Chapter 18. Hypothesis Testing

Some examples:

1. Suppose nD people are given a drug, and nP are given a placebo, and in eachgroup the number of people who improve is counted. The data could be mod-eled as XD ∼ Binomial(nD, pD) and XP ∼ Binomial(nP, pP), XD and XP inde-pendent. The question is whether the drug worked, hence we would test

H0 : pD = pP versus HA : pD > pP. (18.4)

More formally, these would be written as in (18.3) as

H0 : (pD , pP) ∈ (uD, uP) | 0 < uD = uP < 1 versus (18.5)

HA : (pD , pP) ∈ (uD, uP) | 0 < uP < uD < 1. (18.6)

The overall parameter space is (0, 1)× (0, 1).

2. A simple linear regression model has Yi = α + βxi + Ei, i = 1, . . . , n, where thexi’s are fixed and known, and the Ei’s are iid. Often, it is assumed the errorsare also N(0, σ2

e ). Is that assumption plausible? One can test

H0 : E1, . . . , En iid N(0, σ2e ), σ2

e > 0 versus HA : E1, . . . , En iid F, F 6= N(0, σ2e ).

(18.7)To write this more formally, we have the parameters (α, β, F), where F is thecommon distribution function of the Ei’s. Then

Θ0 = R2 × N(0, σ2e ) | σ2

e > 0, (18.8)

andΘA = R

2 × all distribution functions F −Θ0, (18.9)

so that we test

H0 : (α, β, F) ∈ Θ0 versus HA : (α, β, F) ∈ ΘA. (18.10)

The alternative model is semi-parametric in that part of the parameter, α and β,lies in a finite-dimensional space, and part, the F, lies in an infinite-dimensionalspace.

3. Another goodness-of-fit testing problem for regression asks whether the mean

is linear. That is, the model is Yi = g(xi) + Ei, E1, . . . , En iid N(0, σ2e ), and the

null hypothesis is that g(x) is a linear function:

H0 : g(xi) = α + βxi, (α, β) ∈ R2 versus

HA : g ∈ h : R→ R | h′′(x) is continuous − α + βx | (α, β) ∈ R2. (18.11)

More formally, we have that

Yi ∼ N(g(xi), σ2e ), i = 1, . . . , n,

Θ0 = h | h(x) = α + βx, (α, β) ∈ R2 × (0, ∞),

ΘA = (h, σ2e ) | h : R→ R , h′′(x) is continuous × (0, ∞)−Θ0.

andH0 : (g, σ2

e ) ∈ Θ0 versus HA : (g, σ2e ) ∈ ΘA. (18.12)

The alternative is again semi-parametric.

209

Mathematically, there is no particular reason to designate one of the hypothesesnull and the other alternative. In practice, the null hypothesis tends to be the one thatrepresents the status quo, or that nothing unusual is happening, or that everything isok, or that the new isn’t any better than the old, or that the defendant is innocent. Thenull also tends to be simpler (lower-dimensional), although that is not necessarily thecase, e. g., one could test H0 : pD ≤ pP versus HA : pD > pP in (18.4).

There are several approaches to testing the hypotheses, e.g., ...

1. Accept/Reject or fixed α or Neyman-Pearson. This approach is action-oriented:Based on the data x, you either accept or reject the null hypothesis. One draw-back is that whatever decision you make, there is not an obvious way to assessthe accuracy of your decision. (In estimation, you would hope to have an esti-mate and its standard error.) Often the size or level is cited as the measure ofaccuracy, which is the chance of a false positive, i.e., rejecting the null when thenull is true. It is not a very good measure. It really gives the long-run chance ofmaking an error if one kept doing hypothesis testing with the same α. That is,it refers to the infinite tests you haven’t done yet, not the one you are analyzing.

2. Evidence against the null. Rather than make a decision, this approach pro-duces some number measuring support, or lack of it, for the null based on x.Possible such measures include

a) p-value. Given the data x, the p-value is the chance, if the null hypothesiswere true, that one would see X as extreme as, or more extreme than, x.The smaller, the more evidence against the null. The definition is slightlyconvoluted, and many believe that the p-value is the chance that the nullhypothesis is true. Not so!

b) Bayes factor or likelihood ratio. When both hypotheses are simple, i.e.,single points, say θ0 and θA, respectively, then, given x, the ratio of thelikelihoods is

LR(x) =fθA

(x)

fθ0(x)

. (18.13)

The larger the LR is, the more likely the alternative is relative to the null,hence measures evidence against the null. When the hypotheses are notsimple, the likelihoods are averaged before taking the ratio.

c) Probability the null is true. One needs a prior distribution on Θ0 ∪ΘA,but given that, say π, it is straightforward to find the probability of thenull:

P[H0 | X = x] =

∫Θ0

f (x | θ)π(θ)dθ∫Θ0∪ΘA

f (x | θ)π(θ)dθ. (18.14)

This measure of evidence is the most direct and easily interpreted, but isquite sensitive to the prior distribution.

None of these approaches is entirely satisfactory, unless you really do have a prioryou believe in. Notice that any of the evidentiary methods can be turned into anaction, e.g., rejecting the null if the p-value is less than something, or the likelihoodratio is larger than something, or the probability of the null is less than something.

Now to the details of implementation.


18.1 Accept/Reject

There is a great deal of terminology associated with the accept/reject paradigm, butthe basics are fairly simple. Start with a test statistic T(x), which is a function T :X → R that measures in some sense the difference between the data x and the nullhypothesis. For example, suppose the data are X1, . . . , Xn iid N(µ, σ2), and we wishto test

H0 : µ = µ0, σ2> 0 versus HA : µ 6= µ0, σ2

> 0 (18.15)

for some fixed known µ0. The “σ2> 0" in both hypotheses is the technical way of

saying, “σ2 is unknown." The test statistic should measure the closeness of x to µ0.For example, the usual two-sided t-statistic,

T(x) =|x− µ0|s/√

n. (18.16)

The absolute value is there because sample means far above and far below µ0 bothfavor the alternative. This T is not the only plausible test statistic. The sign test takesthe point of view that if the null hypothesis were true, then about half the observationwould be above µ0, and about half below. Thus if the proportion below µ0 is too far

from 12 , we would worry about the null. The two-sided sign test statistic is

S(x) =

∣∣∣∣#xi < µ0

n− 1

2

∣∣∣∣ . (18.17)

Next, choose a cutoff point c which represents how large the test statistic can bebefore rejecting the null hypothesis. That is,

the test

Rejects the null i f T(x) > cAccepts the null i f T(x) ≤ c

. (18.18)

The bigger the c, the harder it is to reject the null.In practice, one usually doesn’t want to reject the null unless there is substantial

evidence against it. The situation is similar to the courts in a criminal trial. Thedefendant is “presumed innocent until proven guilty." That is, the jury imagines thedefendant is innocent, then considers the evidence, and only if the evidence piles upso that the jury believes the defendant is “guilty beyond reasonable doubt" does itconvict. The accept/reject approach to hypothesis testing parallels the courts withthe following connections:

Courts TestingDefendant innocent Null hypothesis true

Evidence DataDefendant declared guilty Reject the null

Defendant declared not guilty Accept the null

(18.19)

Notice that the jury does not say that the defendant is innocent, but either guilty ornot guilty. “Not guilty" is a way to say that there is not enough evidence to convict,not to say the jury is confident the defendant is innocent. Similarly, in hypothesistesting accepting the hypothesis does not mean one is confident it is true. Rather, itmay be true, or just that there is not enough evidence to reject it.

“Reasonable doubt" is quantified by looking at the level of the test:

18.1. Accept/Reject 211

Definition 32. A hypothesis test (18.18) has level α if

Pθ[T(X) > c] ≤ α for all θ ∈ Θ0. (18.20)

That is, the chance of rejecting the null when it is true is no larger than α. (So thechance of convicting an innocent person is less than α.) Note that a test with level0.05 also has level 0.10. A related concept is the size of a test, which is the smallest αfor which it is level α.

It is good to have a small α, because that means one is unlikely to reject the nullerroneously. Of course, you could take c = ∞, and never reject, so that α = 0, andyou would never reject the null erroneously. But if the null were false, you’d neverreject, either, which is bad. That is, there are two types of error to balance. Theseerrors are called, rather colorlessly, Type I and Type II errors:

Truth ↓ ActionAccept H0 Reject H0

H0 OK Type I errorHA Type II error OK

(18.21)

It would have been better if the terminology had been in line with medical andother usage, where a false positive is rejecting when the null is true, e.g., saying youhave cancer when you don’t, and false negative, e.g., saying everything is ok when itis not.

Anyway the probability of Type I error is bounded above by the level. Conven-tionally, one chooses the level α, then finds the cutoff point c to yield that level. Thenthe probability of Type II is whatever it is. Traditionally, more emphasis is on thepower than the probability of Type II error, where Powerθ = 1− Pθ[Type I I Error]when θ ∈ ΘA. Power is good.

Example. Consider the sign statistic in (18.17), where the cutoff point is c = 1/5, sothe test rejects the null hypothesis if

S(x) =

∣∣∣∣#xi < µ0

n− 1

2

∣∣∣∣ >1

5. (18.22)

To find the level and power, we need the distribution of S. Now

Y ≡ #Xi < µ0 ∼ Binomial(n, P[Xi < µ0]) (18.23)

andP[Xi < µ0] = P[N(0, 1) < (µ0 − µ)/σ] = Φ[(µ0 − µ)/σ], (18.24)

henceY ∼ Binomial(n, Φ[(µ0− µ)/σ]). (18.25)

Suppose n = 25. For the level, we assume the null hypothesis in (18.15) is true, so

that µ = µ0. Then Y ∼ Binomial(25, 12 ). Note that the distribution of Y, hence S, is

the same for all σ2’s, and long as µ = µ0. Thus

α = P[

∣∣∣∣Y

25− 1

2

∣∣∣∣ >1

5] = P[|Y− 12.5| > 5] = P[Y ≤ 7 or Y ≥ 18] = 2(0.02164) = 0.0433.

(18.26)


The power depends on µ, µ0 and σ; actually, on the scaled difference between µand µ0:

Y ∼ Binomial(n, Φ(∆)), ∆ =µ0 − µ

σ. (18.27)

The ∆ is a measure of how far away from the null the (µ, σ2) is. Some powers arecalculated below, as well as the level again.

∆ P[Reject H0]−1.5 0.9998 Power−1.0 0.9655 Power−0.5 0.4745 Power

0.0 0.0433 Level0.5 0.4745 Power1.0 0.9655 Power1.5 0.9998 Power

(18.28)

Note that the power is symmetric in that ∆ and −∆ have the same power. Here is aplot, often called a power curve, although when ∆ = 0, it plots the level α:

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.2

0.6

1.0

Delta

Pow

er

Power of Sign Test

Lowering the cutoff point c will increase the power (good) and level (bad), whileincreasing c will decrease power (bad) but also the level (good). So there is a tradeoff.2

The next goal is to find a “good" test, that is, one with low level and high power.The Neyman-Pearson approach is to specify α and try to find the level α test withthe best power. There may not be one best test for all θ ∈ ΘA, which makes theproblem harder. It is analogous to trying to find the unbiased estimator with thelowest variance. The next section introduces test functions and randomized tests, bothof which are mainly to make the theory flow better. In practice, tests as in (18.18) areused.

18.2. Test functions 213

18.2 Test functions

Suppose in the sign test example (18.22), we want α = 0.05. The test we used, withc = 1/5, does have level α, since the probability of rejecting under the null, the size,is 0.0433, which is less than α. But we can obtain better power if we increase the sizeto 0.05. Thus we want to find c so that, from (18.26),

P[|Y− 12.5| > 25c] = 0.05, Y ∼ Binomial(25,1

2). (18.29)

Let’s decrease c a bit. If c = 0.9/5, then

P[|Y− 12.5| > 25× 0.9/5] = P[Y ≤ 7 or Y ≥ 18] = 2(0.02164) = 0.0433, (18.30)

the same as for c = 1/5, but if we make c even a tiny bit smaller than 0.9/5, sayc = 0.9/5− ǫ for tiny ǫ > 0, then

P[|Y− 12.5| > 25× (0.9/5− ǫ)] = P[Y ≤ 8 or Y ≥ 17] = 2(0.05388) = 0.1078.(18.31)

But that is too large, which means we cannot choose a c to give exactly the α wewant. Unless we use a randomized test, which means that if the test statistic equalsthe cutoff point, then we reject the null with a probability that may be strictly between0 and 1. Thus the test (18.18) is generalized slightly to

the test

Rejects the null i f T(x) > cRejects the null with probability γ(x) i f T(x) = cAccepts the null i f T(x) < c

, (18.32)

where γ(x) is a function with 0 ≤ γ(x) ≤ 1.For the sign test, we’ll take c = 0.9/5, and let γ be a constant, so that

the test

Rejects the null i f∣∣∣Y

n − 12

∣∣∣ >0.95

Rejects the null with probability γ i f∣∣∣Y

n − 12

∣∣∣ = 0.95

Accepts the null i f∣∣∣Y

n − 12

∣∣∣ <0.95

. (18.33)

Then

Pθ [The test rejects H0] = Pθ [

∣∣∣∣Y

n− 1

2

∣∣∣∣ >0.9

5] + γ Pθ [

∣∣∣∣Y

n− 1

2

∣∣∣∣ =0.9

5]

= Pθ [Y ≤ 7 or Y ≥ 18] + γ Pθ [Y = 8 or Y = 17].(18.34)

To find γ, we do the calculations under the null, so that Y ∼ Binomial(25, 12 ):

PH0[The test rejects H0] = 0.0433 + γ 0.0645. (18.35)

Taking γ = 0 yields the size 0.0433 test, and taking γ = 1, we have the size 0.1078test from (18.31). To get 0.05, solve

0.05 = 0.0433 + γ 0.0645 =⇒ γ = 0.1042. (18.36)


Which means that if the sign statistic is greater than 0.95 , you reject, and if the sign

statistic is equal to 0.95 , you flip a biased coin with probability of heads being 0.1039,

and reject if it comes up heads.

∆ P[Reject H0] P[Reject H0]γ = 0 γ = 0.1042

−1.5 0.9998 0.9999−1.0 0.9655 0.9679−0.5 0.4745 0.4920

0.0 0.0433 0.05000.5 0.4745 0.49201.0 0.9655 0.96791.5 0.9998 0.9999

(18.37)

So this randomized test is also level α, but has slightly better power.Randomized tests are nice theoretically, but make sure you don’t talk about them

outside of this course. Especially to applied statisticians. They’ll blow their top.Which makes sense, because after they do all the work of conducting an experimentto test whether their drug works, they don’t want the FDA to flip a coin to make thedecision.

A convenient way to express a (possibly) randomized test is to use “1" to meanreject and “0" to mean accept, and keep the γ for the randomized part. That is, fortest (18.32), define the function φ by

φ(x) =

1 i f T(x) > cγ(x) i f T(x) = c

0 i f T(x) < c. (18.38)

Now the power and level are just expected values of the φ because

Pθ [The test rejects H0] = Eθ [φ(X)]. (18.39)

Definition 33. A test function is a function φ : X → [0, 1], with the interpretation thatthe test rejects H0 with probability φ(x) if x is observed.

Now to find good tests.

18.3 Simple versus simple

We will start simple, where each hypothesis has exactly one distribution, so that wetest

H0 : θ = θ0 versus HA : θ = θA, (18.40)

where θ0 and θA are two different parameter values.For example, suppose X ∼ Binomial(4, p), and the hypotheses are

H0 : p =1

2versus HA : p =

3

5. (18.41)

Let α = 0.35, so that the objective is to find a test φ that

Maximizes E3/5[φ(X)] subject to E1/2[φ(X)] ≤ α = 0.35, (18.42)

18.3. Simple versus simple 215

that is, maximizes the power subject to begin of level α. What should we look for?The table below gives the likelihoods under the null and alternative:

x f1/2(x) f3/5(x)0 0.0625 0.02561 0.2500 0.15362 0.3750 0.34563 0.2500 0.34564 0.0625 0.1296

(18.43)

One could focus on the null hypothesis. If the null is true, then it is more likely thatX = 2, or 1 or 3, than 0 or 4. Consider rejecting the null when f1/2(x) is small. Wewould reject when X = 0 or 4, but that gives probability of rejection of only 0.125. Toget up to 0.35, we would need to reject at least part of the time if X = 1 or 3. The testis then

φ1(x) =

1 i f x = 0, 40.45 i f x = 1, 3

0 i f x = 2. (18.44)

The γ = 0.45 gives the level α. Thus

E1/2[φ1(X)] = 0.125 + 0.45× 0.5 = 0.35 and

Power = E3/5[φ1(X)] = 0.0256 + 0.1296 + 0.45× (0.1536 + 0.3456)

= 0.3798. (18.45)

Alternatively, one could look at the alternative density, and reject the null if it islikely that the alternative is true. The values X = 2 and 3 are most likely, and add tomore than 0.35, so one could contemplate the test

φ2(x) =

0.56 i f x = 2, 3

0 i f x = 0, 1, 4. (18.46)

Then

E1/2[φ2(X)] = 0.35 and

Power = E3/5[φ2(X)] = 0.3871. (18.47)

This second test has a slightly better power than the first, 0.3871 to 0.3798, but neitherare much higher than the level α = 0.35. Is that the best we can do?

Bang for the buck. Imagine you have some bookshelves you wish to fill up as cheaplyas possible, e.g., to use as props in a play. You do not care about the quality of thebooks, just their widths (in inches) and prices (in dollars). You have $3.50, and fivebooks to choose from:

Book Cost Width0 0.625 0.2561 2.50 1.5362 3.75 3.4563 2.50 3.4564 0.625 1.296

(18.48)

You are allowed to split the books lengthwise, and pay proportionately. You want tomaximize the total number of inches for your $3.50. Then, for example, book 4 is a


better deal than book 0, because they cost the same but book 4 is wider. Also, book 3is more attractive than book 2, because they are the same width but book 3 is cheaper.Which is better between books 3 and 4? Book 4 is cheaper by the inch: It costs about48¢ per inch, while book 3 is about 73¢ per inch. This suggests the strategy shouldbe to buy the books that give you the most inches per dollar:

Book Cost Width Inches/Dollar0 0.625 0.256 0.40961 2.50 1.536 0.61442 3.75 3.456 0.92163 2.50 3.456 1.38244 0.625 1.296 2.0736

(18.49)

Let us definitely buy book 4, and book 3. That costs us $3.125, and gives us1.296+3.456 = 4.752 inches. We still have 37.5¢ left, with which we can buy a tenth ofbook 2, giving us another 0.3456 inches, totaling 5.0976 inches. 2

Returning to the hypothesis testing problem, we can think of having α to spend,and we wish to spend where we get the most “bang for the buck." Here, bang ispower. The key is to look at the likelihood ratio of the densities:

LR(x) =f3/5(x)

f1/2(x), (18.50)

i.e.,x f1/2(x) f3/5(x) LR(x)0 0.0625 0.0256 0.40961 0.2500 0.1536 0.61442 0.3750 0.3456 0.92163 0.2500 0.3456 1.38244 0.0625 0.1296 2.0736

(18.51)

If LR(x) is large, then the alternative is much more likely than the null is. If LR(x)is small, the null is more likely. One uses the likelihood ratio as the statistic, and findsthe right cutoff point. The likelihood ratio test is then

φLR(x) =

1 i f LR(x) > cγ i f LR(x) = c0 i f LR(x) < c

. (18.52)

Looking at the table (18.51), we see that taking c = LR(2) works, because we rejectif x = 3 or 4, and use up only .3125 of our α. Then the rest we put on x = 2, theγ = 1/10 since we have .35− .3125 = 0.0375 left. Thus the test is

φLR(x) =

1 i f LR(x) > 0.92160.1 i f LR(x) = 0.92160 i f LR(x) < 0.9216

=

1 i f x > 20.1 i f x = 20 i f x < 2

. (18.53)

The last expression is easier to deal with, and valid since LR(x) is a strictly increasingfunction of x. Then the power and level are

E1/2[φLR(X)] = 0.35 and

Power = E3/5[φLR(X)] = 0.1296 + 0.3456 + 0.1× 0.3456 = 0.50976. (18.54)

18.4. The Neyman-Pearson Lemma 217

This is the same as for the books: The power is identified with the number of inches.Also, the likelihood ratio test is much better than φ1 (18.44) and φ2 (18.46), withpowers around 0.38. Is it the best? Yes, that is the Neyman-Pearson Lemma.

18.4 The Neyman-Pearson Lemma

Let X be the random variable or vector with density f , and f0 and fA be two possibledensities for X. We are interested in testing

H0 : f = f0 versus HA : f = fA. (18.55)

For given α ∈ [0, 1], we wish to find a test function φ that

Maximizes EA[φ(X)] subject to E0[φ(X)] ≤ α. (18.56)

A test function ψ has Neyman-Pearson form if for some constant c ∈ [0, ∞] andfunction γ(x) ∈ [0, 1],

ψ(x) =

1 i f LR(x) > cγ(x) i f LR(x) = c

0 i f LR(x) < c, (18.57)

with the caveat thatif c = ∞ then γ(x) = 1 for all x. (18.58)

Here,

LR(x) =fA(x)

f0(x)∈ [0, ∞] (18.59)

is defined unless fA(x) = f0(x) = 0, in which case ψ is unspecified. Notice that LRand c are allowed to take on the value ∞.

Lemma 32. Neyman-Pearson Lemma Any test ψ of Neyman-Pearson form (18.57,18.58)for which E0[ψ(X)] = α satisfies conditions (18.56).

So basically, the likelihood ratio test is best. One can take the γ(x) to be a constant,but sometimes it is convenient to have it depend on x. Before getting to the proof,consider some special cases, of mainly theoretical interest.

• α=0. If there is no chance of rejecting when the null is true, then one mustalways accept if f0(x) > 0, and it always makes sense to reject when f0(x) = 0.Such actions invoke the caveat (18.58), that is, when fA(x) > 0,

1 i f LR(x) = ∞

0 i f LR(x) < ∞

=

1 i f f0(x) = 00 i f f0(x) > 0

(18.60)

• α=1. This one is silly from a practical point of view, but if you do not careabout rejecting when the null is true, then you should always reject, i.e., takeφ(x) = 1.

• Power = 1. If you want to be sure to reject if the alternative is true, then φ(x) = 1when fA(x) > 0, so take the test (18.57) with c = 0. Of course, you may not beable to achieve your desired α.


Proof of Lemma 32. If α = 0, then the above discussion shows that taking c = ∞, γ(x) =1 (18.58) is best. For α ∈ (0, 1], suppose ψ satisfies (18.57) for some c and γ(x) withE0[ψ(X)] = α, and φ is any other test function with E0[φ(X)] ≤ α. Look at

EA[ψ(X)− φ(X)]− c E0[ψ(X)− φ(X)] =∫

X(ψ(x)− φ(x)) fA(x)dx

−c∫

X(ψ(x)− φ(x)) f0(x)dx

=∫

X(ψ(x)− φ(x))LR(x) f0(x)dx

−c∫

X(ψ(x)− φ(x)) f0(x)dx

=∫

X(ψ(x)− φ(x))(LR(x)− c) f0(x)dx

≥ 0. (18.61)

The final inequality holds because ψ = 1 if LR− c > 0, and ψ = 0 if LR− c < 0, sothat the final integrand is always nonnegative. Thus

EA[ψ(X)− φ(X)] ≥ c E0[ψ(X)− φ(X)]

≥ 0, (18.62)

because E0[ψ(X)] = α ≥ E0[φ(X)]. Thus EA[ψ(X)] ≥ EA[φ(X)], i.e., any other level αtest has lower or equal power. 2

There are a couple of addenda to the lemma which we will not prove here. First,for any α, there is a test of Neyman-Pearson form. Second, if the φ in the proof is notessentially of Neyman-Pearson form, then the power of ψ is strictly better than thatof φ.

18.4.1 Examples

If f0(x) > 0 and fA(x) > 0 for all x ∈ X , then it is straightforward (though maybenot easy) to find the Neyman-Pearson test. It can get tricky if one or the other densityis 0 at times.

Normal means. Suppose µ0 and µA are fixed, µA > µ0, and X ∼ N(µ, 1). We wishto test

H0 : µ = µ0 versus HA : µ = µA (18.63)

with α = 0.05. Here,

LR(x) =e−

12 (x−µA)2

e−12 (x−µ0)2

= ex(µA−µ0)− 12 (µ2

A−µ20). (18.64)

Because µA > µ0, LR(x) is strictly increasing in x, so LR(x) > c is equivalent tox > c∗ for some c∗ . For level 0.05, we know that c∗ = 1.645 + µ0, so the test mustreject when LR(x) > LR(c∗), i.e.,

ψ(x) =

1 i f ex(µA−µ0)− 1

2 (µ2A−µ2

0) > c

0 i f ex(µA−µ0)− 12 (µ2

A−µ20) ≤ c

, c = e(1.645+µ0)(µA−µ0)− 12 (µ2

A−µ20).

(18.65)


We have taken the γ = 0; the probability LR(X) = c is 0, so it doesn’t matter whathappens then. That expression is unnecessarily complicated. In fact, to find c wealready simplified the test, that is

LR(x) > c i f f x− µ0 > 1.645, (18.66)

hence

ψ(x) =

1 i f x− µ0 > 1.6450 i f x− µ0 ≤ 1.645

. (18.67)

That is, we really do not care about c, as long as we have the ψ.

Double Exponential versus Normal. Now suppose f0 is the Double Exponential pdfand fA is the N(0, 1) pdf, and α = 0.1. Then

LR(x) =

1√2π

e−12 x2

12 e−|x|

=

√2

πe|x|−

12 x2

. (18.68)

Now LR(x) > c if and only if

|x| − 1

2x2

> c∗ = log(c) +1

2log(π/2) (18.69)

if and only if (completing the square)

(|x| − 1)2< c∗∗ = −2c∗ − 1 ⇐⇒ ||x| − 1| < c∗∗∗ =

√c∗∗. (18.70)

We need to find the constant c∗∗∗ so that

P0[||x| − 1| < c∗∗∗] = 0.10, X ∼ Double Exponential. (18.71)

For a smallish c∗∗∗, using the Double Exponential pdf,

P0[||X| − 1| < c∗∗∗] = P0[−1− c∗∗∗ < X < −1 + c∗∗∗ or 1− c∗∗∗ < X < 1 + c∗∗∗]= 2 P0[−1− c∗∗∗ < X < −1 + c∗∗∗]

= e−(1−c∗∗∗) − e−(1+c∗∗∗). (18.72)

Setting that equal to 0.10, we find c∗∗∗ = 0.1355. The picture shows a horizontal lineat 0.1355. The rejection region consists of the x’s for which the graph of ||x| − 1| isbelow the line.


−2 −1 0 1 2

0.0

0.4

0.8

x

| |x|

− 1

|

The power is

PA[||X| − 1| < 0.1355] = PA[−1− 0.1355 < X < −1 + 0.1355

or 1− 0.1355 < X < 1 + 0.1355]

= 2(Φ(1.1355)−Φ(0.8645))

= 0.1311. (18.73)

Not very powerful, but at least it is larger than α! Of course, it is not surprisingthat it is hard to distinguish the normal from the double exponential with just oneobservation.

Uniform versus Uniform I. Suppose X ∼ Uni f orm(0, θ), and the question is whetherθ = 1 or 2. We could then test

H0 : θ = 1 versus HA : θ = 2. (18.74)

The likelihood ratio is

LR(x) =fA(x)

f0(x)=

12 I[0 < x < 2]

I[0 < x < 1]=

12 i f 0 < x < 1∞ i f 1 ≤ x < 2

. (18.75)

No matter what, you would reject the null if 1 ≤ x < 2, because it is impossible toobserve an x in that region under the Uni f orm(0, 1).

First try α = 0. Usually, that would mean never reject, so power would be 0 aswell, but here it is not that bad. We invoke (18.58), that is, take c = ∞ and γ(x) = 1:

ψ(x) =

1 i f LR(x) = ∞

0 i f LR(x) < ∞

=

1 i f 1 ≤ x < 20 i f 0 < x < 1

. (18.76)


Then

α = P[1 ≤ U(0, 1) < 2] = 0 and Power = P[1 ≤ U(0, 2) < 2] =1

2. (18.77)

What if α = 0.1? Then the Neyman-Pearson test would take c = 12 :

ψ(x) =

1 i f LR(x) >12

γ(x) i f LR(x) = 12

0 i f LR(x) <12

=

1 i f 1 ≤ x < 2

γ(x) i f 0 < x < 1, (18.78)

because LR cannot be less than 12 . Notice that

E0[ψ(X)] = E0[γ(X)] =∫ 1

0γ(x)dx, (18.79)

so that any γ than integrates to α works. Some examples:

γ(x) = 0.1

γ(x) = I[0 < x < 0.1]

γ(x) = I[0.9 < x < 1]

γ(x) = 0.2 x. (18.80)

No matter which you choose, the power is the same:

Power = EA[ψ(X)] =1

2

∫ 1

0γ(x)dx +

1

2

∫ 2

1dx =

1

2α +

1

2= 0.55. (18.81)

Uniform versus Uniform II. Now switch the null and alternative in (18.74), keepingX ∼ Uni f orm(0, θ):

H0 : θ = 2 versus HA : θ = 1. (18.82)

Then the likelihood ratio is

LR(x) =fA(x)

f0(x)=

I[0 < x < 1]12 I[0 < x < 2]

=

2 i f 0 < x < 10 i f 1 ≤ x < 2

. (18.83)

For level α = 0.1, we have to take c = 2:

ψ(x) =

γ(x) i f LR(x) = 2

0 i f LR(x) < 2

=

γ(x) i f 0 < x < 1

0 i f 1 ≤ x < 2. (18.84)

Then any γ(x) with

1

2

∫ 1

0γ(x)dx = α =⇒

∫ 1

0γ(x)dx = 0.2 (18.85)

will work. And that is the power,

Power =∫ 1

0γ(x)dx = 0.2. (18.86)


18.4.2 Bayesian approach

The Neyman-Pearson approach is all about action: Either accept the null or reject it.As in the courts, it does not suggest the degree to which the null is plausible or not.By contrast, the Bayes approach produces the probabilities the null and alternativeare true, given the data and a prior distribution. That is, we start with the prior:

P[H0](= P[H0 is true]) = π0, P[HA] = πA, (18.87)

where π0 + πA = 1. Where do these probabilities come from? Presumably, from areasoned consideration of all that is known prior to seeing the data. Or, one may try

to be fair and take π0 = πA = 12 . The densities are then the conditional distributions

of X given the hypotheses are true:

X | H0 ∼ f0 and X | HA ∼ fA. (18.88)

Bayes Theorem gives the posterior probabilities:

P[H0 | X = x] =π0 f0(x)

π0 f0(x) + πA fA(x), P[HA | X = x] =

πA fA(x)

π0 f0(x) + πA fA(x). (18.89)

Dividing numerators and denominators by f0(x) (assuming it is not 0), we have

P[H0 | X = x] =π0

π0 + πALR(x), P[HA | X = x] =

πALR(x)

π0 + πALR(x), (18.90)

showing that these posteriors depend on the data only through the likelihood ra-tio. That is, the posterior does not violate the likelihood principle. Hypothesis testsdo violate the likelihood principle: Whether they reject depends on the c, which iscalculated from f0, not the likelihood.

Odds are actually more convenient here, where the odds of an event A are

Odds[A] =P[A]

1− P[A], hence P[A] =

Odds[A]

1 + Odds[A]. (18.91)

The prior odds in favor of HA is then πa/π0, and the posterior odds are

Odds[HA | X = x] =P[HA | X = x]

P[H0 | X = x]

=πA

π0LR(x)

= Odds[HA]× LR(x). (18.92)

That is,Posterior odds = (Prior odds)× (Likelihood ratio), (18.93)

which neatly separates the contribution to the posterior of the prior and the data. Ifa decision is needed, one would choose a cutoff point k, say, and reject the null ifthe posterior odds exceed k, and maybe randomize if they equal k. But this test isthe same as the Neyman-Pearson test (18.57) with k = cπ0/πA. The difference isthat k would not be chosen to achieve a certain level, but rather on decision-theoreticgrounds, that is, depending on the relative costs of making a Type I or Type II error.

18.5. Uniformly most powerful tests 223

For example, take the example in (18.41), where X ∼ Binomial(4, p), and thehypotheses are

H0 : p =1

2versus HA : p =

3

5. (18.94)

If the prior odds are even, i.e., π0 = πA = 12 , so that the prior odds are 1, then the

posterior odds are equal to the likelihood ratio, giving the following:

x Odds[HA | X = x] = LR(x) P[HA | X = x]0 0.4096 0.29061 0.6144 0.38062 0.9216 0.47963 1.3824 0.58034 2.0736 0.6746

(18.95)

Thus if you see X = 2 heads, the posterior probability that p = 35 is about 48%, and

if X = 4, it is about 67%. None of these probabilities is overwhelmingly in favor ofone hypothesis or the other.

18.5 Uniformly most powerful tests

Simple versus simple is too simple. Some testing problems are not so simple, and yetdo have a best test. Here is the formal definition.

Definition 34. The test function ψ is a uniformly most power (UMP) level α test fortesting H0 : θ ∈ Θ0 versus HA : θ ∈ ΘA if it is level α and

Eθ [ψ(X)] ≥ Eθ [φ(X)] for all θ ∈ ΘA (18.96)

for any other level α test φ.

Often, one-sided tests do have a UMP test, while two-sided tests do not. Forexample, suppose X ∼ N(µ, 1), and we test whether µ = 0. In a one-sided testingproblem, the alternative is one of µ > 0 or µ < 0, say

H0 : µ = 0 versus H(1)A : µ > 0. (18.97)

The corresponding two-sided testing problem is

H0 : µ = 0 versus H(2)A : µ 6= 0. (18.98)

Level α = 0.05 tests for these are, respectively,

φ(1)(x) =

1 i f x > 1.6450 i f x ≤ 1.645

and φ(2)(x) =

1 i f |x| > 1.960 i f |x| ≤ 1.96

. (18.99)

Their powers are

Eµ[φ(1)(X)] = P[N(µ, 1) > 1.645] = Φ(µ− 1.645) and

Eµ[φ(2)(X)] = P[|N(µ, 1)| > 1.96] = Φ(µ− 1.96) + Φ(−µ− 1.96). (18.100)

See the graph:


−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

mu

E[p

hi]

Phi2

Phi1

Phi2

Phi1

For the one-sided problem, the power is good for µ > 0, but bad (below α) for

µ < 0. But the alternative is just µ > 0, so it does not matter what φ(1) does whenµ < 0. For the two-sided test, the power is fairly good on both sided of µ = 0, but it isnot quite as good as the one-sided test when µ > 0. The other line in the graph is the

one-sided test for alternative µ < 0, which mirrors φ(1), rejecting when x < −1.645.The following are true, and to be proved:

• For the one-sided problem (18.97), the test φ(1) is the UMP level α = 0.05 test.

• For the two-sided problem (18.98), there is no UMP level α test. Test φ(1) isbetter on one side (µ > 0), and the other one-sided test is better on the otherside. None of the three tests is always best.

We start with the null being simple and the alternative being composite (which justmeans not simple). The way to prove a test is UMP level α is to show that it is levelα, and that it is of Newman-Pearson form for each simple versus simple subproblemderived from the big problem. That is, suppose we are testing

H0 : θ = θ0 versus HA : θ ∈ ΘA. (18.101)

A simple versus simple subproblem takes a specific value from the alternative, sothat for a given θA ∈ ΘA, we consider

H0 : θ = θ0 versus H(θA)A : θ = θA. (18.102)

Theorem 14. Suppose that for testing problem (18.101), ψ satisfies

Eθ0[ψ(X)] = α (18.103)

18.5. Uniformly most powerful tests 225

and that for each θA ∈ ΘA,

ψ(x) =

1 i f LR(x; θA) > c(θA)γ(x) i f LR(x; θA) = c(θA)

0 i f LR(x; θA) < c(θA), for some constant c(θA), (18.104)

where

LR(x; θA) =fθA

(x)

fθ0(x)

. (18.105)

Then ψ is a UMP level α test for (18.101).

Proof. Suppose φ is another level α test. Then for the subproblem (18.102), ψ has atleast as high power, i.e.,

EθA[ψ(X)] ≥ EθA

[φ(X)]. (18.106)

But that inequality is true for any θA ∈ ΘA, hence ψ is UMP level α. 2

The difficulty is to find a test ψ which is Neyman-Pearson for all θA. Consider theexample with X ∼ N(µ, 1) and hypotheses

H0 : µ = 0 versus H(1)A : µ > 0, (18.107)

and take α = 0.05. For fixed µA > 0, the Neyman-Pearson test is

ψ(x) =

1 i f LR(x; µA) > c(µA)0 i f LR(x; µA) ≤ c(µA)

=

1 i f e−

12 ((x−µA)2−x2)

> c(µA)

0 i f e−12 ((x−µA)2−x2) ≤ c(µA)

=

1 i f xµA > log(c(µA)) + 1

2 µ2A

0 i f xµA ≤ log(c(µA)) + 12 µ2

A

=

1 i f x > (log(c(µA)) + 1

2 µ2A)/µA

0 i f x ≤ (log(c(µA)) + 12 µ2

A)/µA, (18.108)

where the last step is valid because we know µA > 0. That messy constant is chosenso that the level is 0.05, which we know must be

(log(c(µA)) +1

2µ2

A)/µA = 1.645. (18.109)

The key point is that (18.109) is true for any µA > 0, that is,

ψ(x) =

1 i f x > 1.6450 i f x ≤ 1.645

(18.110)

is true for any µA. Thus ψ is indeed UMP. Note that the constant c(µA) is differentfor each µA, but the test ψ is the same.

Why is there no UMP test for the two-sided problem,

H0 : µ = 0 versus H(1)A : µ 6= 0? (18.111)


The best test at alternative µA > 0 is (18.110), but the best test at alternative µA < 0 is

ψ(x) =

1 i f LR(x; µA) > c(µA)0 i f LR(x; µA) ≤ c(µA)

=

1 i f xµA > log(c(µA)) + 1

2 µ2A

0 i f xµA ≤ log(c(µA)) + 12 µ2

A

=

1 i f x < (log(c(µA)) + 1

2 µ2A)/µA

0 i f x ≥ (log(c(µA)) + 12 µ2

A)/µA

=

1 i f x < −1.6450 i f x ≥ −1.645

, (18.112)

which is different than test (18.110). That is, there is no test that is best at both positiveand negative values of µA, so there is no UMP test.

18.5.1 One-sided exponential family testing problems

The normal example about can be extended to general exponential families. Thekey to the existence of a UMP test is that the LR(x; θA) is increasing in some func-tion no matter what the alternative. That is, suppose X1, . . . , Xn are iid with a one-dimensional exponential family density

f (x | θ) = a(x) eθΣT(xi)−nψ(θ). (18.113)

A one-sided testing problem is

H0 : θ = θ0 versus HA : θ > θ0. (18.114)

Then for fixed alternative θA > θ0,

LR(x; θA) =f (x | θA)

f (x | θ0)= e(θA−θ0)ΣT(xi)−n(ψ(θA)−ψ(θ0)). (18.115)

Then the best test at the alternative θA is

ψ(x) =

1 i f LR(x; θA) > c(θA)1 i f LR(x; θA) = c(θA)0 i f LR(x; θA) ≤ c(θA)

=

1 i f (θA − θ0) ∑ T(xi) > log(c(θA)) + n(ψ(θA)− ψ(θ0))γ i f (θA − θ0) ∑ T(xi) = log(c(θA)) + n(ψ(θA)− ψ(θ0))0 i f (θA − θ0) ∑ T(xi) < log(c(θA)) + n(ψ(θA)− ψ(θ0))

=

1 i f ∑ T(xi) > (log(c(θA)) + n(ψ(θA)− ψ(θ0)))/(θA − θ0)γ i f ∑ T(xi) = (log(c(θA)) + n(ψ(θA)− ψ(θ0)))/(θA − θ0)0 i f ∑ T(xi) < (log(c(θA)) + n(ψ(θA)− ψ(θ0)))/(θA − θ0)

=

1 i f ∑ T(xi) > cγ i f ∑ T(xi) = c0 i f ∑ T(xi) < c

. (18.116)

Then c and γ are chosen to give the right level, but they are the same for any alterna-tive θA > θ0. Thus the test (18.116) is UMP level α.

If the alternative were θ < θ0, then the same reasoning, would work, but theinequalities would switch. For a two-sided alternative, there would not be a UMPtest.

18.6. Some additional criteria 227

18.5.2 Monotone likelihood ratio

A generalization of exponential families, which guarantee UMP tests, are familieswith monotone likelihood ratio. Non-exponential family examples include the non-

central χ2 and F distributions.

Definition 35. A family of densities f (x | θ), θ ∈ Θ ⊂ R has monotone likelihood ratio(MLR) with respect to parameter θ and statistic T(x) if for any θ′ < θ,

f (x | θ)

f (x | θ′)(18.117)

is a function of just T(x), and is nondecreasing in T(x). If the ratio is strictly increasing inT(x), then the family has strict monotone likelihood ratio.

Note in particular that this T(x) is a sufficient statistic. It is fairly easy to see thatone-dimensional exponential families have MLR. The general idea of MLR is that insome sense, as θ gets bigger, T(X) gets bigger. Two lemmas follow, without proofs.

Lemma 33. If the family f (x | θ) has MLR with respect to θ and T(x), then for any nonde-creasing function g(w),

Eθ [g(T(X))] is nondecreasing in θ. (18.118)

If the family has strict MLR, and g is strictly increasing, then the expected value in (18.118)is strictly increasing.

Lemma 34. Suppose the family f (x | θ) has MLR with respect to θ and T(x), and we aretesting

H0 : θ = θ0 versus HA : θ > θ0 (18.119)

for some level α.Then the test

ψ(x) =

1 i f ∑ T(xi) > cγ i f ∑ T(xi) = c0 i f ∑ T(xi) < c

, (18.120)

where c and γ are chosen to achieve level α, is UMP level α.

In the situation in Lemma 34, the power function Eθ [ψ(X)] is increasing, becauseof Lemma 33, since ψ is nondecreasing in θ. In fact, MLR also can be used to showthat the test (18.120) is UMP level α for testing

H0 : θ ≤ θ0 versus HA : θ > θ0 (18.121)

18.6 Some additional criteria

It is quite special to actually have a UMP test. Similar to the estimation case, wherewe may restrict to unbiased estimators or equivariant estimators, in testing we maywish to find UMP tests among those satisfying certain other criteria. Here are a few,pertaining to the general testing problem

H0 : θ ∈ Θ0 versus HA : θ ∈ ΘA. (18.122)


1. Unbiased. A test ψ is unbiased if

EθA[ψ(X)] ≥ Eθ0

[ψ(X)] for any θ0 ∈ Θ0 and θA ∈ ΘA. (18.123)

This means that you are more likely to reject when you should than when youshouldn’t. Note that in the two-sided Normal mean example, the one-sidedtest (18.98) are not unbiased, because their power falls below α on one side of0. The two-sided test (18.99) is unbiased, and is in fact the UMP unbiased levelα = 0.05 test.

2. Similar. A test ψ is similar if it equals the level for all null values, i.e.,

Eθ[ψ(X)] = α for all θ ∈ Θ0. (18.124)

For example, the t-test in (18.16) and the sign test (18.17) are both similar for

testing µ = µ0 in the Normal case where σ2 is unknown, as in (18.15). That

is because the distributions of those two statistics do not depend on σ2 whenµ = µ0. The t-test turns out to be the UMP similar level α test.

3. Invariant. A testing problem is invariant under a group G if both the nulland alternative models are invariant under the group. A test ψ is invariant ifψ(hg(x)) = ψ(x) for all g ∈ G , where hg is the group’s action on X . (RecallChapter 13.5.)

Chapter 19

Likelihood-based testing

The previous chapter concentrates on fairly simple testing situations, primarily one-dimensional. When the parameter spaces are more complex, it is difficult if notimpossible to find optimal procedures. But many good procedures still are based onthe likelihood ratio. For the general testing problem

H0 : θ ∈ Θ0 versus HA : θ ∈ ΘA, (19.1)

consider the likelihood ratio between two fixed parameters,

f (x | θA)

f (x | θ0)for some θ0 ∈ Θ0 and θA ∈ ΘA. (19.2)

How to extend this ratio to all θ’s? Here are three popular approaches:

• Score: Differentiate. If the null hypothesis is simple, then the ratio, or actuallyits log, can be expanded about θA = θ0:

log

(f (x | θA)

f (x | θ0)

)= l(θA; x)− l(θ0; x)

≈ (θA − θ0)′S(θ0; x). (19.3)

That test statistic is approximately the best statistic for alternative θA when θAis very close to θ0.

• Maximize the likelihoods. This approaches picks the best value of θ from each

model. That is, θA is the MLE from the alternative model, and θ0 is the MLEfrom the null model. Then the test statistic is

2 log

(supθA∈ΘA

f (x | θA)

supθ0∈Θ0f (x | θ0)

)= 2 log

(f (x | θA)

f (x | θ0)

)= 2(l(θA; x)− l(θ0; x)).

(19.4)The “2" is just a detail to make the null distribution easier to approximate by a

χ2.

229

230 Chapter 19. Likelihood-based testing

• Average: The Bayes factor. Here, one has probability densities ρA over ΘA andρ0 over Θ0, and looks at the ratio of the averages of the likelihoods:

Bayes Factor =

∫ΘA

f (x | θA)ρA(θA)dθA∫Θ0

f (x | θ0)ρ0(θ0)dθ0. (19.5)

This can take the place of LR in (18.93):

Posterior odds = (Prior odds)× (Bayes Factor). (19.6)

We will look at the first two approaches in more detail in the next two sections.One nice property of these test statistics is that, under certain conditions, it is easy toapproximate the null distributions if n is large.

19.1 Score tests

19.1.1 One-sided

Suppose X1, . . . , Xn are iid with density f (xi | θ), where θ ∈ R. We start with thesimple versus one-sided testing problem,

H0 : θ = θ0 versus HA : θ > θ0. (19.7)

For fixed θA > θ0, the test which rejects the null when (θA − θ0)Sn(θ0; x) > c is thenthe same that rejects when Sn(θ0; x) > c∗. (Sn is the score for all n observations,and Sn(θ; x) = ∑

ni=1 S1(θ; xi), where S1(θ; xi) is the score for just xi.) Under the null

hypothesis, we know that

Eθ0[S1(θ0; Xi)] = 0 and Varθ0

[S1(θ0; Xi)] = I1(θ0), (19.8)

hence by the central limit theorem,

Tn(x) =Sn(θ0; X)√

n I1(θ0)−→D N(0, 1) under the null hypothesis. (19.9)

Then the one-sided score test

Rejects the null hypothesis when Tn(x) > zα, (19.10)

which is approximately level α.

Example. Suppose the Xi’s are iid Cauchy(θ, 1), so that

f (xi | θ) =1

π

1

1 + (xi − θ)2. (19.11)

We wish to testH0 : θ = 0 versus HA : θ > 0, (19.12)

so that θ0 = 0. The score at θ = 0 and information in one observation are, respectively,

S1(0; xi) =2xi

1 + x2i

and I1(0) =1

2. (19.13)

19.1. Score tests 231

Then the test statistic is

Tn(x) =Sn(0; x)√

n I1(0)=

∑ni=1

2xi

1+x2i√

n/2=

2√

2√n

n

∑i=1

xi

1 + x2i

, (19.14)

and for an approximate 0.05 level, the cutoff point is z0.05 = 1.645. If the informationis difficult to calculate, which it is not in this example, then you can use the observedinformation at θ0 instead. 2

19.1.2 Many-sided

Now consider a more general problem, where θ ∈ Θ ⊂ RK , and the null, θ0, is in theinterior of Θ. The testing problem is

H0 : θ = θ0 versus HA : θ ∈ Θ− θ0. (19.15)

For a fixed value of the alternative θA, under null we have that

Eθ0[(θA − θ0)

′Sn(θ0; X)] = 0K and

Varθ0[(θA − θ0)

′Sn(θ0; X)] = n (θA − θ0)′ I1(θ0)(θA − θ0), (19.16)

because the covariance matrix of the score is the Fisher Information matrix. Then thescore statistic for θA is

Tn(x; θA) =(θA − θ0)

′Sn(θ0; X)√n (θA − θ0)

′ I1(θ0)(θA − θ0). (19.17)

For any particular θA, that statistic is approximately N(0, 1) under the null. But wedo not want to have to specify an alternative. Because our test statistic is supposedto measure the distance from the null, we could take the θA that maximizes theTn(x; θA), that is, the statistic is

Tn(x) = maxθA 6=θ0

Tn(x; θA). (19.18)

To find that maximum, we can simplify by taking

a =I1/21 (θ0)(θA − θ0)√

(θA − θ0)′ I1(θ0)(θA − θ0)

, so that ‖a‖ = 1. (19.19)

We are assuming that I1 is invertible, and mean by I1/21 a matrix square root of I1.

We can then write

Tn(x) = maxa | ‖a‖=1

a′b, where b =1√n

I−1/21 (θ0)Sn(θ0; X). (19.20)

The Cauchy-Schwarz inequality says that

(a′b)2 ≤ ‖a‖2 ‖b‖2 = ‖b‖2, (19.21)


because ‖a‖ = 1. Can that upper bound be achieved? Yes, at least if b 6= 0p, by taking

a = b/‖b‖ (which does have length 1), so that

(a′b)2 = ((b/‖b‖)′b)2 = (b′b)2/‖b‖2 = ‖b‖4/‖b‖2 = ‖b‖2. (19.22)

The bound (19.21) is true for b = 0p, too, of course. Going back to (19.20), then, we

obtain

T2n(x) =

1

nSn(θ0; X)′ I−1

1 (θ0)Sn(θ0; X). (19.23)

This T2n is the official score statistic for this problem. Note that the dependence on θA

is gone.We still need a cutoff point. The multivariate central limit theorem says that under

the null hypothesis,

1√n

Sn(θ0; X) ≈ N(0K, I1(θ0)), (19.24)

hence1√n

I−1/21 (θ0)Sn(θ0; X) ≈ N(0K, IK), (19.25)

where this IK is the identity matrix. Thus T2n is the sum of squares of approximately

iid N(0, 1)’s, so that

T2n(x) ≈ χ2

K, (19.26)

and the approximate level α score test rejects the null hypothesis if

T2n(x) > χ2

K,α. (19.27)

Example. Suppose X ∼ Multinomial2(n, (p1, p2, p3)), and we wish to test that theprobabilities are equal:

H0 : p1 = p2 =1

3versus HA : (p1, p2) 6= (

1

3,

1

3). (19.28)

We’ve left out p3 because it is a function of the other two. Leaving it in will cause theinformation matrix to be noninvertible.

The loglikelihood in one Multinoulli Z = (Z1, Z2, Z3) is

l(p1, p2; z) = z1 log(p1) + z2 log(p2) + z3 log(1− p1 − p2). (19.29)

The score at the null is then

S1(1/3, 1/3; z) =

(∂l(p1, p2; z)/∂p1∂l(p1, p2; z)/∂p2

)∣∣∣∣p1=1/3,p2=1/3

=

(z1/p1 − z3/(1− p1 − p2)z2/p2 − z3/(1− p1 − p2)

)∣∣∣∣p1=1/3,p2=1/3

= 3

(z1 − z3z2 − z3

). (19.30)

19.2. The maximum likelihood ratio test 233

The information is the covariance matrix of the score. To find that, we first have underthe null that

Cov0[Z] =

29 − 1

9 − 19

− 19

29 − 1

9

− 19 − 1

929

=1

9

2 −1 −1−1 2 −1−1 −1 2

. (19.31)

Because

S1(1/3, 1/3; z) = 3

(1 0 −10 1 −1

)Z, (19.32)

the information is

I1(1/3, 1/3) = Cov[S1(1/3, 1/3; Z)] (19.33)

=32

9

(1 0 −10 1 −1

)

2 −1 −1−1 2 −1−1 −1 2

1 00 1−1 −1

=

(6 33 6

). (19.34)

The score for all n observations is the same as for one, but with Xi’s instead of theZi’s. Thus

T2n(x) =

1

n(3

(X1 − X3X2 − X3

)′)I1(1/3, 1/3)−1(3

(X1 − X3X2 − X3

)′)

=1

n

(X1 − X3

X2 − X3

)′ (2 −1−1 2

)(X1 − X3

X2 − X3

)′

=1

n(2(X1 − X3)

2 + 2(X2 − X3)2 − 2(X1 − X3)(X2 − X3))

=1

n((X1 − X3)

2 + (X2 − X3)2 + (X1 − X2)

2). (19.35)

The cutoff point is χ22,α, because there are K = 2 parameters. The statistic looks

reasonable, because the Xi’s would tend to be different if their pi’s were. Also, it may

not look like it, but this T2n is the same as the Pearson χ2 statistic for these hypotheses,

X2 =3

∑i=1

(Xi − n/3)2

n/3=

3

∑i=1

(Obsi − Expi)2

Expi, (19.36)

where Obsi is the observed count xi , and Expi is the expected count under the null,which here is n/3 for each i.

19.2 The maximum likelihood ratio test

Even more general that the score test is the maximum likelihood ratio test (MLRT)(or sometimes just LRT), which uses the likelihood ratio statistic, but substitutes the


MLE for the parameter value in each density. The statistic is

Λ(x) =supθA∈ΘA

f (x | θA)

supθ0∈Θ0f (x | θ0)

=f (x | θA)

f (x | θ0), (19.37)

where θ0 is the MLE of θ under the null model, and θA is the MLE of θ under thealternative model. Notice that in the simple versus simple situation, Λ(x) = LR(x).This statistic may or may not lead to a good test. In many situations it is quitereasonable, in the same way that the MLE is often a good estimator. Also, underappropriate conditions, it is easy to find the cutoff point to obtain the approximatelevel α:

Under H0, 2 log(Λ(X)) −→D χ2ν, ν = dimension(ΘA)− dimension(Θ0). (19.38)

First we will show some examples, then formalize the above result.

19.2.1 Examples

Example 1. Suppose X1, . . . , Xn are iid N(µ, σ2), and we wish to test whether µ = 0,

with σ2 unknown:

H0 : µ = 0, σ2> 0 versus HA : µ 6= 0, σ2

> 0. (19.39)

Here, θ = (µ, σ2), so we need to find the MLE under the two models.

• Null. Here, Θ0 = (0, σ2) | σ2> 0. Thus the MLE of µ is µ0 = 0, since it is

the only possibility. For σ2, we then maximize

1

σne−Σx2

i /(2σ2), (19.40)

which by the usual calculations yields

σ2A =

∑ x2i

n, hence θ0 = (0, ∑ x2

i /n). (19.41)

• Alternative. Now ΘA = (µ, σ2) | µ 6= 0, σ2> 0, which from long ago we

know yields MLE’s

θ0 = (x, s2), where s2 = σ20 =

∑(xi − x)2

n. (19.42)

Notice that not only is the MLE of µ different in the two models, but so is the

MLE of σ2. (Which should be reasonable, because if you know the mean is 0, you donot have to use x.) Next, stick those estimates into the likelihood ratio, and see what


happens (the√

2π’s cancel):

Λ(x) =f (x | x, s2)

f (x | 0, ∑ x2i /n)

=1sn e−Σ(xi−x)2/(2s2)

1(∑ x2

i /n)n/2 e−Σx2i /(2Σx2

i /n)

=

(∑ x2

i /n

s2

)n/2e−n/2

e−n/2

=

(∑ x2

i

∑(xi − x)2

)n/2

. (19.43)

Thus the MLRTRejects H0 when Λ(x) > c (19.44)

for some cutoff c. To find c, we need the distribution of Λ(X) under the null, whichdoes not look obvious. But we can start to fool around. There are a couple of waysto go. First,

Λ(x) > c ⇐⇒ ∑(xi − x)2

∑ x2i

< c∗ = c−2/n

⇐⇒ ∑(xi − x)2

∑(xi − x)2 + nx2< c∗

⇐⇒ ∑(xi − x)2/σ2

∑(xi − x)2/σ2 + nx2/σ2< c∗. (19.45)

We know that ∑(Xi − X)2 and X are independent, and

∑(Xi − X)2

σ2∼ χ2

n−1 andnX

2

σ2∼ χ2

1 (since µ = 0). (19.46)

But χ2ν is Γ(ν/2, 1/2). Thus, under the null hypothesis,

∑(Xi − X)2/σ2

∑(Xi − X)2/σ2 + nX2/σ2

∼ Beta

(n− 1

2,

1

2

), (19.47)

and the c∗ can be found from the Beta distribution.Or, if you do not have a Beta table handy,

Λ(x) > c ⇐⇒ ∑(xi − x)2 + nx2

∑(xi − x)2> c∗ = c2/n

⇐⇒ nx2

∑(xi − x)2> c∗∗ = c∗ − 1

⇐⇒ (√

nx)2

∑(xi − x)2/(n− 1)> c∗∗∗ = (n− 1)c∗∗

⇐⇒ |Tn(x)| > c∗∗∗∗ =√

c∗∗∗, (19.48)


where Tn is the t-statistic,

Tn(x) =

√nx√

∑(xi − x)2/(n− 1). (19.49)

Thus the cutoff is c∗∗∗∗ = tn−1,α/2. Which is to say, the MLRT in this case is the usualtwo-sided t-test.

Example 2. In (19.28), we tested p = (p1, p2, p3) = (1/3, 1/3, 1/3) for X ∼ Multinomial3(n, p).

The likelihood isL(p1, p2, p3; x) = px1

1 px22 px3

3 , (19.50)

ignoring the factorial terms. To find the MLRT, we again need to find two MLE’s. Be-cause the null hypothesis is simple, the MLE for the null is just p

0= (1/3, 1/3, 1/3).

For the alternative, the p is unrestricted, in which case the MLE is pA

= (x1/n, x2/n, x3/n).

Thus

Λ(x) =px1

A1 px2

A2 px3

A3

px1

01 px202 px3

03

=(x1/n)x1(x2/n)x2(x3/n)x3

(1/3)x1 (1/3)x2 (1/3)x3

=

(3

n

)n

xx1

1 xx22 xx3

3 . (19.51)

Notice that this statistic is quite a bit different from the score statistic in (19.35). Whatis the distribution of Λ(X) under the null? We will address that question in the nextsubsection.

19.2.2 Asymptotic null distribution

Similar to the way that, under conditions, the MLE is asymptotically (multivariate)

normal, the 2 log(Λ(X)) is asymptotically χ2 under the null. We will not present theproof of the general result, but sketch it when the null is simple. We have X1, . . . , Xn

iid, each with density f ∗(xi | θ), θ ∈ Θ ⊂ RK, and make the assumptions in Section17.1 used for likelihood estimation. Consider testing

H0 : θ = θ0 versus HA : θ ∈ Θ− θ0 (19.52)

for fixed θ0 ∈ Θ. One of the assumptions is that Θ is an open set, so that θ0 is inthe interior of Θ. In particular, this assumption rules out one-sided tests. The MLE

under the null is thus the fixed θ0 = θ0, and the MLE under the alternative, θA, is theusual MLE.

From (19.37), with

ln(θ; x) = log( f (x | θ)) =n

∑i=1

log( f ∗(xi | θ)), (19.53)

we have that2 log(Λ(x)) = 2 (ln(θA; x)− ln(θ0; x)). (19.54)


Expand ln(θ0; x) around θA:

ln(θ0; x) ≈ ln(θA; x) + (θ0− θA)′Sn(θA; x) +1

2(θ0− θA)′Hn(θA; x)(θ0− θA), (19.55)

where Hn is the second derivative matrix,

Hn(θ; x) =

∂2 ln(θ;x)∂θ2

1

∂2ln(θ;x)∂θ1∂θ2

· · · ∂2ln(θ;x)∂θ1∂θK

∂2 ln(θ;x)∂θ2∂θ1

∂2ln(θ;x)∂θ2

2· · · ∂2ln(θ;x)

∂θ2∂θK

......

. . ....

∂2 ln(θ;x)∂θK∂θ1

∂2ln(θ;x)∂θK∂θ2

· · · ∂2ln(θ;x)∂θ2

K

. (19.56)

Then −Hn/n is the observed Fisher Information matrix.Looking at (19.55), we have that Sn(θA; x) = 0K, because the score is the first

derivative of the log likelihood, so that by (19.54),

2 log(Λ(x)) ≈ −(θ0 − θA)′Hn(θA; x)(θ0 − θA)

= −(√

n(θ0 − θA))′(Hn(θA; x)/n)(√

n(θ0 − θA)). (19.57)

Now suppose H0 is true, so that θ0 is the true value of the parameter. Then as in(17.74), √

n (θA − θ0) −→D N(0K, I−1(θ0)), (19.58)

and as in (17.55),1

nHn(θA; x) −→P −I1(θ0). (19.59)

Thus

2 log(Λ(x)) −→D N(0K , I−1(θ0))′ I1(θ0)N(0K, I−1(θ0)) ∼ χ2

K. (19.60)

Now recall the multinomial example with Λ(x) in(19.51). The θ = (p1, p2), eventhough there are three pi’s, because they sum to 0. Thus K = 2. We can checkwhether the parameterization is ok by looking at the Fisher Information, becauseby assumption (17.73) it must be invertible. If we take θ = (p1, p2, p3), then theinformation will not be invertible.

The Θ = (p1, p2) | 0 < p1, 0 < p2, p1 + p2 < 1, which is indeed open, and thenull θ0 = (1/3, 1/3) is in the interior. Thus, under the null,

2 log(Λ(x)) = 2n log(3/n) + 2 ∑ xi log(xi) −→D χ22. (19.61)

A perhaps more familiar expression is

2 log(Λ(x)) = 2 ∑ xi log(xi/(n/3)) = 2 ∑ Obsi log(Obsi/Expi), (19.62)

where as before, Obsi is the observed count xi, and Expi is the expected count under

the null. This statistic is called the likelihood ratio χ2, not unreasonably, which is a

competitor to the Pearson χ2 in (19.36).


Composite null

Again with Θ ⊂ RK , where Θ is open, we consider the null hypothesis that sets partof θ to zero. That is, partition

θ =

(θ(1)

θ(2)

), θ(1) is K1 × 1, θ(2) is K2 × 1, K1 + K2 = K. (19.63)

The problem is to test

H0 : θ(1) = 0K1versus HA : θ(1) 6= 0K1

, (19.64)

with θ(2) unspecified. More precisely,

H0 : θ ∈ Θ0 = θ ∈ Θ | θ(1) = 0K1 versus HA : θ ∈ ΘA = Θ−Θ0. (19.65)

The parameters in θ(2) are called nuisance parameters, because they are not of pri-mary interest, but still need to be dealt with. Without them, we would have a nicesimple null. The main results follows.

Theorem 15. If the Cramér assumptions in Section 17.1 hold, then under the null, the MLRTstatistic Λ for problem (19.65) has

2 log(Λ(X)) −→D χ2K1

. (19.66)

This theorem takes care of the simple null case as well, where K2 = 0.

Example 1. Setting some θi’s to zero may seem to be an overly restrictive type ofnull, but many testing problems can be reparameterized into that form. For example,

suppose X1, . . . , Xn are iid N(µX, σ2), and Y1, . . . , Yn are iid N(µY, σ2), and the Xi’s

and Yi’s are independent. We wish to test µX = µY, with σ2 unknown. Then thehypotheses are

H0 : µX = µY, σ2> 0 versus HA : µX 6= µY, σ2

> 0. (19.67)

To put these in the form (19.65), take a one-to-one reparameterizations

θ =

µX − µYµX + µY

σ2

; θ(1) = µX − µY and θ(2) =

(µX + µY

σ2

). (19.68)

ThenΘ0 = 0 ×R× (0, ∞) and Θ = R

2 × (0, ∞). (19.69)

Here, K1 = 1, and there are K2 = 2 nuisance parameters, µX + µY and σ2. Thus the

asymptotic χ2 has 1 degree of freedom.

Example 2. Consider X ∼ Multinomial4(n, p), where p = (p1, p2, p3, p4)′, arranged

as a 2× 2 contingency table, where the factors are performance on Homework andExams:

HW ↓ ; Exams → lo hilo X1 X2

hi X3 X4

(19.70)


The null hypothesis we want to test is that performance on Exams and HW areindependent. One approach is to find a confidence interval for the log odds ratioγ = log(p1 p4/(p2 p3)), and see whether 0 is in the interval. Here we will find theMLRT. Let q1 = P[HW = lo] and r1 = P[Exams = lo], etc., giving the table

HW ↓ ; Exams → lo hilo p1 p2 q1 = p1 + p2hi p3 p4 q2 = p3 + p4

r1 = p1 + p3 r2 = p2 + p4 1

(19.71)

The null hypothesis is then

H0 : p1 = q1r1, p2 = q1r2, p3 = q2r1, p4 = q2r2. (19.72)

But we have to reparameterize so that the null sets something to 0. First, we knowthat we can drop p4, say. Then consider

(p1, p2, p3)←→ (p1, q1, r1)←→ (p1 − q1r1, q1, r1) ≡ θ. (19.73)

Check that these correspondences are indeed one-to-one. (E.g., p2 = q1 − p1, etc..)Also, check that p1 − q1r1 = 0 implies independence as in (19.72). Then

θ(1) = p1 − q1r1 and θ(2) =

(q1r1

). (19.74)

So, in particular, K1 = 1.Now for the MLRT. The likelihood is

L(p; x) = px1

1 px22 px3

3 px4

4 . (19.75)

Under the null as in (19.72),

L(p; x) = (q1r1)x1(q1(1− r1))

x2 ((1− q1)r1)x3 ((1− q1)(1− r1))

x4

= qx1+x2

1 (1− q1)x3+x4 rx1+x3

1 (1− r1)x2+x4 . (19.76)

Note that that looks like the two binomial likelihoods multiplied together, so weknow what the MLE’s are: q1 = (x1 + x2)/n and r1 = (x1 + x3)/n, and we also have

automatically that θ(1) = p1 − q1r1 = 0.Under the alternative, we know that pi = xi/n for each i. Then from those we

can find the MLE of θ, but actually we do not need to, because all we need is themaximum likelihood. We now have

Λ(x) =(x1/n)x1(x2/n)x2(x3/n)x3(x4/n)x4

(q1r1)x1 (q1r2)x2(q2r1)x3 (q2r2)x4. (19.77)

Shifting the n’s to the denominator, we have

2 log(Λ(x)) = 2 (x1 log

(x1

nq1r1

)+ x2 log

(x2

nq1r2

)

+ x3 log

(x3

nq2r1

)+ x4 log

(x4

nq2r2

))

= ∑ Obsi log

(Obsi

Expi

), (19.78)


where Expi is the MLE of the expected value of Xi under the null. Under the null,

this statistic is asymptotically χ21.

One lesson is that it is fine to use different parameterizations for the two hypothe-ses when finding the respective MLE’s. Expressing the hypotheses in terms of the θis mainly to make sure the theorem can be used, and to find the K1.

Chapter 20

Nonparametric Tests and Randomization

There are two steps in testing a pair of hypotheses: Finding a test statistic with goodpower, and finding its distribution under the null hypothesis. The previous chaptershowed that good test statistics tend to arise from the likelihood ratio. Finding thenull distribution depended on either getting lucky by being able to express as arandom variable with a familiar distribution, or by using the asymptotic normality

or χ2-ity if certain assumptions held. This chapter takes a more expansive approachin not necessarily being tied to parametric models, or at least wishing to be robustagainst departures from the model, and in dealing with test statistics that may nothave an easy null distribution, even approximately.

To illustrate, recall the example (18.15) with X1, . . . , Xn iid N(µ, σ2), where wetest whether µ = µ0. Two possible statistics were the (absolute) t-statistic and signstatistic, respectively

Tn(x) =|x− µ0|s/√

nand Sn(x) =

∣∣∣∣#xi < µ0

n− 1

2

∣∣∣∣ . (20.1)

It turns out that the t-test has better power than the sign test. (Both statistics are scale-invariant, in the sense that multiplying the Xi − µ0’s by the same non-zero constantdoes not change the statistic. Then it can be shown that the two-sample t-test is theUMP level α invariant test. But we won’t go into that, as interesting as it is.)

There are several ways our view of this testing problem can be expanded:

• Robustness. The distribution of the Xi’s may not be exactly normal, but some-thing close, or at least normal but with the possibility of outliers. It still wouldmake sense to test whether the mean is zero, but what happens to the twostatistics? From the central limit theorem and Slutsky, we know that as long as

the Xi’s are iid and have finite variance, Tn(X)→D |N(0, 1)|, so that at least anapproximate level α test is to reject when Tn > 1.96. The sign test is also valid,as long as µ0 is the mean and the median, since Yn = #xi < µ0 would beBin(n, 1/2) anytime the Xi’s are iid (and continuous) with median µ0. Underthose conditions, the same cutoff point would work whether normality held ornot. Whether the power of the tests would be adequate is another question.

• Nonparametric. Rather than assume normality or near-normality, one can in-stead change the problem to a totally nonparametric one. For example, suppose

241

242 Chapter 20. Nonparametric Tests and Randomization

X1, . . . , Xn are iid, Xi ∼ F, where

F ∈ F = F | F is a continuous distribution function with finite variance

that is symmetric about the median η. (20.2)

That symmetry means P[X − η > c] = P[X − η < −c], i.e., 1 − F(c + η) =F(−c + η) for any c. That set of distribution functions is called non-parametricbecause it is not indexed by a finite-dimensional parameter θ. That doesn’tmean there aren’t parameters, e.g., means and variances, etc. Notice that thoseF’s include the normal, double exponential and logistic, but not Cauchy (be-cause the Cauchy does not have finite variance) nor exponential, which is notsymmetric. Then the testing problem would be

H0 : F ∈ F with η = η0 versus HA : F ∈ F with η 6= η0 (20.3)

for some fixed η0. Notice the F is a nuisance parameter (or nuisance non-parameter, if you wish). This problem is testing whether the median is a specificvalue, but by the conditions on F, the median and the mean are the same. Ifwe removed the symmetry condition, then we would not be testing whetherthe mean is η0. Sign and rank tests are often used for such nonparametrichypotheses, because their distribution does not depend on F.

• Randomization. We may wish to use a statistic whose distribution under thenull is difficult to find or even approximate well, whether because of the statis-tic itself or because of the lack of specificity in the null. Instead of finding thedistribution of the test statistic, we look at the statistic as the data is randomlytweaked in a way consistent with the null hypothesis. The randomization dis-tribution of the statistic T(x) is the result of this tweaking.

In the next few sections, we will look at some specific testing problems and theirnonparametric counterparts, showing how to find the randomization distribution ofany test statistic, and exhibiting some popular signs and rank tests.

20.1 Testing the median

As discussed above, the nonparametric analogs to hypotheses on the mean are hy-potheses on the median. Thus to repeat, we assume that X1, . . . , Xn are iid, Xi ∼ F ∈F , where F is the set of continuous distribution functions that have finite varianceand are symmetric about their median η, as in (20.2). We test

H0 : F ∈ F with η = 0 versus HA : F ∈ F with η 6= 0. (20.4)

(What follows is easy to extend to tests of η = η0: just subtract η0 from all the Xi’s.)The two-sided t-test is valid for this problem, although it is not similar, that is, thenull distribution of T depends on exactly which F holds. The sign test depends onthe signs of the data, Sign(x1), . . . , Sign(xn), where for any z ∈ R,

Sign(z) =

+1 i f z > 00 i f z = 0−1 i f z < 0.

(20.5)

20.1. Testing the median 243

The two-sided sign statistic is then

S(x) = |n

∑i=1

Sign(xi)|. (20.6)

If none of the xi’s is exactly 0, then this statistic is the same as that in (18.17). Theexact distribution is the same as |2Y− n| where Y ∼ Binomial(n, 1/2). It is also easyto approximate, either by using the approximation to the binomial, or noting thatunder the null,

E0[Sign(Xi)] = 0 and Var0[Sign(Xi)] = E0[Sign(Xi)2] = 1. (20.7)

Then an approximate level α test rejects the null when

S(x)√n

> zα/2. (20.8)

As mentioned in the introduction to this chapter, the t-test is also valid, the statisticbegin approximately N(0, 1) for large n. Which has better power? It depends on whatthe F is. If it is Normal, the t-test is best. If it is Double Exponential, the sign test isbest. In fact, considering Pitman efficiency, which we will not go into (STAT 575 does)but is analogous to efficiency in estimation, the relative Pitman efficiency of the t-testto the sign test is the same as the asymptotic efficiency of the mean to the median inestimating a location parameter.

20.1.1 A randomization distribution

One could also think of other tests, for example, the two-sided “median" test, whichrejects H0 when

M(x) ≡ |Medianx1, . . . , xn| > c. (20.9)

The problem here is that without knowing the F, we cannot even use the normal

approximation to the distribution, because that involves 1/4 f 2(0). There are ways toestimate the density, but that adds an extra layer of complication. Instead, we canuse the randomization distribution of M(x). The idea is randomly change the xi’s isa way consistent with the null hypothesis, and see what happens to the test statistic.More precisely, find a group G under which the distribution of X is unchanged forany F in the null. That is, where hg is the action, if the null is true,

hg(X) =D X for any g ∈ G . (20.10)

For this testing problem, the group is the set of ±1’s:

G = (ǫ1, . . . , ǫn) | ǫi = ±1, (20.11)

and the action is to multiply the xi’s by the ǫi’s:

h(ǫ1,...,ǫn)(x1, . . . , xn) = (ǫ1x1, . . . , ǫnxn). (20.12)

This is just a fancy way of changing the signs on some of the xi’s. Then the require-ment (20.10) is true since Xi and −Xi have the same distribution under the null, i.e.,when the median is 0. Notice it is not true under the alternative.


Now switch from a fixed g and random X to the fixed data x and random G. Thatis, suppose G is uniformly distributed over G , that is,

P[G = g] =1

#G , g ∈ G . (20.13)

For our G, there are 2n elements, and what we have is that ǫ1, . . . , ǫn are iid Uniform−1, +1, i.e.,

P[ǫi = +1] = P[ǫi = −1] =1

2. (20.14)

Look at the random variable M(hG(x)) defined by

M(hG(x)) = M(ǫ1x1, . . . , ǫnxn), (20.15)

where the ǫi’s are random. The distribution of M(hG(x)) is the randomization distri-bution of M(x). The cutoff point we use is the c(x) that gives the desired α:

P[M(hG(x)) > c(x) | X = x] ≤ (but ≈) α. (20.16)

(Or we could randomize where M = c to get exact α.) The conditional “X = x" isthere just to emphasize that it is not the x that is random, but the ǫi’s.

To illustrate, suppose n = 5 and the data are

x1 = 24.3, x2 = 10.9, x3 = −4.6, x4 = 14.6, x5 = 10.0, =⇒ M(x) = 10.9. (20.17)

Take α = 0.25 (which is a bit big, but n is only 5).

20.1. Testing the median 245

Here are the possible randomizations of the data, and the values of M(hG(x)), andthe |Mean|.

ǫ1x1 ǫ2x2 ǫ3x3 ǫ4x4 ǫ5x5 M(hg(x)) |Mean|24.3 10.9 −4.6 14.6 10 10.9 11.0424.3 10.9 −4.6 14.6 −10 10.9 7.0424.3 10.9 −4.6 −14.6 10 10 5.224.3 10.9 −4.6 −14.6 −10 4.6 1.224.3 10.9 4.6 14.6 10 10.9 12.8824.3 10.9 4.6 14.6 −10 10.9 8.8824.3 10.9 4.6 −14.6 10 10 7.0424.3 10.9 4.6 −14.6 −10 4.6 3.0424.3 −10.9 −4.6 14.6 10 10 6.6824.3 −10.9 −4.6 14.6 −10 4.6 2.6824.3 −10.9 −4.6 −14.6 10 4.6 0.8424.3 −10.9 −4.6 −14.6 −10 10 3.1624.3 −10.9 4.6 14.6 10 10 8.5224.3 −10.9 4.6 14.6 −10 4.6 4.5224.3 −10.9 4.6 −14.6 10 4.6 2.6824.3 −10.9 4.6 −14.6 −10 10 1.32−24.3 10.9 −4.6 14.6 10 10 1.32−24.3 10.9 −4.6 14.6 −10 4.6 2.68−24.3 10.9 −4.6 −14.6 10 4.6 4.52−24.3 10.9 −4.6 −14.6 −10 10 8.52−24.3 10.9 4.6 14.6 10 10 3.16−24.3 10.9 4.6 14.6 −10 4.6 0.84−24.3 10.9 4.6 −14.6 10 4.6 2.68−24.3 10.9 4.6 −14.6 −10 10 6.68−24.3 −10.9 −4.6 14.6 10 4.6 3.04−24.3 −10.9 −4.6 14.6 −10 10 7.04−24.3 −10.9 −4.6 −14.6 10 10.9 8.88−24.3 −10.9 −4.6 −14.6 −10 10.9 12.88−24.3 −10.9 4.6 14.6 10 4.6 1.2−24.3 −10.9 4.6 14.6 −10 10 5.2−24.3 −10.9 4.6 −14.6 10 10.9 7.04−24.3 −10.9 4.6 −14.6 −10 10.9 11.04

(20.18)

The randomization distribution of M can then be gathered from the column ofM(hG(x))’s:

P[M(hG(x)) = 4.6 | X = x] =12

32=

3

8,

P[M(hG(x)) = 10 | X = x] =12

32=

3

8,

P[M(hG(x)) = 10.9 | X = x] =8

32=

1

4. (20.19)

To obtain α = 0.25, we can take the cutoff point somewhere between 10 and 10.9, say10.5. Notice that this cutoff point depends on the data – different data would yield adifferent randomization distribution. The test then rejects the null if

M(x) > c(x) = 10.5.


The actual M(x) = 10.9, so in this case we do reject.We could also obtain the randomization distribution of |x|, which is given in the

last column of (20.18). Here are the values in order:

0.84 0.84 1.20 1.20 1.32 1.32 2.68 2.68 2.68 2.68

3.04 3.04 3.16 3.16 4.52 4.52 5.20 5.20 6.68 6.68

7.04 7.04 7.04 7.04 8.52 8.52 8.88 8.88 11.04 11.04

12.88 12.88

If we choose the cutoff as 8, say, then 8/32 = 0.25 of those values exceed 8. That is,we can reject the null when |x| > 8. From the data, x = 11.04, so we again reject.

When n is large, then it may be tedious to find the randomization distribution ex-actly, because there are 2n elements in G . But we can easily use computer simulations.

Example. Each of n = 16 tires was subject to measurement of tread wear by twomethods, one based on weight loss and one on groove wear. See http://lib.

stat.cmu.edu/DASL/Datafiles/differencetestdat.html. Thus the dataare (x1, y1), . . . , (xn , yn), with the x and y’s representing the two measurements. Weassume the tires (i.e., (xi , yi)’s) are iid, and the null hypothesis is that the two mea-surement methods have the same distribution. (We are not assuming Xi is indepen-dent of Yi.) Under the null, Zi = Xi − Yi has a distribution symmetric about 0, so wecan test the null using the statistic M(z) = |Median(z1, . . . , zn)| again. Here are thedata:

X Y Z1 45.9 35.7 10.22 41.9 39.2 2.73 37.5 31.1 6.44 33.4 28.1 5.35 31 24 76 30.5 28.7 1.87 30.9 25.9 58 31.9 23.3 8.69 30.4 23.1 7.310 27.3 23.7 3.611 20.4 20.9 −0.512 24.5 16.1 8.413 20.9 19.9 114 18.9 15.2 3.715 13.7 11.5 2.216 11.4 11.2 0.2

(20.20)

Then M(z) = 4.35. Is that significant? Rather than find the exact randomization dis-tribution, I found 10000 samples from it, that is, 10000 times the zi’s were multipliedby random ±1’s, and the resulting M(hG(z))’s were calculated. A histogram of theresults:

20.2. Randomization test formalities 247

Randomization values

M*(x)

Fre

quen

cy

0 1 2 3 4

050

010

0015

00

There is not much beyond the observed M(z) = 4.35. We can count how many aregreater than or equal to the observed to be 36. That is,

P[M(hG(z)) ≥ M(z)] = P[M(hG(z)) ≥ 4.35] ≈ 36

10000= 0.0036. (20.21)

That is the randomization p-value. It is less than 0.05 by quite a bit. Thus we caneasily reject the null hypothesis for α = 0.05. (That is, the cutoff point for α = 0.05would be smaller than 4.35.)

20.2 Randomization test formalities

The advantage of randomization tests is that the exact distribution of M(hG(z)) is, atleast in principle, obtainable. You do not need to know unknown nuisance parame-ters or anything. But are the resulting tests valid? What does valid mean?

The formal setup is as in the previous section. We have the data X with space Xand set of null distributions F0. (It may or may not be parameterized.) We also havea finite group G that acts on the data such that if X ∼ F for F ∈ F0,

hg(X) =D X for any g ∈ G , (20.22)


where hg(x) is the action. Let G be a random element of the group,

G ∼ Uni f orm(G) =⇒ P[G = g] =1

#G , g ∈ G . (20.23)

(This definition can be extended to G being compact.)

Definition 36. Let T(x) be a given test statistic, and α a given level. Suppose c(x) is afunction of x that satisfies

1. For each x ∈ X ,P[T(hG(x)) > c(x) | X = x] ≤ α, (20.24)

and

2. The function c is G-invariant, i.e.,

c(hg(x)) = c(x) for all g ∈ G , x ∈ X . (20.25)

Then the randomization test rejects the null hypothesis if

T(x) > c(x). (20.26)

Part 2 of the definition is virtually automatic, because the randomization distri-bution of T(x) is the same as that for T(hg(x)), so the cutoff points for data x andhg(x) can be taken to be the same. For example, for the “±1" group action in (20.12),the data x = (24.3, 10.9,−4.6, 14.6, 10.0) in (20.17) yields the same randomizationdistribution in (20.18) as does the data (−24.3, 10.9,−4.6, 14.6,−10.0), or the data(24.3, 10.9, 4.6, 14.6, 10.0), or the data (−24.3,−10.9,−4.6,−14.6,−10.0), or any otherdata set that has the same absolute values, but possibly different signs.

The next result shows that the randomization test really does have the correctlevel.

Lemma 35. Given the above setup, for the test in Definition 36,

PF[T(X) > c(x)] ≤ α if X ∼ F for F ∈ F0. (20.27)

Proof. Look at the random vectorhG(X), (20.28)

where X and G are independent, and X ∼ F for F ∈ F . Then

(hG(X) | G = g) =D hg(X) =D X, (20.29)

where the last equation is the condition (20.22). That is, the conditional distributionof hG(X) given G = g is the same for each g, hence the unconditional distribution isalso the same:

hG(X) =D X. (20.30)

Thus,

PF[T(X) > c(X)] = PF[T(hG(X)) > c(hG(X))]

= PF[T(hG(X)) > c(X)] (by part 2 of Definition 36)

= EF[p(X)], (20.31)

wherep(x) = P[T(hG(x)) > c(x) | X = x]. (20.32)

But by Part 1 of the definition, (20.24), p(x) ≤ α for all x. Thus by (20.31), (20.27)holds. 2

We end this section with some remarks.


20.2.1 Using p-values

Often in practice, rather than trying to find c(x), it is easier to find the randomizationp-value, which for given data x is

p-value(x) = P[T(hG(x)) ≥ T(x) | X = x], (20.33)

as in (20.21). The p-value is often called the “observed significance level," because itis basically the level you would obtain if you used the observed test statistic as thecutoff point. The randomization test then rejects the null hypothesis if

p-value(x) ≤ α. (20.34)

This procedure works because if p-value(x) ≤ α, then for the level α test, it must bethat c(x) < T(x).

20.2.2 Using the normal approximation

It may be that the Central Limit Theorem can be used to approximate the randomiza-tion distribution of the test statistic. For example, consider the statistic T(x) = ∑ xi,and the group G and action as in (20.12). Then

T(hG(x)) = ∑ ǫixi. (20.35)

The ǫi’s are iid, with P[ǫi = −1] = P[ǫi = +1] = 12 , hence it is easy to see that as in

(20.7),E[ǫi] = 0 and Var[ǫi] = 1, (20.36)

hence

E[T(hG(x))] = ∑ E[ǫi]xi = 0 and Var[T(hG(x))] = ∑ Var[ǫi]x2i = ∑ x2

i . (20.37)

At least if n is fairly large and the xi’s are not too strange,

T(hG(x))− E[T(hG(x))]√Var[T(hG(x))]

=∑ ǫixi√

∑ x2i

≈ N(0, 1). (20.38)

An approximate randomization test then uses the normalized version of the teststatistic and the normal cutoff, that is, the approximate test rejects if

∑ xi√∑ x2

i

> zα. (20.39)

This normalized statistic is close to the t-statistic:

∑ xi√∑ x2

i

=

√n x√

∑ x2i /n≈

√n x√

∑(xi − x)2/(n− 1)= t, (20.40)

at least if x ≈ 0. But the justification of the cutoff point in (20.39) is based on therandomness of the G, not the randomness of the Xi’s. In fact, R. A. Fisher used therandomization justification to bolster the use of the normal-theory tests. (He alsoinvented randomization tests.)


20.2.3 When the regular and randomization distributions coincide

It can happen that the randomization distribution of a statistic does not depend onthe data x. That is, there is some random variable T∗ such that for any x,

(T(hG(X)) | X = x) =D T∗. (20.41)

The conditional distribution of T(hG(X)) given X = x is independent of x, whichmeans that the unconditional distribution is the same as the conditional distribution:

T(hG(X)) =D T∗. (20.42)

But by (20.29), hG(X) and X have the same distribution (under the null), hence

T(X) =D T∗. (20.43)

Thus the randomization distribution and the regular distribution are the same, andthe randomization cutoff point is the same as the regular cutoff point.

For example, suppose under the null hypothesis that X1, . . . , Xn are iid continuousand symmetric about 0, and consider the sign statistic

T(x) = ∑ Sign(xi). (20.44)

With the group G being the ±1’s,

T(hG(x)) = ∑ Sign(ǫixi). (20.45)

The Sign(ǫixi)’s are iid Uni f orm−1, +1, no matter what the xi’s are (as long asnone are 0). Thus for any x,

T(hG(x)) =D T∗ = 2Y− n, (20.46)

where Y ∼ Binoimal(n, 1/2). But we also know that T(X) = ∑ Sign(Xi) has the samedistribution:

T(X) =D 2Y− n. (20.47)

That is, the regular and randomization tests coincide.

20.2.4 Example: The signed rank test

Go back to the example with X1, . . . , Xn iid Xi ∼ F ∈ F , where F is the set ofcontinuous distribution functions that have finite variance and are symmetric abouttheir median η, as in (20.2). Now consider the one-sided testing problem,

H0 : F ∈ F with η = 0 versus HA : F ∈ F with η > 0. (20.48)

One drawback to the sign test is that it takes into account only which side of 0 eachobservation, not how far above or below. It could be that half the data are around -1and half around 100. The sign statistic would be about 0, yet the data itself wouldsuggest the median is around 50. One modification is to weight the signs with theranks of the absolute values. That is, let

(r1, . . . , rn) = (rank(|x1|), . . . , rank(|xn|)), (20.49)


where the smallest |xi| has rank 1, the second smallest has rank 2, ..., and the largesthas rank n. (Since F is continuous, we can ignore the possibility of ties.)

For example, with the data from (20.17),

x1 = 24.3, x2 = 10.9, x3 = −4.6, x4 = 14.6, x5 = 10.0, (20.50)

the absolute values are

|x1| = 24.3, |x2| = 10.9, |x3| = 4.6, |x4| = 14.6, |x5| = 10.0, (20.51)

and the ranks of the absolute values are

r1 = 5, r2 = 3, r3 = 1, r4 = 4, r5 = 2. (20.52)

The signed rank statistic is defined as

U(x) = ∑ ri Sign(xi). (20.53)

Using the same ±1 group, the randomization statistic is

U(hG(x)) = ∑ ri Sign(ǫixi), (20.54)

because the |ǫixi| = |xi|, hence the ranks are invariant under the group. Notice thatthis statistic is just the integers 1, . . . , n multiplied by random ±1’s:

U(hG(x)) =D ∑ i ǫi. (20.55)

In particular, the distribution is independent of x, which means that the randomiza-tion distribution and the regular null distribution are the same (as in Section 20.2.3.)It is easy enough to write out the distribution if n is small, as in (20.18), but with1, 2, 3, 4, 5 for the data. It is also easy to simulate. We will take the normalizationroute from Section 20.2.2. The ǫi’s are iid Uni f orm−1, +1, hence as in (20.36). theyhave means 0 and variances 1, and as in (20.37),

E[∑ i ǫi] = 0 and Var[∑ i ǫi] = ∑ i2 =n(n + 1)(2n + 1)

6. (20.56)

Thus the test using the normalized statistic rejects the null hypothesis when

∑ ri Sign(xi)√n(n + 1)(2n + 1)/6

> zα. (20.57)

This test is both the randomization test and the regular test.For example, look at the tire wear data in (20.20). Applying the signed rank test

to the Zi’s column, note that only one, the −0.5, is negative. It has the second lowestabsolute value, so that U is the sum of 1, 3, . . . , 16 minus 2, which equals 132. Thevariance in (20.56) is 1496, hence the normalized statistic is

132√1496

= 3.413. (20.58)

which exceeds 1.645, so we would reject the null hypothesis at level 0.05. It looks likethe two measurements do differ.


20.3 Two-sample tests

The Normal-based two-sample testing situation has data X1, . . . , Xn iid N(µX, σ2)and Y1, . . . , Ym iid N(µY, σ2), where the Xi’s and Yi’s are independent, and one tests

H0 : µX = µY, σ2> 0 versus HA : µX > µY, σ2

> 0. (20.59)

(Or, one could have a two-sided alternative.) There are various nonparametric analogsof this problem. The one we will deal with has

X1, . . . , Xn iid ∼ FX; Y1, . . . , Ym iid ∼ FY, (20.60)

where the Xi’s and Yi’s are independent, and FX and FY are continuous. The nullhypothesis is that the distribution functions are equal, and the alternative is that FXis stochastically larger than FY:

Definition 37. The distribution function F is stochastically larger than the distributionfunction G, written

F >st G, (20.61)

if

F(c) ≤ G(c) for all c ∈ R, and

F(c) < G(c) for some c ∈ R. (20.62)

It looks like the inequality is going the wrong way, but the idea is that if X ∼ Fand Y ∼ G, then F being stochastically larger than G means that the X tends to belarger than the Y, or

P[X > c] > P[Y > c], which implies that 1− F(c) > 1− G(c) =⇒ F(c) < G(c).(20.63)

On can also say that “X is stochastically larger than Y."For example, if X and Y are both from the same location family, but X’s parameter

is larger than Y’s, then X is stochastically larger than Y. The same is true of both arefrom a distribution with monotone likelihood ratio.

Back to the testing problem. With the data in (20.60), the hypotheses are

H0 : FX = FY, continuous versus HA : FX >st FY, FX and FY continuous. (20.64)

We reject the null if the xi’s are too much larger than the yi’s in some sense. Somepossible test statistics:

Difference in means : A(x, y) = x− y

Difference in medians : M(x, y) = Medianx1, . . . , xn −Mediany1, . . . , ym

Wilcoxon W : W(x, y) = rX − rY

Mann-Whitney U : U(x, y) = ∑ni=1 ∑

mj=1 Sign(xi − yi)

(20.65)

20.3. Two-sample tests 253

For the Wilcoxon statistic, we first find the ranks of the data, considered as one bigsample of n + m observations:

r1 = rank(x1), r2 = rank(x2), . . . , rn = rank(xn),

rn+1 = rank(y1), rn+2 = rank(y2), . . . , rn+m = rank(ym). (20.66)

Then rX is the mean of the ranks of the xi’s, etc., i.e.,

rX =1

n

n

∑i=n+m

ri, rY =1

m

n

∑i=n+1

ri. (20.67)

The Wilcoxon statistic is the same as the difference in means, except with the actualobservations replaced by their ranks. The Mann-Whitney statistic counts +1 if an xiis larger than a yj, and −1 if the reverse. Thus the larger the xi’s, the more +1’s in thesummation.

In order to find the randomization distributions of these statistics, we need to firstdecide how to randomize. The ±1 group is not relevant here.

20.3.1 Permutations

If the null hypothesis is true, then the combined sample of n + m observations,X1, . . . , Xn, Y1, . . . , Ym, is an iid sample. Thus the distribution of the combined sampledoes not depend on the order of the observations. E.g., if n = 2, m = 3, the followingare true under the null hypothesis:

(X1, X2, Y1, Y2, Y3) =D (X2, Y3, Y2, X1, Y1) =D (Y3, Y2, Y1, X2, X1). (20.68)

Note that (20.67) is not true under the alternative. The randomization is thus areordering of the entire sample, which means the group is Pn+m, the group of(n + m) × (n + m) permutation matrices. If we let z be the (n + m) × 1 vector ofxi’s and yi’s,

z = (x1, . . . , xn, y1, . . . , ym)′, (20.69)

then the group action for Γ ∈ Pn+m is

hΓ(z) = Γz. (20.70)

Thus for any statistic T(z), the randomization distribution is the distribution of

T(hΓ(z)) = T(Γz), Γ ∼ Uni f orm(Pn+m). (20.71)

The group has (n + m)! elements, so for small n + m it is not too difficult to listout the distribution. But (n + m)! gets large quite quickly, so simulations or normalapproximations are usually useful. (Actually, because the order within the x and y

groups are irrelevant, there are really only (m+nn ) distinct values of T(Γz) to worry

about.)The difference in medians test statistic appears to be best done with simulations.

I do not see an obvious normal approximation. The other three statistics are easier todeal with, so we will.


20.3.2 Difference in means

Let

a =

1n...1n− 1

m. . .

− 1m

=

(1n 1n

− 1m 1m

). (20.72)

ThenA(z) = a′z = a′(x1, . . . , xn, y1, . . . , ym)′ = x− y = A(x, y). (20.73)

The randomization distribution of the statistic A(x, y) is thus the distribution of

A(Γz) = a′Γz. (20.74)

If n and m are not small, we can use a normal approximation. In order to find themean ans variance of A(Γz), it would be helpful to first find the distribution of thevector Γz.

Lemma 36. Suppose v is a fixed N × 1 vector, and Γ ∼ Uni f orm(PN). Then

E[Γv] = v 1N and Cov[Γv] = s2v HN , (20.75)

where

s2v =

∑(vi − v)2

N− 1and HN = IN −

1

N1N 1′N , (20.76)

the N × N centering matrix.

The proof is in the next section, Section 20.3.3. The lemma shows that

E[Γz] = z 1n+m, where z =nx + my

n + m(20.77)

is the pooled mean of both xi’s and yi’s, and

Cov[Γz] = s2z Hn+m, (20.78)

where s2z is the sample standard deviation of the combined xi’s and yi’s. (It is not

the pooled standard deviation in the usual sense, because that would subtract therespective means from the observations in the two groups.) Then because A(z) =a′Γz), we easily have

E[A(z)] = z a′1n+m = 0, (20.79)

because the sum of the elements of a is m× (1/m)− n× (1/n) = 0. Also,

Var[A(z)] = Var[a′Γz]

= a′Cov[Γz]a

= s2z (a′a− 0) (again because a′1n+m = 0)

= s2z (n× 1

n2+ m× 1

m2)

= s2z (

1

n+

1

m). (20.80)

20.3. Two-sample tests 255

Thus using the normalized randomization statistic, we reject H0 when

x− y

sz

√1n + 1

m

> zα. (20.81)

If the s2z were replaced by the usual pooled variance, then this statistic would be

exactly the two-sample t statistic.

20.3.3 Proof of Lemma 36

Let W = Γv. By symmetry, the elements of W all have the same mean, is that E[W] =a 1N for some a. Now

1′NE[W] = a1′N1N = Na, (20.82)

and1′NE[W] = E[1′NW] = E[1′NΓv] = 1′Nv = ∑ vi, (20.83)

because 1′NΓ = 1N for any permutation matrix. Thus Na = ∑ vi, or a = v.Similarly, the Var[Wi]’s are all equal, and the covariances Cov(Wi, Wj) are all equal

(if i 6= j). Thus for some b and c,

Cov[W] = b IN + c 1N1′N . (20.84)

Take the trace (the sum of the variances) from both sides:

trace(Cov[W]) = trace(E[(W − v 1N)(W − v 1N)′])= trace(E[(Γv− v 1N)(Γv− v 1N)′])= trace(Γ E[(v− v 1N)(v− v 1N)′]Γ′)= trace(E[(v− v 1N)(v− v 1N)′]Γ′Γ)

= trace((v− v 1N)(v− v 1N)′)

= ∑(vi − v)2, (20.85)

andtrace(Cov[W]) = trace(b IN + c 1N1′N) = N(b + c), (20.86)

hence

b + c =∑(vi − v)2

N. (20.87)

Next, note that the sum of the Wi’s is a constant, ∑ vi, hence its variance is 0. That is,

Var[1′NW] = Var[1′NΓv] = Var[1′Nv] = 0, (20.88)

but also

0 = Var[1′NW] = 1NCov[W]1N = 1′N(b IN + c 1N1′N)1N = Nb + N2c. (20.89)

Thusb + Nc = 0. (20.90)

Solving for b and c in (20.87) and (20.90), we have c = −b/N, hence

b =∑(vi − v)2

N − 1= s2

v and c = − 1

Ns2

v. (20.91)


Thus from (20.84),

Cov[W] = s2v IN −

1

Ns2

v1N1′N = s2v(IN −

1

N1N 1′N) = s2

v HN , (20.92)

as in (20.76). 2

20.3.4 Wilcoxon two-sample rank test

The randomization distribution of the Wilcoxon two-sample test statistic, W in (20.65),works basically the same as the difference-in-means statistic, but the values are ranksrather than the original variables. That is, we can write

W(z) = W(x, y) = rX − rY = a′r, (20.93)

where r is the (n + m)× 1 vector of ranks from (20.66),

r =

r1...

rn

rn+1...

rn+m

=

rank(x1)...

rank(xn)rank(y1)

...rank(ym)

. (20.94)

The randomization distribution is that of W(Γz) = aΓr. Let N = n + m be the totalsample size. Then notice that r contains the integers 1, . . . , N in some order (assumingno ties among the data). Because Γ is a random permutation, the distribution Γr doesnot depend on the order of the ri’s, so that

Γr =D Γ

1...

N

≡ Γ 1:N. (20.95)

In particular, as in Section 20.2.3, the randomization distribution W(Γz) does notdepend on z, hence the randomization distribution is the same as the regular nulldistribution of W(Z).

For the normalized statistic, from Lemma 36,

E[Γr] = E[Γ 1:N] =∑

Ni=1 i

N1N =

N + 1

21N , (20.96)

and

Cov[Γr] = Cov[Γ 1:N] =∑

Ni=1(i− (N + 1)/2)2

N − 1HN =

N(N + 1)

12HN . (20.97)

Then with the a from (20.72), as in (20.79) and (20.80),

E[W(Γz)] = a′(N + 1

21N) = 0 and

Cov[W(Γz)] = a′(N(N + 1)

12HN)a =

N(N + 1)

12(

1

n+

1

m). (20.98)

20.4. Linear regression→ Monotone relationships 257

The normalized statistic is then

rX − rY√N(N+1)

12 ( 1n + 1

m ). (20.99)

Mann-Whitney two-sample statistic

It is not immediately obvious, but the Wilcoxon W and Mann-Whitney U statistics in(20.65) are equivalent. Specifically,

W(z) =N

2(

1

n+

1

m) U(z). (20.100)

Thus they lead to the same test, and this test is often called the Mann-Whitney/Wilcoxontwo-sample test.

20.4 Linear regression →Monotone relationships

Consider the linear regression model,

yi = α + βxi + ei. (20.101)

Typically one wishes to test whether β = 0, which is (under the usual assumptions)equivalent to testing whether the Xi’s are independent of the Yi’s. If the ei’s are iid

N(0, σ2), then t-tests and F tests are available. A nonparametric one-sided analogof this situation assumes that (X1, Y1), . . . , (Xn, Yn) are iid pairs of variables withcontinuous distributions, and the hypotheses are

H0 : Xi is independent of Yi, i = 1, . . . , N versus HA : X′i s and Y′i s are positively related.(20.102)

The alternative is not very specific, but the idea is larger xi’s tend to go with largeryi’s. The linear model in (20.101) is a special case if β > 0. The alternative in-cludes nonlinear, but increasing, relationships, and makes no assumption about theei’s (other than they are iid and continuous).

20.4.1 Spearman’s ρ

A natural test statistic in the linear regression model, especially with normal errors,is the sample correlation coefficient,

R(x, y) =∑

ni=1(xi − x)(yi − y)√

∑ni=1(xi − x)2 ∑

ni=1(yi − y)2

=x′Hny

‖Hnx‖‖Hny‖ . (20.103)

(Recall the Hnx subtracts x from all the elements of x.) A robustified version usesthe rank transform idea, replacing the xi’s with their ranks, and the yi’s with theirs.In this case, the xi’s are ranked within themselves, as are the yi’s. (Unlike in thetwo-sample case, where they were combined.) Then we have two vectors of ranks.

rx =

rank(xi among x1, . . . , xn)...

rank(xn among x1, . . . , xn)

, ry =

rank(yi among y1, . . . , yn)...

rank(yn among y1, . . . , yn)

.

(20.104)


Spearman’s ρ is then defined to be the correlation coefficient of the ranks:

S(x, y) =rx′Hnry

‖Hnrx‖‖Hnry‖ . (20.105)

Under the null hypothesis, the xi’s are independent of the yi’s, hence permutingthe yi’s among themselves does not change the distribution. Neither does permutingthe xi’s among themselves. That is, if Γ1 and Γ2 are any n× n permutation matrices,and the null hypothesis is true, then

(X, Y) =D (X, Γ2Y) =D (Γ1X, Y) =D (Γ1X, Γ2Y). (20.106)

Saying it another way, we can match up the xi’s with the yi’s in any arrangement.We will just permute the yi’s, so that the randomization distribution of any statisticT(x, y) is the distribution of T(x, Γy). For Spearman’s ρ, we have

S(x, Γy) =rx′HnΓry

‖Hnrx‖‖HnΓry‖ , (20.107)

because permuting the yi’s has the effect of similarly permuting their ranks. Now,assuming no ties, both rx and ry are permutations of the integers 1, . . . , n, and is Γy,

hence

‖Hnrx‖2 = ‖HnΓry‖2 = ‖Hn 1:n‖ = ∑(i− (n + 1)/2)2 =n(n2− 1)

12, (20.108)

the last equation being from (20.97). Thus the denominator of Spearman’s ρ does not

depend on the Γ. Also, Γy =D Γ 1:n, hence

S(x, Γy) =D(1:n)′HnΓ 1:n

n(n2− 1)/12. (20.109)

Once again, the randomization distribution does not depend on the data, hence therandomization and regular distributions coincide.

To normalize the statistic, we first find the expected value and covariance of Γ 1:n,which we did in (20.96) and (20.97):

E[Γ 1:n] =n + 1

21n and Cov[Γ 1:n] =

n(n + 1)

12Hn. (20.110)

Then

E[HnΓ 1:n] =n + 1

2Hn1n = 0n and

Cov[HnΓ 1:n] =n(n + 1)

12HnHnH′n =

n(n + 1)

12Hn. (20.111)

Finally, the numerator of S has

E[(1:n)′HnΓ 1:n] = 0 (20.112)

and

Var[(1:n)′HnΓ 1:n] =n(n + 1)

12(1:n)′Hn1:n =

n(n + 1)

12

n(n2 − 1)

12, (20.113)

20.5. Fisher’s exact test for independence 259

hence

E[S(x, Γy)] = 0 and Var[S(x, Γy)] =n(n + 1)/12

n(n2− 1)/12=

1

n− 1. (20.114)

Thus the approximate level α test rejects the null hypothesis when

√n− 1 S(x, y) > zα. (20.115)

20.4.2 Kendall’s τ

An alternative to Spearman’s ρ is Kendall’s τ, which is similar to a correlation coeffi-cient in that it goes from −1 to +1, where close to +1 means there is a strong positiverelationship between the xi’s and yi’s, and −1 means a strong negative relationship.If they are independent, Kendall’s τ should be about 0. This statistic is based uponthe signs of the slopes comparing pairs of points (xi, yi) and (xj, yj). Specifically, itlooks at the quantities

Sign(yi − yj)Sign(xi − xj). (20.116)

This value is +1 if the line segment connecting (xi, yi) and (xj , yj) is positive, and −1is the slope is negative. (Why?) We look at that quantity for each pair of points, andadd them up:

U(x, y) = ∑ ∑1≤i<j≤nSign(yi − yj)Sign(xi − xj), (20.117)

which is the number of positive slopes minus the number of negative slopes. Thusif the xi’s and yi’s are independent, about half should be positive and half negative,yielding a U around 0. If all slopes are positive, then the xi’s and yi’s are perfectlypositively related, hence U(x, y) = (n

2) (there being (n2) pairs). Similarly, U(x, y) =

−(n2) is the xi’s and yi’s are perfectly negatively related. So Kendall’s τ is normalized

to be between ±1:

K(x, y) =∑ ∑1≤i<j≤nSign(yi − yj)Sign(xi − xj)

n(n− 1)/2. (20.118)

Under the null hypothesis of independence, if the distributions of the variablesare continuous, it can be shown that

E[U(X, Y)] = 0 and Var[U(X, Y)] =

(n

2

)+

2

3

(n

3

), (20.119)

hence

E[K(X, Y)] = 0 and Var[K(X, Y)] =2

9

2n + 5

n(n− 1), (20.120)

20.5 Fisher’s exact test for independence

Now we turn to testing independence in a 2× 2 contingency table. That is, we have

(X11, X12, X21, X22) ∼ Multinomial(n,(p11, p12, p21, p22)), (20.121)


where we think of the Xij’s arranged in a table:

Column 1 Column 2 SumRow 1 X11 X12 X1·Row 2 X21 X22 X2·Sum X·1 X·2 n

(20.122)

The null hypothesis is that the rows are independent of the columns, that is, if welet

r1 = p11 + p12,

r2 = p21 + p22,

c1 = p11 + p21,

c2 = p12 + p22, (20.123)

so that r1 + r2 = 1 and c1 + c2 = 1, then

H0 : pij = ricj for all i, j versus HA : pij 6= ricj for some i, j. (20.124)

This null hypothesis is not simple, so the distribution of any statistic will likelydepend on unknown parameters, even under the null hypothesis. Fisher’s idea wasto look at the conditional distribution given the marginals, which are the Xi·’s andX·j’s. If you know the marginals, then by knowing just X11 one can figure out all theother elements. Thus what we want is the conditional distribution of X11 given themarginals:

X11 | X1· = b, X·1 = m. (20.125)

(We left out the other two marginals, since we know X2· = n−X1· and X·2 = n−X·1.)First, we need the conditional space of X11. It is easiest to start by writing out the

table with these conditionings:

Column 1 Column 2 SumRow 1 x11 b− x11 bRow 2 m− x11 n−m− b + x11 n− bSum m n−m n

(20.126)

Then each cell must be nonnegative (of course), so the constraints on x11 are

x11 ≥ 0, b− x11 ≥ 0, m− x11 ≥ 0, n− b−m + x11 ≥ 0, (20.127)

which translates to

max0, m + b− n ≤ x11 ≤ minb, m. (20.128)

The joint pmf of the Xij’s under the null hypothesis is

f (x; r1, r2, c1, c2) =

(n

x

)(r1c1)

x11(r1c2)x12 (r2c1)

x21(r2c2)x22

=

(n

x

)rx1·

1 rx2·2 cx·1

1 cx·22 . (20.129)

20.5. Fisher’s exact test for independence 261

Notice that the marginals are sufficient, which means the conditional distribution ofX11 given the marginals does not depend on the parameters. So we can ignore thatpart of the pmf, and look at just the factorials part. Then

P[X11 = x11 | X1· = b, X·1 = m] =( n

x11,b−x11,m−x11,n−b−m+x11)

∑ ( nw,b−w,m−w,n−b−m+w)

, (20.130)

where the sum in the denominator is over w in the conditional space max0, m + b−n ≤ w ≤ minb, m. We rewrite the numerator:

(n

x11, b− x11, m− x11, n− b−m + x11

)(20.131)

=n!

x11!(b− x11)!(m− x11)!(n− b−m + x11)!

=n!

b!(n− b)!

b!

x11!(b− x11)!

(n− b)!

(m− x11)!(n− b−m + x11)!

=

(n

b

) (b

x11

) (n− b

m− x11

). (20.132)

A combinatorial identity is

∑w

(b

w

) (n− b

m− w

)=

(n

m

). (20.133)

Thus

P[X11 = x11 | X1· = b, X·1 = m] =( b

x11) ( n−b

m−x11)

(nm)

, (20.134)

which is the Hypergeometric pmf with parameters n, b, m.Finally, Fisher’s exact test uses X11 as the test statistic, and the Hypergeometric

distribution to find the cutoff point or p-value. For example, for the alternative thatthe rows and columns are positively associated, which means X11 (and X22) shouldbe relatively large, the (conditional) p-value is

P[Hypergeometric(n, b, m) ≥ x11]. (20.135)

Example. Consider the example in which 107 STAT 100 students were categorizedaccording to homework and exam scores:

HW ↓; Exams → lo hi Sumlo 38 16 b = 54hi 16 37 n− b = 53

Sum m = 54 n−m = 53 n = 107

(20.136)

(It is just a coincidence that b = m.) The range of X11 conditional on the marginals is1, . . . , 54 and the Hypergeometric pmf is

f (w; 107, 54, 54) =(54

w) ( 5354−w)

(10754 )

. (20.137)


Then the one-sided p-value (for testing positive association) is then

P[Hypergeometric(107, 54, 54) ≥ 38] =54

∑w=38

f (w; 107, 54, 54) = 0.00003. (20.138)

That is a tiny p-value, so we reject the null hypothesis: people who do well on thehomework tend to do well on the exams.

Bibliography

Jonathan Arnold. Statistics of natural populations. i: Estimating an allele probabilityin cryptic fathers with a fixed number of offspring. Biometrics, 37(3):pp. 495–504,1981. ISSN 0006341X. URL http://www.jstor.org/stable/2530562.

263

mathematical statistics

Documents

2x 2x

p1et1 pketk

cumulant generating function

central limit theorem

general testing problemh0

moment generating function

randomization distribution ofm

unbiased estimator ofg