(st217: mathematical statistics b)

108
MSB (ST217: Mathematical Statistics B) Aim To review, expand & apply the ideas from MSA. In particular, MSA mainly studied one unknown quantity at once. In MSB we’ll study interrelationships. Lectures & Classes Monday 12–1 R0.21 Wednesday 10–11 R0.21 Thursday 1–2 PLT Examples classes will begin in week 3. Style Lectures will be supplemented (NOT replaced!!) with printed notes. Please take care of these notes—duplicates may not be readily available. I shall teach mainly by posing problems (both theoretical and applied) and working through them. Contents 1. Overview of MSA. 2. Bivariate & Multivariate Probability Distributions. Joint distributions, conditional distributions, marginal distributions; conditional expectation. The χ 2 , t, F and multivariate Normal distributions and their interrelationships. 3. Inference for Multiparameter Models. Likelihood, frequentist and Bayesian inference, prediction and decision-making. Comparison between various approaches. Point and interval estimation. Classical simple and composite hypothesis testing, likelihood ratio tests, asymptotic results. 4. Linear Statistical Models. Linear regression, multiple regression & analysis of variance models. Model choice, model checking and residuals. 5. Further Topics (time permitting). Nonlinear models, problems & paradoxes, etc. Books The books recommended for MSA are also useful for MSB. Excellent books on mathematical statistics are: 1. ‘Statistical Inference’ by George Casella & Roger L. Berger [C&B], Duxbury Press (1990), 2. ‘Probability and Statistics’ by Morris DeGroot, Addison-Wesley (2nd edition 1989). A good book discussing the application and interpretation of statistical methods is ‘Introduction to the Practice of Statistics’ by Moore & McCabe [M&M], Freeman (3rd edition 1998). Many of the data sets considered below come from the ‘Handbook of Small Data Sets’ [HSDS] by Hand et al., Chapman & Hall, London (1994). There are many other useful references on mathematical statistics available in the library, including books by Hogg & Craig [H&C], Lindgren, Mood, Graybill & Boes [MG&B], and Rice. These notes are copyright c 1998,1999,2000,2001 by J. E. H. Shaw 1

Upload: trinhque

Post on 02-Jan-2017

234 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: (ST217: Mathematical Statistics B)

MSB(ST217: Mathematical Statistics B)

Aim

To review, expand & apply the ideas from MSA.In particular, MSA mainly studied one unknown quantity at once. In MSB we’ll study interrelationships.

Lectures & ClassesMonday 12–1 R0.21Wednesday 10–11 R0.21Thursday 1–2 PLT

Examples classes will begin in week 3.

Style

• Lectures will be supplemented (NOT replaced!!) with printed notes.Please take care of these notes—duplicates may not be readily available.

• I shall teach mainly by posing problems (both theoretical and applied) and working through them.

Contents

1. Overview of MSA.

2. Bivariate & Multivariate Probability Distributions.Joint distributions, conditional distributions, marginal distributions; conditional expectation. Theχ2, t, F and multivariate Normal distributions and their interrelationships.

3. Inference for Multiparameter Models.Likelihood, frequentist and Bayesian inference, prediction and decision-making. Comparison betweenvarious approaches. Point and interval estimation. Classical simple and composite hypothesis testing,likelihood ratio tests, asymptotic results.

4. Linear Statistical Models.Linear regression, multiple regression & analysis of variance models. Model choice, model checkingand residuals.

5. Further Topics (time permitting).Nonlinear models, problems & paradoxes, etc.

Books

The books recommended for MSA are also useful for MSB. Excellent books on mathematical statistics are:1. ‘Statistical Inference’ by George Casella & Roger L. Berger [C&B], Duxbury Press (1990),2. ‘Probability and Statistics’ by Morris DeGroot, Addison-Wesley (2nd edition 1989).

A good book discussing the application and interpretation of statistical methods is ‘Introduction to thePractice of Statistics’ by Moore & McCabe [M&M], Freeman (3rd edition 1998). Many of the data setsconsidered below come from the ‘Handbook of Small Data Sets’ [HSDS] by Hand et al., Chapman & Hall,London (1994).

There are many other useful references on mathematical statistics available in the library, including booksby Hogg & Craig [H&C], Lindgren, Mood, Graybill & Boes [MG&B], and Rice.

These notes are copyright c© 1998,1999,2000,2001 by J. E. H. Shaw

1

Page 2: (ST217: Mathematical Statistics B)

Chapter 1

Overview of MSA

1.1 Basic Ideas

1.1.1 What is ‘Statistics’?

Statistics may be defined as:

‘The study of how information should be employed to reflect on, and give guidance for actionin, a practical situation involving uncertainty.’ [italics by JEHS]

Vic Barnett, Comparative Statistical Inference

Figure 1.1: A practical situation involving uncertainty

2

Page 3: (ST217: Mathematical Statistics B)

1.1.2 Statistical Modelling

The emphasis of modern statistics is on modelling the patterns and interrelationships in the existing data,and then applying the chosen model(s) to predict future data.

Typically there is a measurable response (for example, reduction Y in patient’s blood pressure) that isthought to be related to explanatory variables xj (for example, treatment applied, dose, patient’s age,weight, etc.) We seek a formula that relates the observed responses to the corresponding explanatoryvariables, and that can be used to predict future responses in terms of their corresponding explanatoryvariables:

Observed Response = Fitted Value + Residual,Future Response = Predicted value + Error.

Here the fitted values should take account of all the consistent patterns in the data, and the residualsrepresent the remaining random variation.

1.1.3 Prediction and Decision-Making

Always remember that the main aim in modelling as above is to predict (for example) the effects of differentmedical treatments, and hence to decide which treatment to use, and in what circumstances.

The fundamental assumption is that the future data will be in some sense similar to existing data. Theideas of exchangeability and conditional independence are crucial.

The following notation is useful:

X ⊥⊥Y ‘X is independent of Y ’, i.e. Y gives you no information about X,

X ⊥⊥Y |Z ‘X is conditionally independent of Y given Z’, i.e. if you know the value taken by theRV Z, then Y gives you no further information about X.

Most methods of statistical inference proceed indirectly from what we know (the observed data and anyother relevant information) to what we really want to know (future, as yet unobserved, data), by assumingthat the random variation in the observed data can be thought of as a sample from an underlying population,and learning about the properties of this population.

1.1.4 Known and Unknown Features of a Statistical Problem

A statistic is a property of a sample, whereas a parameter is a property of a population. Often it’s naturalto estimate a parameter θ (such as the population mean µ) by the corresponding property of the sample(here the sample mean X). Note that θ may be a vector or more complicated object.

Unobserved quantities are treated mathematically as random variables. Potentially observable quantitiesare usually denoted by capital letters (Xi, X, Y etc.) Once the data have been observed, the values takenby these random variables are known (Xi = xi, X = x etc.) Unobservable or hypothetical quantities areusually denoted by Greek letters (θ, µ, σ2 etc.), and estimators are often denoted by putting a hat on thecorresponding symbol (θ, µ, σ2 etc.)

Nearly all statistics books use the above style of notation, so it will be adopted in these notes. How-ever, sometimes I shall wish to distinguish carefully between knowns and unknowns, and shall denote allunknowns by capitals. Thus Θ represents an unknown parameter vector, and θ represents a particularassumed value of Θ. This is especially useful when considering probability distributions for parameters;one can then write fΘ(θ) and Pr(Θ = θ) by exact analogy with fX(x) and Pr(X = x).

The set of possible values for a RV X is called its sample space ΩX. Similarly the parameter space ΩΘ isthe set of possible values for the parameter Θ.

3

Page 4: (ST217: Mathematical Statistics B)

1.1.5 Likelihood

In general, we can infer properties θ of the population by comparing how compatible are the various possiblevalues of θ with the observed data. This motivates the idea of likelihood (equivalently, log-likelihood orsupport). We need a probability model for the data, in which the probability distribution of the randomvariation is a member of a (realistic but mathematically tractable) family of probability distributions,indexed by a parameter θ.

Likelihood-based approaches have both advantages and disadvantages—

Advantages Disadvantages

Unified theory (many practical problems can betackled in essentially the same way).

Is the theory directly relevant? (is likelihood aloneenough? and how do we balance realism andtractability?)

Often get simple sufficient statistics (hence we cansummarise a huge data set by a few simple prop-erties).

If the probability model is wrong, then results canbe misleading (e.g. if one assumes a Normal dis-tribution when the true distribution is Cauchy).

CLT suggests likelihood methods work well whenthere’s loads of data.

One seldom has loads of data!

1.1.6 Where Will We Go from Here?

• MSA provided the mathematical toolbox (e.g. probability theory and the idea of random variables)for studying random variation.

• MSB will add to this toolbox and study interrelationships between (random) variables.

• We shall also consider some important general forms for the fitted/predicted values, in particularlinear models and their generalizations.

1.2 Sampling Distributions

Statistical analysis involves calculating various statistics from the data, for example the maximum likelihoodestimator (MLE) θ for θ. We want to understand the properties of these statistics; hence the importanceof the central limit theorem (CLT) & its generalizations, and of studying the probability distributions oftransformed random variables.

If we have a formula for a summary statistic S, e.g. S =∑

Xi/n = X, and are prepared to make cer-tain assumptions about the original random variables Xi, then we can say things about the probabilitydistribution of S.

The probability distribution of a statistic S, i.e. the pattern of values S would take if it were calculated insuccessive samples similar to the one we actually have, is called its sampling distribution.

4

Page 5: (ST217: Mathematical Statistics B)

1.2.1 Typical Assumptions

1. Standard Assumption (IID RVs):

Xi are IID (independent and identically distributed) with (unknown) mean µ and variance σ2.

This implies

(a) E[X] = E[Xi] = µ, and

(b) Var[X] = 1nVar[Xi] = σ2/n.

(c) If we define the standardised random variables

Zn =X − µ

σ,

then as n →∞, the distribution of Zn tends to the standard Normal N(0, 1) distribution.

2. Additional Assumption (Normality):

The Xi are IID Normal: XiIID∼ N (µ, σ2).

This implies that X ∼ N(µ, σ2/n).

1.2.2 Further Uses of Sampling Distributions

We can also

• compare various plausible estimators (e.g. to estimate the centre of symmetry of a supposedly sym-metric distribution we might use the sample mean, median, or something more exotic),

• obtain interval estimates for unknown quantities (e.g. 95% confidence intervals, HPD intervals, sup-port intervals),

• test hypotheses about unknown quantities.

Comments

1. Note the importance of expectations of (possibly transformed) random variables:

E[X] = µ (measure of location)E[(X − µ)2] = σ2 (measure of scale)E[esX ] = moment generating functionE[eitX ] = characteristic function

2. We must always consider whether the assumptions made are reasonable, both from general consid-erations (e.g.: is independence reasonable? is the assumption of identical distributions reasonable?is it reasonable to assume that the data follow a Poisson distribution? etc.) and with reference tothe observed set of data (e.g. are there any ‘outliers’—unreasonably extreme values—or unexpectedpatterns?)

3. Likelihood and other methods suggest estimators for unknown quantities of interest (parameters etc.)under certain specified assumptions.

Even if these assumptions are invalid (and in practice they always will be to some extent!) we may stillwant to use summary statistics as estimators of properties of the underlying population. Therefore

(a) We’ll want to investigate the properties of estimators under various relaxed assumptions, forexample partially specified models that use only the first and second moments of the unknownquantities.

(b) It’s useful if the calculated statistics (e.g. MLEs) have an intuitive interpretation (like ‘samplemean’ or ‘sample variance’).

5

Page 6: (ST217: Mathematical Statistics B)

1.3 (Revision?) Problems

1. First-year students attending a statistics course were asked to carry out the following procedure:

Toss two coins, without showing anyone else the results.

If the first coin showed ‘Heads’ then answer the following question:“Did the second coin show ‘Heads’? (Yes or No)”

If the first coin showed ‘Tails’ then answer the following question:“Have you ever watched a complete episode of ‘Teletubbies’? (Yes or No)”

The following results were recorded:

Yes NoMales 84 48Females 23 24

For each sex, and for both sexes combined, estimate the proportion who have watched a completeepisode of ‘Teletubbies’.

Using a chi-squared test, or otherwise, test whether the proportions differ between the sexes.

Discuss the assumptions you have made in carrying out your analysis.

2. Let X and Y be IID RVs with a standard Normal N(0, 1) distribution, and define Z = X/Y .

(a) Write down the lower quartile, median and upper quartile of Z, i.e. the points z25, z50 & z75

such that Pr(Z < zk) = k/100.

(b) Show that Z has a Cauchy distribution, with PDF 1/π(z2 + 1).

HINT : consider the transformation Z = X/Y and W = |Y |.

3. Let X1, . . . Xn be mutually independent RVs, with respective MGFs (moment generating functions)MX1(t), . . . ,MXn

(t), and let a1, . . . , an and b1, . . . , bn be fixed constants.

Show that the MGF of Z = (a1X1 + b1) + (a2X2 + b2) + · · ·+ (anXn + bn) is

MZ(t) = exp(t∑

bi

)MX1(a1t)× · · · ×MXn

(ant).

Hence or otherwise show that any linear combination of independent Normal RVs is itself Normallydistributed.

4. A workman has to move a rectangular stone block a short distance, but doesn’t want to strain himself.He rapidly estimates:

• height of block = 10 cm, with standard deviation 1 cm.

• width of block = 20 cm, with standard deviation 3 cm.

• length of block = 25 cm, with standard deviation 4 cm.

• density of block = 4.0 g/cc, with standard deviation 0.5 g/cc.

Assuming these estimates are mutually independent, calculate his estimates of the volume V (cc)and total weight W (Kg) of the block, and their standard deviations.

The workman fears that he might hurt his back if W ≥ 30.Using Chebyshev’s inequality, give an upper bound for his probability Pr(W ≥ 30).[Chebyshev’s inequality states that if X has mean µ & variance σ2, then Pr(|X − µ| ≥ c) ≤σ2/c2—see MSA].

What is the workman’s value for Pr(W > 30) under the additional assumption that W is Normallydistributed? Compare this value with the bound found earlier.

How reasonable are the independence and Normality assumptions used in the above analysis?

6

Page 7: (ST217: Mathematical Statistics B)

5. Calculate the MLE of the centre of symmetry θ, given IID RVs X1, X2, . . . , Xn, where the commonPDF fX(x) of the Xis is

(a) Normal (or Gaussian):

fX(x|θ, σ) =1√2πσ

exp(− 1

2

((x− θ)/σ

)2)(b) Laplacian (or Double Exponential):

fX(x|θ, σ) =12σ

exp(|x− θ|/σ

)(c) Uniform (or Rectangular):

fX(x|θ) =

1 if(θ − 1

2

)< x <

(θ + 1

2

)0 otherwise.

Do you consider these MLEs to be intuitively reasonable?

6. Calculate E[X], E[X2], E[X3] and E[X4] under each of the following assumptions:

(a) X ∼ Poi(λ), i.e. X has PMF (probability mass function)

Pr(X = x|λ) =λx exp(−λ)

x!(x = 0, 1, 2, . . . )

(b) X ∼ Exp(β), i.e. X has PDF (probability density function)

fX(x|β) =

βe−βx if x > 00 otherwise.

(c) X ∼ N(µ, σ2

), i.e. X has PDF

fX(x|µ, σ) =1√2πσ

exp(− 1

2

((x− µ)/σ

)2)7. Describe briefly how, and under what circumstances, you might approximate

(a) a binomial distribution by a Normal distribution,

(b) a binomial distribution by a Poisson distribution,

(c) a Poisson distribution by a Normal distribution.

Suppose X ∼ Bin(100, 0.1), Y ∼ Poi(10), and Z ∼ N(10, 32

). Calculate, or look up in tables,

(i) Pr(X ≥ 6), (ii) Pr(Y ≥ 6), (iii) Pr(Z > 5.5),(iv) Pr(X > 16), (v) Pr(Y > 16), (vi) Pr(Z > 16.5),

and comment on the accuracy of the approximations here.

8. The t distribution with n degrees of freedom, denoted tn or t(n), has the PDF

f(t) =Γ(

12 (n + 1)

)Γ(

12n) 1√

1(1 + t2/n

)(n+1)/2, −∞ < t < ∞,

and the F distribution with m and n degrees of freedom, denoted Fm,n or F (m,n), has PDF

f(x) =Γ(

12 (m + n)

)Γ(

12m)Γ(

12n) mm/2nn/2 x(m/2)−1

(mx + n)(m+n)/2, 0 < x < ∞,

with f(x) = 0 for x ≤ 0.

Show that if T ∼ tn and X ∼ Fm,n, then T 2 and X−1 both have F distributions.

7

Page 8: (ST217: Mathematical Statistics B)

9. Table 1.1 shows the estimated total resident population (thousands) of England and Wales at30 June 1993:

Age Persons Males Females

< 1 669.6 343.1 326.51–14 9,268.0 4,756.9 4,511.115–44 21,875.0 11,115.6 10,759.445–64 11,435.8 5,676.6 5,759.265–74 4,595.9 2,081.7 2,514.2≥ 75 3,594.9 1,224.5 2,370.4

Total 51,439.2 25,198.4 26,240.8

Table 1.1: Estimated resident population of England & Wales, mid 1993, by sex andage-group (simplified from Table 1 of the 1993 mortality tables )

Table 1.2, also extracted from the published 1993 Mortality Statistics, shows the number of deathsin 1993 among the resident population of England and Wales, categorised by sex, age-group andunderlying cause of death.

Assume that the rates observed in Tables 1.1 and 1.2 hold exactly, and suppose that an individualI is chosen at random from the population. Define the random variables S (sex), A (age group), D(death) and C (cause) as follows:

S = 0 if I is male,1 if I is female,

A = 1 if I is under 1 year old, 4 if I is aged 45–64,2 if I is aged 1–14, 5 if I is aged 65–74,3 if I is aged 15–44, 6 if I is 75 years old or over,

D = 0 if I survives the year,1 if I dies,

C = cause of death (0–17).

For example,

Pr(S=0) = 25198.4/51439.2,

Pr(S=0& A=6) = 1224.5/51439.2,

Pr(D=0|S=0& A=6) = 1− 138.239/1224.5,

Pr(C=8|S=0& A=6) = 28.645/1224.5, etc.

(a) Calculate Pr(D=1|S=0), and Pr(D=1|S=0& A=a) for a = 1, 2, 3, 4, 5, 6.Also calculate Pr(S=0|D=1), and Pr(S=0|D=1& A=a) for a = 1, 2, 3, 4, 5, 6.If you were an actuary, and were asked by a non-expert “is the death rate for males higher orlower than that for females?”, how would you respond based on the above calculations? Justifyyour answer.

(b) Similarly, explain how you would respond to the questionsi. “is the death rate from neoplasms higher for males or for females?”ii. “is the death rate from mental disorders higher for males or for females?”iii. “is the death rate from diseases of the circulatory system higher for males or for females?”iv. “is the death rate from diseases of the respiratory system higher for males or for females?”

8

Page 9: (ST217: Mathematical Statistics B)

All Age at death (years)Cause of death Sex ages <1 1–14 15–44 45–64 65–74 ≥75

0 Deaths below 28 days M 1,603 1,603 − − − − −(no cause specified) F 1,192 1,192 − − − − −

1 Infectious & parasitic M 1,954 60 79 565 390 346 514diseases F 1,452 46 44 169 193 283 717

2 Neoplasms M 74,480 16 195 2,000 16,372 25,644 30,253F 67,966 8 138 2,551 15,026 19,141 31,102

3 Endocrine, nutritional M 3,515 28 43 208 639 959 1,638& metabolic diseases and F 4,403 17 37 153 474 901 2,821immunity disorders

4 Diseases of blood and M 897 5 12 62 106 204 508blood-forming organs F 1,084 3 14 28 73 163 803

5 Mental disorders M 2,530 − 8 281 169 334 1,738F 5,189 − 1 83 99 297 4,709

6 Diseases of the nervous M 4,403 59 136 530 675 890 2,113system and sense organs F 4,717 42 118 313 546 809 2,889

7 Diseases of the M 123,717 41 66 1,997 20,682 37,195 63,736circulatory system F 134,439 44 45 834 7,783 23,185 102,548

8 Diseases of the M 41,802 86 79 608 3,157 9,227 28,645respiratory system F 49,068 59 74 322 2,145 6,602 39,866

9 Diseases of the M 7,848 10 27 511 1,706 2,058 3,536digestive system F 10,574 20 14 298 1,193 1,921 7,128

10 Diseases of the M 3,008 4 6 57 215 676 2,050genitourinary system F 3,710 4 7 55 219 535 2,890

11 Complications of pregnancy, M − − − − − − −childbirth and the puerperium F 27 − − 27 − − −

12 Diseases of the skin M 269 1 1 7 22 62 176and subcutaneous tissue F 748 − − 15 30 80 623

13 Diseases of the musculoskeletal M 785 1 5 28 106 151 494system and connective tissue F 2,639 − 5 43 173 385 2,033

14 Congenital anomalies M 660 131 114 158 118 58 81F 675 136 116 133 101 87 102

15 Certain conditions originating M 186 93 8 13 18 16 38in the perinatal period F 114 60 5 3 4 10 32

16 Signs, symptoms and M 1,642 238 17 126 111 72 1,078ill-defined conditions F 5,146 171 17 50 53 75 4,780

17 External causes of M 9,859 34 311 4,749 2,183 941 1,641injury and poisoning F 5,869 30 162 1,240 882 731 2,824

Total M 279,158 2,410 1,107 11,900 46,669 78,833 138,239F 299,012 1,832 797 6,317 28,994 55,205 205,867

Table 1.2: Deaths in England & Wales, 1993, by underlying cause, sex and age-group(extracted from Table 2 of the 1993 mortality tables )

9

Page 10: (ST217: Mathematical Statistics B)

(c) Now treat the data in Tables 1.1 & 1.2 as subject to statistical fluctuations. One can stillestimate

psac = Pr(S=s&A=a&C=c), p·ac = Pr(A=a&C=c), ps · · = Pr(S=s) etc.

from the data, for example p0,·,14 = 660/25198400 = 2.62×10−5. Similarly estimate p1,·,14 andp·,a,14 for a = 1 . . . 6. Using a chi-squared test or otherwise, investigate whether the relativerisk of death from a congenital anomaly between males and females is the same at all ages,i.e. whether it reasonable to assume that

ps, a,14 = ps, ·,14 × p·, a,14.

10. Data were collected on litter size and sex ratios for a large number of litters of piglets. The followingtable gives the data for all litters of size between four and twelve:

Number Litter sizeof males 4 5 6 7 8 9 10 11 12

0 1 2 3 0 1 0 0 0 01 14 20 16 21 8 2 7 1 02 23 41 53 63 37 23 8 3 13 14 35 78 117 81 72 19 15 84 1 14 53 104 162 101 79 15 45 4 18 46 77 83 82 33 96 0 21 30 46 48 13 187 2 5 12 24 12 118 1 7 10 8 159 0 0 1 4

10 0 1 011 0 012 0

Total 53 116 221 374 402 346 277 102 70

(a) Discuss briefly what sort of probability distributions it might be reasonable to assume for thetotal size N of a litter, and for the number M of males in a litter of size N = n.

(b) Suppose now that the litter size N follows a Poisson distribution with mean λ. Write downan expression for Pr(N = n|4 ≤ N ≤ 12). Hence or otherwise give an expression for thelog-likelihood `(λ; . . .) given the above table of data.

(c) Evaluate `(λ; . . .) at λ = 7.5, 8 and 8.5. By fitting a quadratic to these values, provide pointand interval estimates of λ.

(d) Using a chi-squared test or otherwise, check how well your model fits the data.(e) Comment on the following argument: ‘Provided λ isn’t too small, we could approximate the

Poisson distribution Poi(λ) by the Normal distribution N (λ, λ). This is symmetric, so we maysimply estimate the mean λ by the mode of the data (8 in our case). The standard deviation istherefore nearly 3, and so we would expect the counts at litter size 8± 3 to be nearly 60% thecount at 8 (note that for a standard Normal, φ(1)/φ(0) = exp(−0.5) l 0.6). Since there are farfewer litters of size 5 & 11 than this, the Poisson distribution must be a poor fit.’

Data from HSDS, set 176

Education is what survives when what has been learnt has been forgotten.Burrhus Frederoc Skinner

10

Page 11: (ST217: Mathematical Statistics B)

Chapter 2

Bivariate & MultivariateDistributions

MSA largely concerned IID (independent & identically distributed) random variables.

However in practice we are usually most interested in several random variables simultaneously, and theirinterrelationships. Therefore we need to consider the probability distributions of random vectors, i.e. thejoint distribution of the individual random variables.

Bivariate Examples

A. (X1, X2), the number of male & female pigs in a litter.

B. (X, Y ), the systolic and diastolic blood pressure of an individual.

C. (X, Y ), the age and height of an individual.

D. (X, Y ), the height and weight of an individual.

E. (µ, σ2), the estimated common mean and variance of n IID random variables X1, . . . , Xn.

F. (Θ, X) where Θ ∼ U (0, 1) and X|Θ ∼ Bin(n, Θ), i.e.

fΘ(θ) =

1 if 0 < x < 10 otherwise,

fX(x|Θ = θ) =(nx

)θx(1− θ)n−x x = 0, 1, . . . , n.

Definition 2.1 (Bivariate CDF)The joint cumulative distribution function of 2 RVs X & Y is the function

FX,Y (x, y) = Pr(X ≤ x&Y ≤ y), (x, y) ∈ R2. (2.1)

Comments

1. The joint cumulative distribution function (or joint CDF) may also be called the ‘joint distributionfunction’ or ‘joint DF’.

2. If there’s no ambiguity, then we may simply write F (x, y) for FX,Y (x, y).

11

Page 12: (ST217: Mathematical Statistics B)

2.1 Discrete Bivariate Distributions

If RVs X & Y are discrete, then they have a discrete joint distribution and a probability mass function(PMF) that, similarly to the univariate case, is usually written fX,Y (x, y) or more simply f(x, y):

Definition 2.2 (Bivariate PMF)The joint probability mass function of discrete RVs X and Y is

f(x, y) = Pr(X = x&Y = y).

Exercise 2.1Suppose that the numbers X1 and X2 of male and female piglets follow independent Poisson distributionswith means λ1 & λ2 respectively. Find the joint PMF.

Exercise 2.2Now assume the model N ∼ Poi(λ), (X1 |N) ∼ Bin(N, θ), i.e. the total number N of piglets follows aPoisson distribution, and, conditional on N = n, X has a Bin(n, θ) distribution (in particular θ = 0.5 ifthe sexes are equally likely). Again find the joint PMF.

Exercise 2.3Verify that the two models given in Exercises 2.1 & 2.2 give identical fitted values, and are therefore inpractice indistinguishable.

2.1.1 Manipulation

A discrete RV has a countable sample space, which without loss of generality can be represented asN = 0, 1, 2, . . .. Values of a discrete joint distribution f(x, y) can therefore be tabulated:

Y0 1 2 3 . . .

0 f00 f01 f02 . . .X 1 f10 f11 f12 . . .

......

......

. . .

and the probability of any event E obtained by simple summation:

Pr((X, Y ) ∈ E

)=

∑(xi,yi)∈E

f(xi, yi).

Exercise 2.4Continuing Exercise 2.2, find the PMF of X1, and hence identify the distribution of X1.

Exercise 2.5The RV Q is defined on the rational numbers in [0, 1] by Q = X/Y , where f(x, y) = (1− α)αy−1/(y + 1),0 < α < 1, y = 1, 2, . . ., x = 0, 1, . . . , y.

Show that Pr(Q = 0) = (α− 1)(α + log(1− α)

)/α2.

12

Page 13: (ST217: Mathematical Statistics B)

2.2 Continuous Bivariate Distributions

Definition 2.3 (Continuous bivariate distribution)Random variables X & Y have a continuous joint distribution if there exists a function f from R2 to[0,∞) such that

Pr((X, Y ) ∈ A

)=∫∫

A

f(x, y) dx dy ∀A ⊆ R2. (2.2)

Definition 2.4 (Bivariate PDF)The function f(x, y) defined by Equation 2.2 is called the joint probability density function of X & Y .

Comments

1. f(x, y) may be written more explicitly as fX,Y (x, y).

2.∫ ∞

−∞

∫ ∞

−∞f(x, y) dx dy = 1.

3. f(x, y) is not unique—it could be arbitrarily defined at a countable set of points (xi, yi) (moregenerally, any ‘set with measure zero’) without changing the value of

∫∫A

f(x, y) dx dy for any set A.

4. f(x, y) ≥ 0 at all continuity points (x, y) ∈ R2.

Examples

1. As in Example E from page 11, we will want to know properties of the joint distribution of (µ, σ2),the MLEs of µ and σ2 respectively given X1, . . . , Xn

IID∼ N (µ, σ2).

2. In the situation of Example B from page 11, where X is the systolic blood pressure and Y the diastolicblood pressure of an individual, it might be reasonable to assume that

X ∼ N (µS, σ2S),

Y |X ∼ N (α + βX, σ2D),

and hence obtainfX,Y (x, y) = fX(x) fY |X(y|x).

Comment

As in Exercise 2.2, a family of multivariate distributions is most easily built up hierarchically using sim-ple univariate distributions and conditional distributions like that of Y |X. Conditional distributions areconsidered formally in Section 2.4.

2.2.1 Visualising and Displaying a Continuous Joint Distribution

A continuous bivariate distribution can be represented by a contour or other plot of its joint PDF (Fig. 2.1).

Comments

1. The joint distribution of X and Y may be neither discrete nor continuous, for example:

• Either X or Y may have both continuous and discrete components,• One of X and Y may have a continuous distribution, the other discrete (like Example F on

page 11).

2. Higher dimensional joint distributions are obviously much more difficult to interpret and to representgraphically, with or without computer help.

13

Page 14: (ST217: Mathematical Statistics B)

Figure 2.1: Contour and perspective plots of a bivariate distribution

2.3 Marginal Distributions

Given a joint CDF FX,Y (x, y), the distributions defined by the CDFs FX(x) = limy→∞ FX,Y (x, y) andFY (y) = limx→∞ FX,Y (x, y) are called the marginal distributions of X and Y respectively:

Definition 2.5 (Marginal CDF, PMF and PDF—bivariate case)FX(x) = limy→∞ FX,Y (x, y) is the marginal CDF of X.

If X has a discrete distribution, then fX(x) = Pr(X = x) is the marginal PMF of X.

If X has a continuous distribution, then fX(x) =d

dxFX(x) is the marginal PDF of X.

Marginal CDFs and PDFs of Y , and of other RVs for higher-dimensional joint distributions, are definedsimilarly.

Exercise 2.6Suppose that you are given a bag containing five coins:

1 double-tailed, 1 with Pr(head) = 1/4, 2 fair, 1 double-headed.

You pick one coin at random (each with probability 1/5), then toss it twice.

By finding the joint distribution of Θ = Pr(head) and X = number of heads, or otherwise, calculate thedistribution of the number of heads obtained.

Comments

1. If you’ve tabulated Pr(Θ = θ &X = x), then it’s simple to find FΘ(θ) and FX(x) by writing the rowsums and column sums in the margins of the table of Pr(Θ = θ &X = x)—hence the name ‘marginaldistribution’.

2. Although the most satisfactory general definition of marginal distributions is in terms of their CDFs,in practice it’s usually easiest to work with PMFs or PDFs

2.4 Conditional Distributions

2.4.1 Discrete Case

If X and Y are discrete RVs then, by definition,

Pr(Y =y |X=x) = Pr(X=x&Y =y)/ Pr(X=x). (2.3)

14

Page 15: (ST217: Mathematical Statistics B)

In other words (or, more accurately, in other symbols):

Definition 2.6 (Conditional PMF—bivariate case)If X and Y have a discrete joint distribution with PMF fX,Y (x, y), then the conditional PMF fY |Xof Y given X = x is

fY |X(y |x) =fX,Y (x, y)

fX(x)(2.4)

where fX(x) =∑

y fX,Y (x, y) is the marginal PMF of X.

Exercise 2.7Continuing Exercise 2.6, what are the conditional distributions of [X |Θ = 1/4] and [Θ|X = 0]?

2.4.2 Continuous Case

Now suppose that X and Y have a continuous joint distribution. If we observe X = x, then we will wantto know the conditional CDF FY |X(y |X = x). But we CAN’T use Equation 2.3 directly, which wouldentail dividing by zero. Therefore, by analogy with Equation 2.4, we adopt the following definition:

Definition 2.7 (Conditional PDF—bivariate case)If X and Y have a continuous joint distribution with PDF fX,Y (x, y), then the conditional PDF fY |Xof Y given that X = x is

fY |X(y |x) =fX,Y (x, y)

fX(x), (2.5)

defined for all x ∈ R such that fX(x) > 0.

2.4.3 Independence

Recall that two RVs X and Y are independent (X ⊥⊥Y ) if, for any two sets A,B ∈ R,

Pr(X∈A &Y ∈B) = Pr(X ∈ A) Pr(Y ∈ B) (2.6)

Exercise 2.8Show that X and Y are independent according to Formula 2.6 if and only if

FX,Y (x, y) = FX(x)FY (y) −∞ < x, y < ∞, (2.7)

or equivalently if and only if

fX,Y (x, y) = fX(x)fY (y) −∞ < x, y < ∞, (2.8)

(where the functions f are interpreted as PMFs or PDFs in the discrete or continuous case respectively).‖

15

Page 16: (ST217: Mathematical Statistics B)

2.5 Problems

1. Let the function f(x, y) be defined by

f(x, y) =

6xy2 if 0 < x < 1 and 0 < y < 1,0 otherwise.

(a) Show that f(x, y) is a probability density function.

(b) If X and Y have the joint PDF f(x, y) above, show that Pr(X + Y ≥ 1) = 9/10.

(c) Find the marginal PDF fX(x) of X.

(d) Show that Pr(0.5 < X < 0.75) = 5/16.

2. Suppose that the random vector (X, Y ) takes values in the region A = (x, y)|0 ≤ x ≤ 2, 0 ≤ y ≤ 2,and that its CDF within A is given by FX,Y (x, y) = xy(x + y)/16.

(a) Find FX,Y (x, y) for values of (X, Y ) outside A.

(b) Find the marginal CDF FX(x) of X.

(c) Find the joint PDF fX,Y (x, y).

3. Suppose that X and Y are RVs with joint PDF

f(x, y) =

cx2y if x2 ≤ y ≤ 1,0 otherwise.

(a) Find the value of c.

(b) Find Pr(X ≥ Y ).

(c) Find the marginal PDFs fX(x) & fY (y)

4. For each of the following joint PDFs f of X and Y , determine the constant c, find the marginal PDFsof X and Y , and determine whether or not X and Y are independent.

(a)

f(x, y) =

ce−(x+2y), for x, y ≥ 0,0 otherwise.

(b)

f(x, y) =

cy2/2, for 0 ≤ x ≤ 2 and 0 ≤ y ≤ 1,0 otherwise.

(c)

f(x, y) =

cxe−y, for 0 ≤ x ≤ 1 and 0 ≤ y < ∞,0 otherwise.

(d)

f(x, y) =

cxy, for x, y ≥ 0 and x + y ≤ 1,0 otherwise.

5. Suppose that X and Y are continuous RVs with joint PDF f(x, y) = e−y on 0 < x < y < ∞.

(a) Find Pr(X + Y ≥ 1) [HINT : write this as 1− Pr(X + Y < 1)].

(b) Find the marginal distribution of X.

(c) Find the conditional distribution of Y given that X = x.

16

Page 17: (ST217: Mathematical Statistics B)

6. Assume that X and Y are random variables each taking values in [0, 1]. For each of the followingCDFs, show that the marginal distribution of X and Y are both uniform U(0, 1), and determine theconditional CDF FX|Y (x|Y = 0.5) in each case:

(a) F (x, y) = xy,

(b) F (x, y) = min(x, y),

(c) F (x, y) =

0, if x + y < 1,x + y − 1 if x + y ≥ 1.

7. Suppose that Θ is a random variable uniformly distributed on (0, 1), i.e. Θ ∼ U (0, 1), and that, onceΘ = θ has been observed, the random variable X is drawn from a binomial distribution [X|θ] ∼Bin(2, θ).

(a) Find the joint CDF F (θ, x).

(b) How might you display the joint distribution of Θ and X graphically?

(c) What (as simply as you can express them) are the marginal CDFs F1(θ) of Θ and F2(x) of X?

8. Suppose that X and Y are two RVs having a continuous joint distribution. Show that X and Y areindependent if and only if fX|Y (x|y) = fX(x) for each value of y such that fY (y) > 0, and for all x.

9. Suppose that X ∼ U (0, 1) and [Y |X = x] ∼ U (0, x). Find the marginal PDFs of X and Y .

2.6 Multivariate Distributions

2.6.1 Introduction

Given a random vector X = (X1, X2, . . . , Xn)T , the joint distribution of the random variables X1, X2, . . . , Xn

is called a multivariate distribution.

Definition 2.8 (Joint CDF)The joint cumulative distribution function of RVs X1, X2, . . . , Xn is the function

FX(x1, x2, . . . , xn) = Pr(Xk ≤ xk ∀ k = 1, 2, . . . , n). (2.9)

Comments

1. Formula 2.9 can be written succinctly as FX(x) = Pr(X ≤ x), in an ‘obvious’ vector notation.

2. FX(x) can be called simply the CDF of the random vector X.

3. Properties of FX are similar to the bivariate case. Unfortunately the notation is messier, particularlyfor the things we’re generally most interested in for statistical inference, such as

(a) marginal distributions of unknown quantities and vectors,

(b) conditional distributions of unknown quantities and vectors, given what we know.

4. It’s often simpler to blur the distinction between row and column vectors, i.e. to let X denote either(X1, X2, . . . , Xn) or (X1, X2, . . . , Xn)T , depending on context.

17

Page 18: (ST217: Mathematical Statistics B)

Definition 2.9 (Discrete multivariate distribution)The RV X ∈ Rn has a discrete distribution if it can take only a countable number of possible values.

Definition 2.10 (Multivariate PMF)If X has a discrete distribution, then its probability mass function (PMF) is

f(x) = Pr(X = x), x ∈ Rn (2.10)

[i.e. the RVs X1 . . . Xn have joint PMF f(x1 . . . xn) = Pr(X1 = x1& · · ·&Xn = xn)].

Definition 2.11 (Continuous multivariate distribution)The RV X = (X1, X2, . . . , Xn) has a continuous distribution if there is a nonnegative function f(x),where x = (x1, x2, . . . , xn), such that for any subset A ⊂ Rn,

Pr((X1, X2, . . . , Xn) ∈ A

)=∫

. . .

∫A

f(x1, x2, . . . xn) dx1dx2 . . . dxn. (2.11)

Definition 2.12 (Multivariate PDF)The function f in 2.11 is the (joint) probability density function of X.

Comments

1. Without loss of generality, if X is discrete, then we can take its possible values to be Nn (i.e. eachcoordinate Xi of X is a nonnegative integer).

2. Equation 2.11 could be simply written

Pr(X ∈ A) =

∫A

f(x)dx (2.12)

3. As usual, f(·) may be written more explicitly fX(·), etc.

4. By the fundamental theorem of calculus,

fX(x1, . . . , xn) =∂nFX(x1, . . . , xn)

∂x1 · · · ∂xn(2.13)

at all points (x1, . . . , xn) where this derivative exists(i.e. fX(x) = ∂nFX(x)/∂x

).

5. Mixed distributions (neither continuous nor discrete) can be handled using appropriate combinationsof summation and integration.

2.6.2 Useful Notation for Marginal & Conditional Distributions

We’ll sometimes adopt the following notation from DeGroot, particularly when the components Xi of Xare in some way similar, as in the multivariate Normal distribution (see later).

F (x) denotes the CDF of X = (X1, X2, . . . , Xn) at x = (x1, x2, . . . , xn),

f(x) denotes the corresponding joint PMF (discrete case) or PDF (continuous case),

fj(xj) denotes the marginal PMF (PDF) of Xj (integrating over x1 . . . xj−1, xj+1 . . . xn),

fjk(xj , xk) denotes the marginal joint PDF of Xj & Xk (integrating over the remaining xis),

gj(xj |x1 . . . xj−1, xj+1 . . . xn) denotes the conditional PMF (PDF) of Xj given Xi = xi, i 6= j,

Fj(xj) denotes the marginal CDF of Xj ,

Gjk denotes the conditional CDF of (Xj , Xk) given the values xi of all Xi, i 6= j, k, etc.

18

Page 19: (ST217: Mathematical Statistics B)

2.7 Expectation

2.7.1 Introduction

The following are important definitions and properties involving expectations, variances and covariances:

Var(X) = E[(X − µ)2

]where µ = EX

= E[X2]− µ2,

E[aX + b] = aEX + b where a and b are constants,

E[(aX + b)2

]= a2E

[X2]+ 2abEX + b2,

Var(aX + b) = a2Var(X),

E[X1X2] = (EX1)(EX2) if X1⊥⊥X2,

Cov(X1, X2) = E(X1 − µ1)(X2 − µ2) = E[X1X2]− µ1µ2,

SD(X) =√

Var(X),

corr(X1, X2) = ρ(X1, X2) =Cov(X1, X2)

SD(X1)SD(X2).

Note that the definition of expectation applies directly in the multivariate case:

Definition 2.13 (Multivariate expectation)

E[h(X)] =

x h(x) f(x) if X is discrete,∫Rn

h(x) f(x) dx if X is continuous.

For example, if X = (X1, X2, X3) has a continuous distribution, then

E[X1] =∫ ∞

−∞

∫ ∞

−∞

∫ ∞

−∞x1 f(x1, x2, x3) dx1 dx2 dx3

Exercise 2.9Let X and Y be independent continuous RVs. Prove that, for arbitrary functions g(·) and h(·),

E[g(X)h(Y )

]=(E g(X)

)(E h(Y )

).

Exercise 2.10Let X, Y and Z have independent Poisson distributions with means λ, µ, ν respectively. Find E[X2Y Z].

Exercise 2.11[Cauchy-Schwartz] By considering E

((tX − Y )2

), or otherwise, prove the Cauchy Schwartz inequality for

expectations, i.e. for any two RVs X and Y with finite second moments,(E(XY )

)2 ≤ E(X2)E(Y 2), with

equality if and only if Pr(Y = cX) = 1 for some constant c.

Hence or otherwise prove that the correlation ρX,Y between X and Y satisfies |ρX,Y | ≤ 1.

Under what circumstances does ρX,Y = 1?‖

19

Page 20: (ST217: Mathematical Statistics B)

2.8 Approximate Moments of Transformed Distributions

The moments of a transformed RV g(X) can often be well approximated via a Taylor series:

Exercise 2.12[delta method] Let X1, X2, . . . , Xn be independent, each with mean µ and variance σ2, and let g(·) be afunction with a continuous derivative g′(·).

By considering a Taylor series expansion involving

Zn =X − µ√

σ2/n,

show that

E[g(X)

]= g(µ) + O(n−1), (2.14)

Var[g(X)

]= n−1σ2g′(µ)2 + O(n−3/2). (2.15)

Comments

1. There is similarly a multivariate delta method, outside the scope of this course.

2. Important uses of expansions like the delta method include identifying useful transformations g(·), forexample to remove skewness or, when Var(X) is a function of µ, to make Var

[g(X)

](approximately)

independent of µ.

3. A useful transformation g(X) is sometimes in practice applied to the original RVs on the (oftenreasonable) assumption that the properties of

(∑g(Xi)

)/n will be similar to those of g

(∑Xi)/n

).

Exercise 2.13[Variance stabilising transformations]

Suppose that X1, X2, . . . , Xn are IID and that the (common) variance of each Xi is a function of the(common) mean µ = EXi.

Show that the variance of g(X) is approximately constant if

g′(µ) = 1/√

Var(µ).

If X ∼ Poi(µ), show that Y =√

X has approximately constant variance.‖

20

Page 21: (ST217: Mathematical Statistics B)

2.9 Problems

1. The discrete random vector (X1, X2, X3) has the following PMF:

(X1 = 1) X3 (X1 = 2) X3

1 2 3 1 2 31 .02 .03 .05 1 .08 .04 .03

X2 2 .04 .06 .10 X2 2 .12 .11 .073 .02 .03 .05 3 .05 .05 .05

(a) Calculate the marginal PMFs: f1(x1), f2(x2), f3(x3) and f12(x1, x2).

(b) Are X1 and X2 independent?

(c) What are the conditional PMFs: g1(x1|X2 = 1, X3 = 3), g2(x2|X1 = 1, X3 = 3),g3(x3|X1 = 1, X2 = 3), and g12(x1, x2|X3 = 3) ?

2. The RVs A, B, C etc. count the number of times the corresponding letter appears when a word ischosen at random from the following list (each being chosen with probability 1/16):

MASCARA, MASK, MERCY, MONSTER,MOVIE, PREY, REPLICA, REPTILES,RITE, SEAT, SNAKE, SOMBRE,

SQUID, TENDER, TIME, TROUT.

(a) Complete the following table of the joint distribution of E, M and R:

E = 0 E = 1 E = 2

M=0 M=1 M=0 M=1 M=0 M=1

R=0 1/16 1/16 R=0 2/16 R=0

R=1 R=1 R=1

(b) Calculate all three bivariate marginal distributions, and hence find which of the following state-ments are true:

(a) E⊥⊥M , (b) E⊥⊥R, (c) M ⊥⊥R.

(c) Similarly discover which of the following statements are true:

(d) M ⊥⊥R|E=0, (e) M ⊥⊥R|E=1, (f) M ⊥⊥R|E=2,(g) M ⊥⊥R|E, (h) E⊥⊥R|M , (i) E⊥⊥M |R.

3. Find variance stabilizing transformations for

(a) the exponential distribution,

(b) the binomial distribution.

4. Let Z ∼ N (0, 1) and define the RV X by

Pr(X = −√

3) = 1/6, Pr(X = 0) = 4/6, Pr(X = +√

3) = 1/6.

(a) Show that X has the same mean and variance as Z, and that X2 has the same mean andvariance as Z2.

(b) Suppose the RV Y has mean µ and variance σ2. Compare the delta method for estimating themean and variance of the RV T = g(Y ) with the alternative estimates µ(T ) l E

(g(µ + σX)

),

Var(T ) l Var(g(µ + σX)

). [Try a few simple distributions for Y and transformations g(·)].

21

Page 22: (ST217: Mathematical Statistics B)

2.10 Conditional Expectation

2.10.1 Introduction

A common practical problem arises when X1 and X2 aren’t independent, we observe X2 = x2, and wewant to know the mean of the resulting conditional distribution of X1.

Definition 2.14 (Conditional expectation)The conditional expectation of X1 given X2 is denoted E[X1 |X2]. If X2 = x2 then

E[X1 |x2] =∫ ∞

−∞x1 g1(x1 |x2) dx1 (continuous case) (2.16)

=∑x1

x1g1(x1 |x2) (discrete case) (2.17)

where g1(x1 |x2) is the conditional PDF or PMF respectively.

Comment

Note that before X2 is known to take the value x2, E[X1 |X2] is itself a random variable, being a functionof the RV X2. We’ll be interested in the distribution of the RV E[X1 |X2], and (for example) comparing itwith the unconditional expectation EX1. The following is an important result:

Theorem 2.1 (Marginal expectation)For any two RVs X1 & X2,

E[E[X1 |X2]

]= EX1. (2.18)

Exercise 2.14Prove Equation 2.18 (i) for continuous RVs X1 and X2, (ii) for discrete RVs X1 and X2.

Exercise 2.15Suppose that the RV X has a uniform distribution, X ∼ U (0, 1), and that, once X = x has been observed,the conditional distribution of Y is [Y |X = x] ∼ U (x, 1).

Find E[Y |x] and hence, or otherwise, show that EY = 3/4.‖

Exercise 2.16Suppose that Θ ∼ U (0, 1) and (X|Θ) ∼ Bin(2,Θ).

Find E[X |Θ] and hence or otherwise show that EX = 1.‖

2.10.2 Conditional Expectations of Functions of RVs

By extending Theorem 2.1, we can relate the conditional and marginal expectations of functions of RVs(in particular, their variances).

Theorem 2.2 (Marginal expectation of a transformed RV)For any RVs X1 & X2, and for any function h(·),

E[E[h(X1)|X2]

]= E[h(X1)]. (2.19)

22

Page 23: (ST217: Mathematical Statistics B)

Exercise 2.17Prove Equation 2.19 (i) for discrete RVs X1 and X2, (ii) for continuous RVs X1 and X2.

An important consequence of Equation 2.19 is the following theorem relating marginal variance to condi-tional variance and conditional expectation:

Theorem 2.3 (Marginal variance)For any RVs X1 & X2,

Var(X1) = E[Var(X1 |X2)

]+ Var

(E[X1 |X2]

). (2.20)

Comments

1. Equation 2.20 is easiest to remember in English:

‘marginal variance = expectation of conditional variance+ variance of conditional expectation’.

2. A useful interpretation of Equation 2.20 is:

Var(X1) = average random variation inherent in X1 even if X2 were known+ random variation due to not knowing X2 and hence not knowing EX1.

i.e. the uncertainty involved in predicting the value x1 taken by a random variable X1 splits into twocomponents. One component is the unavoidable uncertainty due to random variation in X1, but theother can be reduced by observing quantities (here the value x2 of X2) related to X1.

Exercise 2.18[Proof of Theorem 2.3] Expand E

[Var(X1 |X2)

]and Var

(E[X1 |X2]

).

Hence show that Var(X1) = E[Var(X1 |X2)

]+ Var

(E[X1 |X2]

).

Exercise 2.19Continuing Exercise 2.16, in which Θ ∼ U (0, 1), (X|Θ) ∼ Bin(2,Θ), and E[X |Θ] = 2Θ, find Var

(E[X |Θ]

)and E

[Var(X |Θ)

]. Hence or otherwise show that VarX = 2/3, and comment on the effect on the uncer-

tainty in X of observing Θ.‖

23

Page 24: (ST217: Mathematical Statistics B)

2.11 Problems

1. Two fair coins are tossed independently. Let A1, A2 and A3 be the following events:

A1 = ‘coin 1 comes down heads’A2 = ‘coin 2 comes down heads’A3 = ‘results of both tosses are the same’.

(a) Show that A1, A2 and A3 are pairwise independent (i.e. A1⊥⊥A2, A1⊥⊥A3 and A2⊥⊥A3) butnot mutually independent.

(b) Hence or otherwise construct three random variables X1, X2, X3 such that E[X3|X1 = x1] andE[X3|X2 = x2] are constant, but E[X3|X1 = x1&X2 = x2] isn’t.

2. Construct three random variables X1, X2, X3 with continuous distributions such that X1⊥⊥X2,X1⊥⊥X3 and X2⊥⊥X3, but any two Xi’s determine the remaining one.

3. (a) Show that for any random variables X and Y ,

i. E[Y ] = E[E[Y |X]

],

ii. Var[Y ] = E[Var[Y |X]

]+ Var

[E[Y |X]

].

(b) Suppose that the random variables Xi and Pi, i = 1, . . . , n, have the following distributions:

Xi =

1 with probability Pi,0 with probability 1− Pi,

PiIID∼ Beta(α, β),

i.e. Pi has density

f(p) =Γ(α + β)Γ(α) Γ(β)

pα−1(1− p)β−1

with mean µ and variance σ2 given by

µ = E[Pi] =α

α + β, σ2 = Var[Pi] =

αβ

(α + β)2(α + β + 1),

and Xi has a Bernoulli(Pi) distribution.

Find

i. E[X1|P1],ii. Var[X1|P1],iii. Var

[E[X1|P1]

], and

iv. E[Var[X1|P1]

].

Hence find E[Y ] where Y =∑n

i=1 Xi, and show that Var[Y ] = nαβ/(α + β)2.

(c) Express E[Y ] and Var[Y ] in terms of µ and σ2, and comment on the result.

From Warwick ST217 exam 1998

4. Suppose that the number N of bye-elections occurring in Government-held seats over a 12-monthperiod follows a Poisson distribution with mean 10.

Suppose also that, independently for each such bye-election, the probability that the Governmenthold onto the seat is 1/4. The number X of seats retained in the N bye-elections therefore follows abinomial distribution:

[X|N ] ∼ Bin(N, 0.25).

(a) What are E[N ], Var[N ], E[X|N ] and Var[X|N ]?

(b) What are E[X] and Var[X]?

(c) What is the distribution of X? [HINT : try using generating functions—see MSA]

24

Page 25: (ST217: Mathematical Statistics B)

5. (a) For continuous random variables X and Y , define

i. the marginal density fX(x) of X,ii. the conditional density fY |X(y|x) of Y given X = x,iii. the conditional expectation E[Y |X] of Y given X, andiv. the conditional variance Var[Y |X] of Y given X.

(b) Show that

i. E[g(Y )] = E[E[g(Y )|X]

], for an arbitrary function g(·), and

ii. Var[Y ] = E[Var[Y |X]

]+ Var

[E[Y |X]

].

(c) Suppose that the random variables X and Y have a continuous joint distribution, with PDFf(x, y), means µX & µY respectively, variances σ2

X & σ2Y respectively, and correlation ρ. Also

suppose the conditional mean of Y given X = x is a linear function of x:

E[Y |x] = β0 + β1x.

Show that

i.∫∞−∞ yf(x, y)dy = (β0 + β1x)fX(x),

ii. µY = β0 + β1µX , andiii. ρσXσY + µXµY = β0µX + β1(σ2

X + µ2X).

(Hint: use the fact that E[XY ] = E[E[XY |X]]).

(d) Hence or otherwise express β0 and β1 in terms of µX , µY , σX , σY & ρ, and write down (orderive) the maximum likelihood estimates of β0 & β1 under the assumption that the data(x1, y1), . . . , (xn, yn) are i.i.d observations from a bivariate Normal distribution.

From Warwick ST217 exam 1997

6. For discrete random variables X and Y , define:

(i) The conditional expectation of Y given X, E[Y |X], and

(ii) The conditional variance of Y given X, Var[Y |X].

Show that

(iii) E[Y ] = E[E[Y |X]

], and

(iv) Var[Y ] = E[Var[Y |X]

]+ Var

[E[Y |X]

].

(v) Show also that if E[Y |X] = β0 + β1X for some constants β0 and β1, then

E[XY ] = β0E[X] + β1E[X2].

The random variable X denotes the number of leaves on a certain plant at noon on Monday, Ydenotes the number of greenfly on the plant at noon on Tuesday, and Z denotes the number ofladybirds on the plant at noon on Wednesday.

Suppose that, given X = x, Y has a Poisson distributions with mean µx. If X has a Poissondistribution with mean λ, show that

E[Y ] = λµ

andVar[Y ] = λµ(1 + µ),

(you may assume that for a Poisson distribution the mean and variance are equal).

Suppose further that, given Y = y, Z has a Poisson distributions with mean νy. Find E[Z], Var[Z],and the correlation between X and Z.

From Warwick ST217 exam 1996

25

Page 26: (ST217: Mathematical Statistics B)

7. Using the relationshipE[E[h(X1)|X2]

]= E[h(X1)],

whereh(x1) = (x1 − E[X1 |x2] + E[X1 |x2]− EX1)2,

prove thatVar(X1) = E

[Var(X1 |X2)

]+ Var

(E[X1 |X2]

)for any two random variables X1 & X2.

8. Prove that, for any three RVs X, Y and Z for which the various expectations exist,

(a) X and Y − E(Y |X) are uncorrelated,

(b) Var(Y − E(Y |X)

)= E

(Var(Y |X)

),

(c) if X and Y are uncorrelated then E(Cov(X, Y |Z)

)= −Cov

(E(X|Z), E(Y |Z)

),

(d) Cov(Z, E(Y |Z)

)= Cov(Z, Y ).

In scientific thought we adopt the simplest theory which will explain all the facts under con-sideration and enable us to predict new facts of the same kind. The catch in this criterion liesin the word ‘simplest’. It is really an aesthetic canon such as we find implicit in our criticismsof poetry or painting.

J. B. S. Haldane

All models are wrong, some models are useful.G. E. P. Box

A child of five would understand this. Send somebody to fetch a child of five.Groucho Marx

26

Page 27: (ST217: Mathematical Statistics B)

Chapter 3

The Multivariate NormalDistribution

3.1 Motivation

A Normally distributed RVX ∼ N (µ, σ2)

has PDF

f(x;µ, σ2) = constant× exp[−1

2(x− µ)2

σ2

](3.1)

where

µ is the mean of X,σ2 is the variance of X, and

‘constant’ is there to make f integrate to 1.

The Normal distribution is important because, by the CLT, as n → ∞, the CDF of a MLE such asθ =

∑Xi/n or θ =

∑(Xi − ΣXj/n)2 / n, tends uniformly (under reasonable conditions) to the CDF of a

Normal RV with the appropriate mean and variance.

i.e. the log-likelihood tends to a quadratic in θ.

Similarly it can be shown that, for a model with parameter vector θ = (θ1, . . . , θp)T , under reasonableconditions the log-likelihood will tend to a quadratic in (θ1, . . . , θp).

Therefore, by analogy with Equation 3.1, we will want to define a distribution with PDF

f(x;µ,V) = constant× exp[−1

2(x− µ)T V−1(x− µ)

](3.2)

where

µ is a (p× 1) matrix or column vector,V is a (p× p) matrix, and

‘constant’ is again there to make f integrate to 1.

27

Page 28: (ST217: Mathematical Statistics B)

As an example of a PDF of this form, if X1, X2, . . . , XpIID∼ N (0, 1), then

f(x) = f1(x1)× f2(x2)× · · · × fp(xp) by independence

=1

(2π)p/2exp

(− 1

2Σx2i

)=

1(2π)p/2

exp(− 1

2xT x). (3.3)

Definition 3.1 (Multivariate standard Normal)The distribution with PDF

f(z) = f(z1, z2, . . . , zp) =1

(2π)p/2exp

(− 1

2zT z)

is called the multivariate standard Normal distribution.

The statement ‘Z has a multivariate standard Normal distribution’ is often written

Z ∼ N (0, I), Z ∼ MVN (0, I), Z ∼ N p(0, I), or Z ∼ MVN p(0, I),

and the CDF and PDF of Z are often written Φ(z) and φ(z), or Φp(z) and φp(z), respectively.

In the more general case, where the component RVs X1, X2, . . . , Xp in Equation 3.2 aren’t independent,we need an expression for the constant term.

3.2 Digression: Transforming a Random Vector

Exercise 3.1Suppose that the RVs Z1, Z2, . . . , Zn have a continuous joint distribution, with joint PDF fZ(z).Consider a 1-1 transformation (i.e. a bijection between the corresponding sample spaces) to new RVsX1, X2, . . . , Xn. What is the PDF fX(x) of the transformed RVs? Solution: Because the transformationis 1-1 we can invert it and write

Z = u(X)

i.e. a given point (z1, . . . , zn) transforms to (x1, . . . , xn), where

z1 = u1(x1, . . . , xn),z2 = u2(x1, . . . , xn),

...zn = un(x1, . . . , xn).

(3.4)

Now assume that each function ui(·) is continuous and differentiable.Then we can form the following matrix:

∂u∂x

=

∂u1

∂x1

∂u1

∂x2. . .

∂u1

∂xn

∂u2

∂x1

∂u2

∂x2. . .

∂u2

∂xn...

.... . .

...∂un

∂x1

∂un

∂x2. . .

∂un

∂xn

(3.5)

and its determinant J , which is called the Jacobian of the transformation u[i.e. of the joint transformation (u1, . . . , un)].

Then it can be shown thatfX(x) = |J | × fZ(z)

at all points in the ‘sample space’ (i.e. set of possible values) of X.‖

28

Page 29: (ST217: Mathematical Statistics B)

z z + δ1

z + δ2 z + δ1 + δ2 infinitesimal δ1 × δ2 rectangle

density = fZ(z)area = δ1δ2

∴ probability content = δ1δ2fZ(z)

x = u−1(z)u−1(z + δ1)

u−1(z + δ2)u−1(z + δ1 + δ2)

infinitesimal parallelogram

area = δ1δ2/|J |probability content = δ1δ2fZ(z)

∴ density = |J | × fZ(z)

6

u

Figure 3.1: Bivariate Parameter Transformation

3.3 The Bivariate Normal Distribution

Suppose that Z1 and Z2 are IID with N (0, 1) distributions, i.e. (as in Equation 3.3):

fZ(z1, z2) =12π

exp(− 1

2 (z21 + z2

2)).

Now let µ1, µ2 ∈ (−∞,∞), σ1, σ2 ∈ (0,∞) & ρ ∈ (−1, 1), and define (as in DeGroot §5.12):

X1 = σ1Z1 + µ1,

X2 = σ2

(ρZ1 +

√1− ρ2Z2

)+ µ2.

(3.6)

Then the Jacobian of the transformation from Z to X is given by

J =∣∣∣∣ σ1 0

ρ σ2

√1− ρ2 σ2

∣∣∣∣ =√1− ρ2 σ1σ2.

Therefore the Jacobian of the inverse transformation from X to Z is 1/(√

1− ρ2 σ1σ2

), and the PDF of

X is given by Equations 3.7 & 3.8 below.

Definition 3.2 (Bivariate Normal Distribution)The continuous bivariate distribution with PDF

fX(x) =1|J |

fZ(z) =1√

1− ρ2 σ1σ2

× 12π

× exp

(− 1

2(1− ρ2

)Q) , (3.7)

where

Q =(

x1 − µ1

σ1

)2

− 2ρ

(x1 − µ1

σ1

)(x2 − µ2

σ2

)+(

x2 − µ2

σ2

)2

. (3.8)

is called the bivariate Normal distribution.

29

Page 30: (ST217: Mathematical Statistics B)

Exercise 3.2If the RV X = (X1, X2) has PDF given by Equations 3.7 & 3.8, then show by substituting

v =x2 − µ2

σ2followed by w =

v − ρ(x1 − µ1)/σ1√1− ρ2

,

or otherwise, that X1 ∼ N (µ1, σ21).

Hence or otherwise show that the conditional distribution of X1 given X2 = x2 is Normal with meanµ1 + (ρσ1/σ2)(x2 − µ2) and variance σ2

1

(1− ρ2

).

Comments

1. It’s easy to show (problem 3.4.2, page 31) that EXi = µi, VarXi = σ2i and corr(X1, X2) = ρ. This

suggests that we will be able to write

X = (X1, X2)T ∼ MVN (µ,V), where

µ = (µ1, µ2)T is the ‘mean vector ’ of X, and

V =(

σ21 ρ σ1σ2

ρ σ1σ2 σ22

)is the ‘variance-covariance matrix ’ of X.

2. The ‘level curves’ (i.e. contours in 2-d) of the bivariate Normal PDF are given by Q = constant informula 3.8; i.e. ellipses provided the discriminant is negative:(

ρ

σ1σ2

)2

− 1σ2

1

1σ2

2

=ρ2 − 1σ2

1σ22

< 0.

This holds as we are only considering ‘nonsingular’ bivariate Normal distributions with ρ 6= ±1.

3. PLEASE MAKE NO ATTEMPT TO MEMORISE FORMULAE 3.7 & 3.8!!

Exercise 3.3Show that the inverse of the variance-covariance matrix V =

(σ2

1 ρ σ1σ2

ρ σ1σ2 σ22

)is

V−1 =1

1− ρ2

(1/σ2

1 −ρ/σ1σ2

−ρ/σ1σ2 1/σ22

).

30

Page 31: (ST217: Mathematical Statistics B)

3.4 Problems

1. Suppose that the RVs X1, X2, . . . , Xn have a continuous joint distribution with PDF fX(x), and thatthe RVs Y1, Y2, . . . , Yn are defined by Y = AX, where the (n × n) matrix A is nonsingular. Showthat the joint density of the Yis is given by

fY(y) =1

|detA|fX

(A−1y

)for y ∈ Rn.

Hence or otherwise show carefully that if X1 and X2 are independent RVs with PDFs f1 and f2

respectively, then the PDF of Y = X1 + X2 is given by

fY (y) =∫ ∞

−∞f1(y − z)f2(z)dz for −∞ < y < ∞

or equivalently by

fY (y) =∫ ∞

−∞f1(z)f2(y − z)dz for −∞ < y < ∞

If XiIID∼ Exp(1), i = 1, 2, then what is the distribution of X1 + X2?

2. Suppose that Z1 and Z2 are i.i.d. random variables with standard Normal N(0, 1) distributions.Define the random vector (X1, X2) by:

X1 = µ1 + σ1Z1, X2 = µ2 + σ2

[ρZ1 +

√1− ρ2Z2

],

where σ1, σ2 > 0 and −1 ≤ ρ ≤ 1.

(a) Show that E[X1] = µ1, E[X2] = µ2,Var[X1] = σ21 ,Var[X2] = σ2

2 , and corr[X1, X2] = ρ.(b) Find E[X2|X1] and Var[X2|X1].(c) Derive the joint PDF f(x1, x2).(d) Find the distribution of [X2|X1]. Hence or otherwise show that two r.v.s. with a joint bivariate

Normal distribution are independent if and only if they are uncorrelated.(e) Now suppose that σ1 = σ2. Show that the RVs Y1 = X1+X2 and Y2 = X1−X2 are independent.

3. Suppose that X and Y have the joint density

fX,Y (x, y) =1

2π σXσY

√1− ρ2

× exp

(− 1

2(1− ρ2

) [(x− µX

σX

)2

− 2ρ

(x− µX

σX

)(y − µY

σY

)+(

y − µY

σY

)2])

.

(a) Show by substituting u = (x−µX)/σX and v = (y−µY )/σY followed by w = (u−ρv)/√

1− ρ2,or otherwise, that fX,Y does indeed integrate to 1.

(b) Show that the ‘joint MGF’ MX,Y (s, t) = E[exp(sX + tY )

]is given by

MX,Y (s, t) = exp[µXs + µY t + 1

2 (σ2Xs2 + 2ρσXσY st + σ2

Y t2)].

(c) Show that

∂MX,Y

∂s

∣∣∣∣s,t=0

= µX ,

∂2MX,Y

∂s2

∣∣∣∣s,t=0

= µ2X + σ2

X ,

&∂2MX,Y

∂s∂t

∣∣∣∣s,t=0

= µXµY + ρσXσY .

(d) Guess the formula for the MGF MX(s) of X, where X ∼ MVN (µ,V).

4. Suppose that (X1, X2) have a bivariate Normal distribution. Show that any linear combinationY = a0 + a1X1 + a2X2 has a univariate Normal distribution.

31

Page 32: (ST217: Mathematical Statistics B)

3.5 The Multivariate Normal Distribution

Definition 3.3 (Multivariate Normal distribution)Let µ = (µ1, µ2, . . . , µp) be a p-vector, and let V be a symmetric positive-definite (p × p) matrix.Then the multivariate probability density defined by

fX(x;µ,V) =1√

(2π)p|V|exp[− 1

2 (x− µ)T V−1(x− µ)]

(3.9)

is called a multivariate Normal PDF with mean vector µ and variance-covariance matrix V.

Comments

1. Expression 3.9 is a natural generalisation of the univariate Normal density, with V taking the role ofσ2 in the exponent, and its determinant |V| taking the role of σ2 in the ‘normalising constant’ thatmakes the whole thing integrate to 1. Many of the properties of the MVN distribution are guessablefrom properties of the univariate Normal distribution—in particular, it’s helpful to think of 3.9 as‘exponential of a quadratic’.

2. The statement ‘X = (X1, X2, . . . , Xp) has a multivariate Normal distribution with mean vector µand variance-covariance matrix V’ may be written

X ∼ N (µ,V), X ∼ MVN(µ,V), X ∼ N p(µ,V), or X ∼ MVNp(µ,V).

3. The mean vector µ is sometimes called just the mean, and the variance-covariance matrix V issometimes called the dispersion matrix, or simply the variance matrix or covariance matrix.

4. µ = EX, (or equivalently, componentwise, EXi = µi, i = 1, 2, . . . , p). This fact should be obviousfrom the name ‘mean vector’, and can be proved in various ways, e.g. by differentiating a multivariategeneralization of the MGF, or simply by symmetry.

5. V = E((X− µ)(X− µ)T

)= E(XXT )− µµT , i.e.

E(XXT )− µµT = E

X2

1 X1X2 . . . X1Xp

X2X1 X22 . . . X2Xp

......

. . ....

XpX1 XpX2 . . . X2p

µ2

1 µ1µ2 . . . µ1µp

µ2µ1 µ22 . . . µ2µp

......

. . ....

µpµ1 µpµ2 . . . µ2p

=

E(X2

1

)− µ2

1 E(X1X2)− µ1µ2 . . . E(X1Xp)− µ1µp

E(X2X1)− µ2µ1 E(X2

2

)− µ2

2 . . . E(X2Xp)− µ2µp

......

. . ....

E(XpX1)− µpµ1 E(XpX2)− µpµ2 . . . E(X2

p

)− µ2

p

= V =

v11 v12 . . . v1p

v21 v22 . . . v2p

......

. . ....

vp1 vp2 . . . vpp

, say,

from which it follows that

V =

σ2

1 ρ12σ1σ2 . . . ρ1pσ1σp

ρ12σ1σ2 σ22 . . . ρ2pσ2σp

......

. . ....

ρ1pσ1σp ρ2pσ2σp . . . σ2p

, (3.10)

32

Page 33: (ST217: Mathematical Statistics B)

where σi is the standard deviation of Xi and ρij is the correlation between Xi and Xj . Again theseresults can be proved using a multivariate generalization of the MGF.

6. The p-dimensional MVN p(µ,V) distribution can therefore be parametrised by—

p means µi,p variances σ2

i , and12p(p− 1) correlations ρij

NB. —a total of 12p(p + 3) parameters.

7. Given n random vectors Xi = (Xi1, Xi2, . . . , Xip) IID∼MVN (µ,V), i = 1, 2, . . . , n,a set of minimal sufficient statistics for the unknown parameters is given by:

n∑i=1

Xij j = 1, . . . , p,

n∑i=1

X2ij j = 1, . . . , p,

&n∑

i=1

XijXik j = 2, . . . , p, k = 1, . . . , (j − 1),

(3.11)

and MLEs for µ and V are given by:

µj =1n

∑i

Xij , (3.12)

σ2j =

1n

∑i

(Xij − µj)2, (3.13)

ρjk =1n

∑i(Xij − µj)(Xik − µk)

σj σk, (3.14)

or, in matrix notation,

µ =1n

n∑i=1

Xi, (3.15)

V =1n

n∑i=1

(Xi − µ)(Xi − µ)T (3.16)

=1n

n∑i=1

XiXTi − µµT . (3.17)

8. The fact that V is positive-definite implies various (messy!) constraints on the correlations ρij .

9. Surfaces of constant density form concentric (hyper-)ellipsoids (concentric hyper-spheres in the caseof the standard MVN distribution). In particular, the contours of a bivariate Normal density formconcentric ellipses (or concentric circles for the standard bivariate Normal).

10. It can be proved that all conditional and marginal distributions of a MVN are themselves MVN. Theproof of this important fact is quite straightforward, quite tedious, and mercifully omitted from thiscourse.

33

Page 34: (ST217: Mathematical Statistics B)

3.6 Distributions Related to the MVN

Because of the CLT, the MVN distribution is important throughout statistics. For example, the jointdistribution of the MLEs θ1, θ2, . . . , θp of unknown parameters θ1, θ2, . . . , θp will under reasonable conditionstend to a MVN as the size of the sample from which θ = (θ1, θ2, . . . , θp)T was calculated increases.

Therefore various distributions arising from the MVN by transformation are also important.

Throughout this Section we shall usually denote independent standard Normal RVs by Zi, i.e.:

ZiIID∼ N (0, 1), i = 1, 2, . . .

i.e. Z = (Z1, Z2, . . . , Zn)T ∼ MVN (0, I).

Exercise 3.4Show that if a is a constant (n× 1) column vector, B is a constant nonsingular (n× n) matrix, and Z =(Z1, Z2, . . . , Zn)T is a random n-vector with a MVN (0, I) distribution, then Y = a+BZ ∼ MVN

(a,BBT

).‖

3.6.1 The Chi-squared Distribution

Definition 3.4 (Chi-squared Distribution)If Zi

IID∼ N(0, 1) for i = 1, 2, . . . , n, then the distribution of

X = Z21 + Z2

2 + · · ·+ Z2n

is called a Chi-squared distribution on n degrees of freedom, and we write X ∼ χ2n.

Comments

1. In particular, if Z ∼ N (0, 1), then Z2 ∼ χ21.

2. The above construction of the χ2n distribution shows that if X ∼ χ2

m, Y ∼ χ2n, and X ⊥⊥Y ,

then (X + Y ) ∼ χ2m+n.

This summation property accounts for the importance and usefulness of the χ2 distribution:essentially a squared length is split into two orthogonal components, as in Pythagoras’ theorem.

3. If X ∼ χ2n, then the (unmemorable) density of X can be shown to be

fX(x) =1

2n/2 Γ(n/2)x(n/2)−1 e−x/2 for x > 0, (3.18)

with fX(x) = 0 for x ≤ 0.

Comparing this with the definition of a Gamma distribution (MSA) shows that a Chi-squared dis-tribution on n degrees of freedom is just a Gamma distribution with α = n/2 and β = 1/2 (in theusual parametrisation).

4. It can be shown that if X ∼ χ2n then EX = n and VarX = 2n.

Note that this implies that E[X/n] = 1 and Var[X/n] = 2/n.

5. The χ2 distributions are positively skewed—for example, χ22 is just an exponential distribution with

mean 2. However, because of the CLT, the χ2n distribution tends (slowly!) to Normality as n →∞.

6. The PDF 3.18 cannot be integrated analytically except for the special case n = 2. Therefore theCDFs of χ2

n distributions for various n are given in standard Statistical Tables.

34

Page 35: (ST217: Mathematical Statistics B)

Figure 3.2: Chi-squared distributions for 1, 2, 5 & 20 d.f. Vertical lines show the 2.5%, 16%, 50%, 84%and 97.5% points (which for N(0, 1) are at −2, −1, 0, 1, 2).

3.6.2 Student’s t Distribution

Definition 3.5 (t Distribution)If Z ∼ N(0, 1), Y ∼ χ2

n and Y ⊥⊥Z, then the distribution of

X =Z√Y/n

is called a (Student’s) t distribution on n degrees of freedom, and we write X ∼ tn.

Comments

1. The shape of the t distribution is like that of a Normal, but with heavier tails (since there is variabilityin the denominator of t as well as in the Normally-distributed numerator Z).

However, as n → ∞, the denominator becomes more and more concentrated around 1, so (looselyspeaking!) ‘tn → N(0, 1) as n →∞’.

2. The (highly unmemorable) PDF of X ∼ tn can be shown to be

fX(x) =Γ((n + 1)/2

)√

nπ Γ(n/2)(1 + x2/n

)−(n+1)/2 for −∞ < x < ∞. (3.19)

35

Page 36: (ST217: Mathematical Statistics B)

Figure 3.3: t distributions for 1, 2, 5 & 20 d.f. Vertical lines show the 2.5%, 16%, 50%, 84% and 97.5%points.

3. The t distribution on 1 degree of freedom is also called the Cauchy distribution—note that it arisesas the distribution of Z1/Z2 where Zi

IID∼ N (0, 1).

The Cauchy distribution is infamous for not having a mean. More generally, only the first n − 1moments of the tn distribution exist.

4. Note that if XiIID∼ N (0, σ2), then the RV

T =X1√∑n

i=2 X2i /(n− 1)

has a tn−1 distribution, and is a measure of the length of X1 compared to the root mean squarelength of the other Xis.

i.e. if X has a spherical MVN (0, σ2I) distribution, then we would expect T not to be too large. Thisis, in effect, how the t distribution usually arises in practice.

5. The PDF 3.19 cannot be integrated analytically in general (exception: n = 1 d.f.). The CDF mustbe looked up in Statistical Tables or approximated using a computer.

36

Page 37: (ST217: Mathematical Statistics B)

3.6.3 Snedecor’s F Distribution

Definition 3.6 (F Distribution)If Y ∼ χ2

m, Z ∼ χ2n and Y ⊥⊥Z, then the distribution of

X =Y/m

Z/n

is called an F distribution on m & n degrees of freedom, and we write X ∼ Fm,n.

Figure 3.4: F distributions for selected d.f. Vertical lines show the 2.5%, 16%, 50%, 84% and 97.5% points.

Comments

1. Note that the numerator Y/m and denominator Z/n of X both have mean 1. Therefore, providedboth m and n are large, X will usually take values around 1.

2. If X ∼ Fm,n, then the (extraordinarily unmemorable) density of X can be shown to be

fX(x) =Γ((m + n)/2

)mm/2nn/2

Γ(m/2) Γ(n/2)× x(m/2)−1

(mx + n)(m+n)/2for x > 0, (3.20)

with fX(x) = 0 for x ≤ 0.

37

Page 38: (ST217: Mathematical Statistics B)

3.7 Problems

1. Let Z ∼ N (0, 1) & Y = Z2, and let φ(·) & Φ(·) denote the PDF & CDF respectively of the standardNormal N (0, 1) distribution.

(a) Show that FY (y) = Φ(√

y)− Φ(−√y).

(b) Express fY (y) in terms of φ(√

y).

(c) Hence show that

fY (y) =1√2π

y−1/2e−y/2 for y > 0.

(d) Find the MGF of Y .

2. Using Formula 3.18 for the PDF of the χ2 distribution, show that if X ∼ χ2n then the MGF of X is

MX(t) = (1− 2t)−n/2. (3.21)

Deduce that if X ∼ χ2m & Y ∼ χ2

n with X ⊥⊥Y , then (X + Y ) ∼ χ2m+n.

3. Given Z1, Z2IID∼ N (0, 1), what is the probability that the point (Z1, Z2) lies

(a) in the square (z1, z2) | −1 < z1 < 1 & −1 < z2 < 1,(b) in the circle (z1, z2) | (z2

1 + z22) < 1?

4. Let Z1, Z2, . . . be independent random variables, each with mean 0 and variance 1, and let µi, σi andρij be constants with −1 ≤ ρij ≤ 1. Let

Y1 = Z1,

Y2 = ρ12Z1 +√

1− ρ212 Z2,

and define Xi = µi + σiYi, i = 1, 2.

(a) Show that E[Xi] = µi, Var[Xi] = σ2i (i = 1, 2), and that ρ12 is the correlation between X1 and

X2.

(b) Find constants c0, c1, c2 and c3 such that

Y3 = c0 + c1Z1 + c2Z2 + c3Z3

has mean 0, variance 1, and correlations ρ13 & ρ23 with Y1 and Y2 respectively.

(c) Hence show that the random vector Z = (Z1, Z2, Z3)T with zero mean vector and identityvariance-covariance matrix can be transformed to give a random vector X = (X1, X2, X3)T withspecified first and second moments, subject to constraints on the correlations corr[Xi, Xj ] = ρij

includingρ212 + ρ2

13 + ρ223 ∈ [0, 1 + 2ρ12ρ13ρ23].

(d) What can you say about the distribution of X when Z has a standard trivariate Normal distribu-tion and ρ2

12 +ρ213 +ρ2

23 is at one of the extremes of its allowable range (i.e. 0 or 1+2ρ12ρ13ρ23)?

From Warwick ST217 exam 2001

5. Let Z = (Z1, Z2, . . . , Zm+n)T ∼ MVN m+n(0, I).

(a) Describe the distribution of Y = Z/√∑m+n

1 Z2i .

(b) Show that the RV X = (n∑m

1 Y 2i )/(m

∑m+nm+1 Y 2

i ) has an Fm,n distribution.

(c) Hence show that if Y = (Y1, Y2, . . . , Ym+n)T has any continuous spherically symmetric distri-bution centred at the origin, then X = (n

∑m1 Y 2

i )/(m∑m+n

m+1 Y 2i ) has an Fm,n distribution.

6. Suppose that X has a χ2n distribution with PDF given by Formula 3.18. Find the mean, mode &

variance of X, and an approximate variance-stabilising transformation.

38

Page 39: (ST217: Mathematical Statistics B)

7. Suppose that Yi are independent RVs with Poisson distributions: Yi ∼ Poi(λi), i = 1, . . . , k.

(a) Assuming that λi is large, what is the approximate distribution of Zi = (Yi − λi)/√

λi?

(b) Hence or otherwise show that if all the λis are large, then the RV X =∑k

i=1(Yi − λi)2/λi hasapproximately a χ2

k distribution.

8. Suppose that the RVs Oi have independent Poisson distributions: Oi ∼ Poi(npi), i = 1, . . . , k, where∑ki=1 pi = 1.

(a) Find EOi and VarOi. Hence or otherwise show that E[Oi − npi] = 0 and Var[Oi − npi] = npi.

(b) Define the RV N by N =∑k

i=1 Oi. What is the distribution of N?

(c) Define the RVs Ei = Npi, i = 1, . . . , k. Show that EEi = npi and VarEi = np2i .

(d) By writing E[O1E1] = p1

(E[O2

1] + E[O1

∑ki=2 Oi]

), or otherwise, show that Cov(O1, E1) = np2

1.

(e) Deduce that the RV (Oi − Ei) has mean 0 and variance npi(1− pi) for i = 1, . . . , k.

9. (a) Define a multivariate standard Normal distribution N(0, I), where I denotes the identity matrix.Given Z = (Z1, Z2, . . . , Zn)T ∼ N(0, I), write down functions of Z (i.e. transformed randomvariables) having

i. a chi-squared distribution on (n− 1) degrees of freedom, andii. a t distribution on (n− 1) degrees of freedom.

(b) Let Z = (Z1, Z2, . . . , Zn)T have a multivariate standard Normal distribution, and let Z =∑ni=1 Zi/n.

Also let A = (aij) be an n× n orthogonal matrix, i.e. AAT = I, and define the random vectorY = (Y1, Y2, . . . , Yn)T by Y = AZ.Quoting any properties of probability distributions that you require, show the following:

i. Show that∑n

i=1 Y 2i =

∑ni=1 Z2

i .ii. Show that Y ∼ N(0, I).iii. Show that for suitable choices of ki, i = 1, . . . , n (where ki > 0 for all i), the following

matrix A is orthogonal, and find ki:

A =

k1 −k1 0 . . . 0 0k2 k2 −2k2 . . . 0 0...

......

. . ....

...kn−2 kn−2 kn−2 . . . −(n− 2)kn−2 0kn−1 kn−1 kn−1 . . . kn−1 −(n− 1)kn−1

kn kn kn . . . kn kn

.

iv. With the above definition of A, show that∑n−1

i=1 Y 2i =

∑ni=1(Zi−Z)2 and that Yn =

√n Z.

v. Hence show that the RVs Z and∑n

i=1(Zi − Z)2 are independent and have N (0, 1/n) andχ2

n−1 distributions respectively.vi. Hence or otherwise show that if X1, X2, . . . , Xn

IID∼ N (µ, σ2), and X =∑n

i=1 Xi/n, then therandom variable

T =X√

1n(n−1)

∑ni=1(Xi −X)2

has a t distribution on n− 1 degrees of freedom.

From Warwick ST217 exam 2000

10. Let z(m,n, P ) denote the P% point of the Fm,n distribution. Without looking in statistical tables,what can you say about the relationships between the following values:

(a) z(2, 2, 50) and z(20, 20, 50), (b) z(2, 20, 50) and z(20, 2, 50),(c) z(2, 20, 16) and z(20, 2, 84), (d) z(20, 20, 2.5) and z(20, 20, 97.5)?

39

Page 40: (ST217: Mathematical Statistics B)

11. Suppose that ZiIID∼ N (0, 1), i = 1, 2, . . . What is the distribution of the following RVs?

(a)

X1 = Z1 + Z2 − Z3

(b)

X2 =Z1 + Z2

Z1 − Z2

(c)

X3 =(Z1 − Z2)2

(Z1 + Z2)2

(d)

X4 =

((Z1 + Z2)2 + (Z1 − Z2)2

)2

(e)

X5 =2Z1√

Z22 + Z2

3 + Z24 + Z2

5

(f)

X6 =(Z1 + Z2 + Z3)√

Z24 + Z2

5 + Z26

(g)

X7 =3(Z1 + Z2 + Z3 + Z4)2

(Z1 + Z2 − Z3 − Z4)2 + (Z1 − Z2 + Z3 − Z4)2 + (Z1 − Z2 − Z3 + Z4)2

(h)

X8 = 2Z21 + (Z2 + Z3)2

12. For each of the RVs Xi defined in the previous question, use Statistical Tables to find ci (i = 1 . . . 8)such that Pr(Xi > ci) = 0.95.

13. Show that the PDFs of the t and F distributions (definitions 3.5 & 3.6) are indeed given by formulae3.19 & 3.20.

14. (a) Define the Standard Multivariate Normal distribution MVN (0, I).

(b) Given Z = (Z1, Z2, . . . , Zm+n)T ∼ MVN (0, I), write down transformed random variables X(Z),T (Z) and Y (Z) with the following distributions:

i. X ∼ χ2n,

ii. T ∼ tn,iii. Y ∼ Fm,n.

(c) Given that the PDF of X ∼ χ2n is

fX(x) =1

2n/2 Γ(n/2)x(n/2)−1 e−x/2 for x > 0,

and fX(x) = 0 elsewhere, show that

i. E[X] = n,ii. E[X2] = n2 + 2n, and

40

Page 41: (ST217: Mathematical Statistics B)

iii. E[1/X] = 1/(n− 2) (provided n > 2).

(d) Hence or otherwise find

i. the variance σ2X of X ∼ χ2

n,ii. the mean µY of Y ∼ Fm,n andiii. the mean µT and variance σ2

T of T ∼ tn,

stating under what conditions σ2X , µY , µT and σ2

T exist.

From Warwick ST217 exam 1998

Theory is often just practice with the hard bits left out.J. M. Robson

Get a bunch of those 3–D glasses and wear them at the same time. Use enough to get it up toa good, say, 10– or 12–D.

Rod Schmidt

The Normal . . . is the Ordinary made beautiful; it is also the Average made lethal.Peter Shaffer

Symmetry, as wide or as narrow as you define is meaning, is one idea by which man throughthe ages has tried to comprehend and create order, beauty and perfection.

Hermann Weyl

41

Page 42: (ST217: Mathematical Statistics B)

This page intentionally left blank (except for this sentence).

42

Page 43: (ST217: Mathematical Statistics B)

Chapter 4

Inference for Multiparameter Models

4.1 Introduction: General Concepts

4.1.1 Modelling

Given a random vector X = (X1, X2, . . . , Xp), we can describe the joint distribution of the Xis by theCDF FX(x) or, usually more conveniently, by the PMF or PDF fX(x).

Interrelationships between the Xis can be described using

1. marginals Fi(xi), fi(xi), Fij(xi, xj), etc.,

2. conditionals Gi(xi |xj , j 6= i), gi(xi |xj , j 6= i), Gij(xi, xj |xk, k 6= i, j), etc.,

3. conditional expectations E[Xi |Xj ], Var[Xi |Xj ], etc.

Often FX(x) is assumed to lie in a family of probability distributions:

F = F (x|θ) | θ ∈ ΩΘ (4.1)

where ΩΘ is the ‘parameter space’.

The process of formulating, choosing within, & checking the reasonableness of, such families F , is calledstatistical modelling (or probability modelling, or just modelling).

Exercise 4.1The data-set in Table 4.1, plotted in Figure 1.1 (page 2), shows patients’ blood pressures before and aftertreatment. Suggest some reasonable models for the data.

4.1.2 Data

In practice, we typically have a set of data in which d variables are measured on each of n ‘cases’ (or‘individuals’ or ‘units’):

var.1 var.2 · · · var.d

D =

case.1case.2

...case.n

x11

x21

...xn1

x12

x22

...xn2

· · ·· · ·. . .· · ·

x1d

x2d

...xnd

(4.2)

43

Page 44: (ST217: Mathematical Statistics B)

Patient Systolic DiastolicNumber before after change before after change

1 210 201 -9 130 125 -52 169 165 -4 122 121 -13 187 166 -21 124 121 -34 160 157 -3 104 106 25 167 147 -20 112 101 -116 176 145 -31 101 85 -167 185 168 -17 121 98 -238 206 180 -26 124 105 -199 173 147 -26 115 103 -12

10 146 136 -10 102 98 -411 174 151 -23 98 90 -812 201 168 -33 119 98 -2113 198 179 -19 106 110 414 148 129 -19 107 103 -415 154 131 -23 100 82 -18

Table 4.1: Supine systolic and diastolic blood pressures of 15 patients with moderatehypertension (high blood pressure), immediately before and 2 hours after taking 25mgof the drug captopril.

Data from HSDS, set 72

Definition 4.1 (Data Matrix)A set of data D arranged in the form of 4.2 is called a data matrix or a cases-by-variables array.

The data-set D is assumed to be a representative sample (of size n) from an underlying population ofpotential cases. This population may be actual, e.g. the resident population of England & Wales at noonon June 30th 1993, or purely theoretical/hypothetical, e.g. MVN(µ,V).

Exercise 4.2Table 4.2 presents data on ten asthmatic subjects, each tested with 4 drugs. Describe various ways thatthe data might be set out as a data matrix for analysis by a statistical computing package.

4.1.3 Statistical Inference

Statistical inference is the art/science of using the sample to learn about the population (and hence,implicitly, about future samples).

Typically we use statistics (properties of the sample)to learn about parameters (properties of the population).

This activity might be:

1. Part of analysing a formal probability model,

e.g. calculating the MLEs θ of θ, after making an assumption as in Expression 4.1, or

2. Purely to summarise the data as a part of ‘data analysis’ (Section 4.2),

For example, given X1, X2, . . . , XnIID∼ FX (unknown), the statistics

S1 =1n

∑Xi = X, S2 =

1n

∑(Xi −X)2, S3 =

1n

∑(Xi −X)3

44

Page 45: (ST217: Mathematical Statistics B)

Patient numberDrug Time 1 2 3 4 5 6 7 8 9 10

P −5 mins 0.0 2.3 2.4 1.9 1.6 4.8 0.6 2.7 0.9 1.3+15 mins 3.8 9.2 5.4 3.3 4.2 15.1 1.3 6.7 4.2 3.1

C −5 mins 0.5 1.0 2.0 1.1 2.1 6.8 0.6 3.1 1.5 3.0+15 mins 2.0 5.3 7.5 6.4 4.1 9.1 0.6 14.8 2.4 2.3

D −5 mins 0.8 2.3 0.8 0.8 1.2 9.6 1.1 9.7 0.8 4.9+15 mins 2.4 4.8 2.4 1.9 1.2 12.5 1.7 12.5 4.3 8.1

K −5 mins 0.2 1.7 2.2 0.1 1.7 9.2 0.6 12.7 1.1 2.8+15 mins 0.4 3.4 2.0 1.3 3.4 6.7 1.1 12.5 2.7 5.7

Table 4.2: NCF (Neutrophil Chemotactic Factor) of ten individuals, each tested with4 drugs: P (Placebo), C (Clemastine), D (DSCG), K (Ketotifen). On a given day,an individual was administered the chosen drug, and his NCF measured 5 minutesbefore, and 15 minutes after, being given a ‘challenge’ of allergen.

Data from Dr. R. Morgan of Bart’s Hospital

provide measures of location, scale and skewness.Note that here we’re implicitly estimating the corresponding population quantities

µX = EX, E[(X − µX)2

], E

[(X − µX)3

],

and using these as measures of population location, scale and skewness. Without a formal probabilitymodel, it can be hard to judge whether these or some other measures may be most appropriate.

In both cases, the CLT & its generalisations (to higher dimensions and to ‘near-independence’) show that,under reasonable conditions, the joint distribution of the statistics of interest, such as θ or (S1, S2, S3), isapproximately MVN. This approximation improves if

1. the sample size n →∞, and/or

2. the joint distribution of the random variables being summed (e.g. the original random vectorsX1,X2, . . . ,Xn) is itself close to MVN.

QUESTIONS: How should we interpret this? How should we try to link probability models to reality?

4.2 Data Analysis

Data analysis is the art of summarising data while attempting to avoid probability theory.

For example, you can calculate summary statistics such as means, medians, modes, ranges, standarddeviations etc., thus summarising in a few numbers the main features of a possibly huge data-set. Forexample, the (0%, 25%, 50%, 75%, 100%) points of the data distribution (i.e. minimum, lower quartile,median, upper quartile and maximum) form the five-number summary, and the inter-quartile range (IQR =upper quartile - lower quartile) is a measure of spread, containing the ‘middle 50%’ of the data.

These summaries can be formalised as follows

Definition 4.2 (Order statistics)Given RVs X1, X2, . . . , Xn, one can order them and denote the smallest of the Xis by X(1), the secondsmallest by X(2), etc. Then X(k) is called the kth order statistic.

45

Page 46: (ST217: Mathematical Statistics B)

Thus X(1), X(2), . . . , X(n) are a permutation of X1, X2, . . . , Xn, and x(n), the observed value of X(n), denotesthe largest observed value in a sample of size n.

Given ordered data x(1) ≤ x(2) ≤ · · · ≤ x(n), one can define:

Definition 4.3 (Sample median)

xM =

x(n+1

2

) if n is odd,

12

[x(n

2

) + x(n2 + 1

)] if n is even.

We can always write xM = x(n2 + 1

2

), provided we adopt the following convention:

1. If the number in brackets is exactly half-way between two integers, then take the average of the twocorresponding order statistics.

2. Otherwise round the bracketed subscript to the nearest integer, and take the corresponding orderstatistic.

Similarly the quartiles etc. can be formally defined as follows:

Definition 4.4 (Sample lower quartile)

xL = x(n4 + 1

2

),Definition 4.5 (Sample upper quartile)

xU = x( 3n4 + 1

2

),Definition 4.6 (100p th sample percentile)

x100p% = x( pn100 + 1

2

),Definition 4.7 (Five number summary)(

x(1), xL, xM , xU , x(n)

).

4.3 Classical Inference

4.3.1 Introduction

In ‘classical statistical inference’, the typical procedure is:

1. Choose a family F of models indexed by θ (formula 4.1).

2. Assume temporarily that the true distribution lies in Fi.e. data D ∼ F (d|θ) for some true but unknown parameter vector θ ∈ ΩΘ.

3. Compare possible models according to some criterion of compatibility between the model & the data(equivalently, between the population & the sample).

4. Assess the chosen model(s), and go back to step (1) or (2) if the model proves inadequate.

46

Page 47: (ST217: Mathematical Statistics B)

Comments

1. Step 1 is a compromise between

(a) what we believe is the true underlying mechanism that produced the data, and

(b) what we can do mathematically.

If in doubt, keep it simple.

2. Step 2, by assuming a true θ exists, implicitly interprets probability as a property of Nature

e.g. a ‘fair’ coin is assumed to have an intrinsic property: if you toss it n times, then the proportionof ‘heads’ tends to 1/2 as n →∞.

Thus probability represents a ‘long-run relative frequency’.

3. Most statistical computer packages currently use the classical approach, and we’ll mainly be usingclassical inference in MSB.

4. There are many possible criteria at step 3. For example, hypothesis-testing and likelihood approachesare both discussed briefly below.

4.3.2 Point Estimation (Univariate)

Given RVs X = (X1, X2, . . . , Xn), a point estimator for an unknown parameter Θ ∈ ΩΘ is simply a functionΘ(X) taking values in the parameter space ΩΘ. Once data X = x are obtained, one can calculate thecorresponding point estimate θ = Θ(x).

There are many plausible criteria for Θ to be considered a ‘good’ estimator of Θ. For example:

1. Mean Squared Error

One would like the mean squared error (MSE) of Θ to be small whatever the true value θ of Θ, where

MSE(Θ) = E[(Θ− θ)2

]. (4.3)

In particular, an estimator Θ has minimum mean squared error if

MSE(Θ) = minΘ′

MSE(Θ′).

2. Unbiasedness

Definition 4.8 (Bias)The bias of an estimator Θ is

Bias(Θ) = E[Θ− θ |Θ = θ]. (4.4)

Exercise 4.3Show that MSE(Θ) = Var(Θ) + (Bias Θ)2.

Definition 4.9 (Unbiasedness)An estimator Θ for a parameter Θ is called unbiased if E[Θ|Θ = θ] = θ for all possible truevalues θ of Θ.

47

Page 48: (ST217: Mathematical Statistics B)

ExampleGiven a random sample X1, X2, . . . , Xn, i.e. Xi

IID∼ FX(x), where FX is a member of some family Fof probability distributions,

(a) X =∑n

i=1 Xi/n is an unbiased estimate of the mean µX = EX of FX .

(b) More generally, any statistic of the form∑n

i=1 wiXi, where∑n

i=1 wi = 1, is an unbiased estimateof µX .

(c) σ21 =

∑ni=1(Xi −X)2/(n− 1) is an unbiased estimate of the variance σ2

X of FX , but

(d) σ22 =

∑ni=1(Xi −X)2/n is NOT an unbiased estimate of the variance σ2

X of FX .

3. Efficiency & Minimum Variance Unbiased Estimation

Given two unbiased estimators Θ1 & Θ2 for a parameter Θ, the efficiency of Θ1 relative to Θ2 isdefined by

Eff(Θ1, Θ2) =Var(Θ1)

Var(Θ2). (4.5)

Definition 4.10 (MVUE)The Minimum Variance Unbiased Estimator of a parameter Θ is the unbiased estimator Θ, outof all possible unbiased estimators, that has minimum variance.

ExampleGiven Xi

IID∼ FX(x) ∈ F , the family of all probability distributions with finite mean & variance, it canbe shown that

(a) X is the MVUE of the mean µX = EX of FX , and

(b)∑n

i=1(Xi −X)2/(n− 1) is the MVUE of the variance σ2X of FX .

Note that there are major problems with using MVUE as a criterion for estimation:

(a) The MVUE may not exist (e.g. in general there is no unbiased estimator for the underlyingstandard deviation σX of X).

(b) The MVUE may exist but be nonsensical (see Problems).

(c) Even if the MVUE exists and appears reasonable, other (biased) estimators may be better byother criteria, for example by having smaller mean squared error, which is much more importantin practice than being unbiased.

4. Consistency

Definition 4.11 (Consistency)A sequence of estimators Θ1, Θ2, . . . is consistent for Θ ∈ ΩΘ if, for all ε > 0 and for all θ ∈ ΩΘ,

limn→∞

Pr(|Θn − θ| > ε|Θ = θ) = 0.

5. SufficiencyΘ(X1, . . . Xn) is sufficient for Θ if the conditional distribution of (X1, . . . Xn) given Θ = θ & Θ = θdoes not depend on θ. See MSA.

6. Maximum likelihoodSee MSA.

7. InvarianceSee Casella & Berger, page 300.

48

Page 49: (ST217: Mathematical Statistics B)

8. The ‘plug in’ propertyIf θ is a specified property of the CDF F (x), then θ is the corresponding property of the empiricalCDF

F (x) =1n× (number of Xi ≤ xi). (4.6)

For example (assuming the named quantities exist):

(a) the sample mean θ = x =∑

xi/n is the plug-in estimate of the population mean θ = EX,

(b) the sample median is the plug-in estimate of the population median F−1(0.5).

4.3.3 Hypothesis Testing (Introduction)

In this approach you

1. Choose a statistic T that has a known distribution F0(t) if the true parameter value is θ = θ0 (forsome particular parameter value θ0 of interest). The statistic T should provide a measure of thediscrepancy of the data D from what would be reasonable if θ = θ0.

2. Test the hypothesis ‘θ = θ0’ using the tail probabilities of F0.

An example is the ‘Chi-squared’ statistic used in MSA. Hypothesis testing will be covered in more detailin chapter 5.

Some problems with the standard hypothesis testing approach are:

1. In practice, we don’t really believe that θ = θ0 is ‘true’ and all other possible values of θ are ‘false’;instead we just wish to adopt ‘θ = θ0’ as a convenient assumption, because it’s as good as, andsimpler than, other models.

2. If we really do want to make a decision [e.g. to give drug ‘A’ or drug ‘B’ to a particular patient],then we should weigh up the possible consequences.

3. It’s hard to create appropriate hypothesis tests in complex situations, such as to test whether or notθ lies in a particular subset Ω0 of the parameter space Ω.

Unfortunately, real life is a complex situation.

4.3.4 Likelihood Methods

Use the likelihood function

L(θ;D) =

Pr(D|θ) (discrete case)(constant)× f(D|θ) (continuous case), (4.7)

or equivalently the log-likelihood or ‘support ’ function

`(θ;D) = log L(θ;D) (4.8)

as a measure of the compatibility between data D and parameter θ.

In particular, the MLE corresponds to the particular Fθ∈ F that is most compatible with the data D.

Likelihood underlies the most useful general approaches to statistics:

1. It can handle several parameters simultaneously.

2. The CLT implies that in many cases the log-likelihood will be approximately quadratic in θ (at leastnear the MLE).

This makes both theory and numerical computation easier.

49

Page 50: (ST217: Mathematical Statistics B)

However, there are difficulties with basing inference solely on likelihood:

1. How should we handle ‘nuisance parameters’ (i.e. components θi that we’re not interested in)?

Note that it makes no sense to integrate over values of θi to get a ‘marginal likelihood’ for the otherθjs, since L(θ; d) is NOT a probability density or probability function—we would get a differentmarginal likelihood if we reparametrised say by θi 7→ log θi.

2. A more fundamental problem is that likelihood takes no account of how far-fetched the model mightbe (‘high likelihood’ does NOT mean ‘likely’ !)

This suggests that in practice we may wish to incorporate information not contained in the likelihood:

1. Prior information/Expert opinion: Are there external reasons for doubting some values of θ morethan others?

2. For decision-making : How relatively important are the possible consequences of our inferences?[e.g. an innocent person is punished / a murderer walks free].

4.4 Problems

1. How might the mortality data in Tables 1.1 and 1.2 (pages 8 & 9) be set out as a data matrix?

2. Suppose that θ(X1, . . . Xn) is unbiased. Show that θ is consistent iff limn→∞(Var(θ(X1, . . . Xn))

)= 0.

3. Given XiIID∼ FX(x), where FX is a member of some family F of probability distributions, show that

(a) Any statistic of the form∑n

i=1 wiXi, where∑n

i=1 wi = 1, is an unbiased estimate of µX = EX,

(b) The mean X =∑n

i=1 Xi/n is the unique UMVUE of this form,

(c) σ2 =∑n

i=1(Xi −X)2/(n− 1) is an unbiased estimate of the variance σ2X of FX .

4. The number of mistakes made each lecture by a certain lecturer follow independent Poisson distri-butions, each with mean λ > 0.

You decide to attend the Monday lecture, note the number of mistakes X, and use X to estimatethe probability p that there will be no mistakes in the remaining two lectures that week.

(a) Show that p = exp(−2λ).

(b) Show that the only unbiased estimator of p (and hence, trivially, the MVUE), is

p =

1 if X is even,−1 if X is odd.

(c) What is the maximum likelihood estimator of p?

(d) Discuss (briefly) the relative merits of the MLE and the MVUE in this case.

5. Let T be an unbiased estimator for g(θ), let S be a sufficient statistic for θ, and let φ(S) = E[T |S].

Prove the Rao-Blackwell theorem:

φ(S) is also an unbiased estimator of g(θ), and Var[φ(S)|θ] ≤ Var[T |θ], for all θ,

and interpret this result.

50

Page 51: (ST217: Mathematical Statistics B)

6. (a) Explain what is meant by an unbiased estimator for an unknown parameter θ.

(b) Show, using moment generating functions or otherwise, that if X1 & X2 have independentPoisson distributions with means λ1 & λ2 respectively, then their sum (X1 + X2) follows aPoisson distribution with mean (λ1 + λ2).

(c) A particular sports game comprises four ‘quarters’, each lasting 15 minutes, and a statisticianattending the game wishes to predict the probability p that no further goals will be scored beforefull time.The statistician assumes that the numbers Xk of goals scored in the kth quarter follow inde-pendent Poisson distributions, each with (unknown) mean λ, so that

Pr(Xk = x) =λx

x!e−λ

(k = 1, 2, 3, 4; x = 0, 1, 2, . . .).Suppose that the statistician makes his prediction halfway through the match (i.e. after observingX1 = x1 & X2 = x2). Show that an unbiased estimator of p is

T =

1 if (x1 + x2) = 0,0 otherwise.

(d) Suppose the statistician also made a prediction after 15 minutes. Show that in this case theONLY unbiased estimator of p given X1 = x1 is

T =

2x1 if x1 is even,−2x1 if x1 is odd.

(e) What are the maximum likelihood estimators of p after 15 and after 30 minutes?

(f) Briefly compare the advantages of maximum likelihood and unbiased estimation for this situa-tion.

From Warwick ST217 exam 1997

7. (a) Explain what is meant by a minimum variance unbiased estimator (MVUE).

(b) Let X and Y be random variables. Write down (without proof) expressions relating E[Y ] andVar[Y ] to the conditional moments E[Y |X] and Var[Y |X].

(c) Let S be a sufficient statistic for a parameter θ, let T be an unbiased estimator for τ(θ), anddefine W = E[T |S]. Show that

i. W is an unbiased estimator for τ(θ), andii. Var[W ] ≤ Var[T ] for all θ.

Deduce that a MVUE, if one exists, must be a function of a sufficient statistic.

(d) Let X1, X2, . . . , Xn be IID Bernoulli random variables, i.e.

Pr(Xi = 1) = θPr(Xi = 0) = 1− θ

i = 1, 2, . . . , n.

i. Show that S =∑n

i=1 Xi is a sufficient statistic for θ.ii. Define T by

T =

1 if X1 = 1 and X2 = 0,0 otherwise.

What is E[T ]?iii. Find E[T |S], and hence show that S(n− S)/(n− 1) is an MVUE of Var[S] = nθ(1− θ).

From Warwick ST217 exam 1999

51

Page 52: (ST217: Mathematical Statistics B)

8. Given XiIID∼ Poi(θ), compare the following possible estimators for θ in terms of unbiasedness, consis-

tency, relative efficiency, etc.

θ1 = X =1n

n∑k=1

Xk,

θ2 =1n

(100 +

n∑k=1

Xk

),

θ3 =12(X2 −X1)2,

θ4 =1n

n∑k=1

(Xk −X)2,

θ5 =1

n− 1

n∑k=1

(Xk −X)2,

θ6 = (θ1 + θ5)/2,

θ7 = median(X1, X2, . . . , Xn),

θ8 = mode(X1, X2, . . . , Xn),

θ9 =2

n(n + 1)

n∑k=1

kXk,

θ10 =1

n− 1

n∑k=2

Xk.

9. [Light relief] Discuss the following possible defence submission at a murder trial:

‘The supposed DNA match placing the defendant at the scene of the crime would have arisen witheven higher probability if the defendant had a secret identical twin[the more people with that DNA, the more chances of getting a match at the crime scene].

‘Now assume that my client has been cloned θ times, θ ∈ 0, 1, . . . , n for some n > 0. Clearly thelarger the value of θ, the higher the probability of obtaining the observed DNA results[every increase in θ means another clone who might have been at the scene of the crime].

‘Therefore the MLE of θ is n.

‘But then, even assuming somebody with my client’s DNA committed this terrible crime, the prob-ability that it was my client is only 1/(n + 1) (under reasonable assumptions).

‘Therefore you cannot say that my client is, beyond a reasonable doubt, guilty.

‘The defence rests.’

4.5 Bayesian Inference

4.5.1 Introduction

Classical inference regards probability as a property of physical objects (e.g. a ‘fair coin’).

An alternative interpretation uses probability to represent an individual’s (lack of) understanding of anuncertain situation.

52

Page 53: (ST217: Mathematical Statistics B)

Examples

1. ‘I have no reason to suspect that “heads” or “tails” are more likely. Therefore, by symmetry, mycurrent probability for this particular coin’s coming down “heads” is 1/2.’

2. ‘I doubt the accused has any previously-unknown identical siblings. I’d bet 100,000 to 1 against’(i.e. if θ is the number of identical siblings, then my probability for θ > 0 is 1/100001).

Different people, with different knowledge, can legitimately have different probabilities for real-world events(therefore it’s good discipline to say ‘my probability for. . . ’ rather than ‘the probability of. . . ’).

As you learn. your probabilities can be continually updated using Bayes’ theorem, i.e.

Pr(A|B) =Pr(B|A)× Pr(A)

Pr(B)(4.9)

(assuming Pr(B) is positive, and using the fact that Pr(A&B) = Pr(A|B) Pr(B) = Pr(B|A) Pr(A)

).

The Bayesian approach to statistical inference treats all uncertainty via probability, as follows:

1. You have a probability model for the data, with PMF p(D|Θ).

2. Your prior PMF for Θ (i.e. your PMF for Θ based on a combination of expert opinion, previousexperience, and your own prejudice), is p(θ).

3. Then Bayes’ theorem says

p(θ|D) =p(D|θ) p(θ)

p(D)

or, since once the data have been obtained p(D) is a constant,

p(θ|D) ∝ p(D|θ) p(θ)∝ L(θ;D) p(θ)

i.e. ‘posterior probability ∝ ‘likelihood’× ‘prior’(4.10)

Formula 4.10 also applies in the continuous case, in which case p(·) represents a PDF.

Comments

1. Further applications to decision theory are given in the third year course ST301.

2. Note that if θ = (θ1, θ2, . . . , θp), then p(θ|D) is a p-dimensional function, and may prove difficult tomanipulate, summarise or visualise.

3. Treating all uncertainty via probability has the advantage that one-off events (e.g. managementdecisions, or the results of horse races) can be handled. However, it’s not at all obvious that alluncertainty can be treated via probability!

4. As with Classical inference, a Bayesian analysis of a problem should involve checking whether theassumptions underlying p(D|θ) and p(θ) are reasonable, and rethinking & reanalysing the model ifnecessary.

Exercise 4.4Describe the Bayesian approach to statistical inference, denoting the data by x, the prior by fΘ(θ), andthe likelihood by L(θ;x) = fX|Θ(x|θ).

53

Page 54: (ST217: Mathematical Statistics B)

4.6 Nonparametric Methods

Standard Classical and Bayesian methods make strong assumptions, e.g. XiIID∼ F (x|θ) for some θ ∈ Ω.

Assumptions of independence are critical (what aspects of the problem provide information about otheraspects?)

Assumptions about the form of probability distributions are often less important, at least provided thesample size n is large. However, there are exceptions to this:

1. It might be that the probability distribution encountered in practice is fundamentally different fromthe form assumed in our model. For example, some probability distributions are so ‘heavy-tailed’that their means don’t exist

(e.g. the Cauchy distribution with f(x) = 1/π(1 + x2), x ∈ R

).

2. Some data may be recorded incorrectly, or there may be a few atypically large/small data values(‘outliers’), etc.

3. In any case, what if n is small and the CLT can’t be invoked?

‘Nonparametric’ methods don’t assume that the actual probability distribution F (·|θ) lies in a particularparametric family F ; instead they make more general assumptions, for example

1. ‘F (x) is symmetric about some unknown value Θ’.

Note that this may be a reasonable assumption even if EX doesn’t exist.

Θ is the (unknown) median of the population, i.e. Pr(X < Θ) = Pr(X > Θ).

Therefore one could estimate Θ by the median of the data (though better methods may exist).

2. ‘F (x, y) is such that if (Xi, Yi) IID∼ F , (i = 1, 2), then Pr(Y1< Y2 |X1< X2) = 1/2’.

This is a nonparametric version of the statement ‘X & Y are uncorrelated’.

Many statistical methods involve estimating means, as we’ll see in the rest of the course (t-tests, linearregression, many MLEs etc.)

Corresponding nonparametric methods typically involve medians—or equivalently, various probabilities.

Exercise 4.5Suppose that X has a continuous distribution. Show that a test of the statement ‘median of X is θ0’ isequivalent to a test of the statement ‘Pr(X < θ0) = 1/2’.

If Xi are IID, what is the distribution of R = (number of Xi < θT ), where θT is the true value of θ?‖

Other nonparametric methods involve ranking the data Xi: replacing the smallest Xi by 1, the nextsmallest by 2, etc. Classical statistical methods can then be applied to the ranks. Note that the effect ofoutliers will be reduced.

Example

Given data (Xi, Yi), i = 1, . . . , n from a continuous bivariate distribution, ‘Spearman’s rank correlation’(often written ρS) can be calculated as follows:

1. replace the Xi values by their ranks Ri,

2. similarly replace the Yi values by their ranks Si,

3. calculate the usual (‘product-moment ’ or ‘Pearson’s’) correlation between the Ris and Sis.

54

Page 55: (ST217: Mathematical Statistics B)

Comments

1. If the distribution of the original RVs is not continuous, then some data values may be repeated (‘tiedranks’). Repeated Xis are given averaged ranks (for example, if there are two Xi with the smallestvalue, then they are each given rank 1.5 = (1 + 2)/2).

2. If X ⊥⊥Y , so the ‘true’ ρS is zero, then the distribution of the calculated ρS is easily approximated(using the standard formulae for

∑ni=1 ik).

3. ‘Easily approximated’ does not necessarily mean ‘well approximated’ !

4. Most books give another formula for ρS , which is equivalent unless there are tied ranks, but whichobscures the relationship with the standard product-moment correlation

ρ =∑

(xi − x)(yi − y)√∑(xi − x)2

∑(yi − y)2

.

5. Other, perhaps better, types of nonparametric correlation have been defined (‘Kendall’s τ ’).

4.7 Graphical Methods

A vital part of data analysis is to plot the data using bar-charts, histograms, scatter diagrams etc. Plottingthe data is important no matter what further formal statistical methods will be used:

1. It enables you to ‘get a feel for’ the data,

2. It helps you look for patterns and anomalies,

3. It helps in checking assumptions (such as independence, linearity or Normality).

Many useful plots can be easily churned out using a computer, though sometimes you have to devise originalplots to display the data in the most appropriate way.

Exercise 4.6The following table shows 66 measurements on the speed of light, made by S. Newcomb in 1882. Valuesare the times in nanoseconds (ns), less 24,800 ns, for light to travel from his laboratory to a mirror andback. Values are to be read row-by-row, thus the first to observations are 24,828 ns and 24,826 ns.

28 26 33 24 34 -44 27 16 40 -229 22 24 21 25 30 23 29 31 1924 20 36 32 36 28 25 21 28 2937 25 28 26 30 32 36 26 30 2236 23 27 27 28 27 31 27 26 3326 32 32 24 39 28 24 25 32 2529 27 28 29 16 23

Produce a histogram, a Normal probability plot and a time plot of Newcomb’s data. Decide which (if any)observations to ignore, and produce a normal probability plot of the remaining reduced data set. Finallycompare the mean of this reduced data set with (i) the mean and (ii) the 10% trimmed mean of the originaldata. Solution: Plots are shown in Figure 4.1. There are clearly 2 large outliers, but the time plot alsosuggests that the 6th to 10th observations are unusually variable, and that the last two observations areatypically low (both being lower than the previous 20 observations).

The Normal probability plot is calculated by calculating y(i) (the sorted data) and zi as follows, andplotting y(i) against zi.

55

Page 56: (ST217: Mathematical Statistics B)

i y(i) xi = (i+0.5)/(n+1) zi = Φ(xi)

1 −44 0.0075 −2.4342 −2 0.0224 −2.0073 16 0.0373 −1.7834 16 0.0522 −1.624...

......

...65 39 0.9776 2.00766 40 0.9925 2.434

Omitting the first 10 and the last 2 recorded observations leaves a data-set where the Normality andindependence assumptions are much more reasonable—see plot (d) of Figure 4.1.

Location estimates are (i) 26.2, (ii) 27.4, (iii) 27.9. The trimmed mean is reasonably close to the mean ofobservations 11–64.

Figure 4.1: Plots of Newcomb’s data: (a) histogram, (b) Normal probability plot, (c) time plot, (d) Normalprobability plot of data after excluding the first 10 and last 2 observations.

4.8 Bootstrapping

‘Bootstrap’ methods have become increasingly used over the past few years. They address the generalquestion:

56

Page 57: (ST217: Mathematical Statistics B)

‘What are the properties of the calculated statistics (e.g. MLEs θ) given that the underlyingdistributional assumptions may be false (and, in reality, will be false)?’

Bootstrapping uses the observed data directly as an estimate of the underlying population, then uses‘plug-in’ estimation, and typically involves computer simulation.

Several other computer-intensive approaches to statistical inference have also become very popular recently.

4.9 Problems

1. [Light relief] Discuss the following quote:

‘As a statistician, I want to use mathematics to help deal with practical uncertainty. The naturalmathematical way to handle uncertainty is via probability.

‘About the simplest practical probability statement I can think of is “The probability that a fair coin,tossed at random, will come down ‘heads’ is 1/2”.

‘Now try to define “fair coin”, “at random” and “probability 1/2” without using subjective probabilityor circular definitions.

‘Summary: if a practical probability statement is not subjective, then it must be tautologous, ill-defined, or useless.

‘Of course, for balance, some of the time I teach subjective methods, and some of the time I teachuseless methods :-).’

Ewart Shaw (Internet posting 13–Aug–1993).

2. (a) Plot the captopril data (Table 4.1), and suggest what sort of models seem reasonable.

(b) Roughly estimate from your graph(s) the effect of captopril (C) on systolic and diastolic bloodpressure (SBP & DBP).

(c) Suggest a single summary measure (SBP, DBP or a combination of the two) to quantify theeffect of treatment.

(d) Do you think a transformation of the data would be appropriate?

(e) Comment on the number of parameters in your model(s).

(f) Calculate ρS and ρ between ∆S , the change (after-before) in SBP, and ∆D, the change (after-before) in DBP.Suggest some advantages and disadvantages in using ρS and ρ here.

(g) Calculate some further summary statistics such as means, variances, correlations and five-number summaries, and comment on how useful they are as summaries of the data.

(h) Are there any problems in using the data to estimate the effect of captopril? What furtherinformation would be useful?

(i) What advantages/disadvantages would there be in using bootstrapping here, i.e. using the dis-crete distribution that assigns probability 1/15 to each of the 15 points x1 = (210, 201, 130, 125),x2 = (169, 165, 122, 121), . . . , x15 = (154, 131, 100, 82) as an estimate of the underlying popula-tion, and working out the properties of ρS , ρ, etc. based on that assumption?

57

Page 58: (ST217: Mathematical Statistics B)

This page intentionally left blank (except for this sentence).

58

Page 59: (ST217: Mathematical Statistics B)

Chapter 5

Hypothesis Testing

5.1 Introduction

A hypothesis is a claim about the real world; statisticians will be interested in hypotheses like:

1. ‘The probabilities of a male panda or a female panda being born are equal’,

2. ‘The number of flying bombs falling on a given area of London during World War II follows a Poissondistribution’,

3. ‘The mean systolic blood pressure of 35-year-old men is no higher than that of 40-year-old women’,

4. ‘The mean value of Y = log(systolic blood pressure) is independent of X = age’(i.e. E[Y |X = x] = constant).

These hypotheses can be translated into statements about parameters within a probability model:

1. ‘p1 = p2’,

2. ‘N ∼ Poi(λ) for some λ > 0’, i.e.: pn = Pr(N = n) = λn exp(−λ)/n! (within the general probabilitymodel pn ≥ 0 ∀n = 0, 1, . . .;

∑pn = 1),

3. ‘θ1 ≤ θ2’ and

4. ‘β1 = 0’ (assuming the linear model E[Y |x] = β0 + β1x).

Definition 5.1 (Hypothesis test)A hypothesis test is a procedure for deciding whether to accept a particular hypothesis as a reasonablesimplifying assumption, or to reject it as unreasonable in the light of the data.

Definition 5.2 (Null hypothesis)The null hypothesis H0 is the simplifying assumption we are considering making.

Definition 5.3 (Alternative hypothesis)The alternative hypothesis H1 is the alternative explanation(s) we are considering for the data.

Definition 5.4 (Type I error)A type I error is made if H0 is rejected when H0 is true.

Definition 5.5 (Type II error)A type II error is made if H0 is accepted when H0 is false.

59

Page 60: (ST217: Mathematical Statistics B)

Comments

1. In the first example above (pandas) the null hypothesis is H0 : p1 = p2.

2. The alternative hypothesis in the first example would usually be H1 : p1 6= p2, though it could alsobe (for example)

(a) H1 : p1 < p2,

(b) H1 : p1 > p2, or

(c) H1 : p1 − p2 = δ for some specified δ 6= 0.

5.2 Simple Hypothesis Tests

The simplest type of hypothesis testing occurs when the probability distribution giving rise to the data isspecified completely under the null and alternative hypotheses.

Definition 5.6 (Simple hypotheses)A simple hypothesis is of the form Hk : θ = θk,i.e. the probability distribution of the data is specified completely.

Definition 5.7 (Composite hypotheses)A composite hypothesis is of the form Hk : θ ∈ Ωk,i.e. the parameter θ lies in a specified subset Ωk of the parameter space ΩΘ.

Definition 5.8 (Simple hypothesis test)A simple hypothesis test tests a simple null hypothesis H0 : θ = θ0 against a simple alternativeH1 : θ = θ1, where θ parametrises the distribution of our experimental random variables X =X1, X2, . . . Xn.

There may be many seemingly sensible approaches to testing a given hypothesis. A reasonable criterionfor choosing between them is to attempt to minimise the chance of making a mistake: incorrectly rejectinga true null hypothesis, or incorrectly accepting a false null hypothesis.

Definition 5.9 (Size)A test of size α is one which rejects the null hypothesis H0 : θ = θ0 in favour of the alternativeH1 : θ = θ1 iff

X ∈ Cα where Pr(X ∈ Cα | θ = θ0) = α

for some subset Cα of the sample space S of X.

Definition 5.10 (Critical region)The set Cα in Definition 5.9 is called the critical region or rejection region of the test.

Definition 5.11 (Power & power function)The power function of a test with critical region Cα is the function

β(θ) = Pr(X ∈ Cα | θ),

and the power is β = β(θ1), i.e. the probability that we reject H0 in favour of H1 when H1 is true.

A hypothesis test typically uses a test statistic T (X), whose distribution is known under H0, and such thatextreme values of T (X) are more compatible with H1 that H0.

Many useful hypothesis tests have the following form:

60

Page 61: (ST217: Mathematical Statistics B)

Definition 5.12 (Simple likelihood ratio test)A simple likelihood ratio test (SLRT) of H0 : θ = θ0 against H1 : θ = θ1 rejects H0 iff

X ∈ C∗α =

x∣∣∣ L(θ0;x)

L(θ1;x)≤ Aα

where L(θ;x) is the likelihood of θ given the data x, and the number Aα is chosen so that the size ofthe test is α.

Exercise 5.1Suppose that X1, X2, . . . , Xn

IID∼ N (θ, 1). Show that the likelihood ratio for testing H0 : θ = 0 againstH1 : θ = 1 can be written

λ(x) = exp[n(x− 1

2

)].

Hence show that the corresponding SLRT of size α rejects H0 when the test statistic T (X) = X satisfiesT > Φ−1(1− α)/

√n.

Comments

1. For a simple hypothesis test, both H0 and H1 are ‘point hypotheses’, each specifying a particularvalue for the parameter θ rather than a region of the parameter space.

2. The size α is the probability of rejecting H0 when H0 is in fact true; clearly we want α to be small(α = 0.05, say).

3. Clearly for a fixed size α of test, the larger the power β of a test the better.

However, there is an inevitable trade-off between small size and high power (as in a jury trial: themore careful one is not to convict an innocent defendant, the more likely one is to free a guilty oneby mistake).

4. In practice, no hypothesis will be precisely true, so the whole foundation of classical hypothesis testingseems suspect!

5. Regarding likelihood as a measure of compatibility between data and model, an SLRT compares thecompatibility of θ0 and θ1 with the observed data x, and accepts H0 iff the ratio is sufficiently large.

6. One reason for the importance of likelihood ratio tests is the following theorem, which shows thatout of all tests of a given size, an SLRT (if one exists) is ‘best’ in a certain sense.

Theorem 5.1 (The Neyman-Pearson lemma)Given random variables X1, X2, . . . , Xn, with joint density f(x|θ), the simple likelihood ratio test ofa fixed size α for testing H0 : θ = θ0 against H1 : θ = θ1 is at least as powerful as any other test ofthe same size.

Exercise 5.2[Proof of Theorem 5.1] Prove the Neyman-Pearson lemma. Solution: Fix the size of the test to be α.

Let A be a positive constant and C0 a subset of the sample space satisfying

1. Pr(X ∈ C0 | θ = θ0) = α,

2. X ∈ C0 ⇐⇒ L(θ0;x)L(θ1;x)

=f(x|θ0)f(x|θ1)

≤ A.

Suppose that there exists another test of size α, defined by the critical region C1, i.e.

61

Page 62: (ST217: Mathematical Statistics B)

ΩX

C0

C1

B1B2

B3

Figure 5.1: Proof of Neyman-Pearson lemma

Reject H0 iff x ∈ C1, where Pr(x ∈ C1|θ = θ0) = α.

Let B1 = C0 ∩ C1, B2 = C0 ∩ Cc1, B3 = Cc

0 ∩ C1.

Note that B1 ∪B2 = C0, B1 ∪B3 = C1, and B1, B2 & B3 are disjoint.

Let the power of the likelihood ratio test be I0 = Pr(X ∈ C0 | θ = θ1),and the power of the other test be I1 = Pr(X ∈ C1 | θ = θ1).We want to show that I0 − I1 ≥ 0.

ButI0 − I1 =

∫C0

f(x|θ1)dx−∫

C1f(x|θ1)dx

=∫

B1∪B2f(x|θ1)dx−

∫B1∪B3

f(x|θ1)dx

=∫

B2f(x|θ1)dx−

∫B3

f(x|θ1)dx.

Also B2 ⊆ C0, so f(x|θ1) ≥ A−1f(x|θ0) for x ∈ B2,similarly B3 ⊆ Cc

0, so f(x|θ1) ≤ A−1f(x|θ0) for x ∈ B3,

ThereforeI0 − I1 ≥ A−1

[∫B2

f(x|θ0)dx−∫

B3f(x|θ0)dx

]= A−1

[∫C0

f(x|θ0)dx−∫

C1f(x|θ0)dx

]= A−1[α− α] = 0

as required.‖

5.3 Simple Null, Composite Alternative

Suppose that we wish to test the simple null hypothesis H0 : θ = θ0 against the composite alternativehypothesis H1 : θ ∈ Ω1.

The easiest way to investigate this is to imagine the collection of simple hypothesis tests with null hypothesisH0 : θ = θ0 and alternative H1 : θ = θ1, where θ1 ∈ Ω1. Then, for any given θ1, an SLRT is the mostpowerful test for a given size α. The only problem would be if different values of θ1 result in differentSLRTs.

62

Page 63: (ST217: Mathematical Statistics B)

Definition 5.13 (UMP Tests)A hypothesis test is called a uniformly most powerful test of H0 : θ = θ0 against H1 : θ = θ1, θ1 ∈ Ω1,if

1. There exists a critical region Cα corresponding to a test of size α not depending on θ1,

2. For all values of θ1 ∈ Ω1, the critical region Cα defines a most powerful test of H0 : θ = θ0

against H1 : θ = θ1.

Exercise 5.3Suppose that X1, X2, . . . , Xn

IID∼N(0, σ2).

1. Find the UMP test of H0 : σ2 = 1 against H1 : σ2 > 1.

2. Find the UMP test of H0 : σ2 = 1 against H1 : σ2 < 1.

3. Show that no UMP test of H0 : σ2 = 1 against H1 : σ2 6= 1 exists.‖

Comments

1. If a UMP test exists, then it is clearly the appropriate test to use.

2. Often UMP tests don’t exist!

3. A UMP test involves the data only via a likelihood ratio, so is a function of the sufficient statistics.

4. The critical region Cα therefore often has a simple form, and is usually easily found once the distri-bution of the sufficient statistics have been determined

(hence the importance of the χ2, t and F distributions).

5. The above three examples illustrate how important is the form of alternative hypothesis being con-sidered. The first two are one-sided alternatives whereas H1 : σ2 6= 1 is a two-sided alternativehypothesis, since σ2 could lie on either side of 1.

5.4 Composite Hypothesis Tests

The most general situation we’ll consider is where the parameter space Ω is divided into two subsets:Ω = Ω0 ∪ Ω1, where Ω0 ∩ Ω1 = ∅, and the hypotheses are H0 : θ ∈ Ω0, H1 : θ ∈ Ω1.

For example, one may want to test the null hypothesis that the data come from an exponential distributionagainst the alternative that the data come from a more general gamma distribution. Note that here, as inmany other cases, dim(Ω0) < dim(Ω1) = dim(Ω).

One possible approach to this situation is to regard the maximum possible likelihood over θ ∈ Ωi as ameasure of compatibility between the data and the hypothesis Hi (i = 0, 1). It’s therefore convenient todefine the following:

θ is the MLE of θ over the whole parameter space Ω,θ0 is the MLE of θ over Ω0, i.e. under the null hypothesis H0, andθ1 is the MLE of θ over Ω1, i.e. under the alternative hypothesis H1.

Note that θ must therefore be the same as either θ0 or θ1, since Ω = Ω0 ∪ Ω1.

One might consider using the likelihood ratio criterion L(θ1;x)/L(θ0;x), by direct analogy with the SLRT.However, it’s generally easier to use the equivalent ratio L(θ;x)/L(θ0;x):

63

Page 64: (ST217: Mathematical Statistics B)

Definition 5.14 (Likelihood Ratio Test (LRT))A likelihood ratio test rejects H0 : θ ∈ Ω0 in favour of the alternative H1 : θ ∈ Ω1 = Ω \ Ω0 iff

λ(x) =L(θ;x)L(θ0;x)

≥ λ, (5.1)

where θ is the MLE of θ over the whole parameter space Ω, θ0 is the MLE of θ over Ω0, and thevalue λ is fixed so that

supθ∈Ω0

Pr(λ(X) ≥ λ|θ) = α

where α, the size of the test, is some chosen value.

Equivalently, the test criterion uses the log LRT statistic:

r(x) = `(θ;x)− `(θ0;x) ≥ λ′, (5.2)

where `(θ;x) = log L(θ;x), and λ′ is chosen to give chosen size α = supθ∈Ω0Pr(r(X) ≥ λ′ |θ).

Comments

1. The size α is typically chosen by convention to be 0.05 or 0.01.

2. Note that high values of the test statistic λ(x), or equivalently of r(x), are taken as evidence againstthe null hypothesis H0.

3. The test given in Definition 5.14 is sometimes referred to as a generalized likelihood ratio test, andEquation 5.1 a generalized likelihood ratio test statistic.

4. Equation 5.2 is often easier to work with than Equation 5.1—see the exercises and problems.

Exercise 5.4[Paired t-test] Suppose that X1, X2, . . . , Xn

IID∼ N (µ, σ2), and let X =∑

Xi/n, S2 =∑

(Xi −X)2/(n− 1).What is the distribution of T = X/(S/

√n)?

Is the test based on rejecting H0 : µ = 0 for large T a likelihood ratio test?

Assuming that the observed differences in diastolic blood pressure (after–before) are IID and Normallydistributed with mean δD, use the captopril data (4.1) to test the null hypothesis H0 : δD = 0 against thealternative hypothesis H1 : δD 6= 0.

Comment: this procedure is called the paired t test‖

Exercise 5.5[Two sample t-test] Suppose X1, X2, . . . , Xm

IID∼ N (µX , σ2) and Y1, Y2, . . . , YnIID∼ N (µY , σ2).

1. Derive the LRT for testing H0 : µX = µY versus H1 : µX 6= µY .

2. Show that the LRT can be based on the test statistic

T =X − Y

Sp

√1m + 1

n

. (5.3)

where

S2p =

∑mi=1(Xi −X)2 +

∑ni=1(Yi − Y )2

m + n− 2. (5.4)

3. Show that, under H0, T ∼ tm+n−2.

64

Page 65: (ST217: Mathematical Statistics B)

4. Two groups of female rats were placed on diets with high and low protein content, and the gainin weight (grammes) between the 28th and 84th days of age was measured for each rat, with thefollowing results:

High protein diet134 146 104 119 124 161 107 83 113 129 97 123

Low protein diet70 118 101 85 107 132 94

Using the test statistic T above, test the null hypothesis that the mean weight gain is the same underboth diets.

Comment: this is called the two sample t-test, and S2p is the pooled estimate of variance.

Exercise 5.6[F -test] Suppose X1, X2, . . . , Xm

IID∼ N (µX , σ2X) and Y1, Y2, . . . , Yn

IID∼ N (µY , σ2Y ), where µX , µY , σX and σY

are all unknown.

Suppose we wish to test the hypothesis H0 : σ2X = σ2

Y against the alternative H1 : σ2X 6= σ2

Y .

1. Let S2X =

∑mi=1(Xi −X)2 and S2

Y =∑n

i=1(Yi − Y )2.

What are the distributions of S2X/σ2

X and S2Y /σ2

Y ?

2. Under H0, what is the distribution of the statistic

V =S2

X/(m− 1)S2

Y /(n− 1)?

3. Taking values of V much larger or smaller than 1 as evidence against H0, and given data with m = 16,n = 16,

∑xi = 84,

∑yi = 18,

∑x2

i = 563,∑

y2i = 72, test the null hypothesis H0.

Comment: with the alternative hypothesis H1 : σ2X > σ2

Y , the above procedure is called an F test.‖

Even in simple cases like this, the null distribution of the log likelihood ratio test statistic r(x) (5.2) can bedifficult or impossible to find analytically. Fortunately, there is a very powerful and very general theoremthat gives the approximate distribution of r(x):

Theorem 5.2 (Wald’s Theorem)Let X1, X2, . . . , Xn

IID∼ f(x|θ) where θ ∈ Ω, and let r(x) denote the log likelihood ratio test statistic

r(x) = `(θ;x)− `(θ0;x),

where θ is the MLE of θ over Ω and θ0 is the MLE of θ over Ω0 ⊂ Ω.Then under reasonable conditions on the PDF (or PMF) f(·|·), the distribution of 2r(x) convergesto a χ2 distribution on dim(Ω)− dim(Ω0) degrees of freedom as n →∞.

Comments

1. A proof is beyond the scope of this course, but may be found in e.g. Kendall & Stuart, ‘The AdvancedTheory of Statistics’, Vol. II.

2. Wald’s theorem implies that, provided the sample size is large, you only need tables of the χ2

distribution to find the critical regions for a wide range of hypothesis tests.

65

Page 66: (ST217: Mathematical Statistics B)

Another important theorem, see Problem 3.7.9, page 39, is the following:

Theorem 5.3 (Sample Mean and Variance of XiIID∼N (µ, σ2))

Let X1, X2, . . . , XnIID∼ N(µ, σ2). Then

1. X =∑

Xi/n and Y =∑

(Xi −X)2 are independent RVs,

2. X has a N(µ, σ2/n) distribution,

3. Y/σ2 has a χ2n−1 distribution.

Exercise 5.7Suppose X1, X2, . . . , Xn

IID∼ N (θ, 1), with hypotheses H0 : θ = 0 and H1 : θ arbitrary.

Show that 2r(x) = nx2, and hence that Wald’s theorem holds exactly in this case.‖

Exercise 5.8Suppose now that Xi ∼ N (θi, 1), i = 1, . . . , n are independent, with null hypothesis H0 : θi = θ ∀i andalternative hypothesis H1 : θi arbitrary.

Show that 2r(x) =∑n

i=1(xi−x)2. and hence (quoting any other theorems you need) that Wald’s theoremagain holds exactly.

5.5 Problems

1. Suppose that X ∼ Bin(n, p). Under the null hypothesis H0 : p = p0, what are EX and VarX?

Show that if n is large and p0 is not too close to 0 or 1, then

X/n− p0√p0(1− p0)/n

∼ N (0, 1) approximately.

Out of 1000 tosses of a given coin, 560 were heads and 440 were tails. Is it reasonable to assume thatthe coin is fair? Justify your answer.

2. Out of 370 new-born babies at a Hospital, 197 were male and 173 female.

Test the null hypothesis H0 : p < 1/2 versus H1 : p ≥ 1/2, where p denotes the probability that ababy born at the Hospital will be male.

Discuss any assumptions you make.

3. X is a single observation whose density is given by

f(x) =

(1 + θ)xθ if 0 < x < 1,0 otherwise.

Find the most powerful size α test of H0 : θ = 0 against H1 : θ = 1.

Is there a U.M.P. test of H0 : θ ≤ 0 against H1 : θ > 0? If so, what is it?

4. Suppose X1, X2, . . . , XnIID∼ N (µ, σ2) with null hypothesis H0 : σ2 = 1 and alternative H1 : σ2 is

arbitrary. Show that the LRT will reject H0 for large values of the test statistic r(x) = n(v−1−log v),where v =

∑ni=1(xi − x)2/n.

66

Page 67: (ST217: Mathematical Statistics B)

5. Let X1, . . . , Xn be independent each with density

f(x) =

λx−2e−λ/x if x > 0,0 otherwise,

where λ is an unknown parameter.

(a) Show that the UMP test of H0 : λ = 12 against H1 : λ > 1

2 is of the form:‘reject H0 if

∑ni=1 X−1

i ≤ A∗’, where A∗ is chosen to fix the size of the test.

(b) Find the distribution of∑n

i=1 X−1i under the null & alternative hypotheses.

(c) You observe values 0.59, 0.36, 0.71, 0.86, 0.13, 0.01, 3.17, 1.18, 3.28, 0.49 for X1, . . . , X10.Test H0 against H1, & comment on the test in the light of any assumptions made.

6. (a) Define the size and power of a hypothesis test of a simple null hypothesis H0 : θ = θ0 against asimple alternative hypothesis H1 : θ = θ1.

(b) State and prove the Neyman-Pearson Lemma for continuous random variables X1, . . . , Xn whentesting the null hypothesis H0 : θ = θ0 against the alternative H1 : θ = θ1.

(c) Assume that a particular bus service runs at regular intervals of θ minutes, but that you donot know θ. Assume also that the times you find you have to wait for a bus on n occasions,X1, . . . , Xn, are independent and identically distributed with density

f(x|θ) =

θ−1 if 0 ≤ x ≤ θ,0 otherwise.

i. Discuss briefly when the above assumptions would be reasonable in practice.ii. Find the likelihood L(θ;x) for θ given the data (X1, . . . , Xn) = x = (x1, . . . , xn).iii. Find the most powerful test of size α of the hypothesis H0 : θ = θ0 = 20 against the

alternative H1 : θ = θ1 > 20.

From Warwick ST217 exam 1997

7. The following problem is quoted verbatim from Osborn (1979), ‘Statistical Exercises in MedicalResearch’ :

A study of immunoglobulin levels in mycetoma patients in the Sudan involved 22 patients to becompared to 22 normal individuals. The levels of IgG recorded for the 22 mycetoma patients areshown below. The mean level for the normal individuals was calculated to be 1,477 mg/100ml beforethe data for this group was lost overboard from a punt on the river Nile. Use the data below toestimate the within group variance and hence perform a ‘t’ test to investigate the significance of thedifference between the mean levels of IgG in mycetoma patients and normals.

IgG levels (mg/100ml) in 22 mycetoma patients

1,047 1,135 1,350 1,122 1,3451,377 1,375 804 1,062 1,2041,210 1,067 1,032 1,002 1,0531,103 907 960 960 9361,270 1,230

Osborn (1979) 4.6.16

8. Let X1, X2, . . . , XnIID∼ Exp(θ), i.e. f(x|θ) = θe−θx for θ ∈ (0,∞).

Show that a likelihood ratio test for H0 : θ ≤ θ0 versus H1 : θ > θ0 has the form:

‘Reject H0 iff θ0x < k, where k is given by α =∫ nk

0

1Γ(n)

zn−1e−zdz’.

Show that a test of this form is UMP for testing H0 : θ = θ0 versus H1 : θ > θ0.

67

Page 68: (ST217: Mathematical Statistics B)

9. (a) Define the size and power function of a hypothesis test procedure.

(b) State and prove the Neyman-Pearson lemma in the case of a test statistic that has a continuousdistribution.

(c) Let X1, X2, . . . , XnIID∼ N (µ, σ2), where σ2 is known. Find the likelihood ratio

fX(x|µ1)/fX(x|µ0)

and hence show that the most powerful test of size α for testing the null hypothesis H0 : µ = µ0

against the alternative H1 : µ = µ1, for some µ1 < µ0, has the form:

‘Reject H0 if X < µ0 + σ Φ−1(α)/√

n ’,

where X =∑n

i=1 Xi/n is the sample mean, and Φ−1(α) is the 100 α% point of the standardNormal N (0, 1) distribution.

(d) Define a uniformly most powerful (UMP) test, and show that the above test is UMP for testingH0 : µ = µ0 against H1 : µ < µ0.

(e) What is the UMP test of H0 : µ = µ0 against H1 : µ > µ0?

(f) Deduce that no UMP test of size α exists for testing H0 : µ = µ0 against H1 : µ 6= µ0.

(g) What test would you choose to test H0 : µ = µ0 against H1 : µ 6= µ0, and why?

From Warwick ST217 exam 1999

10. A group of clinicians wish to study survival after heart attack, by classifying new heart attack patientsaccording to

(a) whether they survive at least 7 days after admission, and

(b) whether they currently smoke 10 or more cigarettes per day.

From previous experience, the clinicians predict that after N days the observed counts

Survive Die

Smoker R1 R2

Non-smoker R3 R4

will follow independent Poisson distributions with means

Survive Die

Smoker Nr1 Nr2

Non-smoker Nr3 Nr4

The clinicians intend to estimate the population log-odds ratio ` = log(r1r4/r2r3) by the samplevalue L = log(R1R4/R2R3), and they wish to choose N to give a probability 1− β of being able toreject the hypothesis H0 : ` = 0 at the 100α% significance level, when the true value of ` is `0 > 0.

Using the formula Var(f(X)

)≈(f ′(EX)

)2Var(X), show that L has approximate variance

1Nr1

+1

Nr2+

1Nr3

+1

Nr4,

and hence, assuming a Normal approximation to the distribution of L, that the required number ofdays is roughly

N =1`20

1r1

+1r2

+1r3

+1r4

Φ−1(α/2) + Φ−1(β)

2,

where Φ is the standard Normal cumulative distribution function.

Comment critically on the clinicians’ method for choosing N .

From Warwick ST332 exam 1988

68

Page 69: (ST217: Mathematical Statistics B)

11. (a) Define the size and power of a hypothesis test, and explain what is meant by a simple likelihoodratio test and by a uniformly most powerful test.

(b) Let X1, X2, . . . , Xn be independent random variables, each having a Poisson distribution withmean λ. Find the likelihood ratio test for testing H0 : λ = λ0 against H1 : λ = λ1, whereλ1 > λ0.Show also that this test is uniformly most powerful.

(c) Twenty-five leaves were selected at random from each of six similar apple trees. The number ofadult female European red mites on each was counted, with the following results:

No. of mites 0 1 2 3 4 5 6 7Frequency 70 38 17 10 9 3 2 1

Assuming that the number of mites per leaf follow IID Poisson distributions, and using a Normalapproximation to the Poisson distribution, carry out a test of size 0.05 of the null hypothesisH0 that the mean number of mites per leaf is 1.0, against the alternative H1 that it is greaterthan 1.0.Discuss briefly whether the assumptions you have made in testing H0 appear reasonable here.

From Warwick ST217 exam 2000

12. Hypothesis test procedures can be inverted to produce confidence intervals or more generally confi-dence regions. Thus, given a size α test of the null hypothesis H0 : θ = θ0, the set of all values θ0

that would NOT be rejected forms a ‘100(1− α)% confidence interval for θ’.

An amateur statistician argues as follows:

Suppose something starts at time t0 and ends at time t1. Then at time t ∈ (t0, t1), theratio r of its remaining lifetime (t1 − t) to its current age (t− t0), i.e.

r(t) =t1 − t

t− t0,

is clearly a monotonic decreasing function of t. Also it is easy to check that r = 39 after(1/40)th of the total lifetime, and that r = 1/39 after (39/40)th of the total lifetime.Therefore, for 95% of something’s existence, its remaining lifetime lies in the interval(

(t− t0)/39, 39(t− t0)),

where t is the time under consideration, and t0 is the time the thing came into existence.

The statistician is also an amateur theologian, and firmly believes that the World came into existence6006 year ago. Using his pet procedure outlined above, he says he is ‘95% confident that the Worldwill end sometime between 154 years hence, and 234234 years hence’.

His friend, also an amateur statistician, says she has an even more general procedure to produceconfidence intervals:

In any situation I simply roll an icosahedral (fair 20-sided) die. If the die shows ‘13’ then Iquote the empty set ∅ as a 95% confidence interval, otherwise I quote the whole real line R.

She rolls the die, which comes up 13. She therefore says she is ‘95% confident that the World endedbefore it even began (although presumably no-one has noticed yet).’

Discuss.

69

Page 70: (ST217: Mathematical Statistics B)

5.6 The Multinomial Distribution and χ2 Tests

5.6.1 Multinomial Data

Definition 5.15 (Multinomial Distribution)The multinomial distribution Mn(n, θ) is a probability distribution on points y = (y1, y2, . . . , yk),where yi ∈ 0, 1, 2, . . ., i = 1, 2, . . . , k, and

∑ki=1 yi = n, with PMF

f(y1, y2, . . . , yk) =n!

y1!y2! · · · yk!

k∏i=1

θyi

i (5.5)

where θi > 0 for i = 1, . . . , k, and∑k

i=1 θi = 1.

Comments

1. The multinomial distribution arises when one has n independent observations, each classified in oneof k ways (e.g. ‘eye colour’ classified as ‘Brown’, ‘Blue’ or ‘Other’; here k = 3).

Let θi denote the probability that any given observation lies in category number i, and let Yi denotethe number of observations falling in category i. Then the random vector Y = (Y1, Y2, . . . , Yk) has aMn(n, θ) distribution.

2. A binomial distribution is the special case k = 2, and is usually parametrised by p = θ1 (so θ2 = 1−p).

Exercise 5.9By partial differentiation of the likelihood function, show that the MLEs θi of the parameters θi of theMn(n, θ) satisfy the equations

yi

θi

− yk

1−∑k−1

j=1 θj

= 0, (i = 1, . . . , k − 1)

and hence that θi = yi/n for i = 1, . . . , k.‖

5.6.2 Chi-Squared Tests

Suppose one wishes to test the null hypothesis H0 that, in the multinomial distribution 5.5, θ is somefunction θ(φ) of another parameter φ. The alternative hypothesis H1 is that θ is arbitrary.

Exercise 5.10Suppose H0 is that X1, X2, . . . Xn

IID∼ Bin(3, φ). Let Yi (for i = 1, 2, 3, 4) denote the number of observationsXj taking value i− 1. What is the null distribution of Y = (Y1, Y2, Y3, Y4)?

The log likelihood ratio test statistic r(X) is given by

r(X) =k∑

i=1

Yi log θi −k∑

i=1

Yi log θi(φ) (5.6)

where θi = yi/n for i = 1, . . . , k.

By Walds theorem, under H0, 2r(X) has approximately a χ2 distribution:

2k∑

i=1

Yi[log θi − log θi(φ)] ∼ χ2k1−k0

(5.7)

where

70

Page 71: (ST217: Mathematical Statistics B)

θi = Yi/n,

k0 is the dimension of the parameter φ, and

k1 = k − 1 is the dimension of θ under the constraint∑k

i=1 θi = 1.

Comments

1. in Example 5.10, k = 4, k0 = 1, k1 = 3 and φ = X =∑4

i=1 Yi.

We would reject H0, that the sample comes from a Bin(3, φ) distribution for some φ, if 2r(x) isgreater than the 95% point of the χ2

2 distribution, where r(x) is given in Formula 5.6.

2. It is straightforward to check, using a Taylor series expansion of the log function, that provided EYi

is large ∀ i,

2k∑

i=1

Yi[log θi − log θi(φ)] lk∑

i=1

(Yi − µi)2

µi, (5.8)

where µi = nθi(φ) is the expected number of individuals (under H0) in the ith category.

Definition 5.16 (Chi-squared Goodness of Fit Statistic)

X2 =k∑

i=1

(oi − ei)2

ei, (5.9)

where oi is the observed count in the ith category and ei is the corresponding expected count underthe null hypothesis, is called the χ2 goodness-of-fit statistic.

Comments

1. Under H0, X2 has approximately a χ2 distribution with number of degrees of freedom being (numberof categories) - 1 - (number of parameters estimated under H0).

This approximation works well provided all the expected counts are reasonably large (say all are atleast 5).

2. This χ2 test was suggested by Karl Pearson before the theory of hypothesis testing was fully developed.

71

Page 72: (ST217: Mathematical Statistics B)

5.7 Problems

1. In a genetic experiment, peas were classified according to their shape (‘round’ or ‘angular’) andcolour (‘yellow’ or ‘green’). Out of 556 peas, 315 were round+yellow, 108 were round+green, 101were angular+yellow and 32 were angular+green.

Test the null hypothesis that the probabilities of these four types are 9/16, 3/16, 3/16 and 1/16respectively.

2. A sample of 300 people was selected from a population, and classified into blood type (O/A/B/AB,and Rhesus positive/negative), as shown in the following table:

O A B ABRh positive 82 89 54 19Rh negative 13 27 7 9

The null hypothesis H0 is that being Rhesus negative is independent of whether an individual’s bloodgroup is O, A, B or AB. Estimate the probabilities under H0 of falling into each of the 8 categories,and hence test the hypothesis H0.

3. The random variables X1, X2, . . . , Xn are IID with Pr(Xi = j) = pj for j = 1, 2, 3, 4, where∑

pj = 1and pj > 0 for each j = 1, 2, 3, 4.

Interest centres on the hypothesis H0 that p1 = p2 and simultaneously p3 = p4.

(a) Define the following terms

i. a hypothesis test,ii. simple and composite hypotheses, andiii. a likelihood ratio test.

(b) Letting θ = (p1, p2, p3, p4), X = (X1, . . . , Xn)T with observed values x = (x1, . . . , xn)T , andletting yj denote the number of x1, x2, . . . , xn equal to j, what is the likelihood L(θ|x)?

(c) Assume the usual regularity conditions, i.e. that the distribution of −2 log L(θ|x) tends to χ2ν

as the sample size n →∞. What are the dimension of the parameter space Ωθ and the numberof degrees of freedom ν of the asymptotic chi-squared distribution?

(d) By partial differentiation of the log-likelihood, or otherwise, show that the maximum likelihoodestimator of pj is yj/n.

(e) Hence show that the asymptotic test statistic of H0 : p1 = p2 and p3 = p4 is

−2 log L(x) = 24∑

j=1

yj log(yj/mj),

where m1 = m2 = (y1 + y2)/2 and m3 = m4 = (y3 + y4)/2.

(f) In a hospital casualty unit, the numbers of limb fractures seen over a certain period of time are:

SideLeft Right

Arm 46 49Leg 22 32

Using the test developed above, test the hypothesis that limb fractures are equally likely tooccur on the right side as on the left side.Discuss briefly whether the assumptions underlying the test appear reasonable here.

From Warwick ST217 exam 1998

72

Page 73: (ST217: Mathematical Statistics B)

Prudens quaestio dimidium scientiae.Half of science is asking the right questions.

Roger Bacon

We all learn by experience, and your lesson this time is that you should never lose sight of thealternative.

Sir Arthur Conan Doyle

One forms provisional theories and then waits for time or fuller knowledge to explode them.Sir Arthur Conan Doyle

What used to be called prejudice is now called a null hypothesis.A. W. F. Edwards

The conventional view serves to protect us from the painful job of thinking.John Kenneth Galbraith

Science must begin with myths, and with the criticism of myths.Sir Karl Raimund Popper

73

Page 74: (ST217: Mathematical Statistics B)

This page intentionally left blank (except for this sentence).

74

Page 75: (ST217: Mathematical Statistics B)

Chapter 6

Linear Statistical Models

6.1 Introduction

Definition 6.1 (Response Variable)a response variable is a random variable Y whose value we wish to predict.

Definition 6.2 (Explanatory Variable)An explanatory variable is a random variable X whose values can be used to predict Y .

Definition 6.3 (Linear Model)A linear model is a prediction function for Y in terms of the values x1, x2, . . . , xk of X1, X2, . . . , Xk

of the formE[Y |x1, x2, . . . , xk] = β0 + β1x1 + β2x2 + · · ·+ βkxk (6.1)

Thus if Y1, Y2, . . . , Yn are the responses for cases 1, 2, . . . , n, and xij is the value of Xj (j = 1, . . . , k) forcase i, then

E[Y|X] = Xβ (6.2)

where

Y =

Y1

Y2

...Yn

is the vector of responses,

X = (xij) where xi0 = 1 for i = 1, . . . n,

is the matrix of explanatory variables, and

β =

β0

β1

...βk

is the (unknown) parameter vector.

75

Page 76: (ST217: Mathematical Statistics B)

Examples

Consider the captopril data (page 44), and let

X1 = Diastolic BP before treatment, X2 = Systolic BP before treatment,X3 = Diastolic BP after treatment, X4 = Systolic BP after treatment,Z1 = 2X1 + X2, Z2 = 2X3 + X4.

Some possible linear models of interest are:

1. Response Y = X4,

(a) explanatory variable X2 (this is a ‘simple linear regression model ’, with just 1 explanatoryvariable),

(b) explanatory variable X3

(c) explanatory variables X1 and X2 (a ‘multiple regression model ’).

2. Response Y = Z2,

(a) explanatory variable Z1

(b) explanatory variables Z1 and Z21 (a ‘quadratic regression model ’).

Note how new explanatory variables may be obtained by transforming and/or combining old ones.

3. Looking just at the interrelationship between SBP and DBP at a given time:

(a) response Y = X2, explanatory variable X1,

(b) response Y = X1, explanatory variable X2,

(c) response Y = X4, explanatory variable X3, etc.

Comments

1. A linear relationship is the simplest possible relationship between response variables and explanatoryvariables, so linear models are easy to understand, interpret and also to check for plausibility.

2. One can (in theory) approximate an arbitrarily complicated relationship by a linear model, for ex-ample quadratic regression can obviously be extended to ‘polynomial regression’

E[Y |x] = β0 + β1x + β2x2 + · · ·+ βmxm.

3. Linear models have nice links with

• geometry,

• linear algebra,

• conditional expectations and variances,

• the Normal distribution.

4. Distributional assumptions (if any!) will typically be made ONLY about the response variable Y ,NOT about the explanatory variables.

Therefore the model makes sense even if the Xis are chosen nonrandomly (‘designed experiments’).

5. The response variable Y is sometimes called the ‘dependent variable’, and the explanatory variablesare sometimes called ‘predictor variables’, ‘regressor variables’, or (very misleadingly) ‘independentvariables’.

76

Page 77: (ST217: Mathematical Statistics B)

6.2 Simple Linear Regression

Definition 6.4A simple linear regression model is a linear model with one response variable Y and one explanatoryvariable X, i.e. a model of the form

E[Y |x1] = β0 + β1x1. (6.3)

Typically in practice we have n data points (xi, yi) for i = 1, . . . , n, and we want to predict a futureresponse Y from the corresponding observed value x of X.

Often there’s a natural candidate for which variable should be treated as the response:

1. X may precede Y in time, for example

(a) X is BP before treatment and Y is BP after treatment, or

(b) X is number of hours revision and Y is exam mark;

2. X may be in some way more fundamental, for example

(a) X is age and Y is height or

(b) X is height and Y is weight;

3. X may be easier or cheaper to observe, so we hope in future to estimate Y without measuring it.

In simple linear regression we don’t know β0 or β1, but need to estimate them in order to predict Y byY = β0 + β1x.

To make accurate predictions we require the prediction error

Y − Y = Y − β0 + β1x

to be small.

This suggests that, given data (xi, yi) for i = 1, . . . , n, we should fit β0 and β1 by simultaneously makingall the vertical deviations of the observed data points from the fitted line y = β0 + β1x small.

The easiest way to do this is to minimise the sum of squared deviations∑

(yi − yi)2, i.e. to use the ‘leastsquares’ criterion.

6.3 Method of Least Squares

For simple linear regression,yi = β0 + β1xi (i = 1, . . . , n) (6.4)

Therefore to estimate β0 and β1 by least squares, we need to minimise

Q =n∑

i=1

[yi − (β0 + β1xi)]2. (6.5)

Exercise 6.1Show that Q in equation 6.5 is minimised at values β0 and β1 satisfying the simultaneous equations

β0n + β1

∑xi =

∑yi,

β0

∑xi + β1

∑x2

i =∑

xiyi,(6.6)

77

Page 78: (ST217: Mathematical Statistics B)

and hence that

β1 =∑

xiyi − n x y∑x2

i − nx2 , (6.7)

β0 = y − β1x. (6.8)‖

Comments

1. Forming ∂2Q/∂β20 , ∂2Q/∂β2

1 and ∂2Q/∂β0β1 verifies that Q is minimised at β = β.

2. Equations 6.6 are called the ‘normal equations’ for β0 and β1 (‘normal’ as in ‘perpendicular’ ratherthan as in ‘standard’ or as in ‘Normal distribution’).

3. y = β0 + β1x is called the ‘least squares fit ’ to the data.

4. From equations 6.7 and 6.8, the least squares fitted line passes through (x, y), the centroid of thedata points.

5. Concentrate on understanding and remembering the method for finding β, rather than on memorisingthe formulae 6.7 and 6.8 for β1 and β1.

6. Geometrical interpretation

We have a vector y = (y1, y2, . . . , yn)T of observed responses, i.e. a point in n-dimensional space,together with a surface S representing possible joint predicted values under the model (for simplelinear regression, it’s the 2-dimensional surface β0 + β1x for real values of β0 and β1).

Minimising∑

(yi − yi)2 is equivalent to dropping a perpendicular from the point y to the surface S;the perpendicular hits the surface at y. Thus we are literally finding the model closest to the data.

6.4 Problems

1. Show that the expression∑

xiyi − n x y occurring in the formula for β1 could also be written as∑(xi − x)(yi − y),

∑(xi − x)yi, or

∑xi(yi − y).

2. Show that the ‘residual sum of squares’,∑n

i=1(yi − yi)2, satisfies the following identity:

n∑i=1

(yi − yi)2 =n∑

i=1

(yi − β0 − β1xi)2 =n∑

i=1

(yi − y)2 − β1

n∑i=1

(xi − x)(yi − y).

3. For the captopril data, find the least squares lines

(a) to predict SBP before captopril from DBP before captopril,

(b) to predict SBP after captopril from DBP after captopril,

(c) to predict DBP before captopril from SBP before captopril.

Compare these three lines.

Discuss whether it is sensible to combine the before and after measurements in order to obtain abetter prediction of SBP at a given time from DBP measured at that time.

4. Illustrate the geometrical interpretation of least squares (see above comments) in the following twocases

(a) model E[Y |x] = β0 + β1x with 3 data points (x1, y1), (x2, y2) and (x3, y3),

(b) model E[Y |x] = βx with 2 data points (x1, y1) and (x2, y2).

What does Pythagoras’ theorem tell us in the second case?

78

Page 79: (ST217: Mathematical Statistics B)

6.5 The Normal Linear Model (NLM)

6.5.1 Introduction

Definition 6.5 (NLM)Given n response RVs Yi (i = 1, 2, . . . , n), with corresponding values of explanatory variables xT

i , theNLM makes the following assumptions:

1. (Conditional) Independence

The Yi are mutually independent given the xTi .

2. Linearity

The expected value of the response variable is linearly related to the unknown parameters β:

EYi = xTi β.

3. Normality

The random variation Yi|xi is Normally distributed.

4. Homoscedasticity (Equal Variances)

i.e. Yi|xi ∼ N(xTi β, σ2).

6.5.2 Matrix Formulation of NLM

The NLM for responses y = (y1, y2, . . . , yn)T can be recast as follows

1. E[Y] = Xβ for some parameter vector β = (β1, β2, . . . , βp)T ,

2. ε = Y − E[Y] ∼ MVN(0, σ2I), where I is the (n× n) identity matrix.

It can be shown that the least squares estimates of β are given by solving the simultaneous linear equations

XT y = XT Xβ (6.9)

(the normal equations), with solution (assuming that XT X is nonsingular)

β = (XT X)−1XT y, (6.10)

Comments

1. Note that, by formula 6.10, each estimator βj is a linear combination of the Yis.

Therefore under the NLM, β has a MVN distribution.

2. Even if the Normality assumption doesn’t hold, the CLT implies that, provided the number n ofcases is large, the distribution of the estimator β will still be approximately MVN.

3. The most important assumption is independence, since it’s relatively easy to modify the standardNLM to account for

• nonlinearity: transform the data, or include e.g. x2ij as an explanatory variable,

• unequal variances (‘heteroscedasticity ’): e.g. transform from yi − yi to zi = (yi − yi)/σi.

• non-Normality: transform, or simply get more data!

79

Page 80: (ST217: Mathematical Statistics B)

4. In the general formulation the constant term β0 is omitted, though in practice the first column of thematrix X will often contain 1’s and the corresponding parameter β1 will be the ‘constant term’.

5. The corresponding fitted values are y = XT β, and the vector of residuals is r = y − y,i.e. ri = yi − yi, where yi = xT

i β =∑p

j=1 xij βj .

Definition 6.6 (RSS)The residual sum of squares (RSS) in the fitted NLM is

s2 =∑n

i=1(yi − yi)2

= (y −Xβ)T (y −Xβ)(6.11)

Important Fact about the RSS

Considering the RSS s2 to be the observed value of a corresponding RV S2, it can be shown that

• S2/σ2 ∼ χ2(n−p),

• S2 is independent of β.

Exercise 6.21. Show that the log-likelihood function for the NLM is

(constant)− n

2log(σ2)− 1

2σ2(y −Xβ)T (y −Xβ). (6.12)

2. Show that the maximum likelihood estimate of β is identical to the least squares estimate.

What is the distribution of β?

3. Show that the MLE σ2 of σ2 is

σ2 =s2

n. (6.13)

What are the mean and variance of σ2?

4. Show that an unbiased estimator of σ2 is given by the formula

Residual Sum of SquaresResidual Degrees of Freedom

6.5.3 Examples of the NLM1. Simple Linear Regression (again)

Yi = β0 + β1xi + εi, (6.14)

where εiIID∼ N (0, σ2).

2. Two-sample t-test

y =

x1

x2

...xm

y1

...yn

, X =

1 01 0...

...1 00 1...

...0 1

, β =

(β0

β1

), (6.15)

and we’re interested in the hypothesis H0 : (β0 − β1) = 0.

80

Page 81: (ST217: Mathematical Statistics B)

3. Paired t-test

Some quantity Y is measured on each of n individuals under 2 different conditions (e.g. drugs Aand B), and we want to test whether the mean of Y can be assumed equal in both circumstances.

y =

y11

y21

...yn1

y12

y22

...yn2

, X =

1 0 · · · 0 00 1 · · · 0 0...

.... . .

......

0 0 · · · 1 01 0 · · · 0 10 1 · · · 0 1...

.... . .

......

0 0 · · · 1 1

, β =

α1

α2

...αn

δ

, (6.16)

where δ is the difference between the expected responses under the two conditions, and the αi are‘nuisance parameters’ representing the overall level of response for the ith individual.

The null hypothesis is H0 : δ = 0.

4. Multiple Regression (example thereof)

Y = SBP after captopril, x1 = SBP before captopril, x2 = DBP before captopril,

y =

201165166157147145168180147136151168179129131

, X =

1 210 1301 169 1221 187 1241 160 1041 167 1121 176 1011 185 1211 206 1241 173 1151 146 1021 174 981 201 1191 198 1061 148 1071 154 100

, β =

β0

β1

β2

, (6.17)

where (roughly speaking) β1 represents the increase in EY per unit increase in SBP before captopril(x1), allowing for the fact that EY also depends partly on DBP before captopril (x2), and β2 has asimilar interpretation in terms of the effect of x2 allowing for x1.

In all the above examples, it’s straightforward to calculate β = (XT X)−1XT y, and also (for example) tocalculate the sampling distribution of βi under the null hypothesis H0 : βi = 0.

Exercise 6.3Verify the following calculations from the data given in 6.17 above:

XT X =

15 2654 16852654 475502 3001371685 300137 190817

, XT y =

2370424523268373

,

(XT X)−1 =

8.563 −0.009165 −0.06120−0.009165 0.0003026 −0.0003951−0.06120 −0.0003951 0.001167

, β =

−20.70.7240.450

.

81

Page 82: (ST217: Mathematical Statistics B)

6.6 Checking Assumptions of the NLM

Clearly it’s very important in practice to check that your assumptions seem reasonable; there are variousways to do this

6.6.1 Formal hypothesis testing

χ2 tests are not very powerful, but are simple and general: count the number of data points satisfyingvarious (exhaustive & mutually exclusive) conditions, and compare with the expected counts under yourassumptions.

Other tests, for example to test for Normality, have been devised. However, a general problem withstatistical tests is that they don’t usually suggest what to do if your null hypothesis is rejected.

Exercise 6.4How might you use a χ2 test to check whether SBP after captopril is independent of SBP before captopril?

Exercise 6.5A possible test for linearity in the simple Normal linear regression model (i.e. the NLM with just oneexplanatory variable x) is to fit the quadratic NLM

EY = β0 + β1x + β2x2 (6.18)

and test the null hypothesis H0 : β2 = 0.

Suppose that Y is SBP and x is dose of drug, and that you have rejected the above null hypothesis.Comment on the advisability of using Formula 6.18 for predicting Y given x.

6.6.2 Graphical Methods and Residuals

If all the assumptions of the NLM are valid, then the residuals

ri = yi − yi

= yi − xTi β

(6.19)

should resemble observations on IID Normal random variables.

Therefore plots of ri against ANYTHING should be patternless

SEE LECTURE

Comments

1. Before fitting a formal statistical model (including e.g. performing a t-test), you should plot the data,particularly the response variable against each explanatory variable.

2. After fitting a model, produce several residual plots. The computer is your friend!

3. Note that it’s the residual plots that are most informative. For example, the NLM DOESN’T assumethat the Yi are Normally distributed about µY , but DOES assume that each Yi is Normally distributedabout EYi|xi.

i.e. it’s the conditional distributions, not the marginal distributions, that are important.

82

Page 83: (ST217: Mathematical Statistics B)

6.7 Problems

1. Show that the following is an equivalent formulation of the two-sample t-test to that given above inFormulae 6.15

Y =

x1

x2

...xm

y1

...yn

, X =

1 01 0...

...1 01 1...

...1 1

, β =

(β0

β1

), (6.20)

with null hypothesis H0 : β1 = 0.

2. Independent samples of 10 U.S. men aged 25–34 years, and 15 U.S. men aged 45–54 years were taken.Their heights (in inches) were as follows:

(a) Age 25–3473.3 64.8 72.1 68.9 68.7 70.4 66.8 70.7 74.4 71.8

(b) Age 45–5473.2 68.5 62.4 65.5 71.3 69.5 74.5 70.6 69.3 67.1 64.7 73.0 66.7 68.1 64.3

Use a two-sample t-test to test the hypothesis that the population means of the two age-groups areequal (the 90%, 95%, 97.5%, and 99% points of the t23 distribution are 1.319, 1.714, 2.069 and 2.500respectively).

Comment on whether the underlying assumptions of the two-sample t-test appear reasonable for thisset of data.

Comment also on whether the data can be used to suggest that the population of the U.S. has (orhasn’t) tended to get taller over the last 20 years.

3. Verify that the least squares estimates in simple linear regression

β1 =∑

xiyi − n x y∑x2

i − nx2 , β0 = y − β1x,

are a special case of the general formula β = (XT X)−1XT y.

4. The following data-set shows average January minimum temperature in degrees Fahrenheit (y), to-gether with Latitude (x1) and Longitude (x2) for 28 US cities. Plot y against x1, and comment onwhat this plot suggests about the reasonableness of the various assumptions underlying the NLM forpredicting y from x1 and x2.

y x1 x2 y x1 x2 y x1 x2

44 31.2 88.5 38 32.9 86.8 35 33.6 112.531 35.4 92.8 47 34.3 118.7 42 38.4 123.015 40.7 105.3 22 41.7 73.4 26 40.5 76.330 39.7 77.5 45 31.0 82.3 65 25.0 82.058 26.3 80.7 37 33.9 85.0 22 43.7 117.119 42.3 88.0 21 39.8 86.9 11 41.8 93.622 38.1 97.6 27 39.0 86.5 45 30.8 90.212 44.2 70.5 25 39.7 77.3 23 42.7 71.421 43.1 83.9 2 45.9 93.9 24 39.3 90.58 47.1 112.4

Data from HSDS, set 262

83

Page 84: (ST217: Mathematical Statistics B)

5. (a) Assuming the model

E[Y |x] = β0 + β1x,

Var[Y |x] = σ2 independently of x,

derive formulae for the least squares estimates β0 and β1 from data (xi, yi), i = 1, . . . , n.What advantages are gained if the corresponding random variables Yi|xi can be assumed to beindependently Normally distributed?

(b) The following table shows the tensile strength (y) of different batches of cement after being‘cured’ (dried) for various lengths of time x: 3 batches were cured for 1 day, 3 for 2 days, 5 for3 days, etc. The batch means and standard deviations (s.d.) are also given.

Curing time Tensile strength

(days) x (kg/cm2) y mean s.d.

1 13.0 13.3 11.8 12.7 0.82 21.9 24.5 24.7 23.7 1.63 29.8 28.0 24.1 24.1 26.2 26.5 2.57 32.4 30.4 34.5 33.1 35.7 33.2 2.028 41.8 42.6 40.3 35.7 37.3 40.0 3.0

Plot y against x and discuss briefly how reasonable seem each of the following assumptions:

(i) linearity: E[Yi |xi] = β0 + β1xi for some constants β0 and β1.(ii) independence: the Yi are mutually independent given the xi.

If conditional independence (ii) is assumed true, then how reasonable here are the further as-sumptions:

(iii) homoscedasticity: Var[Yi |xi] = σ2 for all i = 1, . . . , n,(iv) Normality: the random variables Yi are each Normally distributed.

Say briefly whether you consider any of the above assumptions (i)–(iv) would be more plausiblefollowing

(A) transforming from y to y′ = loge(y), and/or(B) transforming x in an appropriate way.

NOTE: you do not need to carry out numerical calculations such as finding theleast-squares fit explicitly.

From Warwick ST217 exam 2000

84

Page 85: (ST217: Mathematical Statistics B)

6. To monitor an industrial process for converting ammonia to nitric acid, the percentage of ammonialost (y) was measured on each of 21 consecutive days, together with explanatory variables representingair flow (x1), cooling water temperature (x2) and acid concentration (x3). The data, together withthe residuals after fitting the model y = 3.614 + 0.072 x1 + 0.130 x2 − 0.152 x3, are given in thefollowing table:

Air Water Acid

Flow Temp. Conc.

Day y (x1) (x2) (x3) Resid.

1 4.2 80 27 58.9 0.323

2 3.7 80 27 58.8 −0.192

3 3.7 75 25 59.0 0.456

4 2.8 62 24 58.7 0.570

5 1.8 62 22 58.7 −0.171

6 1.8 62 23 58.7 −0.301

7 1.9 62 24 59.3 −0.239

8 2.0 62 24 59.3 −0.139

9 1.5 58 23 58.7 −0.314

10 1.4 58 18 58.0 0.127

11 1.4 58 18 58.9 0.264

12 1.3 58 17 58.8 0.278

13 1.1 58 18 58.2 −0.143

14 1.2 58 19 59.3 −0.005

15 0.8 50 18 58.9 0.236

16 0.7 50 18 58.6 0.091

17 0.8 50 19 57.2 −0.152

18 0.8 50 19 57.9 −0.046

19 0.9 50 20 58.0 −0.060

20 1.5 56 20 58.2 0.141

21 1.5 70 20 59.1 −0.724

Some residual plots are shown on the next page (Fig. 6.1).

(a) Discuss whether the pattern of residuals casts doubt on any of the assumptions underlying theNormal Linear Model (NLM).Describe any further plots or calculations that you think would help you assess whether thefitted NLM is appropriate here.

Continued. . .

85

Page 86: (ST217: Mathematical Statistics B)

(b) Various suggestions could be made for improving the model, such as

i. transforming the response (e.g. to log y or to y/x1),ii. transforming some or all of the explanatory variables,iii. deleting outliers,iv. including quadratic or even higher-order terms (e.g. x2

2),v. including interaction terms (e.g. x1x3),vi. carrying out a nonparametric analysis of the data,vii. applying a bootstrap procedure,viii. fitting a nonlinear model.

Outline the merits and disadvantages of each of these suggestions here. What would be yournext step in analysing this data-set?

Figure 6.1: Residual plots

From Warwick ST217 exam 1999

86

Page 87: (ST217: Mathematical Statistics B)

7. Table 6.1, originally from Narula & Wellington (1977), shows data on selling prices of 28 housesin Erie, Pennsylvania, together with explanatory variables that could be used to predict the sellingprice. The variables are:

X1 = current taxes (local, school and county) ÷ 100,X2 = number of bathrooms,X3 = lot size ÷ 1000 (square feet),X4 = living space ÷ 1000 (square feet),X5 = number of garage spaces,X6 = number of rooms,X7 = number of bedrooms,X8 = age of house (years),X9 = number of fireplaces,Y = actual sale price ÷ 1000 (dollars).

Find a function of X1–X9 that predicts Y reasonably accurately (such functions are used to fixproperty taxes, which should be based on the current market value of each property).

X1 X2 X3 X4 X5 X6 X7 X8 X9 Y

4.9176 1.0 3.4720 0.9980 1.0 7 4 42 0 25.95.0208 1.0 3.5310 1.5000 2.0 7 4 62 0 29.54.5429 1.0 2.2750 1.1750 1.0 6 3 40 0 27.94.5573 1.0 4.0500 1.2320 1.0 6 3 54 0 25.95.0597 1.0 4.4550 1.1210 1.0 6 3 42 0 29.93.8910 1.0 4.4550 0.9880 1.0 6 3 56 0 29.95.8980 1.0 5.8500 1.2400 1.0 7 3 51 1 30.95.6039 1.0 9.5200 1.5010 0.0 6 3 32 0 28.9

15.4202 2.5 9.8000 3.4200 2.0 10 5 42 1 84.914.4598 2.5 12.8000 3.0000 2.0 9 5 14 1 82.95.8282 1.0 6.4350 1.2250 2.0 6 3 32 0 35.95.3003 1.0 4.9883 1.5520 1.0 6 3 30 0 31.56.2712 1.0 5.5200 0.9750 1.0 5 2 30 0 31.05.9592 1.0 6.6660 1.1210 2.0 6 3 32 0 30.95.0500 1.0 5.0000 1.0200 0.0 5 2 46 1 30.08.2464 1.5 5.1500 1.6640 2.0 8 4 50 0 36.96.6969 1.5 6.9020 1.4880 1.5 7 3 22 1 41.97.7841 1.5 7.1020 1.3760 1.0 6 3 17 0 40.59.0384 1.0 7.8000 1.5000 1.5 7 3 23 0 43.95.9894 1.0 5.5200 1.2560 2.0 6 3 40 1 37.57.5422 1.5 4.0000 1.6900 1.0 6 3 22 0 37.98.7951 1.5 9.8900 1.8200 2.0 8 4 50 1 44.56.0931 1.5 6.7265 1.6520 1.0 6 3 44 0 37.98.3607 1.5 9.1500 1.7770 2.0 8 4 48 1 38.98.1400 1.0 8.0000 1.5040 2.0 7 3 3 0 36.99.1416 1.5 7.3262 1.8310 1.5 8 4 31 0 45.8

12.0000 1.5 5.0000 1.2000 2.0 6 3 30 1 41.0

Table 6.1: House price data

Weisberg (1980)

87

Page 88: (ST217: Mathematical Statistics B)

8. The number of ‘hits’ recorded on J.E.H.Shaw’s WWW homepage in late 1999 are given below. ‘Local’means the homepage was accessed from within Warwick University, ‘Remote’ means it was accessedfrom outside. Data for the week beginning 7–Nov–1999 were unavailable. Note that there was anexam on Wednesday 8–Dec–1999 for the course ST104, taught by J.E.H.Shaw.

Week Number of HitsBeginning Local Remote Total

26 Sept 0 182 1823 Oct 35 253 288

10 Oct 901 315 121617 Oct 641 443 108424 Oct 1549 525 207431 Oct 823 344 11677 Nov — — —

14 Nov 1136 383 151921 Nov 2114 584 269828 Nov 2097 536 26335 Dec 3732 461 4193

12 Dec 5 352 35719 Dec 0 296 296

(a) Fit a linear least-squares regression line to predict the number of remote hits (Y ) in a week fromthe observed number x of local hits.

(b) Calculate the residuals and plot them against date. Does the plot give any evidence that theinterrelationship between X and Y changes over time?

(c) Using both general considerations and residual plots, comment on how reasonable here are theassumptions underlying the simple Normal linear regression model, and suggest possible waysto improve the prediction of Y .

9. The following table shows the assets x (billions of dollars) and net income y (millions of dollars) forthe 20 largest US banks in 1973.

Bank x y Bank x y Bank x y Bank x y

1 49.0 218.8 6 14.2 63.6 11 11.6 42.9 16 6.7 42.72 42.3 265.6 7 13.5 96.9 12 9.5 32.4 17 6.0 28.93 36.3 170.9 8 13.4 60.9 13 9.4 68.3 18 4.6 40.74 16.4 85.9 9 13.2 144.2 14 7.5 48.6 19 3.8 13.85 14.9 88.1 10 11.8 53.6 15 7.2 32.2 20 3.4 22.2

(a) Plot income (y) against assets (x), and also log(income) against log(assets).

(b) Verify that the least squares fit regression lines are

fit 1: y = 4.987 x + 7.57,fit 2: log(y) = 0.963 log(x) + 1.782 (Note: logs to base e),

and show the fitted lines on your plots.

(c) Produce Normal probability plots of the residuals from each fit.

(d) Which (if either) of these models would you use to describe the relationship between total assetsand net income? Why?

(e) Bank number 19 (the Franklin National Bank) failed in 1974, and was the largest ever US bank tofail. Identify the point representing this bank on each of your plots, and discuss briefly whether,from the data presented, one might have expected beforehand that the Franklin National Bankwas in trouble.

88

Page 89: (ST217: Mathematical Statistics B)

10. The following data show the blood alcohol levels (mg/100ml) at post mortem for traffic accidentvictims. Blood samples in each case were taken from the leg (A) and from the heart (B). Do theseresults indicate that blood alcohol levels differ systematically between samples from the leg and theheart?

Case A B Case A B

1 44 44 11 265 2772 265 269 12 27 393 250 256 13 68 844 153 154 14 230 2285 88 83 15 180 1876 180 185 16 149 1557 35 36 17 286 2908 494 502 18 72 809 249 249 19 39 50

10 204 208 20 272 290

Osborn (1979) 4.6.5

11. (a) Assume the linear model

E[Y |X] = Xβ,

Var[Y |X] = σ2In,

where In denotes the n × n identity matrix, and XT X is nonsingular. By writing Y −Xβ =(Y −Xβ) + X(β − β), or otherwise, show that for this model, the residual sum of squares

(Y −Xβ)T (Y −Xβ)

is minimised at β = β = (XT X)−1XT Y.

(b) Show that E[β] = β and that Var[β] = σ2(XT X)−1.

(c) Let A = X(XT X)−1XT . Show that A and In − A are both idempotent, i.e. AA = A and(In −A)(In −A) = In −A.

(d) For the particular case of a Normal linear model, find the joint distribution of the fitted valuesY = Xβ, and show that Y − Y is independent of Y. Quote carefully any properties of theNormal distribution you use.

(e) For the simple linear regression model (EYi = β0 + β1xi), write down the corresponding matrixX and vector Y, find (XT X)−1, and hence find the least squares estimates β0 and β1 and theirvariances.

From Warwick ST217 exam 2001

89

Page 90: (ST217: Mathematical Statistics B)

6.8 The Analysis of Variance (ANOVA)

6.8.1 One-Way Analysis of Variance: Introduction

This is a generalization of the two-sample t-test to p > 2 groups.

Suppose there are observations yij (j = 1, 2, . . . , ni) in the ith group (i = 1, 2, . . . , p),and let n = n1 + n2 + · · ·+ np denote the total number of observations.

Denote the corresponding RVs by Yij , and assume that Yij ∼ N (βi, σ2) independently.

Traditionally the main aim has been to test the null hypothesis

H0 : β1 = β2 = . . . = βp

i.e. : β = β0 = (β0, β0, . . . , β0)

The idea is to fit MLEs β and β0 and apply a likelihood ratio test, i.e. test whether the ratio

change in RSSRSS

=squared distance from y to y0

squared distance from y to y

(where y and y0 are the corresponding fitted values) is larger than would be expected by chance.

A useful notation for group means etc. uses overbars and ‘+’ suffices as follows:

yi+ =1ni

ni∑j=1

yij , y++ =1n

p∑i=1

ni∑j=1

yij

(=

1n

p∑i=1

niyi+

), etc.

The underlying models fit naturally in the NLM framework:

Definition 6.7 (One-Way ANOVA)The one-way ANOVA model is a NLM of the form

Y ∼ MVN(Xβ, σ2I),

where

Y =

Y1

Y2

...Yn

, X =

1 0 0 · · · 01 0 0 · · · 0...

......

. . ....

1 0 0 · · · 00 1 0 · · · 0...

......

. . ....

0 1 0 · · · 00 0 1 · · · 0...

......

......

0 0 1 · · · 0...

......

......

0 0 0 · · · 1...

......

. . ....

0 0 0 · · · 1

, β =

β1

β2

...βp

, (6.21)

where X has n1 rows of the first type, . . . np rows of the last type, and n1 + n2 + · · ·+ np = n.

Exercise 6.6Show that for one-way ANOVA, XT X = diag(n1, n2, . . . , np), and hence β = (Y 1+, Y 2+, . . . , Y p+)T .

90

Page 91: (ST217: Mathematical Statistics B)

6.8.2 One-Way Analysis of Variance: ANOVA Table

Let

β0 = E[Y ++] =1n

p∑i=1

ni∑j=1

EYij =1n

p∑i=1

niβi,

αi = βi − β0 (i = 1, 2, . . . , p).

Typically the p groups correspond to p different treatments, and αi is then called the ith treatment effect.

We’re interested in the hypotheses

H0 : αi = 0 (i = 1, 2, . . . , p),H1 : the αi are arbitrary.

Note that

1. Y ++ is the MLE of β = β0 under H0,

2. Y i+ is the MLE of β + αi, i.e. the mean response given the i treatment.

Hence the fitted values under H0 and H1 are given by Y ++ and Y i+ respectively.

If we also include the ‘null model’ that all the βi are zero, then the possible models of interest are:

Model # params DF RSS

βi = 0 ∀ i i.e. yij = 0 0 n∑

i,j y2ij (1)

βi = β0 ∀ i i.e. yij = y++ 1 n− 1∑

i,j(yij − y++)2 (2)

βi arbitrary, i.e. yij = yi+ p n− p∑

i,j(yij − yi+)2 (3)

The calculations needed to test H0, involving the RSS formulae given above, can be conveniently presentedin an ‘ANOVA table’:

Source of Degrees of Sum of squares Mean squarevariation freedom (DF) (SS) (MS) = SS/DF

Overall mean 1 (1)–(2) = ny2++ ny2

++

Treatment p− 1 (2)–(3) =∑

i ni(yi+ − y++)2∑

i ni(yi+ − y++)2/

(p− 1)

Residual n− p (3) =∑

i,j(yij − yi+)2∑

i,j(yij − yi+)2/

(n− p)

Total n (1) =∑

i,j y2ij

Finally, calculate the ‘F ratio’

F =Treatment MSResidual MS

=Treatment SS/(p− 1)Residual SS/(n− p)

(6.22)

which, under H0, has an F distribution on (p− 1) and (n− p) d.f.Large values of F are evidence against H0.

Note: DON’T try too hard to remember formulae for sums of squares in an ANOVA table.

Instead THINK OF THE MODELS BEING FITTED. The ‘lack of fit’ of each model is given by thecorresponding RSS, & the formulae for the differences in RSS simplify.

91

Page 92: (ST217: Mathematical Statistics B)

6.9 Problems

1. Show that the formulae for sums of squares in one-way ANOVA simplify:

p∑i=1

ni(Y i+ − Y ++)2 =p∑

i=1

niY2

i+ − nY2

++,

p∑i=1

ni∑j=1

(Yij − Y i+)2 =p∑

i=1

ni∑j=1

Y 2ij −

p∑i=1

niY2

i+.

2. (a) Define the Normal Linear Model, and describe briefly how each of its assumptions may beinformally checked by plotting residuals.

(b) The following data summarise the number of days survived by mice inoculated with three strainsof typhoid (31 mice with ‘9D’, 60 mice with ‘11C’ and 133 mice with ‘DSCI’).

Days Numbers of Miceto Inoculated with. . .

Death 9D 11C DSCI Total

2 6 1 3 103 4 3 5 124 9 3 5 175 8 6 8 226 3 6 19 287 1 14 23 388 11 22 339 4 14 1810 6 14 2011 2 7 912 3 8 1113 1 4 514 1 1

Total 31 60 133 224∑Xi 125 442 1037 1604∑X2

i 561 3602 8961 13124

(Xi is the survival time of the ith mouse in the given group).

Without carrying out any calculations, discuss briefly how reasonable seem the assumptionsunderlying a one-way ANOVA on the data, and whether a transformation of the data may beappropriate.

(c) Carry out a one-way ANOVA on the untransformed data. What do you conclude about theresponses to the three strains of typhoid?

From Warwick ST217 exam 1997

3. The amount of nitrogen-bound bovine serum albumin produced by three groups of mice was measured.The groups were: normal mice treated with a placebo (i.e. an inert substance), alloxan-diabetic micetreated with a placebo, and alloxan-diabetic mice treated with insulin. The resulting data are shownin the following table:

92

Page 93: (ST217: Mathematical Statistics B)

Normal Alloxan-diabetic Alloxan-diabetic+ placebo + placebo + insulin

156 391 82282 46 100197 469 98297 86 150116 174 243127 133 68119 13 22829 499 131

253 168 73122 62 18349 127 20110 276 100143 176 7264 146 13326 108 46586 276 40

122 50 46455 73 34655 4414

(a) Produce appropriate graphical display(s) and numerical summaries of these data, and commenton what can be learnt from these.

(b) Carry out a one-way analysis of variance on the three groups. You may feel it necessary totransform the data first.

Data from HSDS, set 304

4. The following table shows measurements of the steady-state haemoglobin levels for patients withdifferent types of sickle-cell anaemia (‘HB SS’, ‘HB S/-thalassaemia’ and ‘HB SC’). Construct anANOVA table and hence test whether the steady-state haemoglobin levels differ between the threetypes.

HB SS HB S/-thalassaemia HB SC

7.2 8.1 10.77.7 9.2 11.38.0 10.0 11.58.1 10.4 11.68.3 10.6 11.78.4 10.9 11.88.4 11.1 12.08.5 11.9 12.18.6 12.0 12.38.7 12.1 12.69.1 12.69.1 13.39.1 13.39.8 13.8

10.1 13.910.3

Data from HSDS, set 310

93

Page 94: (ST217: Mathematical Statistics B)

5. The data in Table 6.2, collected by Brian Everitt, are described in HSDS as being the ‘weights, inkg, of young girls receiving three different treatments for anorexia over a fixed period of time withthe control group receiving the standard treatment’.

(a) Using a one-way ANOVA on the weight gains, compare the three methods of treatment.

(b) Plot the data so as to clarify the effects of the three treatments, and discuss whether the aboveformal analysis was appropriate.

Cognitivebehavioural Familytreatment Control therapy

Weight Weight Weightbefore after before after before after

80.5 82.2 80.7 80.2 83.8 95.284.9 85.6 89.4 80.1 83.3 94.381.5 81.4 91.8 86.4 86.0 91.582.6 81.9 74.0 86.3 82.5 91.979.9 76.4 78.1 76.1 86.7 100.388.7 103.6 88.3 78.1 79.6 76.794.9 98.4 87.3 75.1 76.9 76.876.3 93.4 75.1 86.7 94.2 101.681.0 73.4 80.6 73.5 73.4 94.980.5 82.1 78.4 84.6 80.5 75.285.0 96.7 77.6 77.4 81.6 77.889.2 95.3 88.7 79.5 82.1 95.581.3 82.4 81.3 89.6 77.6 90.781.3 82.4 81.3 89.6 77.6 90.776.5 72.5 78.1 81.4 83.5 92.570.0 90.9 70.5 81.8 89.9 93.880.4 71.3 77.3 77.3 86.0 91.783.3 85.4 85.2 84.2 87.3 98.083.0 81.6 86.0 75.487.7 89.1 84.1 79.584.2 83.9 79.7 73.086.4 82.7 85.5 88.376.5 75.7 84.4 84.780.2 82.6 79.6 81.487.8 100.4 77.5 81.283.3 85.2 72.3 88.279.7 83.6 89.0 78.884.5 84.680.8 96.287.4 86.7

Table 6.2: Anorexia data

Data from HSDS, set 285

94

Page 95: (ST217: Mathematical Statistics B)

6. The following data come from a study of pollution in inland waterways. In each of seven localities,five pike were caught and the log concentration of copper in their livers measured.

Locality Log concentration of copper (ppm)

1. Windermere 0.187 0.836 0.704 0.938 0.1242. Grassmere 0.449 0.769 0.301 0.045 0.8463. River Stour 0.628 0.193 0.810 0.000 0.8554. Wimbourne St Giles 0.412 0.286 0.497 0.417 0.3375. River Avon 0.243 0.258 -0.276 -0.538 0.0416. River Leam 0.134 0.281 0.529 0.305 0.4597. River Kennett 0.471 0.371 0.297 0.691 0.535

(a) The data are plotted in Figure 6.2. Discuss briefly what the plot suggests about the relativecopper pollution in the various localities.

Figure 6.2: Concentration of copper in pike livers

(b) Carry out a one-way analysis of variance to test for differences between the data between lo-calities. Do the results of the formal analysis agree with your subjective impressions fromFigure 6.2?

95

Page 96: (ST217: Mathematical Statistics B)

6.10 Two-Way Analysis of Variance

Here there are two factors (e.g. two treatments, or patient number and treatment given) that can be variedindependently.

Factor A has I ‘levels’ 1, 2, . . . , I, and factor B has J ‘levels’ 1, 2, . . . , J . For example:(a) A is patient number 1, 2, . . . , I, every patient receiving each treatment j = 1, 2, . . . , J in turn,(b) A is treatment number 1, 2, . . . , I, and B is one of J possible supplementary treatments.

Data can be conveniently tabulated:

Factor B1 2 . . . J

1 Y11 Y12 . . . Y1J

2 Y21 Y22 . . . Y2J

Factor A 3 Y31 Y32 . . . Y3J

......

.... . .

...I YI1 YI2 . . . YIJ

i.e. there is precisely one observation Yij at each (i, j) combination of factor levels.

Again assume the NLM with

E[Yij ] = θi + φj for i = 1 . . . I and j = 1 . . . J.

i.e.Yij ∼ N (θi + φj , σ

2) independently. (6.23)

A problem here is that one could transform θi 7→ θi + c and φj 7→ φj − c for each i and j, where c isarbitrary. Therefore for identifiability one needs to impose some (arbitrary) constraints.

The simplest and most symmetrical reformulation for the two-way ANOVA model is

Yij ∼ N (µ + αi + βj , σ2), where∑I

i=1 αi = 0,∑Jj=1 βj = 0.

(6.24)

Exercise 6.7What is the matrix formulation of the model 6.24?

Particular models of interest within the framework of Formulae 6.24 are:

(1) Yij ∼ N (0, σ2),

RSS =∑i,j

Y 2ij , DF = n = IJ .

(2) Yij ∼ N (µ, σ2),

RSS =∑i,j

(Yij − Y ++)2, DF = n− 1 = IJ − 1.

(3) Yij ∼ N (µ + αi, σ2),

Yij = µ + αi = Y i+.

Therefore RSS =∑i,j

(Yij − Y i+)2, DF = n− I = I(J − 1).

96

Page 97: (ST217: Mathematical Statistics B)

(4) Yij ∼ N (µ + βj , σ2),

Yij = µ + βj = Y +j .

Therefore RSS =∑i,j

(Yij − Y +j)2, DF = n− J = (I − 1)J .

(5) Yij ∼ N (µ + αi + βj , σ2),

Yij = µ + αi + βj = Y i+ + Y +j − Y ++.

Therefore RSS =∑i,j

(Yij − Y i+ − Y +j + Y ++)2, DF = n− I − J + 1 = (I−1)(J−1).

Again, we can form an ANOVA table summarising the independent ‘sources of variation’.The degrees of freedom are the differences between the DFs associated with the various models.The sums of squares are the differences between the SSs associated with the various models.

Source of Degrees of Sum of squares Mean squarevariation freedom (DF) (SS) (MS)

Overall mean 1 (1)−(2)Effect of Factor A I−1 (2)−(3)

((2)−(3)

) /(I−1)

Effect of Factor B J−1 (2)−(4)((2)−(4)

) /(J−1)

Residuals (I−1)(J−1) (5) (5)/ (

(I−1)(J−1))

Total IJ = n (1)

Table 6.3: Two-way ANOVA table

Comments

1. DeGroot gives a more general version.

2. As with one-way ANOVA, one can test H0 : αi = 0, i = 1 . . . I, by comparing

(SS due to A)/(I − 1)(Residual SS)/([I − 1][J − 1])

with the 95% point of F(I−1),([I−1][J−1]).

3. Similarly one can test H0 : βj = 0, j = 1 . . . J , by comparing

(SS due to B)/(J − 1)(Residual SS)/([I − 1][J − 1])

with the 95% point of F(J−1),([I−1][J−1]).

4. The above two F tests are using completely separate aspects of the data (row sums of the Yij table,column sums of the Yij table).

5. The case J = 2 is equivalent to the paired t-test (Exercise 5.4).

6. As for one-way ANOVA, the formulae for sums of squares simplify:

‘sum over each observation the squared difference between the fitted values under the two modelsbeing considered’.

The residual SS is then most easily obtained by subtraction.

See problem 6.11.1

97

Page 98: (ST217: Mathematical Statistics B)

6.11 Problems

1. For the two-way analysis of variance (Table 6.3, page 97), find simplified formulae for the sums ofsquares analogous to those found for the one-way ANOVA (exercise 6.9.1).

2. Three pertussis vaccines were tested on each of ten days. The following table shows estimates of thelog doses of vaccine (in millions of organisms) required to protect 50% of mice against a subsequentinfection with pertussis organisms.

VaccineDay A B C Total

1 2.64 2.93 2.93 8.502 2.00 2.52 2.56 7.083 3.04 3.05 3.35 9.444 2.07 2.97 2.55 7.595 2.54 2.44 2.45 7.436 2.76 3.18 3.25 9.197 2.03 2.30 2.17 6.508 2.20 2.56 2.18 6.949 2.38 2.99 2.74 8.11

10 2.42 3.20 3.14 8.76

Total 24.08 28.14 27.32 79.54

Test the statistical significance of the differences between days and between vaccines.

Osborn (1979) 8.1.2

3. (a) Explain what is meant by the Normal Linear Model (NLM), and show how the two-way analysisof variance may be formulated in this way.

(b) The following table gives the average UK cereal yield (tonnes per hectare) from 1994 to 1998,together with the row, column, and overall totals.

1994 1995 1996 1997 1998 Total

Wheat 7.35 7.70 8.15 7.38 7.56 38.14Barley 5.37 5.73 6.14 5.76 5.29 28.29Oats 5.50 5.52 6.14 5.78 6.00 28.94Other cereal 5.65 5.52 5.86 5.52 5.04 27.59

Total 23.87 24.47 26.29 24.44 23.89 122.96

Calculate the fitted yields and residuals for Wheat in each of the five years

i. under the NLM assuming no column effect, andii. under the NLM assuming that row & column effects are additive.

(c) Describe briefly how to test the null hypothesis that there is no column effect (i.e. no consistentchange in yield from year to year). You do not need to carry out the numerical calculations.

(d) A nonparametric test of the above hypothesis may be carried out as follows: rank the data foreach row from lowest to highest (thus for Wheat the values 7.35, 7.70, 8.15, 7.38 and 7.56 arereplaced by 1, 4, 5, 2 and 3 respectively), then sum the four ranks for each year, and finallycarry out a one-way analysis of variance on the five sums of ranks.Comment on the advantages and disadvantages of applying this procedure, rather than thestandard two-way ANOVA, to the above data.

From Warwick ST217 exam 2001

98

Page 99: (ST217: Mathematical Statistics B)

4. The following table gives the estimated hospital waiting lists (000s) by month & region, throughoutthe years 2000 & 2001.

Month Year NY T E L SE SW WM NW

1 2000 137.6 105.5 121.6 173.3 192.6 111.0 99.4 177.62 2000 132.0 103.2 118.7 167.9 190.4 107.7 95.8 172.23 2000 125.1 98.7 111.9 162.3 184.0 100.7 89.6 164.84 2000 129.5 99.5 114.3 163.5 186.9 101.5 92.1 166.55 2000 129.9 99.5 114.6 163.2 186.4 100.6 92.4 166.26 2000 128.9 99.9 114.4 163.3 183.7 100.2 91.9 165.57 2000 127.9 99.0 113.6 160.8 183.9 99.6 90.7 164.98 2000 126.7 98.9 113.2 159.6 183.4 100.1 90.2 165.69 2000 124.6 98.2 111.9 158.1 183.8 99.6 91.0 164.6

10 2000 123.4 97.1 112.0 156.0 183.4 98.8 91.1 163.111 2000 121.0 97.2 111.7 155.7 183.4 98.9 92.1 161.212 2000 121.3 99.0 112.7 158.0 188.2 99.4 92.8 162.91 2001 122.4 97.7 113.3 159.5 188.8 99.8 93.4 164.02 2001 121.6 96.9 113.0 159.2 186.7 100.1 92.1 163.33 2001 119.6 95.3 109.7 156.3 181.1 97.1 87.2 160.54 2001 122.6 96.5 110.7 158.9 184.1 99.3 88.8 162.75 2001 124.0 97.2 111.0 160.0 185.9 100.3 90.1 164.46 2001 124.2 98.0 111.8 160.6 187.2 99.7 90.8 165.67 2001 123.2 98.9 111.4 160.9 187.4 100.2 91.0 165.58 2001 123.2 99.6 111.9 161.4 185.7 100.0 91.6 166.29 2001 123.0 99.4 111.6 159.1 185.2 100.2 91.0 165.8

10 2001 124.1 99.0 111.8 156.7 184.6 101.1 90.7 165.511 2001 123.0 99.6 113.2 155.4 183.7 102.0 89.8 164.712 2001 124.3 100.7 115.8 159.1 186.7 103.6 92.1 168.0

Key

NY Northern & Yorkshire T TrentE Eastern L London SE South East

SW South West WM West Midlands NW North West

[Data extracted from archived Press Releases at http://tap.ccta.gov.uk/doh/intpress.nsf]

Fit a two-way ANOVA model, possibly after transforming the data, and address (briefly) the followingquestions:

(a) Does the pattern of change in waiting lists differ across the regions?

(b) Is there a simple (but not misleading) description of the overall change in waiting lists over thetwo years?

(c) Predict the values for the eight regions in March 2002 (to the nearest 100, as in the Table).

(d) The set of figures for March 2001 were the latest available at the time of the General Electionin May 2001. A cynical acquaintance suggests to you that the March 2001 waiting lists were‘unusually good’. What do you think?

99

Page 100: (ST217: Mathematical Statistics B)

5. Table 4.2, page 45, presented data on the preventive effect of four different drugs on allergic responsein ten patients.

A simple way to analyse the data is via a two-way ANOVA on a suitable measure of patient response,such as the increase in

√NCF, which is tabulated below (for example, 1.95 =

√3.8 −

√0.0 and

1.52 =√

9.2−√

2.3).

Patient numberDrug 1 2 3 4 5 6 7 8 9 10

P 1.95 1.52 0.77 0.44 0.78 1.69 0.37 0.95 1.10 0.62C 0.71 1.30 1.32 1.48 0.58 0.41 0.00 2.09 0.32 −0.22D 0.65 0.67 0.65 0.48 0.00 0.44 0.26 0.42 1.18 0.63K 0.19 0.54 −0.07 0.82 0.54 −0.44 0.27 −0.03 0.59 0.71

(a) Test the statistical significance of the differences between drugs and between patients.

(b) Plot the original data (Table 4.2) in a way that would help you assess whether the assumptionsunderlying the above two-way ANOVA are reasonable.

(c) Comment on the analysis you have made suggesting possible improvements where appropriate.You do NOT need to carry out any further complicated calculations.

6. Table 6.4 shows purported IQ scores of identical twins, one raised in a foster home (Y ), and theother raised by natural parents (X). The data are also categorised according to the social class ofthe natural parents (upper, middle, low). The data come from Burt (1966), and are also available inWeisberg (1980).

upper class middle class lower classCase Y X Case Y X Case Y X

1 82 82 8 71 78 14 63 682 80 90 9 75 79 15 77 733 88 91 10 93 82 16 86 814 108 115 11 95 97 17 83 855 116 115 12 88 100 18 93 876 117 129 13 111 107 19 97 877 132 131 20 87 93

21 94 9422 96 9523 112 9724 113 9725 106 10326 107 10627 98 111

Table 6.4: Burt’s twin IQ data

(a) Plot the data.

(b) Fit simple linear regression models to predict Y from X within each social class.

(c) Fit parallel lines predicting Y from X within each social class (i.e. fit regression models withthe same slope in each of the three classes, but possibly different intercepts).

(d) Produce an ANOVA table and an F -test to test whether the parallelism assumption is reason-able. Comment on the calculated F ratio.

100

Page 101: (ST217: Mathematical Statistics B)

For we know in part, and we prophesy in part.But when that which is perfect is come, then that which is in part shall be done away.

1 Corinthians 13:9–10

Everything should be made as simple as possible, but not simpler.Albert Einstein

A theory is a good theory if it satisfies two requirements: it must accurately describe a largeclass of observations on the basis of a model that contains only a few arbitrary elements, andit must make definite predictions about the results of future observations.

Stephen William Hawking

The purpose of models is not to fit the data but to sharpen the question.Samuel Karlin

Science may be described as the art of systematic oversimplification.Sir Karl Raimund Popper

101

Page 102: (ST217: Mathematical Statistics B)

This page intentionally left blank (except for this sentence).

102

Page 103: (ST217: Mathematical Statistics B)

Chapter 7

Further Topics

7.1 Generalisations of the Linear Model

You can generalise the systematic part of the linear model, i.e. the formula for E[Y |x]and/or the random part, i.e. the distribution of Y − E[Y |x].

7.1.1 Nonlinear Models

These are models of the formE[Y |x] = g(x,β) (7.1)

where Y is the response, x is a vector of explanatory variables, β = (β1 . . . βp)T is a parameter vector, andthe function g is nonlinear in the βis.

Examples

1. Asymptotic regression:

Yi = α− βγxi + εi (i = 1, 2, . . . , n),

εiIID∼ N (0, σ2).

There are four parameters to be estimated: β = (α, β, γ, σ2)T .

Assuming that 0 < γ < 1, we have:

(a) E[Y |x] is monotonic increasing in x,(b) E[Y |x = 0] = α− β,(c) as x →∞, E[Y |x] → α.

This ‘asymptotic regression’ model might be appropriate, for example, if

(a) x = age of an animal,y = height or weight, or

(b) x = time spent training,y = height jumped(for n people of similar build).

2. The ‘Michaelis-Menten’ equation in enzyme kinetics

E[Y |x] =β1x

β2 + x

with various possible distributional assumptions, the simplest of which is

[Y |x] ∼ N (β1x/(β2+x), σ2).

103

Page 104: (ST217: Mathematical Statistics B)

Comments

1. Nonlinear models can be fitted, in principle, by maximum likelihood.

2. In practice one needs computers and iteration.

3. Even if the random variation is assumed to be Normal, the likelihood may have a very non-Normalshape.

7.1.2 Generalised Linear Models

Definition 7.1 (GLM)A generalized linear model (GLM) has a random part and a systematic part:

Random Part

1. The ith response Yi has a probability distribution with mean µi.

2. The distributions are all of the same form (e.g. all Normal with variance σ2, or all Poisson, etc.)

3. The Yis are independent.

Systematic Part

g(µi) = xTi β

=p∑

j=1

βjxij , where

1. xi = (xi1 . . . xip)T is a vector of explanatory variables,

2. β = (β1 . . . βp)T is a parameter vector, and

3. g(·) is a monotonic function called the link function.

Comments

1. If Yi ∼ N (µi, σ2) and g(·) is the identity function, then we have the NLM.

2. Other GLMs typically must have their parameters estimated by maximising the likelihood numerically(iteratively in a computer).

3. The principles behind fitting GLMs are similar to those for fitting NLMs

Example: ‘logistic regression’

1. Random part: binary response

e.g. Yi|xi =

1 if individual i survived0 if individual i died

(and all Yis are conditionally independent given the corresponding xis).

Note that µi = E[Yi|xi] is here the probability of surviving given explanatory variables xi, and isusually written pi or πi.

2. Systematic part:

g(πi) = log(

πi

1− πi

).

104

Page 105: (ST217: Mathematical Statistics B)

Exercise 7.1Show that under the logistic regression model, if n patients have identical explanatory variables x say, then

1. Each of these n patients has probability of survival given by

π =exp(xT β)

1 + exp(xT β),

2. The number R surviving out of n has expected values nπ and variance nπ(1− π).‖

7.2 Simpson’s Paradox

Simpson’s paradox occurs when there are three RVs X, Y and Z, such that the conditional distributions[X, Y |Z] show a relationship between [X|Z] and [Y |Z], but the marginal distribution [X, Y ] apparentlyshows a very different relationship between X and Y . For example,

1. X(Y )=male(female) death rate, Z=age,

2. X(Y )=male(female) admission rate to University, Z=admission rate for student’s chosen course.

7.3 Problems

1. (a) Explain what is meant by

i. the Normal linear model,ii. simple linear regression, andiii. nonlinear regression.

(b) For simple linear regression applied to data (xi, yi), i = 1, . . . , n, show that the maximum like-lihood estimators β0 and β1 of the intercept β0 and slope β1 satisfy the simultaneous equations

β0n + β1

n∑i=1

xi =n∑

i=1

yi

and

β0

n∑i=1

xi + β1

n∑i=1

x2i =

n∑i=1

xiyi.

Hence find β0 and β1.

(c) The following table shows Y , the survival time (weeks) of leukaemia patients and x, the corre-sponding log of initial white blood cell count.

x Y x Y x Y

3.36 65 4.00 121 4.54 222.88 156 4.23 4 5.00 13.63 100 3.73 39 5.00 13.41 134 3.85 143 4.72 53.78 16 3.97 56 5.00 654.02 108 4.51 26

Plot the data and, without carrying out any calculations, discuss how reasonable are the as-sumptions underlying simple linear regression in this case.

From Warwick ST217 exam 1998

105

Page 106: (ST217: Mathematical Statistics B)

2. Because of concerns about sex discrimination, a study was carried out by the Graduate Division atthe University of California, Berkeley. In fall 1973, there were 8,442 male applications and 4,321female applications to graduate school. It was found that about 44% of the men and 35% of thewomen were admitted.

When the data were investigated further, it was found that just 6 of the more than 100 majorsaccounted for over one-third of the total number of applicants. The data for these six majors (whichBerkeley forbids identifying by name) are summarized in the table below.

Men Women

Number of Percent Number of PercentMajor applicants admitted applicants admitted

A 825 62 108 82B 560 63 25 68C 325 37 593 34D 417 33 375 35E 191 28 393 24F 373 6 341 7

Discuss the possibility of sex discrimination in admission, with particular reference to explanatoryvariables, conditional probability, independence and Simpson’s paradox.

Data from Freedman et al. (1991), page 17

3. (a) At a party, the POTAS1 of your dreams approaches you, and says by way of introduction:

Hi—I’m working on a study of human pheromones, and need some statistical help. Canyou explain to me what’s meant by ‘logistic regression’, and why the idea’s important?

Give a brief verbal explanation of logistic regression, without (i) using any formulae, (ii) say-ing anything that’s technically incorrect, (iii) boring the other person senseless and ruining apotentially beautiful friendship, (iv) otherwise embarrassing yourself.

(b) Repeat the exercise, replacing logistic regression successively with:

Bayesian inference, conditional expectation, likelihood,a multinomial distribution, multiple regression, the Neyman-Pearson lemma,nuisance parameters, one-way ANOVA, order statistics,the Poisson distribution, a linear model, size & power,statistical independence, a t-test.

(c) Suddenly, a somewhat inebriated student (SIS) appears and interrupts your rather impressiveexplanation with the following exchange:

SIS: Think of a number from 1 to 10.POTASOYD: Erm—seven?

SIS: Wrong. Get your clothes off.

You then watch aghast while he starts introducing himself in the same way to everyone in theroom. As a statistician, you of course note down the numbers xi he is given, namely

7, 2, 3, 1, 5, 2, 10, 10, 7, 3, 9, 1, 2, 2, 7, 10, 5, 8, 5, 7, 3, 10, 6, 1, 5, 3, 2, 7, 8, 5, 7.

His response yi is ‘Wrong’ in each case, and you formulate the hypotheses

H0 : yi = ‘Wrong’ irrespective of xi

H1 : yi =

‘Right’ if xi = x0, for some x0 ∈ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,‘Wrong’ if xi 6= x0.

How might you test the null hypothesis H0 against the alternative H1?1Person Of The Appropriate Sex

106

Page 107: (ST217: Mathematical Statistics B)

4. (i) Explain what is meant by:

(a) a generalised linear model,(b) a nonlinear model.

(ii) Discuss the models you would most likely consider for the following data sets:

(a) Data on the age, sex, and weight of 100 people who suffered a heart attack (for the firsttime), and whether or not they were still alive two years later.

(b) Data on the age, sex and weight of 100 salmon in a fish farm.

From Warwick ST217 exam 1996

I have yet to see any problem, however complicated, which, when you looked at it the rightway, did not become still more complicated.

Poul Anderson

The manipulation of statistical formulas is no substitute for knowing what one is doing.Hubert M. Blalock, Jr.

A judicious man uses statistics, not to get knowledge, but to save himself from having ignorancefoisted upon him.

Thomas Carlyle

The best material model of a cat is another, or preferably the same, cat.A. Rosenblueth & Norbert Wiener

A little inaccuracy sometimes saves tons of explanation.Saki (Hector Hugh Munro)

karma police arrest this man he talks in maths he buzzesLikeAfridge hes like a detuned radio.Thom Yorke

Better is the end of a thing than the beginning thereof.Ecclesiastes 7:8

107

Page 108: (ST217: Mathematical Statistics B)

Bibliography

[1] V. Barnett. Comparative Statistical Inference. John Wiley and Sons, New York, second edition, 1982.

[2] C. Burt. The genetic determination of differences in intelligence: A study of monozygotic twins rearedtogether and apart. Brit. J. Psych., 57:137–153, 1966.

[3] G. Casella and R. L. Berger. Statistical Inference. Wadsworth & Brooks/Cole, Pacific Grove, CA,1990.

[4] G. Casella and R. L. Berger. Statistical Inference. Wadsworth & Brooks/Cole, Pacific Grove, CA,second edition, 2001.

[5] M. H. DeGroot. Probability and Statistics. Addison-Wesley, Reading, Mass., second edition, 1989.

[6] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap, volume 57 of Monographs on Statisticsand Applied Probability. Chapman and Hall, New York, 1993.

[7] D. Freedman, R. Pisani, R. Purves, and A. Adhikari. Statistics. W. W. Norton, New York, secondedition, 1991.

[8] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in Practice.Chapman and Hall, London, 1996.

[9] D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski, editors. A Handbook of SmallData Sets. Chapman and Hall, London, 1994.

[10] R. V. Hogg and A. T. Craig. Introduction to Mathematical Statistics. MacMillan, New York, 1970.

[11] B. W. Lindgren. Statistical Theory. Chapman and Hall, London, fourth edition, 1994.

[12] A. M. Mood, F. A. Graybill, and D. C. Boes. Introduction to the Theory of Statistics. McGraw-Hill,New York, third edition, 1974.

[13] D. S. Moore and G. S. McCabe. Introduction to the Practice of Statistics. W. H. Freeman & CompanyLimited, Oxford, UK, third edition, 1998.

[14] S. C. Narula and J. F. Wellington. Prediction, linear regression and minimum sum of relative errors.Technometrics, 19:185–190, 1977.

[15] O.P.C.S. 1993 Mortality Statistics, volume 20 of DH2. Her Majesty’s Stationery Office, London, 1995.

[16] J. F. Osborn. Statistical Exercises in Medical Research. Blackwell Scientific Publications, Oxford, UK,1979.

[17] J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth, Pacific Grove, CA, second edition,1995.

[18] P. Sprent. Data Driven Statistical Methods. Chapman and Hall, London, 1998.

[19] S. Weisberg. Applied Linear Regression. John Wiley and Sons, New York, 1980.

108