linear models - university of bristolmapjg/teach/lm/n1_52.pdf · important to bear the...

Math 35110Autumn term 2005

Linear models

by Peter Green (University of Bristol,[email protected]).

� systematic treatment

– model formulation in various ways

– least squares estimation

– optimality of least squares

– connection with maximum likelihood

– model adequacy

� applications, including to regression andfactorial experiments

� demonstrations and practical work using R

c�University of Bristol, 2005

1

Motivation: data sets and basic ideas

� structure and relationships

� response and explanatory variables

� quantitative and qualitative (categorical, factor)variables

� statistical modelling

� experiment and observation

� causation� estimation, confidence intervals, testing

� prediction

Linear models play a central role in statistics

� in practice – a major part of the basic toolkit

� pedagogically – a pattern for other techniques

2

Relation to other units

This unit leads on from level 2 Statistics 2 in somerespects, and from the linear regression and t-testsin level 1 Statistics.

It leads on to level 3 units:

� Experimental design (weeks 7–12)

� Generalised linear models (weeks 13–20)

It was given by Stuart Coles in 2002/03, and by mein 2003/04 and 2004/05, to a similar syllabus.

It was given as half of the 20cp Linear models andexperimental design unit for several years up to2001/02 (so exam papers for that unit are relevant).

3

1. Model formulation

Linear models can be formulated or specified in variousways. It is important to be able to translate modelsfluently from one specification to another.

Matrix formulation

Responses are known linear functions of unknownparameters, plus an error term. In matrix/vectornotation:

� � ��

where � is �� , and � is �� , where � � �.

The response vector � is assumed known (observed); themodel or design matrix � is also known, comprisingobserved explanatory variables or experimental settings.The parameter vector � is the focus of our interest,whether it is to be estimated, or some other inferencecarried out on it.

4

� � ��

or, spelling it out:��

��

...

��

��

��

......

��

��

��

��

...

��

��

��

...

��

Quite often, the model includes a constant or interceptterm, which we will may refer to as column 0:�

��

...

��

��

� ��

......

� ��

�� ...

��

��

��

...

��

Here, � is � �.

5

Matrix notation is convenient for developing the generaltheory, but specific models are usually specified either inordinary algebraic notation or in the mnemonic modelexpression notation, devised by Wilkinson and Rogers,that is used in many computer systems, including R.

Let us illustrate this with examples.

a. Simple linear regression

Algebraic notation:��

Matrix formulation: � � �� as usual, with

� �� ...

...

��

��

Model expression: Y � x (or, more commonly, wewould use self-explanatory names for the variables, e.g.cholesterol � ). Note that the intercept is notmentioned, but is included by default.

6

b. Multiple linear regression

Algebraic notation:

��

Matrix formulation:

� ��

� ��

......

� ��

��

��

��

Model expression: (e.g.)Abrasion � +

c. Linear regression without an intercept

Algebraic notation:

��

Matrix formulation:

� ��

��

......

��

��

7

Model expression: (e.g.)Abrasion � Hardness+Tensile

d. Linear regression with functions andcombinations of explanatory variables

Algebraic notation:

��

Matrix formulation:

� ��

� ��

......

� ��

��

��

��

��

Model expression: (e.g.)Abrasion � Hardness+

+Hardness:sin(Tensile)

8

e. Factor variables

Algebraic notation:

��

where � � �� is an observed (that is, aqualitative variable). Matrix formulation:

� ��

� � � � � �

......

� � � � � �

� � � � � �

......

� � � � � ��

� ��

�

...

��

The columns of � are often called dummy variables; theyare not observed numbers, but indicators of whichcomponent of � enters into the formula for thatobservation.

Model expression: (e.g.)

� make

9

Such models are also often expressed in adouble-subscript notation:

��

where ��

�� . This is the same model in a different

notation; � is not a matrix, but a vector: � ��

��

...

��

...

��

f. Two factors: row+column model

Algebraic notation (with double subscripts):

� � � � ��

10

Matrix formulation:

� ��

� � � � � � �

......

� � � � � � ��

��

�

...

��

...

��

In � , each � and � is a �-vector of 1’s and 0’s, respectively,and each � is the identity matrix. There are rows ofthese blocks, so � is �� .

Note that we return to this set-up later, and use a slightlydifferent notation.

Model expression: (e.g.)Yield � +Variety

11

g. Regression, with factor variables too

Algebraic notation:

��

where � � �� is an observed factor, and �� anordinary numerical variable. Matrix formulation:

� ��

� � � � � � ��

� � � � � � ��

......

� � � � � � ��

� ��

...

�

��

(assuming � � � � � and � � ).

Model expression: (e.g.)lifetime � +speed

In double-subscript notation:

� ��

Note that this specifies several parallel regression lines.

12

h. Regression, with factor variables andinteraction

Algebraic notation:

� � � ��

where � � �� is an observed .

Model expression: (e.g.)lifetime � make*speed or, equivalently,lifetime � make+speed+

In double-subscript notation:

��

Note that this specifies several separate regression lines:they need no longer be .

13

Alternative parameterisations

There are always several ways to parameterise a model,and when interpreting parameters or their estimates, it isimportant to bear the parameterisation in mind.

For example

��

and ��

are models, where � � ��. While is thegradient in both models, the “intercepts” aredifferent numbers with different meanings.

This issue is especially important with variables.For example in case (e) above, with two levels of thefactor � as in the lathe example, we might have chosen

and , so that Æ is the difference in meanlifetime between the two makes of lathe. The � matrixwould then be of the form

� ��

� �

�� instead of

��

� �

��

Mathematically, two � matrices represent the samemodel if and only if they have the same column spaces.

14

Summary

� Note the general patterns, so that you can put richermodels together using these ideas in a modular way.

� A numerical explanatory variable contributescolumn to � ; a factor contributes as many columnsas there are levels of the factor.

� We use the symbol to assemble sets of columnstogether, corresponding to in successive termsin the model.

� Learn the interpretation of interaction (symbolised by:) between two factors, and between a factor and anumerical variable.

is short for A+B+A:B

� The individual components of � might correspond toother greek letters in the algebraic specification of themodel.

� The individual components of � and might havemultiple subscripts in the algebraic specification ofthe model.

15

2. Least squares estimation

Fitting a linear model means estimating the regressioncoefficient parameter – we usually do this using theprinciple of least squares. An advantage of this principleis that it makes sense without having to assume astatistical model for the errors .

The idea is to choose that value of �, say , such that theerror sum of squares �� is minimised, where

�

��

��

where

� �

We can find a compact general expression for the solutionto this minimisation problem, if we make a simplifyingassumption.

16

Until further notice we assume:

The matrix� is of . This is equivalent tomaking any of these assertions (recalling that

� � �):

� The rank of � is

� The columns of � are

� �� is

(Note that this is a sensible assumption, since ifit was not true, �� would not imply

, so that you could not expect to choosebetween � and �� using data � � �� . Inthis case, we say � is not identifiable.)

Which of the cases (a) to (h) in Section 1 satisfy thisassumption?

17

Let �� satisfy

�� (1)

We shall prove that any such minimises ��.

Note that �� . Let be any

�-vector. Then

� � ��Æ� � ��Æ��

� ��Æ� � �� Æ� �Æ�

Simplifying, the first term on the right is

��Æ �� by assumption (1), for all Æ

and .

The second term on the right is the sum of squares of theelements of , so is non-negative, and is 0 if and only if

. Since � is full rank, this is true if and only if.

Thus, �� Æ�� , with equality if and only if

Æ � �. So any �� satisfying (1) minimises .

Since �� is nonsingular, the only solution is�� ; this is therefore the unique leastsquares estimator.

18

Fitted values and residuals

Having obtained the estimates , the predicted orvalues of the response variable are obtained bysubstitution:

� ��

where � � �� is called the hat matrix (it‘puts the hat’ on ). Similarly, the vector of residuals isthe difference

� � ��

where � � �� is the order � identity matrix. Note thatboth the fitted values and the residuals are, like the leastsquares estimates ��, linear functions of .

The residual sum of squares is

��

� ��

since, as you can show easily:

� � is symmetric:

� � is idempotent: , and so:

� � ��

19

Fitting linear models in R

The R command for fitting a linear model is ;the only compulsory argument is the formula of themodel to be fitted – in the model expression syntaxwe saw in Section 1. For example,

lm(Cholesterol�Age)

The variables used in the formula may be either (i)in the current workspace as ordinary variables, (ii)in a that has been previously attachedusing the command, e.g. attach(lipid), or (iii)in a data frame specified as the argument oflm(), e.g.

lm(Cholesterol�Age, )

It produces brief output, the least squares estimates.

20

More comprehensive output is obtained byassigning the output of lm() to a variable, with aname of your choosing, e.g.

<-lm(Cholesterol�Age)

and then processing the result.

The output from lm() is a list with namedcomponents, e.g. fita$residuals; you can seeall the names with, e.g.,

names(fita)

You can look at the values of these components, asusual, by typing the name, e.g. givesthe least squares estimates.

21

3. Statistical performance

In this section, we start to make statistical assumptionsabout our model, but only about ,not full probability distributions.

Remarkably, this is enough to demonstrate a particularkind of optimality of least squares estimators, in the formof a famous result known as the theorem.

22

Mean and variance assumptions

In our linear model

� � � ��

from now on, we assume that

� �� (or, equivalently, ): that is, theobservations are unbiased;

� var�� for all � (or, equivalently,for all �): that is, the observations have equal variance;

� cov�� for all � �� (or, equivalently,for all � �� ): that is, the observations

are uncorrelated.

The 2nd and 3rd items are the same as saying thatvar�� var� � � .

23

Mean and variance of the least squares estimator

We find that

�� and var��

ProofWe will be using the results that

for any random �-vector � , and any constant

�� matrix �, �� andvar�� .

Recall that under the full-rank assumption,�� . Then��

� ��

and

var�� var��

� ��var ��

� ��

24

These results allow us to say that least squares estimatesare unbiased, and to write down their standard errors

�

var��

Confidence intervals, etc., based on this will be in the nextsection.

But even without the further assumptions made there, wecan claim a remarkable optimality for least squaresestimation; the variance of the estimator is the

(among linear unbiased estimators).

Gauss-Markov theorem

Suppose that �� and var� � � ��. Let be afixed �-vector. Then � �� is an unbiased estimator of ,and has variance than any other estimator that islinear (in ) and unbiased.

Proof

��

25

Also, � �� is clearly linear in . Let

�� be any other linear unbiased estimator. Then

��

��

holds for all �, that is, � must satisfy � � ��. So

var� ��

� ��

Meanwhile,

var��

so

var�� var� ��

� ��

which is the sum of squares of the elements of a vector, so

� as required.

26

Estimating ��

The least squares principle does not tell itself us how toestimate ��. However, we do now have a basis for doingso. Since �� var�� and �� , we wouldexpect the average of the squares of the residuals

�� to be about .

In fact (from page 19) the residual sum of squares is

�� , and using the result that

For any random vector � with �� andvar� � � , and any constant matrix �,

�� tr��

we have

��

� �� tr�� tr��

since �� (check!). But

tr�� tr ��

� tr �� tr�� using the fact that tr�� tr�� for all conformablematrices �, �.

27

Thus,

��

��

��

(“RSS df”) is an unbiased estimator of , which wecall “the least squares estimator” of ��.

Standardised residuals

Although the errors �� are assumed to have equalvariance ��, their estimates, the residuals , do not. Infact,

var�� var�� var� ��

� ��

where �� are the diagonal elements of � .

Therefore we define standardised residuals as

�� . These have(approximately) equal variances.

28

Further results from lm()

A more complete printed summary can be obtained bytyping, e.g.,

summary( )

This includes estimates and their standard errors, andsome statistics about residuals and about the fit.

diagnostic plots are produced if you type, e.g.,

plot(fita)

(If you have first typed par(mfrow=c(2,2)) they willbe displayed as a array.)

29

Diagnostic plots

The four plots are

1. Fitted values vs. residuals: a scatter plot of ��

(a pattern indicates that there is systematicunder-fitting: do you need to fit other explanatoryvariables?)

2. Normal Q-Q plot: a Q-Q plot of the standardisedresiduals(departures from a straight line suggest that theerrors are not normally distributed)

3. Scale-Location plot: a scatter plot of ��

(a more sensitive version of the fitted value/residualplot: does the variance of the errors vary?)

4. Cook’s distance plot: a plot of ��

against(Cook’s distance is a measure of how much theoverall fit would change if observation � wasdeleted). In R versions 2.2.0 or later, the 4th plot isdifferent from this one, by default. To get this one, useplot(fita,which=c(1,2,3,5)).

30

Prediction

We often fit a linear model in order to make predictions ofthe response variable for various future choices of theexplanatory variable(s). The function isprovided for this.

For example,

predict( ,data.frame(Age=c(23,27)))

computes the expected values of Cholesterol for Ageequal to 23 and 27.

31

4. Normal theory assumptions

Now for the first time, we make assumptions about theof our responses. We assume

more, and we get more - we can derive inferentialprocedures like confidence intervals for parameters, makeprobabilistic predictions about future observations, andtest hypotheses about parameter values and about modeladequacy.

We will also find an intimate connection between leastsquares and .

Assumption

In addition to the assumptions of Section 3, we nowassume that the �� are independently normallydistributed. That is, �� , independently, orin brief, � � �� .

(Here �� is the �� row of � ; so �� .)

32

Least squares and maximum likelihood

Since the observations are independent, the likelihood isjust the product of their density functions, so

� � �

��

��!��

��

�

Thus the -likelihood is

" � "�� !� �

��

��

� �� ! � � ��

�

where

��

is the usual sum of squares.

Hence for any �, the likelihood correspondsexactly to the sum of squares of the residuals.That is, for �, least squares estimation and maximumlikelihood estimation is the same thing. (Note that the

distribution assumption is essential for thisconclusion.)

33

Since the least squares estimates do not involve , youget the same estimators on simultaneously maximisingover � and ��.

This connection provides a powerful additionaljustification for using least squares estimators.

Maximum likelihood estimation for

Differentiating the log-likelihood with respect to , andsetting to , we get

��

��

� �

so we immediately obtain the maximum likelihoodestimator of �� as ��. Note that this is different fromthe least squares estimator , which has the divisor

�� ; in practice we always use the latter, the leastsquares estimator, which is .

34

MLE, linear models, and other error distributions

If the errors �� were assumed to be i.i.d. but with adistribution, then the maximum likelihood

estimator of � would turn out to be different.

For example, suppose that the “double-exponential” ordistribution was assumed:

#��

where � is a (scale) parameter.

Then the argument on page 33 is easily modified, and weget

" � "��

��

��

and so the MLE of � is that value minimising

��

sometimes called the estimator (least squares is ‘ �’).

35

Joint distribution of �� and ��Various inferential procedures can be derived from thefollowing result (which we will see how to prove later inthe course):

If � � �� then

(a) � ��

(b) �� $�

and are independent.

Note that we already knew the mean and variance in (a)and the mean in (b) – they did not require normality.

Corollary

For any fixed �-vector �, we have

� ��

36

Definition of multivariate normal distribution

A random �-vector � has the multivariate normaldistribution

� �

if for every constant �-vector �, �� has the ordinarynormal distribution �� .

If � � �� then �� and var� � � .

� � �� if and only if �� for all �,

independently.

If � � �� and � is non-singular, then � has jointdensity function

� �!�� Example

� �� is the same as saying that

� �� for all fixed �-vectors �.

37

Definition of � distribution

If & has a standard normal distribution, � has a $��

distribution ($� with ' degrees of freedom), and & and �

are independent, then &��' has a % distribution with

' degrees of freedom.

Proof of Corollary

From (a), � �� , so

� ��

��

� (2)

From (b), �� $�� , andthis is independent of (2) by (c), so we have our result, bythe definition of the % distribution.

Thus, for example, a �� confidence interval for

�� is given by

� � %��(� � %��(�

where �� .

38

Example 1: confidence interval

A �� confidence interval for is given by

�� %��(� �� %��(�

where now �� .

Example 2: testing a hypothesis

You can reject the hypothesis that �� at level ,against a two-sided alternative, if � ��

) %��

Example 3: prediction

A future observation with explanatory variables willbe � � � �� ; this has least squares estimate �� .The error has variance �� . A

�� interval for � � is

�� %��(�� %��(�where this time �

� � �� .

39

5. Model choice in linear models

We have so far regarded the matrix , containingnumerical explanatory variables, and 0/1 indicators forfactor levels, as fixed and given. In practice, very often,choice of which explanatory toinclude is at the discretion of the analyst. As we haveseen, it is easy enough in a system like R to make severaldifferent choices, and fit them all. But what criteriashould be used to choose between these models?

We wish to avoid

variables that are – that wouldincur in estimation and prediction

variables that have – that is awaste of expense, and would lead to inflatedestimates of of estimates and predictions

The main thing we need is a formal mechanism fordetermining whether an individual variable, or group ofvariables, can be dropped from a linear model, withoutan undue effect on the performance of the model in termsof explaing the . This will allow us to makepairwise comparisons between models.

40

We already have a method for testing whether avariable (component of �) needs to be included – seeExample on page . If we set � to be the �� unitvector, so �� , and suppose �� then we seethat you can reject the hypothesis that at level ,against a -sided alternative, if

�� ) % ��

However, this procedure doesn’t cover the case where thepossible exclusion of components of � is beingconsidered, since the test for each component assumesthe inclusion of all other components.

Analysis of Variance

This is a general term for procedures that give thegeneralisation of the test that we require.

The basic idea is to decompose the variability, measured by, among components in � into terms

attributable to different components of �. Tests are basedon between these sums of squares.

41

The basic decomposition is obtained from results on:

�� and ��

thus:

� � � � � ��

Post- and pre-multiplying by � :

��

since �� . In symbols,

SS� SS� SS

where SS�, SS� and SS are called the ,and residual (or ) sums of squares, respectively.(This is really Pythagoras’ theorem!)

Because of the special properties of� , each of these termscan be expressed in many equivalent ways.

What we have done is decompose the variation in intoa term explained by the regression, and an unexplained,or residual, term.

42

If SS� is relative to SS, we intuitively conclude thatthe regression is doing a good job; now we develop aformal test to assess this.

We know (slides 27/8) that �SS� � .Meanwhile,

�SS��

� �� var��

� �� tr��

So if � � �, SS�� is another unbiased estimator of ,and

* �

SS��

SS ��

should be close to ; if � �� it will tend to be .

43

Definition of � distribution

If & � $�� and � � $�� are independent, then

&��

The � test

If we assume, as usual, that � � �� and that

� has full rank , then

(a) if � � �, SS � ��$��

(b) always, SS � ��$��

(c) SS and SS� are .

It follows by definition of the * distribution that if � � �,

� *��

(Actually, regarding (a), in general

�� $��, and if � ��

then has what we call a non-central $�

distribution.)

44

The resulting procedure of calculating the sums ofsquares and other terms to perform this F test is usuallysummarised as an ANOVA table:

Source SS df MS F

Regression on � � SS�� *

Residual SS ��

Total, uncorrected SS�

45

Correcting for the mean

More often than not, variation in � is of interestmeasured from the mean of � ; SS� and SS� are thenmodified accordingly:

SS�� SS� � ��

SS�� SS� � ��

��

��

The ANOVA table is modified to:

Source SS df MS F

Regression on � SS�� SS�� *

Residual �� SS��

Total, corrected SS��

where now * � *�� This * ratio is appropriate fortesting the hypothesis that �� in themodel

��

��

��

where �� for all �, so that �� is the intercept. Wereject that hypothesis at level if * ) *��.

46

Significance of subsets of variables

Having fitted, say, �� , were � �� reallynecessary? (In practice, we may re-order variables beforeposing this question. We suppose that �� : the modelalways includes an .)

We answer this question by comparing the fits of twomodels using a significance test. The models are:

The Full model: � �

The Reduced model: � � ��

where we have partitioned � ��

�� and

� � �� where �� is + � �, �� is � �, �� is

� and �� is �� +�. [NB – be careful, �� here isstill a vector, the first components of �, not just the firstcomponent.]

The question: are � �� really necessary? becomesnow: can we accept the hypothesis �� ?

47

We fit both models by least squares, and obtain leastsquares estimates and ��, and residual sums of squaresSS � �� and SS� � ��. ObviouslySS� � SS � – how much larger says how strongly itwas worth including the explanatory variables in .This is the basis of the test – we need to find thedistribution of this difference under ��.

Suppose the model is true, then

(a) if �� is true, SS� � SS � $��

(b) always, SS � ��$�

(c) SS and SS� � SS are .

Them under ��,

* �

SS� � SS �� *��

and it otherwise tends to be bigger, so we reject �� at thesignificance level if * ) *�� .

48

The main part of the ANOVA table becomes:

Source SS df

Regression on �� SS��

Due to �� after �� SS�� SS�� +

Regression on � SS��

Residual ��

Total, corrected SS��

using the fact that SS�� SS�� SS � SS�� SS�, sothat SS� � SS � SS�� SS��.

If we need to do the computations by hand, usually it iseasiest to compute SS��, SS�� and SS�� (as sums of squaresof the responses, and of the fitted values under eachmodel, all 3 being corrected by subtracting ), then theother sums of squares by subtraction.

That is, SS��

and

SS��

��

SS��

where �� and are the fitted values from the reducedand full models.

49

The anova() function in R does all the work for you.Note that it displays the regression sums of squares forthe model with just the 1st term, then the extraattributable to each additional term, and finally theresidual sum of squares. That is, numbering the modelsin order of the terms included as �� , thedisplayed sums of squares areSS�� SS� � SS� � � � � �SS�� SS�� SS�

The * tests performed by the anova() function thusrelate to the successive inclusion of each termsequentially.

The order in which terms are included is thereforeimportant. The only exception to this is when the terms inthe linear model are orthogonal, that is �

�� , where

�� and �� are the blocks of columns of �

corresponding to terms � and �, for every pair ��.

50

Sketch proof of �� and � results

The results quoted in slides 36, 44 and 48, and many otherresults used for testing hypotheses in normal linearmodels, follow from the following proposition.

Suppose � � �� , and that �� arereal matrices such that

� ��

� �� for all � ��

These are the same as

�� and

�� for all � � ��. It follows that

�� for all �. Then

� , � � �� are independent

� ��

� �� , where

� tr�� rank��.

� � �� $��

This can be proved using 1st year Linear Algebra, but wewill omit the proof.

51

Corollaries

Distribution of �� and ��. Let � � , �� ,

�� ; we find �� and �� . Also notethat �� simplifies to , so �� is indeed afunction of �� . Details left as exercise. This also provesthe results on page 44.

Full vs. Reduced model � test. Let � � �,

��

�� , �� ,

�� ; we find �� +, �� and�� . We can simplify: � �� SS� � SS and

� �� . Details left as exercise.

Correcting for the mean. This just corresponds to

taking one of the terms, say � �� , to be ��

. The

corresponding �� .

52

linear models - university of bristolmapjg/teach/lm/n1_52.pdf · important to bear the...

Documents