lindsey 1997 applying generalized linear models.pdf

7/29/2019 Lindsey 1997 Applying Generalized Linear Models.pdf

1/265

Applying GeneralizedLinear Models

James K. Lindsey

Springer


2/265


3/265


4/265


5/265


6/265

Preface

Generalized linear models provide a unied approach to many of the most

common statistical procedures used in applied statistics. They have ap-plications in disciplines as widely varied as agriculture, demography, eco-logy, economics, education, engineering, environmental studies and pollu-tion, geography, geology, history, medicine, political science, psychology,and sociology, all of which are represented in this text.

In the years since the term was rst introduced by Nelder and Wedder-burn in 1972, generalized linear models have slowly become well known andwidely used. Nevertheless, introductory statistics textbooks, and courses,still most often concentrate on the normal linear model, just as they did in

the 1950s, as if nothing had happened in statistics in between. For studentswho will only receive one statistics course in their career, this is especiallydisastrous, because they will have a very restricted view of the possibleutility of statistics in their chosen eld of work. The present text, beingfairly advanced, is not meant to ll that gap; see, rather, Lindsey (1995a).

Thus, throughout much of the history of statistics, statistical modellingcentred around this normal linear model. Books on this subject abound.More recently, log linear and logistic models for discrete, categorical datahave become common under the impetus of applications in the social sci-

ences and medicine. A third area, models for survival data, also became agrowth industry, although not always so closely related to generalized linearmodels. In contrast, relatively few books on generalized linear models, assuch, are available. Perhaps the explanation is that normal and discrete, aswell as survival, data continue to be the major elds of application. Thus,many students, even in relatively advanced statistics courses, do not have


7/265

vi

an overview whereby they can see that these three areas, linear normal,categorical, and survival models, have much in common. Filling this gap isone goal of this book.

The introduction of the idea of generalized linear models in the early

1970s had a major impact on the way applied statistics is carried out. In thebeginning, their use was primarily restricted to fairly advanced statisticiansbecause the only explanatory material and software available were addressedto them. Anyone who used the rst versions of GLIM will never forgetthe manual which began with pages of statistical formulae, before actuallyshowing what the program was meant to do or how to use it.

One had to wait up to twenty years for generalized linear modellingprocedures to be made more widely available in computer packages suchas Genstat, Lisp-Stat, R, S-Plus, or SAS. Ironically, this is at a time whensuch an approach is decidedly outdated, not in the sense that it is no longeruseful, but in its limiting restrictions as compared to what statistical modelsare needed and possible with modern computing power. What are nowrequired, and feasible, are nonlinear models with dependence structuresamong observations. However, a unied approach to such models is onlyslowly developing and the accompanying software has yet to be put forth.The reader will nd some hints in the last chapter of this book.

One of the most important accomplishments of generalized linear modelshas been to promote the central role of the likelihood function in inference.Many statistical techniques are proposed in the journals every year withoutthe user being able to judge which are really suitable for a given dataset. Most ad hoc measures, such as mean squared error, distinctly favourthe symmetry and constant variance of the normal distribution. However,statistical models, which by denition provide a means of calculating theprobability of the observed data, can be directly compared and judged:a model is preferable, or more likely, if it makes the observed data moreprobable (Lindsey, 1996b). This direct likelihood inference approach will beused throughout, although some aspects of competing methods are outlinedin an appendix.

A number of central themes run through the book:

the vast majority of statistical problems can be formulated, in a uni-ed way, as regression models; any statistical models, for the same data, can be compared (whethernested or not) directly through the likelihood function, perhaps, with

the aid of some model selection criterion such as the AIC;

almost all phenomena are dynamic (stochastic) processes and, withmodern computing power, appropriate models should be constructed; many so called semi- and nonparametric models (although notnonparametric inference procedures) are ordinary (often saturated)


8/265

vii

generalized linear models involving factor variables; for inferences, onemust condition on the observed data, as with the likelihood function.

Several important and well-known books on generalized linear models are

available (Aitkin et al. , 1989; McCullagh and Nelder, 1989; Dobson, 1990;Fahrmeir and Tutz, 1994); the present book is intended to be complement-ary to them.

For this text, the reader is assumed to have knowledge of basic statisticalprinciples, whether from a Bayesian, frequentist, or direct likelihood pointof view, being familiar at least with the analysis of the simpler normal linearmodels, regression and ANOVA. The last chapter requires a considerablyhigher level of sophistication than the others.

This is a book about statistical modelling , not statistical inference. Theidea is to show the unity of many of the commonly used models. In sucha text, space is not available to provide complete detailed coverage of eachspecic area, whether categorical data, survival, or classical linear models.The reader will not become an expert in time series or spatial analysisby reading this book! The intention is rather to provide a taste of thesedifferent areas, and of their unity. Some of the most important specializedbooks available in each of these elds are indicated at the end of eachchapter.

For the examples, every effort has been made to provide as much back-ground information as possible. However, because they come from such awide variety of elds, it is not feasible in most cases to develop prior the-oretical models to which conrmatory methods, such as testing, could beapplied. Instead, analyses primarily concern exploratory inference involvingmodel selection, as is typical of practice in most areas of applied statistics.In this way, the reader will be able to discover many direct comparisonsof the application of the various members of the generalized linear modelfamily.

Chapter 1 introduces the generalized linear model in some detail. Thenecessary background in inference procedures is relegated to Appendices Aand B, which are oriented towards the unifying role of the likelihood func-tion and include details on the appropriate diagnostics for model checking.Simple log linear and logistic models are used, in Chapter 2, to introducethe rst major application of generalized linear models. These log linearmodels are shown, in turn, in Chapter 3, to encompass generalized linearmodels as a special case, so that we come full circle. More general regres-sion techniques are developed, through applications to growth curves, in

Chapter 4. In Chapter 5, some methods of handling dependent data are de-scribed through the application of conditional regression models to longit-udinal data. Another major area of application of generalized linear modelsis to survival, and duration, data, covered in Chapters 6 and 7, followed byspatial models in Chapter 8. Normal linear models are briey reviewed inChapter 9, with special reference to model checking by comparing them to


9/265

viii

nonlinear and non-normal models. (Experienced statisticians may considerthis chapter to be simpler than the the others; in fact, this only reectstheir greater familiarity with the subject.) Finally, the unifying methodsof dynamic generalized linear models for dependent data are presented in

Chapter 10, the most difficult in the text.The two-dimensional plots were drawn with MultiPlot, for which I thank

Alan Baxter, and the three-dimensional ones with Maple. I would also liketo thank all of the contributors of data sets; they are individually cited witheach table.

Students in the masters program in biostatistics at Limburgs Universityhave provided many comments and suggestions throughout the years thatI have taught this course there. Special thanks go to all the members of theDepartment of Statistics and Measurement Theory at Groningen Universitywho created the environment for an enjoyable and protable stay as VisitingProfessor while I prepared the rst draft of this text. Philippe Lambert,Patrick Lindsey, and four referees provided useful comments that helped toimprove the text.Diepenbeek J.K.L.December, 1996


10/265

Contents

Preface v

1 Generalized Linear Modelling 11.1 Statistical Modelling . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 A Motivating Example . . . . . . . . . . . . . . . . . 11.1.2 History . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Data Generating Mechanisms and Models . . . . . . 61.1.4 Distributions . . . . . . . . . . . . . . . . . . . . . . 61.1.5 Regression Models . . . . . . . . . . . . . . . . . . . 8

1.2 Exponential Dispersion Models . . . . . . . . . . . . . . . . 91.2.1 Exponential Family . . . . . . . . . . . . . . . . . . . 101.2.2 Exponential Dispersion Family . . . . . . . . . . . . 111.2.3 Mean and Variance . . . . . . . . . . . . . . . . . . . 11

1.3 Linear Structure . . . . . . . . . . . . . . . . . . . . . . . . 131.3.1 Possible Models . . . . . . . . . . . . . . . . . . . . . 141.3.2 Notation for Model Formulae . . . . . . . . . . . . . 151.3.3 Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4 Three Components of a GLM . . . . . . . . . . . . . . . . . 18

1.4.1 Response Distribution or Error Structure . . . . . 181.4.2 Linear Predictor . . . . . . . . . . . . . . . . . . . . 181.4.3 Link Function . . . . . . . . . . . . . . . . . . . . . . 18

1.5 Possible Models . . . . . . . . . . . . . . . . . . . . . . . . . 201.5.1 Standard Models . . . . . . . . . . . . . . . . . . . . 201.5.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . 21


11/265

x Contents

1.6 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 Discrete Data 27

2.1 Log Linear Models . . . . . . . . . . . . . . . . . . . . . . . 272.1.1 Simple Models . . . . . . . . . . . . . . . . . . . . . 282.1.2 Poisson Representation . . . . . . . . . . . . . . . . 30

2.2 Models of Change . . . . . . . . . . . . . . . . . . . . . . . . 312.2.1 MoverStayer Model . . . . . . . . . . . . . . . . . . 322.2.2 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . 332.2.3 Diagonal Symmetry . . . . . . . . . . . . . . . . . . 352.2.4 Long-term Dependence . . . . . . . . . . . . . . . . . 362.2.5 Explanatory Variables . . . . . . . . . . . . . . . . . 36

2.3 Overdispersion . . . . . . . . . . . . . . . . . . . . . . . . . 372.3.1 Heterogeneity Factor . . . . . . . . . . . . . . . . . . 382.3.2 Random Effects . . . . . . . . . . . . . . . . . . . . . 382.3.3 Rasch Model . . . . . . . . . . . . . . . . . . . . . . 39

2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Fitting and Comparing Probability Distributions 493.1 Fitting Distributions . . . . . . . . . . . . . . . . . . . . . . 49

3.1.1 Poisson Regression Models . . . . . . . . . . . . . . 493.1.2 Exponential Family . . . . . . . . . . . . . . . . . . . 52

3.2 Setting Up the Model . . . . . . . . . . . . . . . . . . . . . . 543.2.1 Likelihood Function for Grouped Data . . . . . . . . 543.2.2 Comparing Models . . . . . . . . . . . . . . . . . . . 55

3.3 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 573.3.1 Truncated Distributions . . . . . . . . . . . . . . . . 573.3.2 Overdispersion . . . . . . . . . . . . . . . . . . . . . 583.3.3 Mixture Distributions . . . . . . . . . . . . . . . . . 60

3.3.4 Multivariate Distributions . . . . . . . . . . . . . . . 633.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4 Growth Curves 694.1 Exponential Growth Curves . . . . . . . . . . . . . . . . . . 70

4.1.1 Continuous Response . . . . . . . . . . . . . . . . . 704.1.2 Count Data . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 Logistic Growth Curve . . . . . . . . . . . . . . . . . . . . . 724.3 Gomperz Growth Curve . . . . . . . . . . . . . . . . . . . . 74

4.4 More Complex Models . . . . . . . . . . . . . . . . . . . . . 764.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5 Time Series 875.1 Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . 88

5.1.1 Point Processes . . . . . . . . . . . . . . . . . . . . . 88


12/265

Contents xi

5.1.2 Homogeneous Processes . . . . . . . . . . . . . . . . 885.1.3 Nonhomogeneous Processes . . . . . . . . . . . . . . 885.1.4 Birth Processes . . . . . . . . . . . . . . . . . . . . . 90

5.2 Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . 91

5.2.1 Autoregression . . . . . . . . . . . . . . . . . . . . . 935.2.2 Other Distributions . . . . . . . . . . . . . . . . . . . 965.2.3 Markov Chains . . . . . . . . . . . . . . . . . . . . . 101

5.3 Repeated Measurements . . . . . . . . . . . . . . . . . . . . 1025.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6 Survival Data 1096.1 General Concepts . . . . . . . . . . . . . . . . . . . . . . . . 109

6.1.1 Skewed Distributions . . . . . . . . . . . . . . . . . . 1096.1.2 Censoring . . . . . . . . . . . . . . . . . . . . . . . . 1096.1.3 Probability Functions . . . . . . . . . . . . . . . . . 111

6.2 Nonparametric Estimation . . . . . . . . . . . . . . . . . 1116.3 Parametric Models . . . . . . . . . . . . . . . . . . . . . . . 113

6.3.1 Proportional Hazards Models . . . . . . . . . . . . . 1136.3.2 Poisson Representation . . . . . . . . . . . . . . . . 1136.3.3 Exponential Distribution . . . . . . . . . . . . . . . . 1146.3.4 Weibull Distribution . . . . . . . . . . . . . . . . . . 115

6.4 Semiparametric Models . . . . . . . . . . . . . . . . . . . 1166.4.1 Piecewise Exponential Distribution . . . . . . . . . . 1166.4.2 Cox Model . . . . . . . . . . . . . . . . . . . . . . . 116

6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7 Event Histories 1217.1 Event Histories and Survival Distributions . . . . . . . . . . 1227.2 Counting Processes . . . . . . . . . . . . . . . . . . . . . . . 1237.3 Modelling Event Histories . . . . . . . . . . . . . . . . . . . 123

7.3.1 Censoring . . . . . . . . . . . . . . . . . . . . . . . . 1247.3.2 Time Dependence . . . . . . . . . . . . . . . . . . . . 124

7.4 Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . 1277.4.1 Geometric Process . . . . . . . . . . . . . . . . . . . 1287.4.2 Gamma Process . . . . . . . . . . . . . . . . . . . . 132

7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

8 Spatial Data 1418.1 Spatial Interaction . . . . . . . . . . . . . . . . . . . . . . . 141

8.1.1 Directional Dependence . . . . . . . . . . . . . . . . 1418.1.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . 1458.1.3 One Cluster Centre . . . . . . . . . . . . . . . . . . . 1478.1.4 Association . . . . . . . . . . . . . . . . . . . . . . . 147

8.2 Spatial Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 1498.2.1 Response Contours . . . . . . . . . . . . . . . . . . . 149


13/265

xii Contents

8.2.2 Distribution About a Point . . . . . . . . . . . . . . 1528.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

9 Normal Models 159

9.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 1609.2 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . 1619.3 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . 164

9.3.1 Empirical Models . . . . . . . . . . . . . . . . . . . . 1649.3.2 Theoretical Models . . . . . . . . . . . . . . . . . . . 165

9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

10 Dynamic Models 17310.1 Dynamic Generalized Linear Models . . . . . . . . . . . . . 173

10.1.1 Components of the Model . . . . . . . . . . . . . . . 17310.1.2 Special Cases . . . . . . . . . . . . . . . . . . . . . . 17410.1.3 Filtering and Prediction . . . . . . . . . . . . . . . . 174

10.2 Normal Models . . . . . . . . . . . . . . . . . . . . . . . . . 17510.2.1 Linear Models . . . . . . . . . . . . . . . . . . . . . . 17610.2.2 Nonlinear Curves . . . . . . . . . . . . . . . . . . . . 181

10.3 Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 18610.4 Positive Response Data . . . . . . . . . . . . . . . . . . . . 18910.5 Continuous Time Nonlinear Models . . . . . . . . . . . . . . 191

Appendices

A Inference 197A.1 Direct Likelihood Inference . . . . . . . . . . . . . . . . . . 197

A.1.1 Likelihood Function . . . . . . . . . . . . . . . . . . 197A.1.2 Maximum Likelihood Estimate . . . . . . . . . . . . 199A.1.3 Parameter Precision . . . . . . . . . . . . . . . . . . 202A.1.4 Model Selection . . . . . . . . . . . . . . . . . . . . . 205A.1.5 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . 210

A.2 Frequentist Decision-making . . . . . . . . . . . . . . . . . . 212A.2.1 Distribution of the Deviance Statistic . . . . . . . . . 212A.2.2 Analysis of Deviance . . . . . . . . . . . . . . . . . . 214A.2.3 Estimation of the Scale Parameter . . . . . . . . . . 215

A.3 Bayesian Decision-making . . . . . . . . . . . . . . . . . . . 215A.3.1 Bayes Formula . . . . . . . . . . . . . . . . . . . . . 216A.3.2 Conjugate Distributions . . . . . . . . . . . . . . . . 216

B Diagnostics 221B.1 Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . 221B.2 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

B.2.1 Hat Matrix . . . . . . . . . . . . . . . . . . . . . . . 222B.2.2 Kinds of Residuals . . . . . . . . . . . . . . . . . . . 223


14/265

Contents xiii

B.2.3 Residual Plots . . . . . . . . . . . . . . . . . . . . . 225B.3 Isolated Departures . . . . . . . . . . . . . . . . . . . . . . . 226

B.3.1 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . 227B.3.2 Inuence and Leverage . . . . . . . . . . . . . . . . . 227

B.4 Systematic Departures . . . . . . . . . . . . . . . . . . . . . 228

References 231

Index 243


15/265

1Generalized Linear Modelling

1.1 Statistical Modelling

Models are abstract, simplied representations of reality, often used bothin science and in technology. No one should believe that a model could betrue, although much of theoretical statistical inference is based on just thisassumption. Models may be deterministic or probabilistic. In the formercase, outcomes are precisely dened, whereas, in the latter, they involvevariability due to unknown random factors. Models with a probabilisticcomponent are called statistical models.

The one most important class, that with which we are concerned, containsthe generalized linear models. They are so called because they generalizethe classical linear models based on the normal distribution. As we shallsoon see, this generalization has two aspects: in addition to the linear re-gression part of the classical models, these models can involve a variety of distributions selected from a special family, exponential dispersion models,and they involve transformations of the mean, through what is called a linkfunction (Section 1.4.3), linking the regression part to the mean of one of these distributions.

1.1.1 A Motivating Example

Altman (1991, p. 199) provides counts of T 4 cells/mm 3 in blood samplesfrom 20 patients in remission from Hodgkins disease and 20 other patientsin remission from disseminated malignancies, as shown in Table 1.1. We


16/265

2 1. Generalized Linear Modelling

TABLE 1.1. T 4 cells/mm 3 in blood samples from 20 patients in remission fromHodgkins disease and 20 patients in remission from disseminated malignancies(Altman, 1991, p. 199).

Hodgkins Non-HodgkinsDisease Disease396 375568 375

1212 752171 208554 151

1104 116257 736

435 192295 315397 1252288 675

1004 700431 440795 771

1621 6881378 426902 410958 979

1283 3772415 503

wish to determine if there is a difference in cell counts between the twodiseases. To do this, we should rst dene exactly what we mean by adifference. For example, are we simply looking for a difference in meancounts, or a difference in their variability, or even a difference in the overallform of the distributions of counts?

A simple naive approach to modelling the difference would be to lookat the difference in estimated means and to make inferences using the es-timated standard deviation. Such a procedure implicitly assumes a normaldistribution. It implies that we are only interested in differences of meansand that we assume that the variability and normal distributional form areidentical in the two groups. The resulting Student t value for no difference

in means is 2.11.Because these are counts, a more sophisticated method might be to as-sume a Poisson distribution of the counts within each group (see Chapter2). Here, as we shall see later, it is more natural to use differences in log-arithms of the means, so that we are looking at the difference between themeans, themselves, through a ratio instead of by subtraction. However, this


17/265

1.1 Statistical Modelling 3

TABLE 1.2. Comparison of models, based on various distributional assumptions,for no difference and difference between diseases, for the T 4 cell count data of Table 1.1.

AICNo Difference Estimate

Model difference Difference in 2log(L) /s.e.Normal 608.8 606.4 4.4 2.11Log normal 590.1 588.6 3.5 1.88Gamma 591.3 588.0 5.3 2.14Inverse Gaussian 590.0 588.2 3.8 1.82Poisson 11652.0 10294.0 1360.0 36.40Negative binomial 589.2 586.0 5.2 2.36

model also carries the additional assumption that the variability will bedifferent between the two groups if the mean is, because the variance of a Poisson distribution is equal to its mean. Now, the asymptotic Studentt value for no difference in means, and hence in variances, is 36.40, quitedifferent from the previous one.

Still a third approach would be to take logarithms before calculating themeans and standard deviation in the rst approach, thus, in fact, ttinga log normal model. In the Poisson model, we looked at the difference inlog mean, whereas now we have the difference in mean logarithms. Here,it is much more difficult to transform back to a direct statement about thedifference between the means themselves. As well, although the variance of the log count is assumed to be the same in the two groups, that of the countitself will not be identical. This procedure gives a Student t value of 1.88,yielding a still different conclusion.

A statistician only equipped with classical inference techniques has littlemeans of judging which of these models best ts the data. For example,study of residual plots helps little here because none of the models (ex-cept the Poisson) show obvious discrepancies. With the direct likelihoodapproach used in this book, we can consider the Akaike (1973) informa-tion criterion (AIC) for which small values are to be preferred (see SectionA.1.4). Here, it can be applied to these models, as well as some other mem-bers of the generalized linear model family.

The results for this problem are presented in Table 1.2. We see, as mightbe expected with such large counts, that the Poisson model ts very poorly.

The other count model, that allows for overdispersion (Section 2.3), thenegative binomial (the only one that is not a generalized linear model),ts best, whereas the gamma is second. By the AIC criterion, a differencebetween the two diseases is indicated for all distributions.

Consider now what would happen if we apply a signicance test at the 5%level. This might either be a log likelihood ratio test based on the difference


18/265


in minus two log likelihood, as given in the second last column of Table 1.2,or a Wald test based on the ratio of the estimate to the standard error, in thelast column of the table. Here, the conclusions about group difference varydepending on which distribution we choose. Which test is correct? Funda-

mentally, only one can be: that which we hypothesized before obtaining thedata (if we did). If, by whatever means, we choose a model, based on thedata, and then test for a difference between the two groups, the P -valuehas no meaning because it does not take into account the uncertainty in themodel choice.

After this digression, let us nally draw our conclusions from our modelselection procedure. The choice of the negative binomial distribution indic-ates heterogeneity among the patients with a group: the mean cell countsare not the same for all patients. The estimated difference in log mean forour best tting model, the negative binomial, is 0.455 with standard error,0.193, indicating lower counts for non-Hodgkins disease patients. The ratioof means is then estimated to be exp( 0.455) = 0 .634.Thus, we see that the conclusions drawn from a set of data dependvery much on the assumptions made. Standard naive methods can be verymisleading. The modelling and inference approach to be presented hereprovides a reasonably wide set of possible assumptions, as we see from thisexample, assumptions that can be compared and checked with the data.

1.1.2 History

The developments leading to the general overview of statistical modelling,known as generalized linear models, extend over more than a century. Thishistory can be traced very briey as follows (adapted from McCullagh andNelder, 1989, pp. 817):

multiple linear regression a normal distribution with the identitylink (Legendre, Gauss: early nineteenth century); analysis of variance (ANOVA) designed experiments a normal dis-tribution with the identity link (Fisher: 1920s 1935); likelihood function a general approach to inference about any stat-istical model (Fisher, 1922); dilution assays a binomial distribution with the complementary loglog link (Fisher, 1922); exponential family a class of distributions with sufficient statisticsfor the parameters (Fisher, 1934); probit analysis a binomial distribution with the probit link (Bliss,1935);


19/265


logit for proportions a binomial distribution with the logit link(Berkson, 1944; Dyke and Patterson, 1952); item analysis a Bernoulli distribution with the logit link (Rasch,

1960); log linear models for counts a Poisson distribution with the loglink (Birch, 1963); regression models for survival data an exponential distributionwith the reciprocal or the log link (Feigl and Zelen, 1965; Zippin and

Armitage, 1966; Glasser, 1967);

inverse polynomials a gamma distribution with the reciprocal link(Nelder, 1966).

Thus, it had been known since the time of Fisher (1934) that many of the commonly used distributions were members of one family, which hecalled the exponential family . By the end of the 1960s, the time was ripefor a synthesis of these various models (Lindsey, 1971). In 1972, Nelderand Wedderburn went the step further in unifying the theory of statisticalmodelling and, in particular, regression models, publishing their article ongeneralized linear models (GLM). They showed

how many of the most common linear regression models of classicalstatistics, listed above, were in fact members of one family and couldbe treated in the same way,

that the maximum likelihood estimates for all of these models couldbe obtained using the same algorithm, iterated weighted least squares (IWLS, see Section A.1.2).

Both elements were equally important in the subsequent history of this ap-proach. Thus, all of the models listed in the history above have a distribu-tion in the exponential dispersion family (Jrgensen, 1987), a generalizationof the exponential family, with some transformation of the mean, the linkfunction, being related linearly to the explanatory variables.

Shortly thereafter, the rst version of an interactive statistical computerpackage called GLIM (Generalized Linear Interactive Modelling) appeared,allowing statisticians easily to t the whole range of models. GLIM producesvery minimal output, and, in particular, only differences of log likelihoods,what its developers called deviances, for inference. Thus, GLIM

displaced the monopoly of models based on the normal distributionby making analysis of a larger class of appropriate models possible byany statistician,

had a major impact on the growing recognition of the likelihood func-tion as central to all statistical inference,


20/265


allowed experimental development of many new models and uses forwhich it was never originally imagined.However, one should now realize the major constraints of this approach, a

technology of the 1970s:1. the linear component is retained;

2. distributions are restricted to the exponential dispersion family;

3. responses must be independent.

Modern computer power can allow us to overcome these constraints, al-though appropriate software is slow in appearing.

1.1.3 Data Generating Mechanisms and Models

In statistical modelling, we are interested in discovering what we can learnabout systematic patterns from empirical data containing a random com-ponent. We suppose that some complex data generating mechanism hasproduced the observations and wish to describe it by some simpler, butstill realistic, model that highlights the specic aspects of interest. Thus, bydenition, models are never true in any sense.

Generally, in a model, we distinguish between systematic and randomvariability, where the former describes the patterns of the phenomenon inwhich we are particularly interested. Thus, the distinction between the twodepends on the particular questions being asked. Random variability can bedescribed by a probability distribution, perhaps multivariate, whereas thesystematic part generally involves a regression model, most often, but notnecessarily (Lindsey, 1974b), a function of the mean parameter. We shallexplore these two aspects in more detail in the next two subsections.

1.1.4 Distributions

Random Component

In the very simplest cases, we observe some response variable on a numberof independent units under conditions that we assume homogeneous in allaspects of interest. Due to some stochastic data generating mechanism thatwe imagine might have produced these responses, certain ones will appear

more frequently than others. Our model, then, is some probability distribu-tion , hopefully corresponding in pertinent ways to this mechanism, and onethat we expect might represent adequately the frequencies with which thevarious possible responses are observed.

The hypothesized data generating mechanism, and the corresponding can-didate statistical models to describe it, are scientic or technical constructs.


21/265


The latter are used to gain insight into the process under study, but are gen-erally vast simplications of reality. In a more descriptive context, we are just smoothing the random irregularities in the data, in this way attemptingto detect patterns in them.

A probability distribution will usually have one or more unknown para-meters that can be estimated from the data, allowing it to be tted tothem. Most often, one parameter will represent the average response, orsome transformation of it. This determines the location of the distributionon the axis of the responses. If there are other parameters, they will de-scribe, in various ways, the variability or dispersion of the responses. Theydetermine the shape of the distribution, although the mean parameter willusually also play an important role in this, the form almost always changingwith the size of the mean.

Types of Response Variables

Responses may generally be classied into three broad types:

1. measurements that can take any real value, positive or negative;

2. measurements that can take only positive values;

3. records of the frequency of occurrence of one or more kinds of events.

Let us consider them in turn.

Continuous Responses

The rst type of response is well known, because elementary statisticscourses concentrate on the simpler normal theory models: simple linearregression and analysis of variance (ANOVA). However, such responses areprobably the rarest of the three types actually encountered in practice. Re-sponse variables that have positive probability for negative values are ratherdifficult to nd, making such models generally unrealistic, except as roughapproximations. Thus, such introductory courses are missing the mark. Nev-ertheless, such models are attractive to mathematicians because they havecertain nice mathematical properties. But, for this very reason, the char-acteristics of these models are unrepresentative and quite misleading whenone tries to generalize to other models, even in the same family.

Positive Responses

When responses are measurements, they most often can only take positivevalues (length, area, volume, weight, time, and so on). The distribution of the responses will most often be skewed, especially if many of these valuestend to be relatively close to zero.

One type of positive response of special interest is the measurement of duration time to some event: survival, illness, repair, unemployment, and


22/265


so on. Because the length of time during which observations can be made isusually limited, an additional problem may present itself here: the responsetime may not be completely observed it may be censored if the eventhas not yet occurred we only know that it is at least as long as the

observation time.

Events

Many responses are simple records of the occurrence of events. We are ofteninterested in the intensity with which the events occur on each unit. If onlyone type of event is being recorded, the data will often take the form of counts : the number of times the event has occurred to a given unit (usualat least implicitly within some xed interval of time). If more than one typeof response event is possible, we have categorical data, with one categorycorresponding to each event type. If several such events are being recordedon each unit, we may still have counts, but now as many types on each unitas there are categories (some may be zero counts).

The categories may simply be nominal, or they may be ordered in someway. If only one event is recorded on each unit, similar events may beaggregated across units to form frequencies in a contingency table . Whenexplanatory variables distinguish among several events on the same unit,the situation becomes even more complex.

Duration time responses are very closely connected to event responses,because times are measured between events. Thus, as we shall see, many of the models for these two types of responses are closely related.

1.1.5 Regression Models

Most situations where statistical modelling is required are more complexthan can be described simply by a probability distribution, as just outlined.Circumstances are not homogeneous; instead, we are interested in how theresponses change under different conditions. The latter may be describedby explanatory variables . The model must have a systematic component.

Most often, for mathematical convenience rather than modelling realism,only certain simplifying assumptions are envisaged:

responses are independent of each other;

the mean response changes with the conditions, but the functionalshape of the distribution remains fundamentally unchanged;

the mean response, or some transformation of it, changes in somelinear way as the conditions change.Thus, as in the introductory example, we nd ourselves in some sort of general linear regression situation. We would like to be able to choose from


23/265

1.2 Exponential Dispersion Models 9

x i

FIGURE 1.1. A simple linear regression. (The vertical axis gives both the ob-served yi and its mean, i .)

among the available probability distributions that which is most appropri-ate, instead of being forced to rely only on the classical normal distribution.

Consider a simple linear regression plot, as shown in Figure 1.1. Thenormal distribution, of constant shape because the variance is assumedconstant, is being displaced to follow the straight regression line as theexplanatory variable changes.

1.2 Exponential Dispersion Models

As mentioned above, generalized linear models are restricted to membersof one particular family of distributions that has nice statistical properties.

In fact, this restriction arises for purely technical reasons: the numerical al-gorithm, iterated weighted least squares (IWLS; see Section A.1.2) used forestimation, only works within this family. With modern computing power,this limitation could easily be lifted; however, no such software, for a widerfamily of regression models, is currently being distributed. We shall nowlook more closely at this family.


24/265


1.2.1 Exponential Family

Suppose that we have a set of independent random response variables,Z i (i = 1 , . . . , n ) and that the probability (density) function can be writtenin the form

f (zi ; i ) = r (zi )s( i )exp[t(zi )u( i )]= exp[ t(zi )u( i ) + v(zi ) + w( i )]

with i a location parameter indicating the position where the distributionlies within the range of possible response values. Any distribution thatcan be written in this way is a member of the (one-parameter) exponentialfamily. Notice the duality of the observed value, zi , of the random variableand the parameter, i . (I use the standard notation whereby a capital lettersignies a random variable and a small letter its observed value.)

The canonical form for the random variable, the parameter, and the fam-ily is obtained by letting y = t(z) and = u( ). If these are one-to-onetransformations, they simplify, but do not fundamentally change, the modelwhich now becomes

f (yi ; i ) = exp[ yi i b(i ) + c(yi )]where b(i ) is the normalizing constant of the distribution. Now, Y i (i =

1, . . . , n ) is a set of independent random variables with means, say i , sothat we might, classically, write yi = i + i .

Examples

Although it is not obvious at rst sight, two of the most common discretedistributions are included in this family.

1. Poisson distribution

f (yi ; i ) =y ii e i

yi != exp[ yi log(i ) i log(yi !)]where i = log( i ), b(i ) = exp[ i ], and c(yi ) = log(yi !).2. Binomial distribution

f (yi ; i ) =n iyi

y ii (1 i )n i y i

= exp yi logi

1 i+ n i log(1

i ) + log

n i

yi

where i = log i1 i , b(i ) = n i log(1 + exp[ i ]), and c(yi ) = logn iy i . 2

As we shall soon see, b() is a very important function, its derivativesyielding the mean and the variance function.


25/265

1.2 Exponential Dispersion Models 11

1.2.2 Exponential Dispersion Family

The exponential family can be generalized by including a (constant) scale parameter , say , in the distribution, such that

f (yi ; i , ) = exp yi i b(i )a i () + c(yi , ) (1.1)where i is still the canonical form of the location parameter, some functionof the mean, i .

Examples

Two common continuous distributions are members of this family.1. Normal distribution

f (yi ; i , 2) = 1 2 2 e ( y i i ) 2

2 2

= exp yi i 2i2

12

y2i22

12

log(2 2)

where i = i , b(i ) = 2i / 2, a i () = 2 , and c(yi , ) = [y2i / +log(2 )]/ 2.2. Gamma distribution

f (yi ; i , ) = i

y 1i e y i i

( )= exp {[yi / i log(i )] + ( 1) log(yi )

+ log( ) log[( )]}where i = 1/ i , b(i ) = log(i ), a i () = 1 / , and c(yi , ) = ( 1)log(yi ) + log( ) log[( )]. 2

Notice that the examples given above for the exponential family are alsomembers of the exponential dispersion family, with a i () = 1. With known, this family can be taken to be a special case of the one-parameterexponential family; yi is then the sufficient statistic for i in both families.

In general, only the densities of continuous distributions are members of these families. As we can see in Appendix A, working with them impliesthat continuous variables are measured to innite precision. However, theprobability of observing any such point value is zero. Fortunately, such anapproximation is often reasonable for location parameters when the samplesize is small (although it performs increasingly poorly as sample size in-

creases).

1.2.3 Mean and Variance

For members of the exponential and exponential dispersion families, a spe-cial relationship exists between the mean and the variance: the latter is


26/265


a precisely dened and unique function of the former for each member(Tweedie, 1947).

The relationship can be shown in the following way. For any likelihoodfunction, L( i , ; yi ) = f (yi ; i , ), for one observation, the rst derivative

of its logarithm,

Ui = log[L(i , ; yi )]

iis called the score function. (When this function, for a complete set of observations, is set to zero, the solution of the resulting equations, called thescore equations, yields the maximum likelihood estimates.) From standardinference theory, it can be shown that

E[Ui ] = 0 (1.2)and

var[U i ] = E[U2i ]

= E Ui i

(1.3)

under mild regularity conditions that hold for these families.From Equation (1.1), for the exponential dispersion family,

log[L(i , ; yi )] =yi i b(i )

a i ()+ c(yi , )

Then, for i ,

Ui =yi b ( i ) i

a i ()(1.4)

so that

E[Y i ] = b(i )i(1.5)

= i

from Equation (1.2), and

Ui = 2 b(i )

2ia i ()

from Equation (1.4). This yields

var[U i ] =var[Y i ]a2i ()

= 2 b(i )

2ia i ()


27/265

1.3 Linear Structure 13

from Equations (1.3), (1.4), and (1.5), so that

var[Y i ] = 2b(i )

2ia i ()

Usually, we can simplify by taking

a i () =wi

where wi are known prior weights . Then, if we let 2b(i )/ 2i = 2i , whichwe shall call the variance function , a function of i (or i ) only, we have

var[Y i ] = a i () 2i

= 2

iwia product of the dispersion parameter and a function of the mean only. Here,i is the parameter of interest, whereas is usually a nuisance parameter.For these families of distributions, b(i ) and the variance function eachuniquely distinguishes among the members.

Examples

Distribution Variance functionPoisson = e Binomial n (1 ) = ne / (1 + e )2Normal 1Gamma 2 = ( 1/ )2Inverse Gaussian 3 = ( 2/ )3/ 2

2

Notice how exceptional the normal distribution is, in that the variance

function does not depend on the mean. This shows how it is possible tohave the classical linear normal models with constant variance.

1.3 Linear Structure

We have noted that one simplifying assumption in a model is often thatsome function of the mean response varies in a linear way as conditionschange: the linear regression model. With n independent units observed,this can be written as a linear predictor . In the simplest case, the canonicallocation parameter is equated to a linear function of other parameters, of the form

i (i ) =j

x ij j


28/265


or

( ) = X

where is a vector of p < n (usually) unknown parameters, the matrixX n p = [x T 1 , . . . , x T n ]T is a set of known explanatory variables, the condi-tions, called the design or model matrix , and X is the linear structure .Here, i is shown explicitly to be a function of the mean, something thatwas implicit in all that preceded.

For a qualitative or factor variable, x ij will represent the presence orabsence of a level of a factor and j the effect of that level; for a quantitativevariable, x ij is its value and j scales it to give its effect on the (transformed)mean.

This strictly linear model (in the parameters, but not necessarily theexplanatory variables) can be further generalized by allowing other smoothfunctions of the mean, ():

( ) = X

called the linear predictor. The model now has both linear and nonlinearcomponents.

1.3.1 Possible Models

In the model selection process, a series of regression models will be underconsideration. It is useful to introduce terminology to describe the variouscommon possibilities that may be considered.

Complete, full, or saturated model The model has as many locationparameters as observations, that is, n linearly independent paramet-

ers. Thus, it reproduces the data exactly but with no simplication,hence being of little use for interpretation.

Null model This model has one common mean value for all observations.It is simple but usually does not adequately represent the structureof the data.

Maximal model Here we have the largest, most complex model that weare actually prepared to consider.

Minimal model This model contains the minimal set of parameters thatmust be present; for example, xed margins for a contingency table.

Current model This model lies between the maximal and minimal and ispresently under investigation.


29/265


The saturated model describes the observed data exactly (in fact, if thedistribution contains an unknown dispersion parameter, the latter will oftennot even be estimable), but, for this very reason, has little chance of beingadequate in replications of the study. It does not highlight the pertinent

features of the data. In contrast, a minimal model has a good chance of tting as well (or poorly!) to a replicate of the study. However, the importantfeatures of the data are missed. Thus, some reasonable balance must befound between closeness of t to the observed data and simplicity.

1.3.2 Notation for Model Formulae

For the expression of the linear component in models, it is often more con-venient, and clearer, to be able to use terms exactly describing the variablesinvolved, instead of the traditional Greek letters. It turns out that this hasthe added advantage that such expressions can be directly interpreted bycomputer software. In this section, let us then use the following conventionfor variables:

quantitative variate X,Y, . . .

qualitative factor A,B, . . .

Note that these are abstract representations; in concrete cases, we shall usethe actual names of the variables involved, with no such restrictions on theletters.

Then, the Wilkinson and Rogers (1973) notation has

Variable Algebraic Model

type Interpretation component termQuantitative Slope x i XQualitative Levels i AInteraction ( ) ij ABMixed Changing slopes i x BX

Notice how these model formulae refer to variables , not to parameters.

Operators

The actual model formula is set up by using a set of operators to indicate therelationships among the explanatory variables with respect to the (functionof) the mean.


30/265


Combine terms + X+Y+A+YAAdd terms to previous model + +XARemove terms from model - -YNo change Interaction ABNested model / A/B/C

(=A+AB+ABC)Factorial model * A*B(=A+B+AB)Constant term 1 X-1(line through origin)

With some software, certain operator symbols are modied. For example,

in R and S-Plus, the colon (:) signies interaction. These software alsoallow direct specication of the response variable before the linear structure:Y ~ X1+ X2. I shall use the notation in the original article, as shown in thetable, throughout the text.

Example

(AB) (C + D) = ( A+ B + AB) (C + D)= AC + AD+ B C + BD+ ABC+ ABD

(AB)/ C = A+ B + AB + ABCABCA(BC) = A+ B + C + BC

ABCBC = A+ B + C + AB + ACABC/ A = A+ B + C + B C

2

1.3.3 Aliasing For various reasons, the design matrix, X n p , in a linear model may not beof full rank p. If the columns, x 1 , . . . , x j , form a linearly dependent set, thensome of the corresponding parameters 1 , . . . , j are aliased. In numericalcalculations, we can use a generalized inverse of the matrix in order toobtain estimates.

Two types of alias are possible:

Intrinsic alias The specication of the linear structure contains redund-ancy whatever the observed values in the model matrix; for example,the mean plus parameters for all levels of a factor (the sum of thematrix columns for the factor effects equals the column for the mean).

Extrinsic alias An anomaly of the data makes the columns linearly de-pendent; for example, no observations are available for one level of a


31/265


factor (zero column) or there is collinearity among explanatory vari-ables.

Let us consider, in more detail, intrinsic alias. Suppose that the rank of X

is r < p , that is, that there are pr independent constraints on p estimates, . Many solutions will exist, but this is statistically unimportant because and will have the same estimated values for all possible values of . Thus,these are simply different ways of expressing the same linear structure, thechoice among them being made for ease of interpretation.

Example

Suppose that, in the regression model,

= 0 + 1x1 + 2x2 + 3x3

x3 = x1 + x2 , so that 3 is redundant in explaining the structure of the data.Once information on 1 and 2 is removed from data, no further informationon 3 remains. Thus, one adequate model will be

= 1x1 + 2x2

However,

= 1x1 + 2x2 + 3x3

is also possible if

1 = 0 2 = 2 3 = 3

or

1 = (2 1 2)/ 3 2 = (2 2 1)/ 3 3 = ( 1 + 2)/ 3

2

The rst parametrization in this example, with say 1 = 0, is called thebaseline constraint , because all comparisons are being made with respectto the category having the zero value. The second, where 1 + 2 + 3 = 0,is known as the usual or conventional constraint . Constraints that make theparameters as meaningful as possible in the given context should be chosen.


32/265


1.4 Three Components of a GLM

Consider again the simple linear (least-squares) regression plot of Figure1.1. This model has been written, classically, as

yi = 0 + 1x i + i where i N(0, 2)

but is more clearly seen to be

i = 0 + 1x i

where i is the mean of a normal distribution with constant variance, 2 .From this simple model, it is not necessarily obvious that three elements

are in fact involved. We have already looked at two of them, the probabilitydistribution and the linear structure, in some detail and have mentioned thethird, the link function. Let us look at all three more closely.

1.4.1 Response Distribution or Error Structure

The Y i (i = 1 , . . . , n ) are independent random variables with means, i .They share the same distribution from the exponential dispersion family,with a constant scale parameter.

1.4.2 Linear Predictor Suppose that we have a set of p (usually) unknown parameters, , and a setof known explanatory variables X n p = [x T 1 , . . . , x T n ]T , the design matrix,are such that

= X

where X is the linear structure. This describes how the location of theresponse distribution changes with the explanatory variables.

If a parameter has a known value, the corresponding term in the linearstructure is called an offset . (This will be important for a number of modelsin Chapters 3 and 6.) Most software packages have special facilities tohandle this.

1.4.3 Link Function

If i = i , our generalized linear model denition is complete. However, thefurther generalization to noncanonical transformations of the mean requires

an additional component if the idea of a linear structure is to be retained.The relationship between the mean of the ith observation and its linearpredictor will be given by a link function , gi ():

i = gi (i )= x T i


33/265

1.4 Three Components of a GLM 19

This function must be monotonic and differentiable. Usually the same linkfunction is used for all observations. Then, the canonical link function isthat function which transforms the mean to a canonical location parameterof the exponential dispersion family member.

Example

Distribution Canonical link functionPoisson Log i = log( i )Binomial Logit i = log i1 i = log

in i i

Normal Identity i = iGamma Reciprocal i = 1 iInverse Gaussian Reciprocal 2 i = 1 2i

2

With the canonical link function, all unknown parameters of the linearstructure have sufficient statistics if the response distribution is a mem-ber of the exponential dispersion family and the scale parameter is known.However, the link function is just an artifact to simplify the numerical meth-ods of estimation when a model involves a linear part, that is, to allow theIWLS algorithm to work. For strictly nonlinear regression models, it loses

its meaning (Lindsey, 1974b).Consider now the example of a canonical linear regression for the bino-mial distribution, called logistic regression, as illustrated in Figure 1.2. Wesee how the form of the distribution changes as the explanatory variablechanges, in contrast to models involving a normal distribution, illustratedin Figure 1.1.

Link functions can often be used to advantage to linearize seemingly non-linear structures. Thus, for example, logistic and Gomperz growth curvesbecome linear when respectively the logit and complementary log log links

are used (Chapter 4).

Example

The MichaelisMenten equation,

i = 1x i

1 + 2x i

is often used in biology because of its asymptotic properties. With a recip-rocal link, it can be written

1i

= 1 + 2x i

where 1 = 2 / 1 and 2 = 1 / 1 . 2


34/265


x i

FIGURE 1.2. A simple linear logistic regression.

Thus, generalized linear models, as their name suggests, are restrictedto having a linear structure for the explanatory variables. In addition, theyare restricted to univariate, independent responses. Some ways of gettingaround these major constraints will be outlined in the next section andillustrated in some of the following chapters.

1.5 Possible Models

1.5.1 Standard Models

With GLM software, one can usually t the following standard distributions,all members of the exponential dispersion family:

Poisson

binomial normal (also log normal) gamma (also log gamma, exponential, and Pareto)


35/265

1.5 Possible Models 21

inverse Gaussianand a series of link functions, only some of which are canonical,

identity

reciprocal 1

quadratic inverse 1 2

square root

exponent ( + c1)c2 (c1 and c2 known)

log log()

logit log n

complementary log log log log n probit 1 n

With some software, the user can dene other models (distribution and/orlink) if the distribution is a member of the exponential dispersion family.

1.5.2 Extensions

A number of tricks can be used with standard GLM software in order to tcertain models that are not generalized linear family.

Distributions Close to the Exponential Dispersion FamilyIf a distribution would be a member of the exponential dispersion familyexcept for one (shape) parameter, an extra iteration loop can be used toobtain the maximum likelihood estimate of that parameter.

Example

The Weibull distribution,

f (y; , ) = y 1e (y/ )

with known shape parameter, , is an exponential distribution (gamma with = 1). If we take an initial value of the shape parameter, t an exponentialdistribution with that value, and then estimate a new value, we can continueretting until convergence. 2


36/265


Parameters in the Link Function

Two possibilities are to plot likelihoods for various values of the unknownlink parameter or to expand the link function in a Taylor series and includethe rst term as an extra covariate. In this latter case, we have to iterate toconvergence.

Example

An exponent link with unknown power parameter

=

can be estimated by including an extra term

( 0)0 log()in the linear model. Change in likelihood will provide a measure of accept-ability. 2

Parameters in the Variance Function

In models from the exponential dispersion family, the likelihood equationsfor the linear structure can be solved without knowledge of the disper-sion parameter (Section A.1.2). Some distributions have a parameter in thevariance function that is not a dispersion parameter and, hence, cannot beestimated in the standard way. Usually, special methods are required foreach case.

Example

Consider the negative binomial distribution with unknown power para-meter, , as will be given in Equation (2.4). If it were known and xed,we would have a member of the exponential family. One approximate wayin which this parameter can be estimated is by the method of moments,choosing a value that makes the Pearson chi-squared statistic equal to itsexpectation.

Another way, that I used in the motivating example and shall also usein Chapters 5 and 9, consists in trying a series of different values of theunknown parameter and choosing that with the smallest deviance. 2

Nonlinear Structure

We can linearize a nonlinear parameter by a Taylor series approximation(Chapter 9), as for the link function.


37/265

1.6 Inference 23

Example

If h (x, ) is the nonlinear term (for example, ex ), then

h(x, ).= h(x, 0) + ( 0)

h = 0

We can use two linear terms:

h (x, 0) + h = 0

where = ( 0). At each iteration,

s +1 = s +

2

Survival Curves and Censored Observations

Many survival distributions can be shown to have a log likelihood that isessentially a Poisson distribution plus a constant term (an offset) not de-pending on the linear predictor (Section 6.3.2). A censored exponential dis-tribution can be tted with IWLS (no second iteration), whereas a numberof others, including the Weibull, extreme value, and logistic distributions,require one simple extra iteration loop.

Composite Link Functions

The link function may vary with (subsets of) the observations. In manycases, this can be handled as for user-programmed link functions (and dis-tributions). Examples include the proportional odds models for orderedvariables in a contingency table and certain components of dispersion (of variance) in random effects and repeated measurements models.

1.6 Inference

Statistical software for generalized linear models generally produce deviancevalues (Section A.1.3) based on twice the differences of the log likelihood

from that for a saturated model (that is, 2 log[L]). However, as we haveseen, the number of parameters in this saturated model depends on thenumber of observations, except in special cases; these models are a type of semiparametric model where the distribution is specied but the func-tional form of the systematic part, that is, the regression, is not. Hence,only differences in deviance, where this saturated term cancels out, may be


38/265


relevant. The one major exception is contingency tables where the saturatedmodel has a xed number of parameters, not increasing with the number of observations.

Thus, semiparametric and nonparametric models, that is, those where

a functional form is not specied either for the systematic or for the stochasticpart, are generally at least partially saturated models with a number of parameters that increases with the sample size. Most often, they involve afactor variable whose levels depend on the data observed. This creates noproblem for direct likelihood inference where we condition on the observeddata. Such saturated models often provide a point of comparison for thesimpler parametric models.

In the examples in the following chapters, the AIC (Section A.1.4) isused for inference in the exploratory conditions of model selection. Thisis a simple penalization of the log likelihood function for complexity of the model, whereby some positive penalizing constant (traditionally unity)times the number of estimated parameters is subtracted from it. It onlyallows comparison of models; its absolute size is arbitrary, depending onwhat constants are left in the likelihood function and, thus, has no meaning.

For contingency tables, I shall use an AIC based on the usual devianceprovided by the software. In all other cases, I base it on the complete minustwo log likelihood, including all constants. The latter differs from the AICproduced by some of these packages by an additive constant, but has theimportant advantage that models based on different distributions can bedirectly compared. Because of the factor of minus two in these AICs, thepenalty involves the subtraction of twice the number of estimated paramet-ers. In all cases, a smaller AIC indicates a preferable model in terms of thedata alone.

Generalized linear models provide us with a choice of distributions thatfrequentist inference, with its nesting requirements, does not easily allow usto compare. Direct likelihood inference overcomes this obstacle (Lindsey,1974b, 1996b) and the AIC makes this possible even with different numbersof parameters estimated in the models to be compared.

In spite of some impressions, use of the AIC is not an automated process.The penalizing constant should be chosen, before collecting the data, to yieldthe desired complexity of models or smoothing of the data. However, for theusual sample sizes, unity (corresponding to minus two when the deviance isused) is often suitable. Obviously, if enough different models are tried, somewill usually be found to t well; the generalized linear model family, withits variety of distributions and link functions, already provides a sizable

selection. However, a statistician will not blindly select that model withthe smallest AIC; scientic judgment must also be weighed into the choice.Model selection is exploratory hypothesis generation; the chosen modelmust then be tested, on new data, the conrmatory part of the statisticalendeavour.


39/265

1.7 Exercises 25

If the AIC is to be used for model selection, then likelihood intervals forparameters must also be based on this criterion for inferences to be compat-ible. Otherwise, contradictions will arise (Section A.1.4). Thus, with a pen-alizing constant of unity, the interval for one parameter will be 1 /e = 0 .368

normed likelihood. This is considerably narrower than those classicallyused: for example a 5% asymptotic condence interval, based on the chi-squared distribution, has a exp( 3.84/ 2) = 0 .147 normed likelihood. TheAIC corresponding to the latter has a penalizing constant of 1.96, adding3.84 times the number of estimated parameters, instead of 2 times, to thedeviance. This will result in the selection of much simpler models if oneparameter is checked at a time. (For example, in Section 6.3.4, the expo-nential would be chosen over the Weibull.)

For a further discussion of inference, see Appendix A.

Summary

For a more general introduction to statistical modelling, the reader mightlike to consult Chapter 1 of Lindsey (1996b) and Chapter 2 of Lindsey(1993).

Books on the exponential family are generally very technical; see, forexample, Barndorff-Nielsen (1978) or Brown (1986). Chapter 2 of Lind-sey (1996b) provides a condensed survey. Jrgensen (1987) introduced theexponential dispersion family.

After the original paper by Nelder and Wedderburn (1972) on generalizedlinear models, several books have been published, principally McCullaghand Nelder (1989), Dobson (1990), and Fahrmeir and Tutz (1994).

For much of their history, generalized linear models have owed their suc-cess to the computer software GLIM. This has resulted in a series of bookson GLIM, including Healy (1988), Aitkin et al. (1989), Lindsey (1989,1992), and Francis et al. (1993) and the conference proceedings of Gil-christ 1982), Gilchrist et al. (1985), Decarli et al. (1989), van der Heijdenet al. (1992), Fahrmeir et al. (1992), and Seeber et al. (1995).

For other software, the reader is referred to the appropriate section of their manual.

For references to direct likelihood inference, see those listed at the endof Appendix A.

1.7 Exercises1. (a) Figures 1.1 and 1.2 show respectively how the normal and bino-

mial distributions change as the mean changes. Although inform-ative, these graphics are, in some ways, fundamentally different.Discuss why.


40/265


(b) Construct a similar plot for the Poisson distribution. Is it moresimilar to the normal or to the binomial plot?

2. Choose some data set from a linear regression or analysis of variance

course that you have had and suggest some more appropriate modelfor it than ones based on the normal distribution. Explain how themodel may be useful in understanding the underlying data generatingmechanism.

3. Why is intrinsic alias more characteristic of models for designed ex-periments whereas extrinsic aliases arises most often in observationstudies such as sample surveys?

4. (a) Plot the likelihood function for the mean parameter of a Poissondistribution when the estimated mean is y = 2 .5 for n = 10observations. Give an appropriate likelihood interval about themean.

(b) Repeat for the same estimated mean when n = 30 and comparethe results in the two cases.

(c) What happens to the graphs and to the intervals if one workswith the canonical parameter instead of the mean?

(d) How do these results relate to Fisher information? To the use of standard errors as a measure of estimation precision?


41/265

2Discrete Data

2.1 Log Linear Models

Traditionally, the study of statistics begins with models based on the normaldistribution. This approach gives students a biased view of what is possiblein statistics because, as we shall see, the most fundamental models are thosefor discrete data. As well, the latter are now by far the most commonly usedin applied statistics. Thus, we begin our presentation of generalized linearregression modelling with the study of log linear models.

Log linear models and their special case for binary responses, logisticmodels, are designed for the modelling of frequency and count data, that is,those where the response variable involves discrete categories, as describedin Section 1.1.4. Because they are based on the exponential family of dis-tributions, they constitute a direct extension of traditional regression andanalysis of variance. The latter models are based on the normal distribu-tion (Chapter 9), whereas logistic and log linear models are based on thePoisson or multinomial distributions and their special cases, such as thebinomial distribution. Thus, they are all members of the generalized linearmodel family.

Usually, although not necessarily, one models either the frequencies of occurrence of the various categories or the counts of events. Occasionally,as in some logistic regression models, the individual indicator variables of the categories are modelled. However, when both individual and groupedfrequency data are available, they both give identical results. Thus, for themoment, we can concentrate, here, on grouped frequency data.


42/265

28 2. Discrete Data

TABLE 2.1. A two-way table for change over time.

Time 2A B

Time A 45 131 B 12 54

2.1.1 Simple Models

In order to provide a brief and simple introduction to logistic and log linearmodels, I have chosen concrete applications to modelling changes over time.However, the same principles that we shall study here also apply to cross-sectional data with a set of explanatory variables.

Observations over Time

Consider a simple two-way contingency table, Table 2.1, where some re-sponse variable with two possible values, A and B, was recorded at twopoints in time. A rst characteristic that we may note is a relative stabilityover time, as indicated by the large frequencies on the diagonal. In otherwords, response at time 2 depends heavily on that at time 1, most oftenbeing the same.

As a simple model, we might consider that the responses at time 2 have abinomial distribution and that this distribution depends on what responsewas given at time 1. Thus, we might have the simple linear regression model

log1| j2| j

= 0 + 1x j

where x j is the response at time 1 and i | j is the conditional probabilityof response i at time 2 given the observed value of x j at time 1. Then, if

1 = 0, this indicates independence, that is, that the second response doesnot depend on the rst. In the Wilkinson and Rogers (1973) notation, themodel can be written simply as the name of the variable:

TIME1

If the software also required specication of the response variable at thesame time, this would become

TIME2 TIME1

where TIME2 represents a 2 2 matrix of the frequencies in the table, withcolumns corresponding to the two possible response values at the secondtime point.

This logistic regression model , with a logit link, the logarithm of the ratioof probabilities, is the direct analogue of classical (normal theory) linear


43/265

2.1 Log Linear Models 29

TABLE 2.2. A two-way table of clustered data.

Right eyeA B

Left A 45 13eye B 12 54

regression. On the other hand, if x j is coded (1, 1) or (0, 1), we mayrewrite this aslog

1| j2| j

= + j

where = 0 , the direct analogue of an analysis of variance model, withthe appropriate constraints. With suitable software, TIME1 would simply bedeclared as a factor variable having two levels.

Example

The parameter estimates for Table 2.1 are 0 = = 1 .242 and 1 = 1 = 2.746, when x j is coded (0, 1), with an AIC of 4. (The devianceis zero and there are two parameters.) That with 1 = 1 = 0, that is,independence, has AIC 48.8. Thus, in comparing the two models, the rst,with dependence on the previous response, is much superior, as indicatedby the smaller AIC. 2

Clustered Observations

Let us momentarily leave data over time and consider, instead, the sametable, now Table 2.2, as some data on the two eyes of people. We againhave repeated observations on the same individuals, but here they may be

considered as being made simultaneously rather than sequentially. Again,there will usually be a large number with similar responses, resulting fromthe dependence between the two similar eyes of each person.

Here, we would be more inclined to model the responses simultaneously asa multinomial distribution over the four response combinations, with joint probability parameters, ij . In that way, we can look at the associationbetween them. Thus, we might use a log link such that

log( ij ) = + i + j + ij (2.1)

With the appropriate constraints, this is again an analogue of classical ana-lysis of variance. It is called a log linear model . If modelled by the Poissonrepresentation (Section 2.1.2), it could be given in one of two equivalentways:

REYELEYE


44/265

30 2. Discrete Data

or

REYE+ LEYE+ REYELEYEWith specication of the response variable, the latter becomes

FREQ REYE+ LEYE+ REYELEYEwhere FREQis a four-element vector containing the frequencies in the table.Notice that, in this representation of the multinomial distribution, the res-ponse variable, FREQ, is not really a variable of direct interest at all.

Example

Here, the parameter estimates for Table 2.1 are = 2 .565, 1 = 1 .424, 1 =1.242, and 11 =

2.746, with an AIC of 8. (Again, the deviance is zero, but

here there are four parameters.) That with 11 = 0 has AIC 52.8. (This is 4larger than in the previous case because the model has two more parameters,but the difference in AIC is the same.) The conclusion is identical, that theindependence model is much inferior to that with dependence. 2

Log Linear and Logistic Models

The two models just described have a special relationship to each other.With the same constraints, the dependence parameter, , is identical in thetwo cases because

log1| 12| 21| 22| 1

= log11 2212 21

The normed prole likelihoods for = 0 are also identical, although theAICs are not because of the different numbers of parameters explicitly es-timated in the two models (differences in AIC are, however, the same). Thisis a general result: in cases where both are applicable, logistic and log linearmodels yield the same conclusions. The choice is a matter of convenience.

This is a very important property, because it means that such models canbe used for retrospective sampling . Common examples of this include, inmedicine, case-control studies, and, in the social sciences, mobility studies.

These results extend directly to larger tables, including higher dimen-sional tables. There, direct analogues of classical regression and ANOVAmodels are still applicable. Thus, complex models of dependence amongcategorical variables can be built up by means of multiple regression. Ex-planatory variables can be discrete or continuous (at least if the data are

not aggregated in a contingency table).

2.1.2 Poisson Representation

With a log linear model, we may have more than two categories for the re-sponse variable(s), so that we require a multinomial, instead of a binomial,


45/265

2.2 Models of Change 31

distribution. This cannot generally be directly tted by standard general-ized linear modelling software. However, an important relationship existsbetween the multinomial and Poisson distributions that makes tting suchmodels possible.

Consider independent Poisson distributions with means k and corres-ponding numbers of events nk . Let us condition on the observed total num-ber of events, n = k nk . From the properties of the Poisson distribution,this total will also have the same distribution, with mean = k k .Then, the conditional distribution will be

e k n kkn k !

e n n !

=n

n1 nK K

k =1

k

n k

a multinomial distribution with probabilities, k = k / . Thus, any mul-tinomial distribution can be tted as a product of independent Poisson dis-tributions with the appropriate conditioning on the total number of events.

Specically, this means that, when tting such models, the product of allexplanatory variables must be included in the minimal log linear model, inorder to x the appropriate marginal totals in the table:

R1 + R2+ + E1E2 (2.2)where Ri represents a response variable and Ej an explanatory variable.This ensures that all responses have proper probability distributions (Lind-sey, 1995b). Much of log linear modelling involves searching for simplestructural models of relationships among responses ( Ri ) and of dependen-cies of responses on explanatory variables.

2.2 Models of Change

One of the most important uses of log linear models has been in samplesurvey data. A particularly interesting area of this eld is panel data . There,the same survey questions are administered at two or more points in timeto the same people. In this way, we can study changes over time.

For simplicity, let us restrict attention, for the moment, to the observationof responses at only two points in time, that is, to two-dimensional tables, asin the simple example above. However, the generalization to more complex

cases is fairly direct.Suppose that the response has I categories, called the states , so that wehave a I I table and are studying changes in state over time. Then, ourdependence parameter, , in Equation (2.1), will be a I I matrix, butwith only ( I 1) (I 1) independent values, because of the need forconstraints. When I > 2, the idea is to reduce this number of parameters


46/265

32 2. Discrete Data

by structuring the values in some informative way, that is, to be able tomodel the specic forms of dependence among successive responses.

The minimal model will be independence, that is, when ij = i j or,equivalently, ij = 0 i, j . The maximal model is the saturated or non-

parametric one. The latter is often not especially useful. Most interestingmodels, in this context, are based on Markov chains : the current responsesimply is made to depend on the previous one. These are models describingthe transition probabilities of changing from one state to another betweentwo points in time.

2.2.1 MoverStayer Model

Because, in the example above, we have noticed that there is often a ratherlarge number of individuals who will give the same response the two times,let us rst see how to model this.

Suppose that we have a mixture of two subpopulations or latent groups,one of which is susceptible to change while the other is not. This is calleda moverstayer model. We know that individuals recorded off the maindiagonal will all belong to the rst subpopulation, the movers, because theyhave changed. However, the main diagonal frequencies are more complexbecause they will contain both the stayers and any movers who did nothappen to change within the observation period (more exactly, who were inthe same place on both observation dates).

For a simple model, let us assume that the locations of the movers atthe two points in time are independent. If we ignore the mixture on thediagonal, we can model the rest of the table by quasi-independence , that is,independence in an incomplete table where the diagonal is missing. Then,with this independence assumption, we can obtain estimates of the numberof movers on the diagonal and, hence, of the number of stayers.

Example

A 10% sample is available, from the published migration reports of the1971 British census, of people migrating between 1966 and 1971 amongfour important centres of population in Britain, the Metropolitan Counties.The results are given in Table 2.3. All are British residents who were bornin the New Commonwealth. Here, the numbers not moving, those on thediagonal, are very extreme.

For this table, the deviance for independence

MOVE66+ MOVE71is 19,884 (AIC 19,898) with nine degrees of freedom (d.f.), a strong in-dication of dependence, whereas that for the moverstayer model (quasi-independence), tted in the same way but to the table without the maindiagonal, is 4.4 (26.4) with ve d.f., a remarkable improvement. The loss of


47/265

2.2 Models of Change 33

TABLE 2.3. Place of residence in Britain in 1966 and 1971. (Fingleton, 1984,p. 142)

1971

1966 CC ULY WM GLCentral Clydesdale 118 12 7 23Urban Lancs. & Yorks. 14 2127 86 130West Midlands 8 69 2548 107Greater London 12 110 88 7712

four d.f. results from eliminating the diagonal entries; this is equivalent toallowing a separate parameter for each of them. This is taken into account

in the AIC. Because the deviance is zero, the AIC for the saturated model of a contingency table is just two times the number of entries in the table, here32, so that the moverstayer model is also to be preferred to the saturatedmodel.

Notice that the dependence arises almost entirely from stayers being inthe same place at the two time points. The numbers of movers on the diag-onal are estimated to be only 1.6, 95.2, 60.3, and 154.6, respectively. Thus,most people in the table can have their 1971 place of residence exactly pre-dicted by that of 1966: they will be in the same place. This is the dependence

just detected above. 2

The moverstayer model allows a different probability of staying withineach response state. The special case where all of these probabilities areequal is called the loyalty model. This can be tted by a factor variablewith one level for all diagonal entries and a second for off-diagonal ones,instead of eliminating the diagonal.

Note, however, that, for calculating conditional probabilities, such modelsare really illegitimate. The probability of being in a given state at the second

time point depends on knowledge about whether or not one is a mover(or loyal), but this cannot be known until the state at the second time isavailable.

2.2.2 Symmetry

Because, in panel data, the same response variables are being recorded two(or more) times, we might expect some symmetry among them.

Complete Symmetry

Suppose that the probability of changing between any pair of categories isthe same in both directions:

i | j = j | i i, j (2.3)


48/265

34 2. Discrete Data

a model of complete symmetry . In terms of Markov chains, this is equivalentto the combination of two characteristics, reversibility and equilibrium , thatwe shall now study. In fact, we can separate the two.

EquilibriumConsider, rst, equilibrium: the marginal probabilities are the same at thetwo time points,

i = i i

In other words, the (marginal) distribution of the states remains the sameat the different time points, hence the name. In the usual analysis of contin-gency tables, this is called marginal homogeneity ; ho

lindsey 1997 applying generalized linear models.pdf

Documents