glm1 guide

8/12/2019 GLM1 Guide

1/31

GLM (Generalized Linear Model) #1 (version 9)

Paul Johnson

March 10, 2006

This handout describes the basics of estimating the Generalized Linear Model: the exponen-

tial distribution, familiar examples, the maximum likelihood estimation process, and iterative

re-weighted least squares. The first version of this handout was prepared as lecture notes on Jeff

Gills handy book,Generalized Linear Models: A Unified Approach(Sage, 2001). In 2004 and

2006, I revised in light of

Myers, Montgomery and Vining, Generalized Linear Models withApplications in Engi-neering and the Sciences(Wiley, 2002),

Annette J. Dobson,An Introduction to Generalized Linear Models, 2ed, (Chapman andHall, 2002)

Wm. Venables and Brian Ripley,Modern Applied Statistics with S, 4ed (2004)

P. McCullagh and J.A. Nelder, Generalized Linear Models, 2ed(Chapman & Hall, 1989).

D. Firth, Generalized Linear Models, in D.V. Hinkley, N. Reid and E.J. Snell, Eds, Statis-

tical Theory and Modeling(Chapman and Hall, 1991).

This version has some things that remain only to explain (to myself and others) the approach in

Gill, but most topics are now presented with a more general strategy.

Contents

1 Motivating idea. 3

1.1 Consider anecdotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2 logit/probit model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.3 Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 The exponential family unites these examples . . . . . . . . . . . . . . . . . . . . 4

1.3 Simplest Exponential Family: one parameter . . . . . . . . . . . . . . . . . . . . 4

1.4 More General Exponential Family: introducing a dispersion parameter . . . . . . . 5

2 GLM: the linkage betweeniand Xib 52.1 Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Think of a sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1


2/31

3 About the Canonical link 6

3.1 The canonical link is chosen so thati= i. . . . . . . . . . . . . . . . . . . . . . 63.2 Declarei=q(i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 How can you findq(i)? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.4 Maybe you think Im presumptuous to name that thingq(i). . . . . . . . . . . . . 8

3.5 Canonical means convenient, not mandatory . . . . . . . . . . . . . . . . . . . . . 8

4 Examples of Exponential distributions (and associated canonical links) 8

4.1 Normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.1.1 MassageN(i,2i)into canonical form of the exponential family . . . . . . 9

4.1.2 The Canonical Link function is the identity link . . . . . . . . . . . . . . 9

4.1.3 We reproduced the same old OLS, WLS, and GLS . . . . . . . . . . . . . 10

4.2 Poisson. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.2.1 Massage the Poisson into canonical form . . . . . . . . . . . . . . . . . . 11

4.2.2 The canonical link is the log . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2.3 Before you knew about the GLM . . . . . . . . . . . . . . . . . . . . . . 11

4.3 Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3.1 Massage the Binomial into the canonical exponential form . . . . . . . . . 12

4.3.2 Supposeni=1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.3.3 The canonical link is the logit . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3.4 Before you knew about the GLM, you already knew this . . . . . . . . . . 13

4.4 Gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.5 Other distributions (may fill these in later). . . . . . . . . . . . . . . . . . . . . . . 14

5 Maximum likelihood. How do we estimate these beasts? 14

5.1 Likelihood basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2 Bring the regression coefficients back in:idepends onb . . . . . . . . . . . . . 15

6 Background information necessary to simplify the score equations 15

6.1 Review of ML handout #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6.2 GLM Facts 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6.2.1 GLM Fact #1: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6.2.2 GLM Fact #2: The Variance function appears! . . . . . . . . . . . . . . . 17

6.2.3 Variance functions for various distributions . . . . . . . . . . . . . . . . . 18

7 Score equations for the model with a noncanonical link. 18

7.1 How to simplify expression 72. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

7.2 The Big Result in matrix form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

8 Score equations for the Canonical Link 20

8.1 Proof strategy 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

8.2 Proof Strategy 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

8.3 The Matrix form of the Score Equation for the Canonical Link . . . . . . . . . . . 22

2


3/31

9 ML Estimation: Newton-based algorithms. 23

9.1 Recall Newtons method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

9.2 Newton-Raphson method with several parameters . . . . . . . . . . . . . . . . . . 24

9.3 Fisher Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

9.3.1 General blithering about Fisher Scoring and ML . . . . . . . . . . . . . . 27

10 Alternative description of the iterative approach: IWLS 27

10.1 Start by remembering OLS and GLS . . . . . . . . . . . . . . . . . . . . . . . . . 28

10.2 Think of GLM as a least squares problem . . . . . . . . . . . . . . . . . . . . . . 28

10.3 The IWLS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

10.4 Iterate until convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

10.5 The Variance ofb. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

10.6 The weights in step 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

10.7 Big insight from the First Order Condition . . . . . . . . . . . . . . . . . . . . . . 30

1 Motivating idea.

1.1 Consider anecdotes

1.1.1 linear model

We already know

yi=Xib + ei (1)

whereXiis a row vector of observations and b is a column vector of coefficients. The error

term is assumed Normal with a mean of 0 and variance 2.Now think of it a different way. Suppose that instead of predicting yi, we predict the mean of

yi. So put the linear predictorXibinto the Normal distribution where you ordinarily expect tofind the mean:

yiN(Xib,2) (2)

In the GLM literature, they often tire of writingXibor such, and they start using (for short-

hand) the Greek letter eta.

i=Xib (3)

1.1.2 logit/probit model

We already studied logit and probit models. We know that we can think of theXibpart, the sys-tematic part, as the input into a probability process that determines whether the observation is 1

or 0. We might write

yiB((Xib)) (4)

where stands for a cumulative distribution of some probability model, such as a cumulativeNormal or cumulative Logistic model, and B is some stupid notation I just threw in to denote a

3


4/31

binary process, such as a Bernoulli distribution (or a Binomial distribution). Use the Greek

letterito represent the probability that yiequals 1,i= (Xib). For the Logistic model,

i= 1

1 + exp(Xib)=

exp(Xib)

1 + e(Xib) (5)

which is the same thing as:

ln

i

1i

=Xib (6)

1.1.3 Poisson

We also now know about the Poisson model, where we can calculate the impact ofXibon the

probability of a certain outcome according to the Poisson model. Check back on my Poisson

notes if you cant remember, but it was basically the same thing, where we said we are taking

the impact of the input to be exp(Xib),or eXib,and then we assert:

yiPoisson(eXib) (7)

1.2 The exponential family unites these examples

In these examples, there is a general pattern. We want to predict something about y, and we use

Xibto predict it. We estimate the bs and try to interpret them. How far can a person stretch this

metaphor?

According to Firth (1991), Nelder and Wedderburn (1972) were the first to identify a gen-

eral scheme, the Generalized Linear Model (GLM). The GLM ties together a bunch of modeling

anecdotes.

1.3 Simplest Exponential Family: one parameter

The big idea of the GLM is that yiis drawn from one of the many members of the exponential

family of distributions. In that family, one finds the Normal, Poisson, Negative Binomial, expo-

nential, extreme value distribution, and many others.

Depending on which book you read, this is described either in a simple, medium, or really

general form. The simple form is this (as in Dobson, 2000, p. 44). Note the property that the ef-

fects ofy and are very SEPARABLE:

prob(yi|i) = f(yi|i) =s(i) t(yi) exp{a(yi) b(i)} (8)

Dobson says that ifa(y) = y, then we have the so-calledcanonical form(canonical meansstandard):

prob(yi|i) = f(yi|i) =s(i) t(yi) exp{yi b(i)} (9)

Note weve gotyi (the thing we observe), and an unknown parameter for each observation,

i.

4


5/31

b() in Dobson, the function b()is called the natural parameter.

In other treatments, they assumeb() = , so itself is the natural parameter (see commentsbelow).

s() a function that depends on but NOT on y

t(y) a function that depends ony but NOT on.

This is the simplest representation.

Following Dobson, rearrange like so:

f(yi|i) =exp{y b(i) + ln[s(i)] + ln[t(yi)]} (10)

And re-label again:

f(yi|i) =exp{yi b(i)c(i) + d(yi)} (11)

Most treatments assume that b(i) = i. So one finds

f(yi|i) =exp{y i c(i) + d(yi) (12)

Many books relabel of these functionsb(),c(), andd(yi). It is a serious source of confu-sion.

1.4 More General Exponential Family: introducing a dispersion parame-

ter

We want to include a parameter similar to the standard deviation, an indicator of dispersion

or scale. Consider this reformulation of the exponential family (as in Firth (1991)):

f(yi) =exp

yii c(i)

i+ h(yi,i)

(13)

The new element here isi(the Greek letter phi, pronounced fee). That is a dispersionparameter. Ifi=1, then this formula reduces to the canonical form seen earlier in 9.

The critical thing, once again, is that the natural parameter iand the observationyiappear ina particular way.

2 GLM: the linkage betweeniandXib

In all these models, one writes that the observedyidepends on a probability process, and the

probability process takes an argumenti which (through a chain of reasoning) reflects the inputXib.

5


6/31

2.1 Terms

In order to qualify as a generalized linear model, the probability process has to have a certain

kind of formula.

Most sources on the GLM use this set of symbols:

i=Xib iis the linear predictor

i=E(yi) the expected value (or mean) ofyi.

i=g(i) the link function,g(), translates between the linear predictor and the mean ofyi. It isassumed to be monotonically increasing in i.

g1(i) =i the inverse link function,g1(), gives the mean as a function of the linear predictor

The only difficult problem is to connect these ideas to the observedyi.

2.2 Think of a sequence idepends onXi and b (think of that as a function also called i(Xi,b))

idepends oni(via the inverse of the link function,i/i=g1(i)

idepends on i(via some linkage that is distribution specific; but below I will call it q())

f(yi|i)depends oni

In this sequence, the only free choice for the modeler is g(). The others are specified by theassumption of linearity inXiband the mathematical formula of the probability distribution.

3 About the Canonical link

Recall that the link translates between i(the mean ofyi) and the linear predictor.

i=g(i) (14)

3.1 The canonical link is chosen so thati= i.

Use of the canonical link simplifies the maximum likelihood analysis (by which we find esti-

mates of the bs).

If you consider the idea of the sequence that translatesXibintoias described in section 2.2,notice that a very significant simplifying benefit is obtained by the canonical link.

But the simplification is costly. If we are to end up withi= i, that means there is a majorrestriction on the selection of the link function, because whatever happens in g1(i)to translateiintoi has to be exactly undone by the translation from i intoi.

McCullagh and Nelder state that the canonical link should not be used if it contradicts the

substantive ideas that motivate a research project. In their experience, however, the canonical link

is often substantively acceptable.

6


7/31

3.2 Declarei=q(i).

I will invent a piece of terminology now! Henceforth, this is Johnsons q. (Only kidding. You can

just call it q.) Consideri= q(i). (I chose the letterq because, as far as I know, it is not used foranything important.)

The functionq(i)is not something we choose in the modeling process. Rather, it is a feature

dictated by the probability distribution.

If we had the canonical link,q(i)would have the exact opposite effect ofg1(i). That is to

say, the linkg()must be chosen so that:

i=q(g1(i)) (15)

One can reach that conclusion by taking the following steps.

By definition,i=g1(i), which is the same as sayingi=g(i).

The GLM framework assumes that a mapping exists betweeniand i. Im just calling it

by a name,q: i=q(i) If you connect the dots, you find:

i=g(i) = i=q(i) (16)

In other words, the function g()is the same as the function q()if the link is canonical. So ifyou findq()you have found the canonical link.

3.3 How can you findq(i)?

For convenience of the reader, heres the canonical family definition:

f(yi) =exp

yii c(i)

i+ h(yi,i)

(17)

A result detailed below in section 6.2 is that the average ofyiis equal to the derivative ofc().This is the finding I label GLM Fact #1:

i=dc(i)

di(18)

That means that the mysterious function q(i)is the value you find by calculatingdc/diandthen re-arranging so thatiis a function ofi.

Perhaps notation gets in the way here. If you think ofdc/dias just an ordinary function, us-ing notation likei= c

(i), then it is (perhaps) easier to think of the inverse (or opposite, denotedby 1) transformation, applied to i:

i = [c]1(i) (19)

The conclusion, then, is that q(i) = [c]1(i)and the canonical link are the same.

Just to re-state the obvious, if you have the canonical family representation of a distribution,

the canonical link can be found by calculatingc(i)and then solving fori.

7


8/31

3.4 Maybe you think Im presumptuous to name that thingq(i).

There is probably some reason why the textbooks dont see the need to definei= q(i)and findthe canonical link through my approach. Perhaps it makes sense only to me (not unlikely).

Perhaps you prefer the approach mentioned by Firth (p.62). Start with expression 18. Apply

g()to both sides:

g(i) =g

dc(i)

di

(20)

And from the definition of the canonical link function,

i= i=g(i) =g

dc(i)

di

(21)

As Firth observes, for a canonical link, one must chooseg()so that it is the inverse of dc(i)

di.

That is, whateverdc(i)/di does to the input variablei, the linkg()has to exactly reverse it,

because we geti as the end result. Ifg undoes the derivative, then it will be true that

i=g(i) = i (22)

Thats the canonical link, of course. Eta equals theta.

3.5 Canonical means convenient, not mandatory

In many books, one finds that the authors are emphatic about the need to use the canonical link.

Myers, Montgomery, and Vining are a agnostic on the matter, acknowledging the simplifying

advantage of the canonical link but also working out the maximum likelihood results for non-

canonical links (p. 165).

4 Examples of Exponential distributions (and associated canon-

ical links)

We want to end up with a dependent variable that has a distribution that fits within the framework

of 13.

4.1 Normal

The Normal is impossibly easy! The canonical link

g(i) =i

Its especially easy because the Normal distribution has a parameter named iwhich happens also

to equal the expected value. So, if you have the formula for the Normal, theres no need to do any

calculation to find its mean. It is equal to the parameteri.

8


9/31

4.1.1 MassageN(i,2i)into canonical form of the exponential family

Recall the Normal is

f(yi;i,2i) =

1

22i e 1

22i

(yii)2

(23)

= 1

22i

e 1

22(y2i+

2i2yii) (24)

= 1

22i

e

yii

2

2i

22

y2i

22

(25)

Next recall that

ln 1

22i =ln(22i)1/2 (26)=

1

2ln(22i)

In addition, recall that

A ex =ex+ln(A) (27)

So that means 25 can be reduced to:

exp

yi i2i

2i

22i

y2i

22i

1

2ln(22i)

(28)

That fits easily into the general form of the exponential family in 13. The dispersion parame-

teri= 2i.

4.1.2 The Canonical Link function is the identity link

Match up expression 28 against the canonical distribution 13. Since the numerator of the first

term in 28 isyi i, that means the Normal mean parameteriis playing the role ofi. As a result,

i=i (29)

The functionq(i)that I introduced above is trivially simple,i= i= q(i). Whateveriyou put in, you get the same thing back out.

If you believed the result in section 3.2 , then you know that g()is the same as q(). So thecanonical link is theidentity link,g() =.

You arrive at the same conclusion through the route suggested by equation 21. Note that the

exponential family functionc()corresponds to

c(i) =1

22i (30)

9


10/31

and

dc(i)

di=c(i) =i (31)

You put ini and you get back i. The reverse must also be true. If you have the inverse of

c(i), which was called[c]1 before, you know that

[c]1(i) = i (32)

As Firth observed, the canonical link is such thatg(i) = [c]1(i). Since we already know

fromi=i, this means that[c]1 must be the identity function, because

i= i= [c]1(i) (33)

4.1.3 We reproduced the same old OLS, WLS, and GLS

We can think of the linear predictor as fitting directly into the Normal mean. That is to say, thenatural parameter is equal to the mean parameter, and the mean equals the linear predictor:

i=i= i=Xib

In other words, the model we are fitting is yN(i,2i)or yN(Xib,2i).

This is the Ordinary Least Squares model if you assume that

i. 2i = 2 f or all i

and

ii. Cov(yi,yj) =0 f oralli= j

If you keep assumptionii but dropi, then this implies Weighted Least Squares.

If you drop bothi and ii, then you have Generalized Least Squares.

4.2 Poisson.

Consider a Poisson dependent variable. It is tradition to refer to the single important parameter

in the Poisson as, but since we know that is equal to the mean of the variable, lets make ourlives simple by using for the parameter. For an individuali,

f(yi|i) =

eiyii

yi! (34)

The parameteriis known to equal the expected value of the Poisson distribution. The vari-

ance is equal toias well. As such, it is a one-parameter distribution.

If you were following along blindly with the Normal example, you would (wrongly) assume

that you can use the identity link and just put in i= Xibin place ofi. Strangely enough, if youdo that, you would not have a distribution that fits into the canonical exponential family.

10


11/31

4.2.1 Massage the Poisson into canonical form

Rewrite the Poisson as:

f(yi|i) =e(i+ln(

yi )ln(yi!))

which is the same as

f(yi|) =exp[yiln(i)i ln(yi!)] (35)

There is no scale factor (dispersion parameter), so we dont have to compare against the gen-

eral form of the exponential family, equation 13. Rather, we can line this up against the simpler

form in equation 12

4.2.2 The canonical link is the log

In 35 we have the first termyiln(i), which is not exactly parallel to the exponential family, whichrequiresy.In order to make this match up against the canonical exponential form, note:

i=ln(i) (36)

So the mysterious function q()is ln(i). So the canonical link is

g(i) =ln(i) (37)

You get the same answer if you follow the other route, as suggested by 21. Note that in 35,

c() =exp()and since

i=dc(i)

di=exp(i) (38)

The inverse function forexp(

i)is ln

(),so

[c

]

1

=ln, and as a result

g(i) = [c]1 =ln(i) (39)

The Poisson is a one parameter distribution, so we dont have to worry about dispersion.

And the standard approach to the modeling of Poisson data is to use the log link.

4.2.3 Before you knew about the GLM

In the standard count data (Poisson) regression model, as popularized in political science by

Gary King in the late 1980s, we assume that the mean ofyifollows this Poisson distribution with

an expected value ofexp[Xib]. That is to say, we assume

i=exp[Xib]

Putexp[Xib]in place ofi in expression 40, and you have the Poisson count model.

f(y|X,b) =eexp[Xib][Xib]

y

y! (40)

11


12/31

When you view this model in isolation, apart from GLM, the justification for using exp[Xib]instead ofXibis usually given by hand waving arguments, contending that we need to be sure

the average is 0 or greater, soex pis a suitable transformation or we really do think the impact

of the input variables is multiplicative and exponential.

The canonical link from the GLM provides another justification that does not require quite so

much hand waving.

4.3 Binomial

A binomial model is used to predict the number of successes that we should expect out of a num-

ber of tries. The probability of success is fixed at pi and the number of tries is ni, and the proba-

bility ofyisuccesses givennitries, is

f(yi|ni,pi) =

niyi

p

yii (1pi)

niyi = ni!

yi!(ni yi)!p

yii (1pi)

niyi (41)

In this case, the role of the mean,i, is played by the letter p

i. We do that just for variety :)

4.3.1 Massage the Binomial into the canonical exponential form

I find it a bit surprising that this fits into the same exponential family, actually. But it does fit be-

cause you can do this trick. It is always true that

x=eln(x) =exp[ln(x)] (42)

Ordinarily, we write the binomial model for the probability ofy successes out ofn trials

ny py(1p)ny = n!y!(ny)!py(1p)ny (43)But that can be massaged into the exponential family

e

ln

n

y

py(1p)ny

=e

ln

n

y

+ln(py)+ln((1p)ny)

(44)

=eln

n

y

+yln(p)+(ny)ln(1p)

=eln

n

y

+yln(p)+nln(1p)yln(1p)

(45)

Note that becauseln(a) ln(b) =ln( ab ), we have:

e

yln( p1p )+nln(1p)+ln

n

y

this clearly fits within the simplest form of the exponential distribution.

exp[yiln( pi

1pi) (ni ln(1pi)) + ln

niyi

] (46)

12


13/31

4.3.2 Supposeni=1

The probability of success on a single trial is pi. Hence, the expected value with one trial is pi.

The expected value ofyiwhen there arenitrials isi=ni pi. The parameterniis a fixed constantfor theith observation.

When we do a logistic regression, we typically think of each observation as one test case. A

survey respondent says yes or no and we code yi= 1 oryi= 0.If just one test of the randomprocess is run per observation. That meansyiis either 0 or 1.

So, in the following, lets keep it simple by fixing ni=1. And, in that context,

i= pi (47)

4.3.3 The canonical link is the logit

Then eyeball equation 46 against the canonical form 12, you see that if you think the probability

of a success is pi, then you have to transform pi so that:

=ln( pi1pi

) (48)

Since pi= i, then the mysterious function q()isln( i1i

). That is the canonical link for theBinomial distribution. It is called the logit linkand it maps from the mean of observations,i=piback to the natural parameter i.

4.3.4 Before you knew about the GLM, you already knew this

Thats just the plain old logistic regression equation for a(0,1)dependent variable. The logis-tic regression is usually introduced either as

P(yi=1|Xi,b) = 11 + exp(Xib)

= exp(Xib)1 + exp(Xib)

(49)

Or, equivalently,

ln

pi

1pi

=Xib (50)

It is easy to see that, if the canonical link is used,i= i.

4.4 Gamma

Gamma is a random variable that is continuous on (0,+). The probability of observing a score

yidepends on 2 parameters, the shape and the scale. Call the shape (same for all obser-vations, so no subscript is used) and the scalei. The probability distribution for the Gammais

yi f(,i) = 1

i()y

(1)i e

yi/i (51)

The mean of this distribution isi. We use can input variables and parameters to predict avalue ofi, which in turn influences the distribution of observed outcomes.

The canonical link is 1/i. I have a separate writeup about GammaGLM models.

13


14/31

4.5 Other distributions (may fill these in later).

Pareto f(y|) = y1

Exponential f(y|) = ey or f(y|) = 1ey/ (depending on your taste)

Negative Binomial

f(y|) =

y + r1r1

r(1)y

r is a known value.

Extreme Value (Gumbel)

f(y|) =1

exp

(y)

exp

(y)

5 Maximum likelihood. How do we estimate these beasts?The parameter fitting process should maximize the probability that the model fits the data.

5.1 Likelihood basics

According to classical maximum likelihood theory, one should calculate the probability of ob-

serving the whole set of data, for observations i=1,...,N, given a particular parameter:

L(|X1,...,XN) = f(X1|) f(X2|) ... f(XN|)

The log of the likelihood function is easier to work with, and maximizing that is logicallyequivalent to maximizing the likelihood.

Rememberln(exp(x)) =x?With the more general form of exponential distribution, equation 13, the log likelihood for a

single observation equals:

li(i,i|yi) =log

exp

yii c(i)

i+ h(yi,i)

=yii c(i)

i+ h(yi,i) (52)

The dispersion parameteri is often treated as a nuisance parameter. It is estimated separatelyfrom the other parameters, the ones that affecti.

14


15/31

5.2 Bring the regression coefficients back in:idepends onb

Recall the comments in section 2.2 which indicated that we can view the problem at hand as a

sequence of transformations. The log likelihood contribution of a particular case lnLi=lidepends

oni, andi depends oni, andidepends oni, andidepends onb. So youd have to think ofsome gnarly thing like

li(b) = f(i(i(i(b))))

The chain rule for derivatives applies to a case where you have a function of a function

of a function.... In this case it implies, as noted in McCullagh and Nelder (p. 41), that the score

equation for a coefficientbkas

l

bk= Uk=

N

i=1

lii

ii

ii

ibk

(53)

The letter U is a customary (very widely used) label for the score equations. It is necessary to

simultaneously solve pof these score equations(one for each of the pparameters being esti-

mated):

l

b1= U1 =0

l

b2= U2 =0

... =0l

bp= Up =0

That assumptions linkingb to reduces the number of separate coefficients to be estimatedquite dramatically. We started out with = (1,2,...,N)natural location parameters, one foreach case, but now there are just pcoefficients in the vectorb.

6 Background information necessary to simplify the score equa-

tions

6.1 Review of ML handout #2

When you maximize something by optimally choosing an estimate , take the first derivative withrespect to and set that result equal to zero. Then solve for for . Thescore functionis the firstderivative of the log likelihood with respect to the parameter of interest.

If you find the values which maximize the likelihood, it means the score function has to equal

zero, and there is some interesting insight in that (Gill, p.22).

From the general theory of maximum likelihood, as explained in the handout ML#2, it is true

that:

15


16/31

ML Fact#1 : E

lii

=0 (54)

and

ML Fact#2 : E2li

2i =Varli

i (55)6.2 GLM Facts 1 and 2

GLM Facts 1 & 2 arealmost invariablystated as simple facts in textbooks, but often they are

not explained in detail (see Gill, p. 23, McCullagh and Nelder 1989 p. 28; Firth 1991, p. 61 and

Dobson 2000, p. 47-48).

6.2.1 GLM Fact #1:

For the exponential family as described in 13, the following is true.

GLM Fact#1 : E(yi) =i=dc(i)

di(56)

In words, the expected value ofyiis equal to the derivative ofc(i)with respect toi. InFirth, the notation fordc(i)/di is c(i). In Dobson, it is c

(i).The importance of this result is that if we are given an exponential distribution, we can calcu-

late its mean by differentiating thec()function.Proof.

Take the first derivative ofli(equation 52) with respect to iand you find:

lii

= 1

i yi

dc(i)

di (57)

This is the score of the ith case. Apply the expected value operator to both sides, and tak-

ing into account ML Fact #1 stated above (54), the whole thing is equal to 0:

E

lii

=

1

i

E[yi]E

dc(i)

di

=0 (58)

which implies

E[yi] =E

dc(i)

di

(59)

Two simplifications immediately follow. First, by definition,E[yi] =. Second, becausec(i)does not depend on yi, thendc/diis the same for all values ofyi, and since the expected value of

a constant is just the constant (recallE(A) =A),then

E

dc(i)

di

=

dc(i)

di(60)

As a result, expression 59 reduces to

i=dc(i)

di(61)

And the proof is finished!

16


17/31

6.2.2 GLM Fact #2: The Variance function appears!

The second ML fact (55), leads to a result about the variance of y as it depends on the mean. The

variance of y is

GLM Fact#2 : Var(yi) = i d2c(i)

d2

i

= iV(i) (62)

The variance ofyibreaks into two parts, one of which is the dispersion parameteriandthe other is the so-called variance function V(i). This can be found in McCullagh & Nelder(1989, p. 29; Gill, p. 29; or Firth p. 61).

Note V(i)this does not mean the variance ofi. It means the variance as it depends on i.The variance of observed yicombines the 2 sorts of variability, one of which is sensitive to the

mean, one of which is not. which represents the impact of, the second derivative ofc(i), some-times labeled c(i)or c

(i), times the scaling elementi.Proof.

Apply the variance operator to 57.

Var li

= 12i

Vary dcdi

(63)Since (as explained in the previous section)dc/di is a constant, then this reduces to

Var

l

i

=

1

2iVar[yi] (64)

Put that result aside for a moment. Take the partial derivative of 57 with respect toiand theresult is

2li

2i=

1

i yii

d2c(i)

d2i (65)

I wrote this out term-by-term in order to draw your attention to the fact that yii

=0 becauseyiis treated as a constant this point. So the expression reduces to:

2li

2i=

1

i

d2c(i)

d2i(66)

And, of course, it should be trivially obvious (since the expected value of a constant is the con-

stant) that

E

2li

2i

=

1

i

d2c(i)

d2i(67)

At this point, we use ML Fact #2, and replace the left hand side with Var[lii ], which resultsin

Var[lii

] = 1

i

d2c(i)

d2i(68)

Next, put to use the result stated in expression (64). That allows the left hand side to be re-

placed:

1

2iVar[yi] =

1

i

d2c(i)

d2i(69)

17


18/31

and the end result is:

Var[yi] = id2c(i)

d2i(70)

The Variance function, V(), is defined as

V(i) =d2c(i)

d2i=c(i) (71)

The relabeling amounts to

Var(yi) = iV(i)

6.2.3 Variance functions for various distributions

Firth observes that the variance function characterizes the members in this class of distributions,

as in:

Normal V() =1

Poisson V() =

Binomial V() =(1)

Gamma V() =2

7 Score equations for the model with a noncanonical link.

Our objective is to simplify this in order to calculate the kparameter estimates in b = (b1, b2, . . . , bk)

that solve the score equations (one for each parameter):

l

bk= Uk=

N

i=1

lii

ii

ii

ibk

=0 (72)

The big result that we want to derive is this:

l

bk= Uk=

N

i=1

1

iV(i)g(i)[yi i]xik=0 (73)

Thats a big result because because it looks almost exactly like the score equation from

Weighted (or Generalized) Least Squares.

7.1 How to simplify expression 72.

1. Claim:l/= [yii]/i. This is easy to show. Start with the general version of theexponential family:

l(i|yi) =yii c(i)

i+ h(yi,i) (74)

18


19/31

Differentiate:

lii

= 1

i

yi

dc(i)

di

(75)

Use GLM Fact #1,i=dc(i)/di.

lii

= 1

i[yi i] (76)

This is starting to look familiar, right? Its the difference between the observation and its

expected (dare we say, predicted?) value. A residual, in other words.

2. Claim: ii= 1

V()

Recall GLM Fact #1,i=dc/di. Differentiateiwith respect toi:

ii

=d2c(i)

d2i(77)

In light of GLM Fact #2,

ii

=Var(yi)

i(78)

And Var(yi) = i V(),so 77 reduces to

i

i = V(i) (79)For continuous functions, it is true that

ii

= 1ii

(80)

and so the conclusion isii

= 1

V(i) (81)

3. Claim:i/bk=xik

This is a simple result because we are working with a linear model!

The link function indicates

i=Xib=b1xi1+ ...+ bpxip

ibk

=xik (82)

19


20/31

The Big Result: Put these claims together to simplify 72

There is one score equation for each parameter being estimated. Insert the results from the previ-

ous steps into 72. (See, eg, McCullagh & Nelder, p. 41, MM&V, p. 331, or Dobson, p. 63).

l

bk = Uk=

N

i=11

iV(i)[yi i]

iix

ik=0 (83)

Notei/i= 1/g(o)because

ii

= 1ii

= 1

g(i) (84)

As a result, we find:

l

bk= Uk=

N

i=1

1

iV(i)g(i)[yi i]xik=0 (85)

7.2 The Big Result in matrix form

In matrix form,y and are column vectors and X is the usual data matrix. So

XW(y) =0 (86)

The matrix Wis NxN square, but it has only diagonal elements 1i1

V(i)ii

. More explicitly, 86

would be:

x11 x21 xN1x1p x2p xN p

1

iV(1)

i

i 0 00 1

2V(2)22

0...

... . . .

...

0 0 0 1NV(N)NN

y1 1y2 2

...

yNN

=

00...

0

(87)

There are prows here, each one representing one of the score equations.

The matrix equation in 86 implies

XWy=X (88)

The problem remains to find estimate ofb so that these equations are satisfied.

8 Score equations for the Canonical Link

Recall we want to solve this score equation.

l

bk= Uk=

N

i=1

lii

ii

ii

ibk

=0

20


21/31

The end result will be the following

l

bk= Uk=

N

i=1

1

i[yi i]xik= 0 (89)

Usually it is assumed that the dispersion parameter is the same for all cases. That means isjust a scaling factor that does not affect the calculation of optimalb. So this simplifies as

l

bk= Uk=

N

i=1

[yi i]xik=0 (90)

I know of two ways to show that this result is valid. The first approach follows the same route

that was used for the noncanonical case.

8.1 Proof strategy 1

1. Claim 1:

l

bk= Uk=

N

i=1

lii

ibk

(91)

This is true because the canonical link implies that

ii

ii

=1 (92)

Because the canonical link is chosen so thati= i, then it must be that/= 1 and

/ =1. If you just think of the derivatives in 92 as fractions and rearrange the denomi-nators, you can arrive at the right conclusion (but I was cautioned by calculus professors atseveral times about doing things like that without exercising great caution).

It is probably better to remember my magicalq, wherei= q(i), and note thati= g1(i),

so expression 92 can be written

q(i)

i

g1(i)

i(93)

In the canonical case,i=q(i) =g(i), and alsoi=g1(i) =q

1(). Recall the defini-tion of the link function includes the restriction that it is a monotonic function. For mono-

tonic functions, the derivative of the inverse is (here I cite my first year calculus book: Jer-

rold Marsden and Alan Weinstein,Calculus,Menlo Park, CA: Benjamin-Cummings, 1980,

p. 224):

g1(i)

i=q1(i)

i=

1

q(i)i

(94)

Insert that into expression 93 and you see that the assertion in 92 is correct.

21


22/31

2. Claim 2: Previously obtained results allow a completion of this exercise.

The results of the proof for the noncanonical case can be used. Equations 76 and 82 indi-

cate that we can replaceli/iwith 1i

[yi i]andi/bkwithxik.

8.2 Proof Strategy 2.Begin with the log likelihood of the exponential family.

l(i|yi) =yii c(i)

i+ h(yi,i) (95)

Step 1. Recalli= i=Xib:

l(i|yi) =yi Xibc(Xib)

i+ h(yi,i) (96)

We can replaceiby Xibbecause the definition of the canonical link (i= i) and the modeloriginally statedi=Xib.

Step 2. Differentiate with respect tobkto find thekthscore equation

libk

=yixi c

(Xib)xiki

(97)

The chain rule is used.

dc(Xib)

dbk=

dc(Xib)

dk

dXib

dbk=c(Xib)xik (98)

Recall GLM Fact #1, which stated that c(i) =i. That implies

l

bk=

yixikixiki

(99)

and after re-arrangingl

bk=

1

i[yi i]xik (100)

There is one of the score equations for each ofpthe parameters. The challenge is to choose a

vector of estimatesb= (b1, b2, . . . , bk1, bp)that simultaneously solve all equations of this form.

8.3 The Matrix form of the Score Equation for the Canonical Link

In matrix form, the kequations would look like this:

x11 x21 x31 x(N1)1 xN1x12 x22 xN2x13 x23 xN3

... ...

x1p x2p x(N1)p xN p

11

0 0 0

0 120 0

... . . .

...

0 0 0 1N

y1 1y2 2y3 3

...

y(N1) (N1)yNN

=

00

0

0...

0

0

(101)

22


23/31

MM&V note that we usually assumeiis the same for all cases. If it is a constant, it disap-pears from the problem.

x11 x21 x31 x(N1)1 xN1

x12 x22 xN2x13 x23 xN3

... ...

x1p x2p x(N1)p xN p

y1 1y2 2

y3 3

...

y(N1) (N1)yNN

=

0

0

00...

0

0

(102)

Assumingi= , the matrix form is as follows. yand are column vectors andXis the usualdata matrix. So this amounts to:

1

X(y) =X(y) =0 (103)

Xy=X

(104)

Although this appears to be not substantially easier to solve than the noncanonical version, it is

in fact quite a bit simpler. In this expression, the only element that depends on b is . In the non-

canonical version, we had both W and that were sensitive to b.

9 ML Estimation: Newton-based algorithms.

The problem outlined in 73 or 101 is not easy to solve for optimal estimates ofb. The value of

the mean,idepends on the unknown coefficients,b, as well as the values of the Xs, theys, and

possibly other parameters. And in the noncanonical system, there is the additional complication

of the weight matrix W. There is no closed form solution to the problem.

The Bottom Line: we need to find the roots of the score func-

tions.Adjust the estimated values of thebs so that the score functions equal to 0.

The general approach for iterative calculation is as follows:

bt+1 =bt +Ad justment

The Newton and Fisher scoring methods differ only slightly, in the choice of theAd justment.

9.1 Recall Newtons method

Review my Approximations handout and Newtons method. One can approximate a function f()by the sum of its value at a point and a linear projection (f means derivative of f):

f(bt+1) = f(bt) + (bt+1 bt)f(bt) (105)

This means that the value of f atbt+1 is approximated by the value of f atbt plus an approximat-

ing increment.

23


24/31

The Newton algorithm for root finding (also called the Newton-Raphson algorithm) is a tech-

nique that uses Netwons approximation insight to find the roots of an first order equation. If this

is a score equation (first order condition), we want the left hand size to be zero, that means that

we should adjust the parameter estimate so that

0= f(bt

) + (bt+1

bt

)f

(bt

) (106)

or

bt+1 =bt f(bt)

f(bt) (107)

This gives us aniterative procedurefor recalculating new estimates ofb.

9.2 Newton-Raphson method with several parameters

The Newton approach to approximation is the basis for both the Newton-Raphson algorithm and

the Fisher method of scoring.Examine formula 107. That equation holds whether b is a single scalar or a vector, with the

obvious exception that, in a vector context, the denominator is thought of as the inverse of a ma-

trix, 1/f(b) = [f(b)]1.Note that f(b)is a column vector ofpderivatives, one for each variablein the vectorb (it is called a gradient). And the f(b)is the matrix of second derivatives. For athree-parameter problem, for example:

f(b1,b2,,b3) =

l(b|y)b1

l(b|y)b2

l(b|y)b3

(108)

f(b1,b2,,b3) =

2l

bb

=

2l(b|y)

b21

2l(b|y)b1b2

2l(b|y)b1b3

2l(b|y)b1b2

2l(b|y)

b22

2l(b|y)b2b3

2l(b|y)b1b3

2l(b|y)b2b3

2l(b|y)

b23

(109)That matrix of second derivatives is also known as the Hessian. By custom, it is often calledH.

The Newton-Raphson algorithm at steptupdates the estimate of the parameters in this way:

bt+1 =bt2l(bt|y)

bb

1

l(bt|y)

b (110)If you want to save some ink, write bt+1 =btH1f(b)or bt+1 =btH1l/b.We have the negative of the Hessian matrix in this formula. From the basics of matrix cal-

culus, ifHis negative semi-definite, it means we are in the vicinity of a local maximum in

the likelihood function. The iteration process will cause the likelihood to increase with each re-

calculation. (Note, sayingHis negative definite is the same as saying thatHis not positivedefinite). In the vicinity of the maximum, all will be well, but when the likelihood is far from

maximal, it may not be so.

24


25/31

The point of caution here is that the matrix H1 is not always going to take the estimate of

the new parameter in the right direction. There is nothing that guarantees the matrix is negative

semi-definite, assuring that the fit is going down the surface of the likelihood function.

9.3 Fisher Scoring

R.A. Fisher was a famous statistician who pioneered much of the maximum likelihood theory.

In his method of scoring, instead of the negative Hessian, one should instead use the information

matrix, which is the negative of the expected value of the Hessian matrix.

I(b) =E

2l(bt|y)

bb

(111)

The matrixE[2l/bb]is always negative semi-definite, so it has a theoretical advantageover the Newton-Raphson approach. (E[2l/bb]is positive semi-definite).

There are several different ways to calculate the information matrix. In Dobsons An Intro-

duction to Generalized Linear Models, 2ed, the following strategy is used. The item in the jthrow, kth column of the information matrix is known to be

Ijk(b) =E[U(bj)U(bk)] (112)

I should track down the book where I read that this is one of three ways to calculate the ex-

pected value of the Hessian matrix in order to get the Information. Until I remember what that

was, just proceed. Fill in the values of the score equations U(bj)and U(bk)from the previousresults:

E[[N

i=1

yi i

V(i)g(i)

xi j][N

l=1

yll

V(l)g(l )

xlk] (113)

That givesNNterms in a sum, but almost all of them are 0. E[(yii)(yl l)] = 0 ifi = l,since the cases are statistically independent. That collapses the total to N terms,

N

i=1

E(yi i)2

[V(i)g(i)]2xi jxik (114)

The only randomly varying quantity isyi, so the denominator is already reduced as far as it

can go. Recall that the definition of the variance ofyi is equal to the numerator, Var(yi) =E(yi i)

2 and Var(yi) = V(i), so we can do some rearranging and end up with

Ni=1

1V()[g(i)]2

xi jxik (115)

which is the same asN

i=1

1

V()

ii

xi jxik (116)

The part that includes the Xdata information can be separated from the rest. Think of the

not-Xpart as a weight.

25


26/31

Using matrix algebra, lettingxj represent the jth column from the data matrixX,

Ijk(b) =xjW xk (117)

The Fisher Information Matrix is simply compiled by filling in all of its elements in that way,

and the whole information matrix is I(b) =XW X (118)

The Fisher scoring equations, using the information matrix, are

bt+1 =bt +I(bt)

1U(bt) (119)

Get rid of the inverse on the right hand side by multiplying through on the left by I(bt):

I(bt)bt+1 =I(bt)bt +U(bt) (120)

Using the formulae that were developed for the information matrix and the score equation,

the following is found (after about 6 steps, which Dobson wrote down clearly):

XW X bt+1 =XW z (121)

The reader should become very alert at this point because we have found something that

should look very familiar. The previous equation is the VERY RECOGNIZABLE normal equa-

tion in regression analysis. It is the matrix equation for the estimation of a weighted regression

model in whichz is the dependent variable and Xis a matrix of input variables. That means that

a computer program written to do regression can be put to use in calculating the b estimates for a

GLM. One simply has to obtain the new (should I say constructed) variable z and run regres-

sions. Over and over.

The variablez appears here after some careful re-grouping of the right hand side. It is some-thing of a magical value, appearing as if by accident (but probably not an accident in the eyes of

the experts). The column vector z has individual elements equal to the following (t represents

calculations made at thetth iteration).

zi=xib

t + (yi ti)

tii

(122)

Sincei=xib

t, this is

zi= ti+ (yi

ti)

tii

(123)

This is a Newtonian approximation formula. It gives an approximation of the linear predic-tors value, starting at the point(ti,

ti) and moving horizontally by the distance(yi

ti). The

variableziessentially represents our best guess of what the unmeasured linear predictor is for a

case in which the observed value of the dependent variable is yi. The estimation usesti , our cur-

rent estimate of the mean corresponding to yi, with the link function to make the approximation.

26


27/31

9.3.1 General blithering about Fisher Scoring and ML

I have encountered a mix of opinion aboutHandIand why some authors emphasize Hand oth-

ers emphasizeI. In the general ML context, the Information matrix is more difficult to find be-

cause it requires us to calculate the expected value, which depends on a complicated distribution.

That is William Greenes argument(Econometric Analysis,5th edition, p. 939). He claims that

most of the time the Hessian itself, rather than its expected value, is used in practice. In the GLM,

however, we have special structurethe exponential familywhich simplifies the problem.

In a general ML context, many excellent books argue in favor of using Fishers information

matrix rather than the Hessian matrix. This approach is called Fisher scoring most of the time,

but the terminology is sometimes ambiguous. Eugene DemidenkosMixed Models: Theory and

Applications(New York: Wiley, 2004). Demidenko outlines several reasons to prefer the Fisher

approach. In the present context, this is the most relevant (p. 86):

The negative Hessian matrix,2l/2, orempiricalinformation matrix, may notbe positive definite (more precisely, not nonnegative definite) especially when the

current approximation is far from the MLE. When this happens, the NR algorithmslows down or even fails. On the contrary, the expectedinformation matrix, used in

the FS algorithm, is always positive definite...

Incidentally, it does not matter if we use the ordinary N-R algorithm or Fisher scoring with the

canonical link in a GLM. The expected value of the Hessian equals the observed value when a

canonical link is used. McCullagh & Nelder observe thatHequalsI.The reader might want to review the ML theory. The Variance/Covariance matrix of the esti-

mates ofb is correctly given by the inverse ofI(b). As mentioned in the comparison ofIand H,my observation is that, in the general maximum likelihood problem,I(b)cant be calculated, soan approximation based on the observed second derivatives is used. It is frequently asserted that

the approximations are almost as good.I have found several sources which claim that the Newton-Raphson method will lead to slightly

different estimates of the Variance/Covariance matrix ofb than the Fisher scoring method.

10 Alternative description of the iterative approach: IWLS

The glm() procedure in Rs stat library uses a routine called Iterated Weighted Least Squares.

As the previous section should indicate, the calculating algorithm in the N-R/Fisher framework

will boil down to the iterated estimation of a weighted least squares problem. The interesting

thing, in my opinion, is that McCullagh & Nelder prefer to present the IWLS as their solution

strategy. Comprehending IWLS as the solution requires the reader to make some truly heroic

mental leaps. Neverhtheless, the IWLS approach is a frequent starting place. After describing the

IWLS algorithm, McCullagh & Nelder then showed that IWLS is equivalent to Fisher scoring. I

dont mean simply equivalent in estimates of the bs, but in fact algorithmically equivalent. I did

not understand this argument on first reading, but I do now. The two approaches are EXACTLY

THE SAME. Not just similar. Mechanically identical.

I wonder if the following is true. Ibelievethat it probably is. In 1972, they had good algo-

rithms for calculating weighted least squares problems, but not such good access to maximum

27


28/31

likelihood software. So McCullagh and Nelder sought an estimating procedure for their GLM

that would use existing tools. The IWLS approach is offered as a calculating algorithm, and then

in the following section, McCullagh & Nelder showed that if one were to use Fisher scoring, one

would arrive at the same iterative calculations.

Today, there are good general purpose optimization algorithms for maximum likelihood that

could be used to calculate GLM estimates. One could use them instead of IWLS, probably. Thealgorithm would not be so obviously tailored to the substance of the problem, and a great deal of

insight would be lost.

If you study IWLS, it makes your mental muscles stronger and it also helps you to see how

all of the different kinds of regression fit together.

10.1 Start by remembering OLS and GLS

Remember the estimators from linear regression in matrix form:

OLS W LS and GLS

y=Xb y=Xbminimize(yy)(yy) minimize (yy)W(yy)

b= (XX)1Xy b= (XW X)1XWy

Var(b) = 2(XX)1 Var(b) = 2(XW X)1

10.2 Think of GLM as a least squares problem

Something like this would be nice as an objective function:

Sumo f Squares(i i)2 (124)

If we could choose bto make the predicted mean fit most closely against the true means, life

would be sweet!

But recall in the GLM that we dont get to observe the mean, so we dont have ito compare

against i. Rather we just have observations gathered from a distribution. In the case of a logit

model, for example, we do not observei, but rather we observe 1 or 0. In the Poisson model, we

observe count values, noti.Maybe we could think of the problem as a way of making the linear predictor fit most closely

against the observations:

Sumo f Squares(i i)2

We certainly can calculate i= b0+b1 x11. But we cant observei, the true value of thelinear predictor, any more than we can observe the true mean i.

And, for that matter, if we could observe either one, we could just use the link functiong to

translate between the two of them.

We can approximate the value of the unobserved linear predictor, however, by making an ed-

ucated guess. Recall Newtons approximation scheme, where we approximate an unknown value

by taking a known value and then projecting toward the unknown. If wesomehow magically

knew the mean value for a particular case,i, then we would use the link function to figure out

what the linear predictor should be:

28


29/31

i=g(i)

If we want to know the linear predictor at some neighboring value, sayyi, we could use Newtons

approximation method and calculate an approximation ofg(yi)as

g(yi) =g(i) + (yi i) g(i) (125)The symbolg(yi)means approximate value of the linear predictor and we could call iti.However, possibly because publishers dont like authors to use lots of expensive symbols, we

instead use the letterzito refer to that approximated value of the linear predictor.

zi=g(i) + (yi i) g(i) (126)

If we use that estimate of the linear predictor, then we could think of the GLM estimation

process as a process of calculatingi= b0+b1x1+ . . .to minimize:

Sumo f Squares(zi i)2 (127)This should help you see the basic idea that the IWLS algorithm is exploiting.

The whole exercise here is premised on the idea that you know the mean, i, and in real life,

you dont. Thats why an iterative procedure is needed. Make a guess fori, then make an esti-

mate forzi, and repeat.

10.3 The IWLS algorithm.

1. Begin with0i, starting estimates of the mean ofyi.

2. Calculate a new variablez0

i (this is to be used as an approximate value of the response

variablei).z0i =g(

0i) + (yi

0i)g

(0i) (128)

Because0i =g(0i), sometimes we write

z0i = 0i + (yi

0i)g

(0i) (129)

This is a Newton-style first-order approximation. It estimates the linear predictors value that cor-

responds to the observed value ofyi starting at the value of the g(0i)and adding the incrementimplied by moving the distance(yi

0i)with the slope g

(i).Ifg(0i)is undefined, such asln(0), then some workaround is required.

3. Estimateb

0

by weighted least squares, minimizing the squared distances

w0i(z0i i)2 =w0i(z0i Xb0)2 (130)Information on the weights is given below. The superscriptw0i is needed because the weights

have to be calculated at each step, just asz0 andb0 must be re-calculated.

In matrix form, this is

b0 = (XW0X)1XW0z0 (131)

29


30/31

4. Useb0 to calculate1 =Xib0 and then calculate a new estimate of the mean as

1i =g1(1i) (132)

5. Repeat step 1, replacing0 by1.

10.4 Iterate until convergence

Repeat that process again and again, until the change from one iteration to the next is very small.

A common tolerance criterion is 106, as in

bt+1 bt

bt


31/31

Next, 136 the first order condition is written with matrices as

XW[z] =0 (138)

Recall the definition of =X b, so

X

W[zX b] =0 (139)XW z=XW X b (140)

b= (XW X)1XW z (141)

Wow! Thats just like the WLS formula! We are regressing z on X.

glm1 guide

Documents