# Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It

Post on 03-Dec-2016

219 views

TRANSCRIPT

Logistic Regression: Why WeCannot Do What We ThinkWe Can Do, and What We CanDo About ItCarina Mood

Logistic regression estimates do not behave like linear regression estimates in one

important respect: They are affected by omitted variables, even when these variables are

unrelated to the independent variables in the model. This fact has important implications

that have gone largely unnoticed by sociologists. Importantly, we cannot straightforwardly

interpret log-odds ratios or odds ratios as effect measures, because they also reflect

the degree of unobserved heterogeneity in the model. In addition, we cannot compare

log-odds ratios or odds ratios for similar models across groups, samples, or time points,

or across models with different independent variables in a sample. This article discusses

these problems and possible ways of overcoming them.

Introduction

The use of logistic regression is routine in the socialsciences when studying outcomes that are naturally ornecessarily represented by binary variables. Examplesare many in stratification research (educational transi-tions, promotion), demographic research (divorce,childbirth, nest-leaving), social medicine (diagnosis,mortality), research into social exclusion (unemploy-ment, benefit take-up), and research aboutpolitical behaviour (voting, participation in collectiveaction). When facing a dichotomous dependentvariable, sociologists almost automatically turn tologistic regression, and this practice is generally

recommended in textbooks in quantitative methodol-ogy. However, our common ways of interpretingresults from logistic regression have some importantproblems.1

The problems stem from unobservables, or the factthat we can seldom include in a model all variablesthat affect an outcome. Unobserved heterogeneity is

the variation in the dependent variable that is caused

by variables that are not observed (i.e. omitted

variables).2 Many sociologists are familiar with the

problems of bias in effect estimates that arise if

omitted variables are correlated with the observed

independent variables, as this is the case in ordinary

least squares (OLS) regression. However, few recognize

that in logistic regression omitted variables affect

coefficients also through another mechanism, which

operates regardless of whether omitted variables are

correlated to the independent variables or not. This

article sheds light on the problem of unobserved

heterogeneity in logistic regression and highlights three

important but overlooked consequences:

(i) It is problematic to interpret log-odds ratios

(LnOR) or odds ratios (OR) as substantive

effects, because they also reflect unobserved

heterogeneity.

(ii) It is problematic to compare LnOR or OR

across models with different independent

European Sociological Review VOLUME 26 NUMBER 1 2010 6782 67DOI:10.1093/esr/jcp006, available online at www.esr.oxfordjournals.org

Online publication 9 March 2009

The Author 2009. Published by Oxford University Press. All rights reserved.For permissions, please e-mail: journals.permissions@oxfordjournals.org

at University of A

labama at Birm

ingham on M

arch 2, 2013http://esr.oxfordjournals.org/

Dow

nloaded from

variables, because the unobserved heterogene-

ity is likely to vary across models.

(iii) It is problematic to compare LnOR or OR

across samples, across groups within samples,

or over timeeven when we use models with

the same independent variablesbecause the

unobserved heterogeneity can vary across the

compared samples, groups, or points in time.

Though some alarms have been raised about theseproblems (Allison, 1999), these insights have notpenetrated our research practice and the inter-relations between the problems have not been fullyappreciated. Moreover, these problems are ignoredor even misreported in commonly used methodologybooks. After introducing and exemplifying the behav-iour of logistic regression effect estimates in light ofunobserved heterogeneity, I outline these three prob-lems. Then, I describe and discuss possible strategies ofovercoming them, and I conclude with some recom-mendations for ordinary users.

The Root of The Problems:Unobserved Heterogeneity

In this section, I demonstrate the logic behind theeffect of unobserved heterogeneity on logistic regres-sion coefficients. I describe the problem in terms ofan underlying latent variable, and I use a simpleexample with one dependent variable (y) and twoindependent variables (x1 and x2) to show that theLnOR or OR of x1 will be affected in two differentways by excluding x2 from the model: First, it willbe biased upwards or downwards by a factor deter-mined by (a) the correlation between x1 and x2 and (b)the correlation between x2 and y when controllingfor x1. Second, it will be biased downwards by a factordetermined by the difference in the residual variancebetween the model including x2 and the modelexcluding it. For readers less comfortable with algebra,verbal and graphical explanations of the problem andextensive examples are given in the following sections.One can think of logistic regression as a way

of modelling the dichotomous outcome (y) as theobserved effect of an unobserved propensity (or latentvariable) (y), so that when y40, y 1, and wheny50, y 0. The latent variable is in turn linearlyrelated to the independent variables in the model(Long, 1997, section 3.2). For example, if our observedbinary outcome is participation in some collectiveaction (yes/no), we can imagine that among those whoparticipate, there are some who are highly enthusiastic

about doing so while others are less enthusiastic, and

that among those not participating there are some

who can easily tip over to participation and some who

will never participate. Thus, the latent variable can in

this case be perceived as a taste for participation that

underlies the choice of participation.For simplicity, we assume only one independent

variable. The latent variable model can then be

written as:

yi x1i1 "i 1where yi is the unobserved individual propensity, x1iis the independent variable observed for individual i,

and 1 are parameters, and the errors "i areunobserved but assumed to be independent of x1. To

estimate this model, we must also assume a certain

distribution of "i, and in logistic regression we assumea standard logistic distribution with a fixed variance of

3.29. This distribution is convenient because it results

in predictions of the logit, which can be interpreted

as the natural logarithm of the odds of having y 1versus y 0. Thus, logistic regression assumes the logitto be linearly related to the independent variables:

LnP

1 P

a x1ib1 2

where P is the probability that y 1. For purposes ofinterpretation, the logit may easily be transformed to

odds or probabilities. The odds that yi 1 is obtainedby exp(logit), and the probability by exp(logit)/

[1 exp(logit)]. The logit can vary from 1 to1, but it always translates to probabilities above 0and below 1. Transforming the logit reveals that

we assume that yi 1 if the odds is above 1 or, whichmakes intuitive sense, the probability is above 0.5.

Results from logistic regression are commonly pre-

sented in terms of log-odds ratios (LnOR) or odds

ratios (OR). Log-odds ratios correspond to b1 in

Equation 2, and odds ratios are obtained by eb1, so

LnOR gives the additive effect on the logit, while OR

gives the multiplicative effect on the odds of y 1 overy 0. In other words, LnOR tells us how much thelogit increases if x1 increases by one unit, and OR tells

us how many times higher the odds of y 1 is if x1increases by one unit. While the effect on probabilities

depends on the values of other independent variables

in a model, LnOR and OR hold for any values of the

other independent variables.The total variance in y in Equation 1 consists

of explained and unexplained variance. When we use

Equation 2 to estimate this underlying latent variable

model we force the unexplained (residual) part of the

variance to be fixed. This means that any increases in

68 MOOD at U

niversity of Alabam

a at Birmingham

on March 2, 2013

http://esr.oxfordjournals.org/D

ownloaded from

the explained variance forces the total variance of the

dependent variable, and hence its scale, to increase.

When the scale of the dependent variable increases, b1must also increase since it now expresses the change in

the dependent variable in another metric. So because

the residual variance is fixed, the coefficients in (2)

estimate the effect on the dependent variable on a scale

that is not fixed, but depends on the degree of

unobserved heterogeneity. This means that the size of

b1 reflects not only the effect of x1 but also the degree

of unobserved heterogeneity in the model. This can be

seen if we rewrite Equation 1 in the following way:

yi x1i1 "i 3where everything is as in Equation 1, except for the

fact that the variance of "i is now fixed and the factor

adjusts "i to reflect its true variance. is the ratio ofthe true standard deviation of the errors to the assumed

standard deviation of the errors. Because we cannot

observe , and because we force "i to have a fixedvariance, b1 in the logistic regression model (Equation

2) estimates 1/ and not 1 (cf. Gail, Wieand and

Piantadosi, 1984; Wooldridge, 2002: pp. 470472).3 In

other words, we standardise the true coefficients 1so that the residuals can have the variance of the

standard logistic distribution (3.29).To further clarify the consequences of unobserved

heterogeneity in logistic regression, consider omitting

the variable x2 when the true underlying model is

yi x1i1 x2i2 "i 4where "i is logistically distributed with a true variance

of 3.29 (which means that 1 in this case), and therelation between x2 and x1 is

x2i 0 1x1i vi 5where 0 and 1 are parameters to be estimated and viis the error term (which is uncorrelated to "i inEquation 4). The omission of x2 from Equation 4 leads

to two problems, one that is familiar from the linear

regression case and one that is not. First, just as in

linear regression, the effect of x1 is confounded with

the effect of x2, so that when (5) is substituted into (4),

the effect of x1 becomes 121. That is, to thedegree that x1 and x2 are correlated 1 captures theeffect of x2.The second problem, which does not occur in linear

regression, is that the residual variance increases. In

Equation 4, 3:29p = 3:29p , that is, the assumedvariance equals the true variance and 1/ is just 1.When omitting x2, the true residual variance becomes

var(") 22var(v), and as a consequence changes to3:29 22var

p=3:29

p.

Taken together, these two problems imply that ifwe exclude x2 from Equation 4, instead of 1 weestimate:4

b1 1 213:29

p3:29 22var

p 6If x1 and x2 are uncorrelated, Equation 6 collapses to

b1 13:29

p3:29 22varx2

p 7Therefore, the size of the unobserved heterogeneitydepends on the variances of omitted variables and theireffects on y, and LnOR and OR from logisticregression are affected by unobserved heterogeneityeven when it is unrelated to the included independentvariables. This is a fact that is very often overlooked.For example, one of the standard sociological refer-ences on logistic regression, Menard (1995, p. 59),states that Omitting relevant variables from the equa-tion in logistic regression results in biased coefficientsfor the independent variables, to the extent that theomitted variable is correlated with the independentvariables, and goes on to say that the direction andsize of bias follows the same rules as in linearregression. This is clearly misleading, as the coefficientsfor the independent variables will as a matter offact change when including other variables that arecorrelated with the dependent variable, even whenthese are unrelated to the original independentvariables.

An Example UsingProbabilities

The previous section has explained how logisticregression coefficients depend on unobserved hetero-geneity. The logic behind this is perhaps most easilygrasped when we express it from the perspective of theestimated probabilities of y 1. Imagine that we areinterested in the effect of x1 on y and that y is alsoaffected by a dummy variable x2 defining two equal-sized groups. For concreteness, assume that x1 isintelligence (as measured by some IQ test), y is thetransition to university studies, and x2 is sex (boy/girl).I construct a set of artificial data5 where (a) IQ isnormally distributed and strongly related to thetransition to university, (b) girls are much morelikely to enter university than are boys, and (c) sexand IQ are uncorrelated (all these relations arerealities in many developed countries). I estimate the

LOGISTIC REGRESSION 69 at U

niversity of Alabam

a at Birmingham

on March 2, 2013

http://esr.oxfordjournals.org/D

ownloaded from

following logistic models and the results are shown in

Table 1:

Model 1: yi x1i1 "i

Model 2: yi x1i1 x2i2 "i

R(x1, x2) 0; P(y 1) 0.5

In Table 1, we can see that the LnOR and OR for IQ

increase when we control for sexeven though IQ and

sex are not correlated. This happens because we can

explain the transition to university better with both

IQ and sex than with only IQ, so in Model 2 the

unobserved heterogeneity is less than in Model 1. The

stronger the correlation between sex and the transition

to university, the larger is the difference in the

coefficient for IQ between Models 1 and 2.The estimated odds of y 1 versus y 0 can be

translated to probabilities (P) through the formula

P(y 1) odds/(1 odds). Figure 1 shows these pre-dicted probabilities from Model 1 and Model 2 inTable 1, for different values of IQ. The figure containsthe following curves: (1) the predicted transitionprobability at different values of IQ from Model 1;(2) the predicted transition probability at differentvalues of IQ from Model 2, setting sex to its mean; (3)and (4) the predicted transition probability fromModel 2 for boys and girls separately, and (5) theaverage predicted transition probability from Model 2for small intervals of the IQ variable.In Figure 1, we can see that curves (1) and (2) look

quite different. In addition, we can see that eventhough there is no interaction effect in Model 2, thepredicted transition probability at different levels ofIQ differs between boys and girls [curves (3) and (4)].The reason that curves (3) and (4) differ is that boysand girls have different average transition probabilities(i.e. different intercepts in the logistic regression).Importantly, curves (2), (3), and (4) are parallel in thesense that for any point on the y-axis, they have thesame slope, which means that the curves representthe same OR and LnOR (recall that the estimatedLnOR and OR is constant for all values of IQ eventhough the effect on the transition probability varies).However, curve (1) is more stretched out, and thusrepresents a smaller OR and LnOR. Why is this?

0.2

.4

.6

.8

1

-4 -2 0 2 4IQ

(1) P(Y=1) model 1 (2) P(Y=1) model 2 sex=mean(3) P(Y=1) model 2, boys (4) P(Y=1) model 2, girls(5) Mean P(Y=1) model 2

Figure...