Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It

Download Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It

Post on 03-Dec-2016

219 views

Category:

Documents

7 download

TRANSCRIPT

  • Logistic Regression: Why WeCannot Do What We ThinkWe Can Do, and What We CanDo About ItCarina Mood

    Logistic regression estimates do not behave like linear regression estimates in one

    important respect: They are affected by omitted variables, even when these variables are

    unrelated to the independent variables in the model. This fact has important implications

    that have gone largely unnoticed by sociologists. Importantly, we cannot straightforwardly

    interpret log-odds ratios or odds ratios as effect measures, because they also reflect

    the degree of unobserved heterogeneity in the model. In addition, we cannot compare

    log-odds ratios or odds ratios for similar models across groups, samples, or time points,

    or across models with different independent variables in a sample. This article discusses

    these problems and possible ways of overcoming them.

    Introduction

    The use of logistic regression is routine in the socialsciences when studying outcomes that are naturally ornecessarily represented by binary variables. Examplesare many in stratification research (educational transi-tions, promotion), demographic research (divorce,childbirth, nest-leaving), social medicine (diagnosis,mortality), research into social exclusion (unemploy-ment, benefit take-up), and research aboutpolitical behaviour (voting, participation in collectiveaction). When facing a dichotomous dependentvariable, sociologists almost automatically turn tologistic regression, and this practice is generally

    recommended in textbooks in quantitative methodol-ogy. However, our common ways of interpretingresults from logistic regression have some importantproblems.1

    The problems stem from unobservables, or the factthat we can seldom include in a model all variablesthat affect an outcome. Unobserved heterogeneity is

    the variation in the dependent variable that is caused

    by variables that are not observed (i.e. omitted

    variables).2 Many sociologists are familiar with the

    problems of bias in effect estimates that arise if

    omitted variables are correlated with the observed

    independent variables, as this is the case in ordinary

    least squares (OLS) regression. However, few recognize

    that in logistic regression omitted variables affect

    coefficients also through another mechanism, which

    operates regardless of whether omitted variables are

    correlated to the independent variables or not. This

    article sheds light on the problem of unobserved

    heterogeneity in logistic regression and highlights three

    important but overlooked consequences:

    (i) It is problematic to interpret log-odds ratios

    (LnOR) or odds ratios (OR) as substantive

    effects, because they also reflect unobserved

    heterogeneity.

    (ii) It is problematic to compare LnOR or OR

    across models with different independent

    European Sociological Review VOLUME 26 NUMBER 1 2010 6782 67DOI:10.1093/esr/jcp006, available online at www.esr.oxfordjournals.org

    Online publication 9 March 2009

    The Author 2009. Published by Oxford University Press. All rights reserved.For permissions, please e-mail: journals.permissions@oxfordjournals.org

    at University of A

    labama at Birm

    ingham on M

    arch 2, 2013http://esr.oxfordjournals.org/

    Dow

    nloaded from

  • variables, because the unobserved heterogene-

    ity is likely to vary across models.

    (iii) It is problematic to compare LnOR or OR

    across samples, across groups within samples,

    or over timeeven when we use models with

    the same independent variablesbecause the

    unobserved heterogeneity can vary across the

    compared samples, groups, or points in time.

    Though some alarms have been raised about theseproblems (Allison, 1999), these insights have notpenetrated our research practice and the inter-relations between the problems have not been fullyappreciated. Moreover, these problems are ignoredor even misreported in commonly used methodologybooks. After introducing and exemplifying the behav-iour of logistic regression effect estimates in light ofunobserved heterogeneity, I outline these three prob-lems. Then, I describe and discuss possible strategies ofovercoming them, and I conclude with some recom-mendations for ordinary users.

    The Root of The Problems:Unobserved Heterogeneity

    In this section, I demonstrate the logic behind theeffect of unobserved heterogeneity on logistic regres-sion coefficients. I describe the problem in terms ofan underlying latent variable, and I use a simpleexample with one dependent variable (y) and twoindependent variables (x1 and x2) to show that theLnOR or OR of x1 will be affected in two differentways by excluding x2 from the model: First, it willbe biased upwards or downwards by a factor deter-mined by (a) the correlation between x1 and x2 and (b)the correlation between x2 and y when controllingfor x1. Second, it will be biased downwards by a factordetermined by the difference in the residual variancebetween the model including x2 and the modelexcluding it. For readers less comfortable with algebra,verbal and graphical explanations of the problem andextensive examples are given in the following sections.One can think of logistic regression as a way

    of modelling the dichotomous outcome (y) as theobserved effect of an unobserved propensity (or latentvariable) (y), so that when y40, y 1, and wheny50, y 0. The latent variable is in turn linearlyrelated to the independent variables in the model(Long, 1997, section 3.2). For example, if our observedbinary outcome is participation in some collectiveaction (yes/no), we can imagine that among those whoparticipate, there are some who are highly enthusiastic

    about doing so while others are less enthusiastic, and

    that among those not participating there are some

    who can easily tip over to participation and some who

    will never participate. Thus, the latent variable can in

    this case be perceived as a taste for participation that

    underlies the choice of participation.For simplicity, we assume only one independent

    variable. The latent variable model can then be

    written as:

    yi x1i1 "i 1where yi is the unobserved individual propensity, x1iis the independent variable observed for individual i,

    and 1 are parameters, and the errors "i areunobserved but assumed to be independent of x1. To

    estimate this model, we must also assume a certain

    distribution of "i, and in logistic regression we assumea standard logistic distribution with a fixed variance of

    3.29. This distribution is convenient because it results

    in predictions of the logit, which can be interpreted

    as the natural logarithm of the odds of having y 1versus y 0. Thus, logistic regression assumes the logitto be linearly related to the independent variables:

    LnP

    1 P

    a x1ib1 2

    where P is the probability that y 1. For purposes ofinterpretation, the logit may easily be transformed to

    odds or probabilities. The odds that yi 1 is obtainedby exp(logit), and the probability by exp(logit)/

    [1 exp(logit)]. The logit can vary from 1 to1, but it always translates to probabilities above 0and below 1. Transforming the logit reveals that

    we assume that yi 1 if the odds is above 1 or, whichmakes intuitive sense, the probability is above 0.5.

    Results from logistic regression are commonly pre-

    sented in terms of log-odds ratios (LnOR) or odds

    ratios (OR). Log-odds ratios correspond to b1 in

    Equation 2, and odds ratios are obtained by eb1, so

    LnOR gives the additive effect on the logit, while OR

    gives the multiplicative effect on the odds of y 1 overy 0. In other words, LnOR tells us how much thelogit increases if x1 increases by one unit, and OR tells

    us how many times higher the odds of y 1 is if x1increases by one unit. While the effect on probabilities

    depends on the values of other independent variables

    in a model, LnOR and OR hold for any values of the

    other independent variables.The total variance in y in Equation 1 consists

    of explained and unexplained variance. When we use

    Equation 2 to estimate this underlying latent variable

    model we force the unexplained (residual) part of the

    variance to be fixed. This means that any increases in

    68 MOOD at U

    niversity of Alabam

    a at Birmingham

    on March 2, 2013

    http://esr.oxfordjournals.org/D

    ownloaded from

  • the explained variance forces the total variance of the

    dependent variable, and hence its scale, to increase.

    When the scale of the dependent variable increases, b1must also increase since it now expresses the change in

    the dependent variable in another metric. So because

    the residual variance is fixed, the coefficients in (2)

    estimate the effect on the dependent variable on a scale

    that is not fixed, but depends on the degree of

    unobserved heterogeneity. This means that the size of

    b1 reflects not only the effect of x1 but also the degree

    of unobserved heterogeneity in the model. This can be

    seen if we rewrite Equation 1 in the following way:

    yi x1i1 "i 3where everything is as in Equation 1, except for the

    fact that the variance of "i is now fixed and the factor

    adjusts "i to reflect its true variance. is the ratio ofthe true standard deviation of the errors to the assumed

    standard deviation of the errors. Because we cannot

    observe , and because we force "i to have a fixedvariance, b1 in the logistic regression model (Equation

    2) estimates 1/ and not 1 (cf. Gail, Wieand and

    Piantadosi, 1984; Wooldridge, 2002: pp. 470472).3 In

    other words, we standardise the true coefficients 1so that the residuals can have the variance of the

    standard logistic distribution (3.29).To further clarify the consequences of unobserved

    heterogeneity in logistic regression, consider omitting

    the variable x2 when the true underlying model is

    yi x1i1 x2i2 "i 4where "i is logistically distributed with a true variance

    of 3.29 (which means that 1 in this case), and therelation between x2 and x1 is

    x2i 0 1x1i vi 5where 0 and 1 are parameters to be estimated and viis the error term (which is uncorrelated to "i inEquation 4). The omission of x2 from Equation 4 leads

    to two problems, one that is familiar from the linear

    regression case and one that is not. First, just as in

    linear regression, the effect of x1 is confounded with

    the effect of x2, so that when (5) is substituted into (4),

    the effect of x1 becomes 121. That is, to thedegree that x1 and x2 are correlated 1 captures theeffect of x2.The second problem, which does not occur in linear

    regression, is that the residual variance increases. In

    Equation 4, 3:29p = 3:29p , that is, the assumedvariance equals the true variance and 1/ is just 1.When omitting x2, the true residual variance becomes

    var(") 22var(v), and as a consequence changes to3:29 22var

    p=3:29

    p.

    Taken together, these two problems imply that ifwe exclude x2 from Equation 4, instead of 1 weestimate:4

    b1 1 213:29

    p3:29 22var

    p 6If x1 and x2 are uncorrelated, Equation 6 collapses to

    b1 13:29

    p3:29 22varx2

    p 7Therefore, the size of the unobserved heterogeneitydepends on the variances of omitted variables and theireffects on y, and LnOR and OR from logisticregression are affected by unobserved heterogeneityeven when it is unrelated to the included independentvariables. This is a fact that is very often overlooked.For example, one of the standard sociological refer-ences on logistic regression, Menard (1995, p. 59),states that Omitting relevant variables from the equa-tion in logistic regression results in biased coefficientsfor the independent variables, to the extent that theomitted variable is correlated with the independentvariables, and goes on to say that the direction andsize of bias follows the same rules as in linearregression. This is clearly misleading, as the coefficientsfor the independent variables will as a matter offact change when including other variables that arecorrelated with the dependent variable, even whenthese are unrelated to the original independentvariables.

    An Example UsingProbabilities

    The previous section has explained how logisticregression coefficients depend on unobserved hetero-geneity. The logic behind this is perhaps most easilygrasped when we express it from the perspective of theestimated probabilities of y 1. Imagine that we areinterested in the effect of x1 on y and that y is alsoaffected by a dummy variable x2 defining two equal-sized groups. For concreteness, assume that x1 isintelligence (as measured by some IQ test), y is thetransition to university studies, and x2 is sex (boy/girl).I construct a set of artificial data5 where (a) IQ isnormally distributed and strongly related to thetransition to university, (b) girls are much morelikely to enter university than are boys, and (c) sexand IQ are uncorrelated (all these relations arerealities in many developed countries). I estimate the

    LOGISTIC REGRESSION 69 at U

    niversity of Alabam

    a at Birmingham

    on March 2, 2013

    http://esr.oxfordjournals.org/D

    ownloaded from

  • following logistic models and the results are shown in

    Table 1:

    Model 1: yi x1i1 "i

    Model 2: yi x1i1 x2i2 "i

    R(x1, x2) 0; P(y 1) 0.5

    In Table 1, we can see that the LnOR and OR for IQ

    increase when we control for sexeven though IQ and

    sex are not correlated. This happens because we can

    explain the transition to university better with both

    IQ and sex than with only IQ, so in Model 2 the

    unobserved heterogeneity is less than in Model 1. The

    stronger the correlation between sex and the transition

    to university, the larger is the difference in the

    coefficient for IQ between Models 1 and 2.The estimated odds of y 1 versus y 0 can be

    translated to probabilities (P) through the formula

    P(y 1) odds/(1 odds). Figure 1 shows these pre-dicted probabilities from Model 1 and Model 2 inTable 1, for different values of IQ. The figure containsthe following curves: (1) the predicted transitionprobability at different values of IQ from Model 1;(2) the predicted transition probability at differentvalues of IQ from Model 2, setting sex to its mean; (3)and (4) the predicted transition probability fromModel 2 for boys and girls separately, and (5) theaverage predicted transition probability from Model 2for small intervals of the IQ variable.In Figure 1, we can see that curves (1) and (2) look

    quite different. In addition, we can see that eventhough there is no interaction effect in Model 2, thepredicted transition probability at different levels ofIQ differs between boys and girls [curves (3) and (4)].The reason that curves (3) and (4) differ is that boysand girls have different average transition probabilities(i.e. different intercepts in the logistic regression).Importantly, curves (2), (3), and (4) are parallel in thesense that for any point on the y-axis, they have thesame slope, which means that the curves representthe same OR and LnOR (recall that the estimatedLnOR and OR is constant for all values of IQ eventhough the effect on the transition probability varies).However, curve (1) is more stretched out, and thusrepresents a smaller OR and LnOR. Why is this?

    0.2

    .4

    .6

    .8

    1

    -4 -2 0 2 4IQ

    (1) P(Y=1) model 1 (2) P(Y=1) model 2 sex=mean(3) P(Y=1) model 2, boys (4) P(Y=1) model 2, girls(5) Mean P(Y=1) model 2

    Figure...