vi. logistic regression. an event occurs or doesnt. a category applies to an observation or doesnt....

139
VI. Logistic Regression

Upload: salma-chambless

Post on 30-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

VI. Logistic Regression

Page 2: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

An event occurs or doesn’t.

A category applies to an observation or doesn’t.

A student passes or fails.

A patient survives or dies.

A candidate wins or loses.

A person is poor or not poor.

A person is a citizen or not.

Page 3: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

These are examples of categorical data.

They are also examples of binary discrete phenomena.

Binary discrete phenomena usually take the form of a dichotomous indicator, or dummy, variable.

It’s best to code binary discrete phenomena 0/1 so that the mean of the dummy variable equals the proportion of cases with a value of 1, & can be interpreted as a probability: e.g., mean of female=.545 (=sample’s probability of being female).

Page 4: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Here’s what we’re going to figure out how to interpret:

Page 5: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. logit hsci read math female, or nolog

Logit estimates Number of obs = 200

LR chi2(3) = 60.07

Prob > chi2 = 0.0000

Log likelihood = -79.013272 Pseudo R2 = 0.2754

------------------------------------------------------------------------------

hsci | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

read | 1.073376 .0274368 2.77 0.006 1.020926 1.128521

math | 1.10315 .0316097 3.43 0.001 1.042904 1.166877

female | .3380283 .1389325 -2.64 0.008 .1510434 .7564918

------------------------------------------------------------------------------

Page 6: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

OLS regression encounters serious problems in dealing with a binary dependent variable:

OLS’s explanatory variable coefficients can extend to positive or negative infinity, but binary probabilities & proportions can’t exceed 1 or fall below 0.

OLS is premised on linearity, but with a binary dependent variable the effects of an explanatory variable are non-linear at the binary variable’s lower & upper levels.

Page 7: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

(3) OLS is also premised on additivity, but with a binary outcome variable an explanatory variable’s effect depends on the relative effects of the other variables: if, say, one explanatory variable pushes the probability of the binary outcome variable near 0 or near 1, then the effects of the other explanatory variables can’t have much influence.

(4) And because a binary outcome variable has just two values, it violates the OLS assumptions of normality &, more important, non-constant variance of residuals.

Page 8: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

What to do? A logit transformation is an advantageous way of representing the S-shaped curve of a binary outcome variable’s y/x distribution.

A probit transformation, which has somewhat thinner tails, is also commonly used.

A complementary log-log transformation or a scobit transformation is often used if the binary outcome variable is highly skewed.

There are other such transformations as well (see, e.g., Long & Freese; & the Stata manuals).

Page 9: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

A logit transformation changes probabilities into logged odds, which eliminate the binary proportion (or probability) ceiling of 1.

Odds express the likelihood of occurrence relative to the likelihood of non-occurrence (i.e. odds are the ratio of the proportions of the two possible outcomes [see Moore & McCabe, chap. 15, pages 40-42]):

odds = event’s probability/1 – event’s probability

probability = event’s odds/1 + event’s odds

Page 10: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

To repeat, a logit transformation changes probabilities into logged odds, which eliminate the binary proportion (or probability) ceiling of 1.

Logged odds are also known as logits (i.e. the natural log of the odds).

Why not stick with odds rather than logged odds? Because logged odds (i.e. logits) also eliminate the binary proportion (or probability) floor of 0.

Page 11: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

So, on balance, the logit transformation eliminates the outcome variable’s proportion (or probability) ceiling of 1 & floor of 0.

Thus an explanatory variable coefficients can extend to positive or negative infinity.

Page 12: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Note: larger sample size is even more important for logistic regression than for OLS regression.

Page 13: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

So that we get a feel for what’s going on, & can do the exercises in Moore & McCabe, chap. 15, let’s compute some odds & logged odds (i.e. logits).

Page 14: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Display the odds of being in honors math:. tab hmath

(>=60) Freq. Percent Cum.

0 151 75.50 75.50

1 49 24.50 100.00

Total 200 100.00

. display .245/(1 - .245) = .3245

Interpretation? The event occurs .3245 times per each time it does not occur.

That is, there are 32.45 occurrences per 100 non-occurrences.

Page 15: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Display the logged odds of being in honors math:

. display ln(.3245) = -1.1255

Display the odds of not being in honors math:

. di .755/(1 - .755) = 3.083

Display the logged odds of not being in honors math:

. di ln(3.083) = 1.126

Page 16: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Although we’ll never have to do the following—thankfully, software will do it for us automatically—let’s make a variable that combines the odds of being in honors math versus not being in honors math.

. gen ohmath = .3245 if hmath==1

. replace ohmath = 3.083 if hmath==0

. tab ohmath

Page 17: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Let’s transform ohmath into another variable that represents logged odds:

. gen lohmath = ln(ohmath)

. su ohmath lohmath

And how could we display lohmath not as logged odds but rather as odds (i.e. as ohmath)?

. display exp(lohmath)

Page 18: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

From the standpoint of regression analysis, why should we indeed have transformed the variable into logged odds?

That is, what are the advantages of doing so?

Page 19: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Overall, a logit transformation of a binary outcome variable linearizes the non-linear relationship of X with the probability of Y.

It does so as the logit transformation:

eliminates the upper & lower probability ceilings of the binary variable; &

is symmetric around the mid-range probability of 0.5, so that probabilities below this value have negative logits (i.e. logged odds) while those above this value have positive logits (i.e. logged odds).

Page 20: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Let’s summarize:

the effect of X on the probability of binary Y is non-linear; but

the effect of X on logit-transformed binary Y is linear.

We call the latter either logit or logistic regression: they’re the same thing.

Let’s fit a model using hsb2.dta:

Page 21: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. logit hsci read math female, nolog

Logit estimates Number of obs = 200

LR chi2(3) = 60.07

Prob > chi2 = 0.0000

Log likelihood = -79.013272 Pseudo R2 = 0.2754

------------------------------------------------------------------------------

hsci | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

read | .0708092 .0255612 2.77 0.006 .0207102 .1209082

math | .09817 .028654 3.43 0.001 .0420091 .1543308

female | -1.084626 .4110086 -2.64 0.008 -1.890188 -.2790636

_cons | -9.990621 1.60694 -6.22 0.000 -13.14017 -6.841076

------------------------------------------------------------------------------

Page 22: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

How, though, do we interpret logit coefficients in meaningful terms?

That is, how do we interpret ‘logged odds’?

The gain in parsimony via the logit transformation is mitigated by the loss in interpretability: the metric of logged odds (i.e. logits) is not instrinsically meaningful to us.

Page 23: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

An alternative, more comprehensible approach is to express regression coefficients not as not logged odds but rather as odds:

odds = event’s probability/1 – event’s probability

Page 24: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

The odds are obtained by taking the exponent, or anti-log, of the logged odds (i.e. the logit coefficient):

. odds of honors math: di exp(-1.1255) = .325

. odds of not honors math: di exp(1.126) =3.083

Review: interpretation?

Page 25: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

What are the odds of being in honors math?

odds: a ratio of probabilities

= event’s probability/1 - event’s probability

Page 26: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

What are the odds of being in honors math versus the odds of not being in honors math? This is called an odds ratio.

odds ratio: a ratio of odds

odds ratio = .325/3.083 = .105

Interpretation? The odds of being in honors math are .105 those of not being in honors math.

Page 27: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Via logit or logistic regression, Stata gives us slope coefficients as odds, instead of logged odds, in any of the following ways:

(1) logit hsci read math female, or nolog

(2) logistic hsci read math female, nolog

(3) quietly logit hsci read math female

listcoef, factor help

Page 28: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

But expressing slope coefficient as odds, instead of logged odds, causes a complication: the equation determining the odds is not additive but rather multiplicative.

Page 29: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. logit hsci read math female, or nolog

Logit estimates Number of obs = 200

LR chi2(3) = 60.07

Prob > chi2 = 0.0000

Log likelihood = -79.013272 Pseudo R2 = 0.2754

------------------------------------------------------------------------------

hsci | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

read | 1.073376 .0274368 2.77 0.006 1.020926 1.128521

math | 1.10315 .0316097 3.43 0.001 1.042904 1.166877

female | .3380283 .1389325 -2.64 0.008 .1510434 .7564918

------------------------------------------------------------------------------

Page 30: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

For every 1-unit increase in reading score, the odds of being in honors science increase by the multiple of 1.07 on average, holding the other variables constant.

Every 1-unit increase in math score, the odds of being in honors science by the factor of 1.10 on average, holding the other variables constant.

The odds of being in honors science is lower for females than males by a multiple of .338 on average, holding the other variables constant.

Page 31: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. logit hsci read math female, or nolog

Logit estimates Number of obs = 200

LR chi2(3) = 60.07

Prob > chi2 = 0.0000

Log likelihood = -79.013272 Pseudo R2 = 0.2754

------------------------------------------------------------------------------

hsci | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

read | 1.073376 .0274368 2.77 0.006 1.020926 1.128521

math | 1.10315 .0316097 3.43 0.001 1.042904 1.166877

female | .3380283 .1389325 -2.64 0.008 .1510434 .7564918

------------------------------------------------------------------------------

Page 32: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

The Metrics

logit = 0 is the equivalent of odds=1 & the equivalent of probability = .5

Page 33: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Regarding odds ratios: an odds ratio of .5, which indicates a negative effect, is of the same magnitude as a positive-effect odds ratio of 2.0.

Here’s a helpful way to disentangle this complication after estimating a logit (i.e. logistic) model:

. listcoef, reverse help

This reverses the outcome variable.

Page 34: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. listcoef, reverse help [So that the coefficients refer to the odds of not being in honors science.]logit (N=200): Factor Change in Odds

Odds of: 0 vs 1

----------------------------------------------------------------------

hsci | b z P>|z| e^b e^bStdX SDofX

-------------+--------------------------------------------------------

read | 0.07081 2.770 0.006 0.9316 0.4838 10.2529

math | 0.09817 3.426 0.001 0.9065 0.3986 9.3684

female | -1.08463 -2.639 0.008 2.9583 1.7185 0.4992

b = raw coefficient

z = z-score for test of b=0

P>|z| = p-value for z-test

e^b = exp(b) = factor change in odds for unit increase in X

e^bStdX = exp(b*SD of X) = change in odds for SD increase in X

SDofX = standard deviation of X

Page 35: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

listcoef, reverse related the explanatory variables to the odds of not being in hsci.

Page 36: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

An easier interpretation than odds ratios is percentage change in odds:

. quietly logit hsci read math female

. listcoef, percent help----------------------------------------------------------------------

hsci | b z P>|z| % %StdX SDofX

-------------+--------------------------------------------------------

read | 0.07081 2.770 0.006 7.3 106.7 10.2529

math | 0.09817 3.426 0.001 10.3 150.9 9.3684

female | -1.08463 -2.639 0.008 -66.2 -41.8 0.4992

----------------------------------------------------------------------

P>|z| = p-value for z-test

% = percent change in odds for unit increase in X

%StdX = percent change in odds for SD increase in X

SDofX = standard deviation of X

Page 37: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

For every 1-unit increase in reading score, the odds of being in honors science increase by 7.3% on average, holding the other variables constant.

For every 1-unit increase in math score, the odds of being in honors science increase by 10.3% on average, holding the other variables constant.

The odds of being in honors science are lower by 66.2% on average for females than males, holding the other variables constant.

Page 38: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

The percentage interpretation, then, eliminates the multiplicative aspect of the model.

Page 39: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Alternatives to know about are ‘relative risk’ & ‘relative risk ratio’.

See, e.g., Utts, chap. 12.

And see the downloadable command ‘relrisk’.

. logit hsci read math female, nolog

. relrisk

Page 40: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Pseudo-R2: this is not equivalent to OLS R2.

Many specialists (e.g., Pampel) recommend not to report pseudo-R2 .

Its metric is different from OLS R2—typically it’s much lower than OLS R2—but readers (including many academic & policy specialists) are not aware of this difference.

Page 41: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

In the logistic equation we could have specified the options robust &/or cluster (as in OLS regression).

Specifying ‘cluster’ automatically invokes robust standard errors.

Page 42: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Before we proceed, keep in mind the following:

odds=1 is the equivalent of:

logit=0

probability=0.5

Page 43: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Besides logits (i.e. logged odds) & odds, we can also interpret the relationships from the perspective of probabilities.

Recall, though, that the the relationship between X & the probability of binary Y is non-linear.

Thus the effect of X on binary Y has to be identified at particular X-values or at a particular sets of values for X’s; & we compare the effects across particular X-levels.

Page 44: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. logit hsci read math female, or nolog

. prvalue, r(mean) delta brieflogit: Predictions for hsci

Pr(y=1|x): 0.1525 95% ci: (0.0998,0.2259)

Pr(y=0|x): 0.8475 95% ci: (0.7741,0.9002)

prvalue, like the rest of the pr-commands, can only be used after estimating a regression model.

It summarizes the samples 1/0 probabilities for the specified binary y-variable holding all of the data set’s (not just the model’s) other variables at their means (though medians can be specified alternatively).

Page 45: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. prvalue, x(female=0) r(mean) delta brieflogit: Predictions for hsci

Pr(y=1|x): 0.2453 95% ci: (0.1568,0.3622)

Pr(y=0|x): 0.7547 95% ci: (0.6378,0.8432)

. prvalue, x(female=1) r(mean) delta brieflogit: Predictions for hsci

Pr(y=1|x): 0.0990 95% ci: (0.0526,0.1784)

Pr(y=0|x): 0.9010 95% ci: (0.8216,0.9474)

Page 46: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Comparing males across particular scores:

. prvalue, x(read=40 math=40 female=0), delta b save

. prvalue, x(read=60 math=60 female=0), delta b dif

How else could we compare the estimated probabilities of being in honors science? By comparing female=0 versus female=1 at particular reading & math scores.

Note: ‘b’ – ‘brief’ (i.e. display only the model’s most relevant values)

Page 47: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Comparing males & females at particular scores:

. prvalue, x(read=40 math=40 female=0), delta b save

. prvalue, x(read=40 math=40 female=1), delta b dif

Or try prtab to see, e.g., how female versus male estimated probabilities vary across the range of math scores, holding reading scores constant at 40 (but does not provide a confidence interval):

. prtab math female, x(read=40) brief

Page 48: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Or try prchange to see, e.g., how female estimated probabilities vary as female math scores increase from 40 to 60 (which, however, does not provide a confidence interval):

. prchange math, x(female=1 math=40) fromto delta(20) uncentered brief

Page 49: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

A problem with prtab & prchange is that they don’t give confidence intervals, which prvalue delta does provide.

Here’s a different way of making predictions—for logged odds, odds, or probabilities—that gives confidence intervals:

. adjust math=40, by(female) ci

. adjust math=40, by(female) exp ci

. adjust math=40, by(female) pr ci

Note: the first variant can be used to obtain predicted coefficients with OLS regression as well.

Page 50: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Remember: we can examine the relationship of X with binary Y via:

logits (i.e. logged odds)

odds; or

probabilities

What are the differences in functional forms & interpretations?

Page 51: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

While we’re at it, why not give probit a look?

. probit hsci read math, nolog

. logit hsci read math, nolog

Although the logit coefficients exceed the probit coefficients by a factor of about 1.8, the functional y/x relationship is virtually the same: it is very slightly different at the low & high ends because the probit transformation has thinner tails.

But there are no odds or probabilities interpretations for probit models, which thus are harder to express in meaningful terms.

Page 52: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

For all forms of categorical-dependent variable regression, the same options exist for explanatory variables—interactions, power transformations, & categorizing—as for OLS regression.

Page 53: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Estimation & Model Fit

Because a dichotomous dependent variable violates OLS assumptions of normality & non-constant variance in residuals, models are estimated instead by maximum likelihood estimation.

Maximum likelihood estimation finds estimates of model parameters that are most likely to give rise to the pattern of observations in the sample data.

Page 54: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Maximum likelihood estimation yields not an F-statistic & F-test for model fit but rather a log likelihood value & likelihood ratio test.

The likelihood ratio test compares a model with the constant only (the baseline model) to the fitted (i.e. full) model: what’s the likelihood of giving rise to the sample estimates via the baseline model vs. the fitted (i.e. full) model?

It then assesses the reduction in log likelihood value anchored by the specified degrees of freedom & computes a Chi-squared test of the reduction.

Page 55: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

This is equivalent to the OLS F-statistic & F-test for model fit.

We use the likelihood ratio test to test nested models (via the command lrtest).

And as with OLS we examine the p-values of individual coefficients, although in this case these are computed via the Wald-test.

Page 56: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Here we begin by testing the ‘full’ model (& storing its estimates) & then test ‘reduced models’ versus the full model:

. logit hsci read write math female, or nolog

. estimates store full

. logit hsci read write, or nolog

. lrtest full likelihood-ratio test LR chi2(2) = 17.69

(Assumption: . nested in full) Prob > chi2 = 0.0001

Page 57: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

But if the models use robust standard errors, pweights, or cluster adjustment, the likelihood ratio test can’t be used.

And of course, if the number of observations varies from one model to another, the likelihood ratio test can’t be used either.

In such cases (aside from varying number of observations) use the Wald-test, which is what is applied in logistic regression via the test command:

Page 58: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. logit hsci read write math female, or nolog

. test math female ( 1) math = 0

( 2) female = 0

chi2( 2) = 15.35

Prob > chi2 = 0.0005

The likelihood ratio test is considered superior to the Wald-test; the results may differ somewhat.

Page 59: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

There is a way to test non-nested models or nested models with unequal observations: BIC or AIC tests (see STATA Manual [‘estimates,’ table options]; Pampel; & Long/Freese).

See Long/Freese’s downloadable ‘fitstat’ command & their discussions of BIC & AIC.

Page 60: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. logistic hsci read write math female, nolog

. fitstat, saving(m1)

. est store m1

Page 61: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. logistic hsci read write, nolog

. fitstat, using(m1) bic

. est store m2

. est table m1 m2, eform star(.1 .05 .01) stats(N chi2 df_m bic)

Page 62: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

BIC test criteria: smaller BIC indicates better-fitting model.

. difference 0-2: weak support for complete model

. difference 2-6: positive support for complete model

. difference 6-10: strong support for complete model

. difference >10: very strong support for complete model

Page 63: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

AIC criterion: smaller AIC*n indicates better-fitting model.

AIC, BIC, & fitstat can be used to test OLS regression models, too.

Page 64: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Beware: Pitfalls in Comparing Logistic Coefficients across

Nested Models

Unlike Y in OLS, the error variance for Y in logistic changes with the addition/deletion of variables.

Hence, unlike OLS, comparing the values of logistic coefficients is tricky.

Check Stata listserv on this topic (e.g., commentaries by Richard Williams, Herb Smith, Roger Newsome).

Page 65: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

If you need to compare the values of logistic coefficients across nested models, one recommended approach is to use Stata’s ‘suest’ (‘seemingly unrelated estimation’) command.

Another approach is to standardize Y across the models (e.g., using ‘listcoef, std’).

Page 66: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

State-of-the-Art on Using Logistic or Probit Regression:

Glenn Hoetker, “The Use of Logit and Probit Models in Strategic Management Research: Critical Issues”

http://www.business.uiuc.edu/ghoetker/documents/Hoetker_logit_in_strategy.pdf

Hoetker’s Stata command ‘complogit’:

. net from http://www.business.uiuc.edu/ghoetker/

. findit oglm

Page 67: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Richard Williams’ Stata ‘oglm’ command:

http://econpapers.repec.org/software/bocbocode/s453402.htm

. findit oglm

Page 68: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Data Preparation

Data preparation is crucial for logistic regression.

Particular attention must be paid to any contingency table of the outcome variable with a categorical explanatory variable that has a zero cell or cell<5 observations, which can cause serious problems of model fitting

Options for a zero cell or cell<5 observations: collapse the categories in some sensible way to eliminate the problem; eliminate the variable (if advisable on other grounds); or if the variable is ordinal, model it as if it were continuous.

Page 69: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

A common problem arising from zero cells or low-count cells is ‘perfect prediction’ (see also ‘separation problem’ or ‘quasi-separation problem’).

Stata provides a warning for ‘perfect prediction,’ & in general the standard errors & confidence intervals are inflated.

There’s considerable literature on strategies for dealing with these problems, or consult Stata listserv.

Page 70: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

In sum, for a contingency table of the outcome variable with a categorical explanatory variable:

No zero cells.

No more than 20% of cells with less than 5 observations.

Page 71: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

The following approach to data preparation for logistic regression is based on Hosmer & Lemeshow, Applied Logistic Regression (2ed).

Page 72: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Variable Selection

Begin with a careful analysis of each variable.

For nominal, ordinal, & continuous variables with few integer values, do a contingency table of the outcome variable (y=0, 1) versus the levels of the explanatory variable, checking for zero cells, cells with <5 obs., &, at this stage, for p<=.25.

. tab hsci female, col chi2

. tab hsci race, col chi2

. tab hsci quintiles_math, col chi2

Page 73: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Take appropriate action to eliminate zero cells and minimize cells with <5 observations.

For continuous variables, a form of preparatory analysis is the two-sample ttest:

. ttest read, by(hsci) unequal

. ttest math, by(hsci) unequal

Page 74: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Likewise pertinent to continuous explanatory variables is a graph of x versus y=0, 1, to explore for possible curvilinearity:

. sparl hsci write, quad | logx

. twoway fpfit hsci write

. gr7 hsci write, c(s) ba(8)

. twoway mband hsci write, mband(8)

. twoway lowess hsci write, bwidth(.6)

Page 75: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Choosing from those variables having p<=.25, conceptualize & list a set of potential explanatory variables.

Discuss their anticipated relationship to outcome variable Y , including anticipated linearity or curvilinearity & relationships with the other explanatory variables.

Order the potential explanatory variables in terms of their conceptual importance in relation to Y.

Page 76: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Fit a preliminary main-effects model, dropping—for now—those variables that test insignificant.

Explore & add pertinent higher-order terms, dropping—for now—those higher-order terms that test insignificant.

Explore & add pertinent interaction terms, dropping—for now—those interaction terms that test insignificant.

Page 77: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Add those variables that previously tested insignificant, dropping—in the context of substantive judgement, including retention of appropriate controls—those variables that now test insignificant but whose absence doesn’t practically effect the other explanatory variables.

One by one, drop & re-add each explanatory variable, checking for changes in coefficients (direction, size) & p-values.

The set of substantively meaningful & statistically stable variables that’s left is the preliminary final model.

Page 78: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Re-add all variables that are conceptually/theoretically relevant: this is the final, complete model.

Conduct the diagnostic tests.

Test nested models (keeping the previously discussed caveats in mind), conducting the diagnostic tests for each model.

Page 79: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Let’s work through an example of logistic regression,

using toxic.dta.

Toxic wastes contaminated the grounds of two public schools in a small Vermont town (Hamilton, Regression with Graphics, chap. 7).

Some residents thought the schools should be closed until proven safe (which would involve high financial costs), while others though the schools should stay open.

Is the following model helpful in explaining the residents’ opinions?

Page 80: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Let’s assume that we’ve done the preparatory analysis, as outlined above.

. logistic close female kids nodad educ lived contam hsc, nolog

quantitative—educ (years); lived(years in town)

categorical—female; kids: kids<19 years old living in town; contam (whether believes his/her own property was contaminated); hsc: whether respondent attended town Health & Safety meeting

interaction—nodad: 1=interaction of male with no children in town, else=0

Page 81: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Multicollinearity:

Our model includes two quantitative variables. We can begin by fitting an OLS regression with the binary outcome variable & then assess the possibility of multicollinearity by issuing the command ‘vif’.

. reg close nodad educ lived contam hsc

. vif

There’s no collinearity problem.

Try Stata’s ‘collin’ (download).

Page 82: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Now we’ll test nested models:

Page 83: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. logistic close female kids nodad educ lived contam hsc, nologLogit estimates Number of obs = 153

LR chi2(7) = 68.16

Prob > chi2 = 0.0000

Log likelihood = -70.524689 Pseudo R2 = 0.3258

------------------------------------------------------------------------------

close | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

female | .949745 .5291234 -0.09 0.926 .3187016 2.830282

kids | .51139 .2892497 -1.19 0.236 .1687719 1.549545

nodad | .1079607 .1078655 -2.23 0.026 .015234 .7650969

educ | .8138141 .075845 -2.21 0.027 .677947 .9769102

lived | .9544288 .0162015 -2.75 0.006 .9231968 .9867174

contam | 3.604136 1.734916 2.66 0.008 1.403007 9.25854

hsc | 11.22341 5.720166 4.74 0.000 4.133313 30.47554

Page 84: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. estimates store full

. fitstat, saving(m1) brief

Page 85: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. logistic close nodad educ lived contam hsc, nologLogit estimates Number of obs = 153

LR chi2(5) = 66.56

Prob > chi2 = 0.0000

Log likelihood = -71.326227 Pseudo R2 = 0.3181

------------------------------------------------------------------------------

close | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

nodad | .1771164 .128458 -2.39 0.017 .0427468 .7338607

educ | .8214643 .0760781 -2.12 0.034 .6851041 .9849652

lived | .9611269 .0148794 -2.56 0.010 .9324019 .9907369

contam | 3.663985 1.746363 2.72 0.006 1.439614 9.325267

hsc | 9.76251 4.787244 4.65 0.000 3.733856 25.52498

------------------------------------------------------------------------------

Page 86: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. lrtest full likelihood-ratio test LR chi2(2) = 1.60

(Assumption: . nested in full) Prob > chi2 = 0.4486

We fail to reject the null hypothesis & thus we select the reduced model.

Page 87: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. fitstat, using(m1) bic

Current Saved Difference

Model: logistic logistic

N: 153 153 0

AIC: 1.011 1.026 -0.016

AIC*n: 154.652 157.049 -2.397

BIC: -596.822 -588.364 -8.458

BIC': -41.407 -32.949 -8.458

Difference of 8.458 in BIC provides strong support for current model.

Page 88: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Caution in Comparing the Logistic Coefficients across

Nested Models

Keep in mind the pitfalls of comparing logistic coefficients across nested models.

Try experimenting with ‘suest’ and with standardizing Y (‘listcoef, std’).

Page 89: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Continuing, let’s display the coefficients as percentage change in odds:

. quietly logistic close nodad educ lived contam hsc

. listcoef, percent help

Page 90: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. listcoef, percent helplogit (N=153): Percentage Change in Odds

Odds of: close vs open

----------------------------------------------------------------------

close | b z P>|z| % %StdX SDofX

-------------+--------------------------------------------------------

nodad | -1.73095 -2.387 0.017 -82.3 -47.9 0.3768

educ | -0.19667 -2.124 0.034 -17.9 -38.0 2.4315

residence | -0.03965 -2.561 0.010 -3.9 -48.9 16.9547

contam | 1.29855 2.724 0.006 266.4 79.6 0.4510

hsc | 2.27855 4.647 0.000 876.3 187.1 0.4628

----------------------------------------------------------------------

Page 91: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Next we’ll look at model specification:

. linktest, nolog

Page 92: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. linktest, nologLogit estimates Number of obs = 153

LR chi2(2) = 66.85

Prob > chi2 = 0.0000

Log likelihood = -71.182064 Pseudo R2 = 0.3195

------------------------------------------------------------------------------

close | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

_hat | 1.007788 .16741 6.02 0.000 .6796707 1.335906

_hatsq | -.0488738 .0922346 -0.53 0.596 -.2296503 .1319027

_cons | .082427 .2637352 0.31 0.755 -.4344844 .5993384

There’s no problem at all.

Page 93: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

If linktest did test significant, we’d reconsider the model, such as whether or not the outcome variable is notably skewed & the possible need to add, transform, categorize, or drop explanatory variables.

It the outcome variable were notably skewed, we’d consider alternatives such as complementary log-log (cloglog) or scobit regression (see Long & Freese; Pampel; UCLA-ATS Stata web site; STATA Manuals).

Page 94: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Let’s now consider summary measures of model fit: ‘ldev’ & ‘estat gof, group(#)’. These reflect cell patterns of observed versus fitted values (i.e. residuals), as in contingency tables.

We want the tests to test insignificant so that we can fail to reject the null hypothesis that the model correctly predicts the cell-pattern of the data.

Covariate pattern: the set of values that corresponds to each combination of levels of the model’s covariates (i.e. explanatory variables); i.e. a configuration of explanatory variables & their values.

Page 95: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. ldevLogistic model deviance goodness-of-fit test

number of observations = 153

number of covariate patterns = 132

deviance goodness-of-fit = 129.47

degrees of freedom = 126

Prob > chi2 = 0.3980

Page 96: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. estat gof, group(10) tableLogistic model for close, goodness-of-fit test

(Table collapsed on quantiles of estimated probabilities)

number of observations = 153

number of groups = 10

Hosmer-Lemeshow chi2(8) = 7.98

Prob > chi2 = 0.4359

Adjust the groups (to no fewer than 6) so that the number of observations is relatively large (5+) in each cell.

Page 97: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

If the summary measures of model fit did test significant, then we’d reconsider the model: re-check contingency tables of the outcome variable and categorical explanatory variables for zero cells & cells with <5 obs, & take remedial action; re-check whether comp log-log (cloglog) or scobit, or some other binary statistic, should be used; & re-consider the explanatory variables.

Page 98: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

While summary indicators of model fit are helpful, they overlook information about a model’s various components.

E.g., even when the overall model fits the data, there may be some interesting deviation from fit for a portion of the observations.

So, before concluding that a model fits, we need to inspect the residual diagnostics for the covariate patterns.

Page 99: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Regarding the diagnostics for individual observations & model fit, dd=deviance (i.e. residuals for y); hat=leverage (i.e. x-outliers); dx2 =chi2-measure of observaton fit; db=influence of each x observation on the beta coefficients in general; & n=identities of the covariate patterns.

Page 100: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

We begin by predicting the probability of outcome variable Y , followed by predicting the residual diagnostic indicators:

. predict p if e(sample)

. predict dd if e(sample), dd

. predict h if e(sample), hat

. predict db if e(sample), db

. predict dx2 if e(sample), dx2

. predict n if e(sample), n

Page 101: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. su p-n

Variable | Obs Mean Std. Dev. Min Max

-------------+------------------------------------------------------------

p | 153 .4313725 .3060614 .0111418 .9711162

dd | 153 1.066321 1.282536 .022654 6.885444

hat | 153 .0469299 .0337735 .010832 .1845852

db | 153 .0541295 .0922495 .0001247 .6595063

dx2 | 153 1.113449 2.926733 .0113907 29.19545

-------------+--------------------------------------------------------------

n | 153 64.32026 37.9672 1 132

Is dx2>4? Is dd or db>1? These levels are rough indicators only (see Hosmer & Lemeshow).

Page 102: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. scatter dx2 p, yline(4) ml(n)

12 34

5

6 78 9

10

11

12

13

14 15

161616

17

1818191920 212223 24242525 262626

27

2829 30 31

3232

33333435353536 37

38

39404041 4243 4445

46

47 48495051

52

53

545556575859

60

61 62626364

65

6667 68

69

7070

7172

73

74

75

7677 78 798081

8283

8485

86

87888989909090 9191

92

93949596

979899100101102 103 104

105

106106107108

109110111112113 114

115

116117118 119120120121122123124125126127 128129 130 1311320

10

20

30

H-L

dX

^2

0 .2 .4 .6 .8 1Pr(close)

Page 103: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. scatter dx2 p [w=db], yline(4)0

10

20

30

H-L

dX

^2

0 .2 .4 .6 .8 1Pr(close)

Page 104: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. scatter db p, yline(1) ml(n)

1

2

3

45

678 9

10

11

12

13

14 15

161616

17

1818

191920 212223 24242525 26262627

2829 30 31

3232

33333435353536 37

38

39404041 4243 4445

46

47 48

49

5051

52

53

5455565758

59

60

6162626364

65 66

67 68

69

7070

717273 7475

7677 7879

80

81

82

83

84

8586

87888989909090

9191

92

9394

9596

97

9899100101102

103104

105

106106107108

109

110

111112113 114

115

116117118

119

120120121122123124125126127

128

129

130

131

1320.2

.4.6

.8

Pre

gib

on's

dbeta

0 .2 .4 .6 .8 1Pr(close)

Page 105: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

0.2

.4.6

.8

Pre

gib

on's

dbeta

0 .2 .4 .6 .8 1Pr(close)

. scatter db p [w=dx2], yline(1)

Page 106: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

.l n dd h dx db if n==60 | n==11 | n==53 | n==109

n dd h dx db

11. 11 6.885444 .0141875 29.19545 .4201701

66. 53 4.144958 .0520987 6.468378 .3555159

73. 60 6.068201 .0347088 18.34161 .6595063

129. 109 1.931174 .1845852 1.468668 .3324619

Page 107: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Explanatory Variables

quantitative—educ (years); lived(years in town)

categorical—female; kids: kids<19 years old living in town; contam (whether believes his/her own property was contaminated); hsc: whether respondent attended town Health & Safety meeting

interaction—nodad: 1=interaction of male with no children in town, else=0

Page 108: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. l n close nodad educ lived contam hsc if n==60 | n==11 | n==53 | n==109

What about these covariate patterns makes them outliers? That is, how do they deviate from the model’s predictions?

+---------------------------------------------------+

n close nodad educ lived contam hsc

11. 11 open no 12 1 yes yes

66. 53 open no 12 40 yes yes

73. 60 close no 12 68 no no

129. 109 close yes 8 9 yes no

Page 109: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

What is it about each of these covariate patterns that the model can’t explain? What insights do these deviations yield concerning the model?

Should we take remedial action of some sort, such as the following?

Page 110: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Possible Remedial Action

Check for & correct data errors.

Possibly incorporate:

omitted variables

interactions

log or other transformations

Consider categorizing quantitative explanatory variables.

Consider other actions to temper outliers.

Page 111: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

For our immediate purposes, we’re satisfied, on balance, that the model fits well.

Page 112: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

What about testing for unequal slopes via a more or less ‘full interaction’ model?

Doing so it routine in OLS regression, but is hugely risky with a categorical dependent variable.

This is essentially due to the nonlinearity of logistic & other such models.

Page 113: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

On the pitfalls of exploring for unequal slopes (i.e. comparing coefficients across groups) with a categorical dependent variable, see Allison “Comparing Logit and Probit Coefficients Across Groups” (1999); Norton et al., “Computing Interaction Effects and Standard Errors in Logit and Probit Models” (2004); Hoetker, “Confounded Coefficients: Accurately Comparing Logit and Probit Coefficients Across Groups.

Here’s a lucid review of the literature by Richard Williams:

http://www.nd.edu/~rwilliam/oglm/index.html.

Page 114: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

STATA commands that address the problem: oglm (by Richard Williams, Notre Dame); inteff (by Norton et al., UNC Chapel Hill); complogit (by Glenn Hoetker, U. Illinois); vibl (by Michael Mitchell, ATS/UCLA)

In STATA: findit…

Page 115: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

A low-tech, common-sense approach to the problem:

Estimate a separate model for each category (e.g., male vs. female).

Compare the patterns of significance.

When an explanatory variable is significant in both models, consider the magnitudes of their coefficients to be significantly different from each other if their confidence intervals do not overlap.

Page 116: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Recalling the cautionary remarks, here’s a nifty way to check out possible interactions graphically via a downloaded command (xi3):

. findit xi3 [then download]

. findit postgr3 [then download]

. help postgr3

. xi3: logistic close nodad educ lived contam hsc, nolog

Page 117: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. postgr3 lived, by(contam)0

.2.4

.6.8

yhat_

, conta

m =

= n

o/y

hat_

, conta

m =

= y

es

0 20 40 60 80years in town

yhat_, contam == no yhat_, contam == yes

No interaction.

Page 118: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. postgr3 educ, by(contam)0

.2.4

.6.8

yhat_

, conta

m =

= n

o/y

hat_

, conta

m =

= y

es

5 10 15 20years education

yhat_, contam == no yhat_, contam == yes

No interaction.

Page 119: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

postgr3 can be used in OLS & other kinds of regression analysis, too.

Page 120: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Let’s next explore estimated probabilities for particular levels of the explanatory variables.

We’ll begin by obtaining the baseline estimated probabilities of Y=0, 1.

Page 121: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. prvalue, delta r(mean) logit: Predictions for close

Pr(y=close|x): 0.4113 95% ci: (0.3141,0.5159)

Pr(y=open|x): 0.5887 95% ci: (0.4841,0.6859)

nodad educ lived contam hsc

x= .16993464 12.954248 19.267974 .28104575 .30718954

prvalue: the estimated probability for an average respondent (i.e. holding each explanatory variable constant at its mean).

Page 122: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Strategy

Determine the range of values for independent variables having relatively wide ranges of estimated probability (or .5-.95 in cases of extreme X-outliers).

Find the extent to which change in one X-variable affects the estimated probability by allowing one X-variable to vary from its minimum to maximum while holding the others constant.

Emphasize y/x relations with relatively wide ranges of estimated probability.

Page 123: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Let’s next inspect the range of each X’s effects on estimated probabilities that respondents favor closing the town’s schools, holding each of the other explanatory variables constant at its mean:

. prchange, fromto r(mean) brief help from: to: dif: from: to: dif:

x=min x=max min->max x=0 x=1 0->1

nodad 0.4839 0.1424 -0.3415 0.4839 0.1424 -0.3415

educ 0.7328 0.1488 -0.5841 0.8993 0.8800 -0.0193

lived 0.5904 0.0570 -0.5334 0.6000 0.5904 -0.0096

contam 0.3266 0.6399 0.3133 0.3266 0.6399 0.3133

hsc 0.2576 0.7721 0.5145 0.2576 0.7721 0.5145

Page 124: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. prtab educ contam, r(mean) brieflogit: Predicted for close

What are the estimated probabilities of ‘close’ associated with a cross-classification of 2-4 explanatory variables: respondent’s education level & opinion that her/his own property is or isn’t contaminated:

Note: better to categorize a continuous variable such as ‘educ’.

Page 125: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

highest

Yr contam

compl no yes

6 0.6557 0.8746

7 0.6100 0.8514

8 0.5624 0.8248

9 0.5135 0.7946

10 0.4644 0.7606

12 0.3691 0.6819

13 0.3246 0.6378

14 0.2831 0.5913

15 0.2449 0.5431

16 0.2104 0.4940

17 0.1796 0.4451

18 0.1524 0.3972

20 0.1082 0.3077

Page 126: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

The estimated probabilities of ‘close’ according to whether respondent attended 2+ town meetings on the school problem & where respondent’s property is reportedly contaminated:

. prtab hsc contam, r(mean) brieflogit: Predicted for close

| believe own

Attend 2+ | property/water

HSC | contam

meetings? | no yes

----------+---------------

no | 0.1941 0.4688

yes | 0.7016 0.8960

Page 127: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Next let’s compare particular configurations of respondent traits:

. prvalue, x(educ=12 contam=0 hsc=0) delta r(mean) brief save

. prvalue, x(educ=16 contam=0 hsc=0) delta r(mean) brief dif

Page 128: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Finally, let’s graph the relationship of years living in town by years education to estimated probability of opinion that the contaminated schools should be closed.

We’ll use the ‘prgen’ command to create pseudo-variable data to be graphed.

Page 129: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. prgen lived, from(1) to(81) x(educ=10) gen(p10) n(11)

. prgen lived, from(1) to(81) x(educ=12) gen(p12) n(11)

. prgen lived, from(1) to(81) x(educ=14) gen(p14) n(11)

. prgen lived, from(1) to(81) x(educ=16) gen(p16) n(11)

. prgen lived, from(1) to(81) x(educ=18) gen(p18) n(11)

. prgen lived, from(1) to(81) x(educ=20) gen(p10) n(11)

. la var p10p1 "10 years educ"

. la var p12p1 "12 years educ"

. la var p14p1 "14 years educ"

. la var p16p1 "16 years educ"

. la var p18p1 "18 years educ"

. la var p20p1 "20 years educ"

Page 130: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. scatter p10p1 p12p1 p14p1 p16p1 p18p1 p20p1 p20x, c(l l l l l l) title(“Opinion by Years Residence & Years Education”, box bexpand) l2title(Pr(Open|Close)) yvar(“”) xvar(“Years residence in town”) legend(c(ltkhaki))

Page 131: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

0.2

.4.6

.8

0 20 40 60 80Years residence in tow n

10 years educ 12 years educ

14 years educ 16 years educ

18 years educ 20 years educ

Pr(

Clo

se|O

pen)

Opinion by Years Residence & Years Education

Page 132: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

See Long/Freese, Pampel, the UCLA-ATS Stata website, & the Stata manuals for alternatives to logistic regression.

Recall that probit generally gives the same results as logistic but gives results in probit coefficients only.

complementary log-log (cloglog in Stata) is for highly skewed outcome variables, but doesn’t give odds ratios. scobit, which is used for the same purpose, does give odds ratios but can be tricky to use.

Page 133: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

We have not discussed predicting probabilities with curvilinear explanatory variables.

On this topic, see Long/Freese, chap. 8.

Page 134: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Finally, models with binary outcome variables are useful in exploring patterns of missing values: is the pattern random or does it reflect bias of some kind or another?

. u hsb2_miss, clear

. egen mvals=rmiss(_all)

. tab mvals

. gen mval=(mvals>=1 & mvals<.)

. tab mval

Page 135: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

. tab mval female, col chi2

. tab mval ses, col chi2

. tab mval race, col chi2

. tab mval schtyp, col chi2

. ttest read, by(mval) unequal

. ttest math, by(mval) unequal

. xi: logistic mval female i.ses i.race i.schtyp, nolog

Page 136: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

In sum, use logistic regression or an alternative to analyze the pattern of missing values.

The topic of missing values has received increased attention, as its consequences for data analysis are commonly ignored but can be serious in terms of bias.

See Allison, Missing Data (Sage Publications).

Page 137: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

And see Gary King et al., “Analyzing Incomplete Political Science Data…,” APSR (March 2001); & King et al., “Amelia: A Program for Imputing Missing Data.” http://gking.harvard.edu/stats.shtml

Note: King et al.’s downloadable-to-Stata program “Clarify” is an alternative way to obtain predicted probabilities.

Page 138: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

One last question: are there any conditions under which you might start out doing OLS regression & decide to use logistic regression instead?

Page 139: VI. Logistic Regression. An event occurs or doesnt. A category applies to an observation or doesnt. A student passes or fails. A patient survives or dies

Remember: State-of-the-Art on Using Logistic or Probit Regression:

Glenn Hoetker, “The Use of Logit and Probit Models in Strategic Management Research: Critical Issues”

http://www.business.uiuc.edu/ghoetker/documents/Hoetker_logit_in_strategy.pdf