logistic regression 2 sociology 8811 lecture 7 copyright 2007 by evan schofer do not copy or...

Logistic Regression 2

Sociology 8811 Lecture 7

Copyright © 2007 by Evan SchoferDo not copy or distribute without permission

Stata Notes: Logistic Regression• Stata has two commands: “logit” & “logistic”

– Logit, by default, produces raw coefficients– Logistic, by default, produces odds ratios

• It exponentiates all coefficients for you!

• Note: Both yield identical results– The following pairs of commands are identical

– For raw coefficients:• logit gun male educ income south liberal• logistic gun male educ income south liberal, coef

– And for odds ratios:• logit gun male educ income south liberal, nocoef• logistic gun male educ income south liberal

Review: Interpreting Coefficients• Raw Coefficients: Change in log odds per

unit change in X• Show direction• Magnitude is hard to interpret

• Odds Ratios: Multiplicative change in odds per unit change in X

• OR > 1 = positive effect, OR < 1 = negative• Operates multiplicatively. Effect of 2-point change is

found by multiplying twice

• Percentage change in odds per unit change• (OR-1)*100%.

Review: Interpreting Results• Important point: Substantive effect of a

variable on predicted probability differs depending on values of other variables

• If probability is already high for a given case, additional increases may not have much effect

– Suppose a 1-point change in X doubles the odds…• Effect isn’t substantively consequential if probability

(Y=1) is already very high– Ex: 20:1 odds = .95 probability; 40:1 odds = .975 probability– Change in probability is only .025

• Effect matters a lot for cases with probabilities near .5– 1:1 odds = .5 probability. 2:1 odds = .67 probability– Change in probability is nearly .2!

Review: Interpreting Results• Predicted values of real (or hypothetical cases)

can vividly illustrate findings• Stata “Adjust” command is very useful• Example: Probabilities for men/women

. adjust, pr by(male)

------------------------------------------------------------------ Dependent variable: gun Command: logistic Variables left as is: educ, income, south, liberal

---------------------- male | pr----------+----------- 0 | .225814 1 | .417045----------------------

Note that the predicted probability for men is nearly twice as high as for women.

Stata Notes: Adjust Command• Stata “adjust” command can be tricky

– 1. By default it uses the entire sample, not just cases in your prior analysis

• Best to specify prior sample: • adjust if e(sample), pr by(male)

– 2. For non-specified variables, stata uses group means (defined by “by” command)

• Don’t assume it pegs cases to overall sample mean• Variables “left as is” take on mean for subgroups

– 3. It doesn’t take into account weighted data• Use “lincom” if you have weighted data

Predicted Probabilities: Stata• Effect of pol views & gender for PhD students. adjust south=0 income=4 educ=20, pr by(liberal male)

------------------------------------------------------------ Dependent variable: gun Command: logisticCovariates set to value: south = 0, income = 4, educ = 20---------------------------- | male liberal | 0 1----------+----------------- 1 | .046588 .096652 2 | .039818 .083241 3 | .033996 .071544 4 | .029 .06138 5 | .024719 .052578 6 | .021057 .044978 7 | .017927 .038433

Note that independent variables are set to values of interest. (Or can be set to mean).

Graphing Predicted Probabilities• P(Y=1) for Women & Men by Liberal

• scatter Women Men Liberal, c(l l)

.02

.04

.06

.08

.1

0 2 4 6 8Liberal

Women Men

Did model categorize cases correctly?• We can choose a criteria: predicted P > .5:. estat clas-------- True --------Classified | D ~D | Total-----------+--------------------------+----------- + | 64 48 | 112 - | 229 509 | 738-----------+--------------------------+----------- Total | 293 557 | 850

Classified + if predicted Pr(D) >= .5True D defined as gun != 0--------------------------------------------------Sensitivity Pr( +| D) 21.84%Specificity Pr( -|~D) 91.38%Positive predictive value Pr( D| +) 57.14%Negative predictive value Pr(~D| -) 68.97%--------------------------------------------------False + rate for true ~D Pr( +|~D) 8.62%False - rate for true D Pr( -| D) 78.16%False + rate for classified + Pr(~D| +) 42.86%False - rate for classified - Pr( D| -) 31.03%--------------------------------------------------Correctly classified 67.41%--------------------------------------------------

The model yields predicted p>.5 for 112 people; only 64 of them actually have guns

Overall, this simple model doesn’t offer extremely accurate predictions…

67% of people are correctly classified

Note: Results change if you use a different criteria (e.g., p>.6)

Sensitivity / Specificity of Prediction• Sensitivity: Of gun owners, what proportion

were correctly predicted to own a gun?• Specificity: Of non-gun owners, what

proportion did we correctly predict?• Choosing a different probability cutoff affects

those values• If we reduce the cutoff to P > .4, we’ll catch a higher

proportion of gun owners• But, we’ll incorrectly identify more non-gun owners.• And, we’ll have more false positives.

Sensitivity / Specificity of Prediction• Stata can produce a plot showing how

predictions will change if we vary “P” cutoff:• Stata command: lsens

0.00

0.25

0.50

0.75

1.00

Sen

sitiv

ity/S

peci

ficity

0.00 0.25 0.50 0.75 1.00Probability cutoff

Sensitivity Specificity

Hypothesis tests• Testing hypotheses using logistic regression

• H0: There is no effect of year in grad program on coffee drinking

• H1: Year in grad school is associated with coffee– Or, one-tail test: Year in school increases probability of coffee

– MLE estimation yields standard errors… like OLS– Test statistic: 2 options; both yield same results

• t = b/SE… just like OLS regression • Wald test (Chi-square, 1df); essentially the square of t

– Reject H0 if Wald or t > critical value• Or if p-value less than alpha (usually .05).

Model Fit: Likelihood Ratio Tests• MLE computes a likelihood for the model

• “Better” models have higher likelihoods• Log likelihood is typically a negative value, so “better”

means a less negative value… -100 > -1000

• Log likelihood ratio test: Allows comparison of any two nested models

• One model must be a subset of vars in other model– You can’t compare totally unrelated models!

• Models must use the exact same sample.

Model Fit: Likelihood Ratio Tests• Default LR test comparison: Current model

versus “null model”• Null model = only a constant; no covariates; K=0

• Also useful: Compare small & large model• Do added variables (as a group) fit the data better?

– Ex: Suppose a theory suggests 4 psychological variables will have an important effect…

• We could use LR test to compare “base model” to model with 4 additional variables.

• STATA: Run first model; “store” estimates; run second model; use stata command “lrtest” to compare models

Model Fit: Likelihood Ratio Tests• Likelihood ratio test is based on the G-square

• Chi-square distributed; df = K1 – K0

• K = # variables; K1 = full model, K0 = simpler model

• L1 = likelihood for full model; L0 = simpler model

101

02 ln2ln2ln2 LLLLG

• Significant likelihood ratio test indicates that the larger model (L1) is an improvement

• G2 > critical value; or p-value < .05.

Model Fit: Likelihood Ratio Tests• Stata’s default LR test; compares to null model. logistic gun male educ income south liberal, coef

Logistic regression Number of obs = 850 LR chi2(5) = 89.53 Prob > chi2 = 0.0000Log likelihood = -502.7251 Pseudo R2 = 0.0818

------------------------------------------------------------------------------ gun | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- male | .7837017 .156764 5.00 0.000 .4764499 1.090954 educ | -.0767763 .0254047 -3.02 0.003 -.1265686 -.026984 income | .2416647 .0493794 4.89 0.000 .1448828 .3384466 south | .7363169 .1979038 3.72 0.000 .3484327 1.124201 liberal | -.1641107 .0578167 -2.84 0.005 -.2774294 -.0507921 _cons | -2.28572 .6200443 -3.69 0.000 -3.500984 -1.070455------------------------------------------------------------------------------

LR Chi2(5) indicates G-square for 5 degrees of freedom

Prob > chi2 is a p-value. p < .05 indicates a significantly better model

Model likelihood = -502.7 Null model is a lower value (more negative)

Model Fit: Likelihood Ratio Tests• Example: Null model log likelihood: -547.5;

Full model: -502.7• 5 new variables, so K1 – K0 = 5.

101

02 ln2ln2ln2 LLLLG

• According to 2 table, crit value=11.07• Since 89.5 greatly exceeds 11.07, we are confident that

the full model is an improvement• Also, observed p-value in STATA output is .000!

5.897.50225.54722 G

Model Fit: Pseudo R-Square• Pseudo R-square

• “A descriptive measure that indicates roughly the proportion of observed variation accounted for by the… predictors.” Knoke et al, p. 313

Logistic regression Number of obs = 850 LR chi2(5) = 89.53 Prob > chi2 = 0.0000Log likelihood = -502.7251 Pseudo R2 = 0.0818

------------------------------------------------------------------------------ gun | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- male | 2.189562 .3432446 5.00 0.000 1.610347 2.977112 educ | .926097 .0235272 -3.02 0.003 .8811137 .9733768 income | 1.273367 .0628781 4.89 0.000 1.155904 1.402767 south | 2.08823 .4132686 3.72 0.000 1.416845 3.077757 liberal | .848648 .049066 -2.84 0.005 .7577291 .9504762------------------------------------------------------------------------------

Model explains roughly 8% of variation in Y

Assumptions & Problems• Assumption: Independent random sample

• Serial correlation or clustering violate assumptions; bias SE estimates and hypothesis tests

• We will discuss possible remedies in the future

• Multicollinearity: High correlation among independent variables causes problems

• Unstable, inefficient estimates• Watch for coefficient instability, check VIF/tolerance• Remove unneeded variables or create indexes of related

variables.

Assumptions & Problems• Outliers/Influential cases

• Unusual/extreme cases can distort results, just like OLS– Logistic requires different influence statistics

• Example: dbeta – very similar to OLS “Cooks D”– Outlier diagnostics are available in STATA

• After model: “predict outliervar, dbeta”• Lists & graphs of residuals & dbetas can identify

influential cases.

Plotting Residuals by Casenumber• predict sresid, rstandard• gen casenum = _n• scatter sresid casenum

-2-1

01

23

stan

dard

ized

Pea

rson

resi

dual

0 1000 2000 3000casenum

Assumptions & Problems• Insufficient variance: You need cases for

both values of the dependent variable• Extremely rare (or common) events can be a problem• Suppose N=1000, but only 3 are coded Y=1• Estimates won’t be great

– Also: Maximum likelihood estimates cannot be computed if any independent variable perfectly predicts the outcome (Y=1)

• Ex: Suppose Soc 8811 drives all students to drink coffee... So there is no variation…

– In that case, you cannot include a dummy variable for taking Soc 8811 in the model.

Assumptions & Problems• Model specification / Omitted variable bias

• Just like any regression model, it is critical to include appropriate variables in the model

• Omission of important factors or ‘controls’ will lead to misleading results.

Real World Example: Coups• Issue: Many countries face the threat of a

coup d’etat – violent overthrow of the regime• What factors whether a countries will have a coup?

• Paper Handout: Belkin and Schofer (2005)• What are the basic findings?• How much do the odds of a coup differ for

military regimes vs. civilian governments?– b=1.74; (e1.74 -1)*100% = +470%

• What about a 2-point increase in log GDP?– b=-.233; ((e-.233 * e-.233) -1)*100% = -37%

Real World Example • Goyette, Kimberly and Yu Xie. 1999.

“Educational Expectations of Asian American Youths: Determinants and Ethnic Differences.” Sociology of Education, 72, 1:22-36.

• What was the paper about?• What was the analysis?• Dependent variable? Key independent variables?• Findings?• Issues / comments / criticisms?

logistic regression 2 sociology 8811 lecture 7 copyright 2007 by evan schofer do not copy or...

Documents