unit 4c: taxonomies of logistic regression models © andrew ho, harvard graduate school of...

17
Unit 4c: Taxonomies of Logistic Regression Models © Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 1 ttp://xkcd.com/795 /

Upload: sharyl-goodman

Post on 23-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Unit 4c: Taxonomies of Logistic Regression Models

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 1

http://xkcd.com/795/

• Building the Logistic Regression Model• Dichotomous Predictors• Interactions• Post-Hoc GLH Tests

© Andrew Ho, Harvard Graduate School of Education Unit 4c– Slide 2

Multiple RegressionAnalysis (MRA)

Multiple RegressionAnalysis (MRA) iiii XXY 22110

Do your residuals meet the required assumptions?

Test for residual

normality

Use influence statistics to

detect atypical datapoints

If your residuals are not independent,

replace OLS by GLS regression analysis

Use Individual

growth modeling

Specify a Multi-level

Model

If time is a predictor, you need discrete-

time survival analysis…

If your outcome is categorical, you need to

use…

Binomial logistic

regression analysis

(dichotomous outcome)

Multinomial logistic

regression analysis

(polytomous outcome)

If you have more predictors than you

can deal with,

Create taxonomies of fitted models and compare

them.

Form composites of the indicators of any common

construct.

Conduct a Principal Components Analysis

Use Cluster Analysis

Use non-linear regression analysis.

Transform the outcome or predictor

If your outcome vs. predictor relationship

is non-linear,

Use Factor Analysis:EFA or CFA?

Course Roadmap: Unit 4c

Today’s Topic Area

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 10 20 30 40 50Husband's Annual Salary (in $1,000)

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 3

The Bivariate Distribution of HOME on HUBSAL

0 10 20 30 40 50Husband's Annual Salary (in $1,000)

In Labor Force

HomemakerRQ: In 1976, were married Canadian women

who had children at home and husbands with higher salaries more likely to work at home

rather than joining the labor force (when compared to their married peers with no children at home and husbands who earn

less)?

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 .2 .4 .6 .8 1Are Children Present in the Home?

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 4

The Bivariate Distribution of HOME on CHILD

Scatterplots don’t work very well with dichotomous outcomes and dichotomous predictors.

Instead, try a 2x2 table with the “tabulate” command.Note (1,1) is in the lower right for tables but upper right

for scatterplots.

29.49 70.51 100.00 29.49 70.51 100.00 Total 128 306 434 9.22 59.45 68.66 13.42 86.58 100.00 Children at Home 40 258 298 20.28 11.06 31.34 64.71 35.29 100.00 No Child 88 48 136 Home? In Labor Homemaker Total Present in the Is Woman a Homemaker? Are Children

cell percentage row percentage frequency Key

. tabulate CHILD HOME, cell row Specifies conditional percentages by rows (and

joint probabilities by cells): Given that there is a child present, the sample

probability of being a homemaker is 86.58%. Given that there is no child present, the sample

probability of being a homemaker is 35.29%.

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 5

Sample Probabilities, Odds, Log-Odds, and Odds Ratios

Are Children Present in the Home?

Sample Probability Homemaker

Sample Log-Odds

(Logit)

Sample Difference in

Log-OddsSample Odds Sample

Odds RatioSample Log-Odds Ratio

No Child 35.29% -.606 2.47 .545 11.83 2.47

Children 86.58% 1.864 6.45

29.49 70.51 100.00 29.49 70.51 100.00 Total 128 306 434 9.22 59.45 68.66 13.42 86.58 100.00 Children at Home 40 258 298 20.28 11.06 31.34 64.71 35.29 100.00 No Child 88 48 136 Home? In Labor Homemaker Total Present in the Is Woman a Homemaker? Are Children

cell percentage row percentage frequency Key

. tabulate CHILD HOME, cell row

I recommend understanding the logit scale (nonlinear in probability): -2 is around 10%, -1 is around 25%, 0 is 50%, 1 is 75%, 2 is 90%.

We note that an increment from No Child (0) to Children (1) increments the log-odds by 2.47.

I find it harder to interpret “odds” unless I want to know how much money I’ll win on a bet, or “for every , we estimate there is cases where .

Odds ratios are harder still. A way to compare two probabilities for a unit increment in : The sample odds of being a homemaker increments by a factor of 11.83 if you have children (for a unit increment in ).

We can see that the difference in log-odds is the log-odds ratio (by the properties of logs). To get to the odds ratio from the log-odds ratio, we exponentiate, , or .

_cons -.6061358 .1794351 -3.38 0.001 -.9578222 -.2544494 CHILD 2.470216 .2471294 10.00 0.000 1.985851 2.954581 HOME Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -205.81288 Pseudo R2 = 0.2181 Prob > chi2 = 0.0000 LR chi2(1) = 114.82Logistic regression Number of obs = 434

Iteration 4: log likelihood = -205.81288 Iteration 3: log likelihood = -205.81288 Iteration 2: log likelihood = -205.81844 Iteration 1: log likelihood = -207.40062 Iteration 0: log likelihood = -263.22441

. logit HOME CHILD

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 6

Modeling a Dichotomous Outcome on a Dichotomous Predictor

Our fitted model

Alternatively:

Are Children Present in the

Home?

Sample Probability

Homemaker

Sample Log-Odds

(Logit)

Sample Difference

in Log-Odds

No Child 35.29% -.606 2.47

Children 86.58% 1.864

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 7

Building the logistic regression model

Our old friend eststo: Beginning with the baseline model, no predictors,

constant only (Model 1). Adding main effects separately (Models 2 and 3),

together (Model 4), and an interaction (Model 5) At each step, save the “deviance” (-2*loglikelihood)

As always, we store key statistics, including the statistic (akin to the omnibus statistic in OLS regression), the degrees of freedom for the model, the deviance, and the pseudo- value (which we recall is based on the reduction in model deviance over baseline).

Always include a descriptive title, the sample size, and variable abbreviation explanations.

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 8

Interpretation of Main Effects

Interpretation of main effects: On average, two observations that differ by 1 on are predicted to differ by in the log-odds (logits) that , if all else in the model can be held constant.

The fitted log-odds that the wife is a homemaker are 2.47 logits higher when children are present and 2.58 logits higher when children are present and if the husband’s salary can be held constant.

Equivalently, on average, two observations that differ by 1 on differ in their odds that by a factor of . The fitted odds that the wife is a homemaker are times larger when children are present, and

times larger when children are present and if the husband’s salary can be held constant.

* p<0.05, ** p<0.01, *** p<0.001HUBSAL = Husband's salary in 1976 $1000; CHILD = Are children present?Standard errors in parentheses r2_p 3.89e-15 0.0425 0.218 0.263 0.271 neg2ll 526.4 504.0 411.6 388.1 384.0 df_m 0 1 1 2 3 chi2 2.05e-12 22.40 114.8 138.4 142.5 (0.105) (0.263) (0.179) (0.354) (0.409) _cons 0.872*** -0.237 -0.606*** -1.948*** -1.461***

(0.0417) HUBxCHLD 0.0838*

(0.247) (0.261) (0.583) CHILD 2.470*** 2.582*** 1.504**

(0.0184) (0.0203) (0.0249) HUBSAL 0.0808*** 0.0922*** 0.0591* HOME Model 1 Model 2 Model 3 Model 4 Model 5 Fitting logistic regression models for the probability of a wife being a homemaker (n=434)

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 9

Interpretation of Fit Statistics

The omnibus statistic is simply the difference in the deviances (-2loglikelihood) compared to the baseline model, evaluated on the number of degrees of freedom in the model. For example, Model 1 to Model 3, .

The critical chi-square value can always be looked up as . display invchi2tail (,.05) , where represents the degrees of freedom, though you can just reference , too

Nested tests between non-baseline models are simple differences in neg2ll (deviances) evaluated on differences in df_m (degrees of freedom).

See Unit 4b for further details on nested tests and interpretation and use of pseudo-.

* p<0.05, ** p<0.01, *** p<0.001HUBSAL = Husband's salary in 1976 $1000; CHILD = Are children present?Standard errors in parentheses r2_p 3.89e-15 0.0425 0.218 0.263 0.271 neg2ll 526.4 504.0 411.6 388.1 384.0 df_m 0 1 1 2 3 chi2 2.05e-12 22.40 114.8 138.4 142.5 (0.105) (0.263) (0.179) (0.354) (0.409) _cons 0.872*** -0.237 -0.606*** -1.948*** -1.461***

(0.0417) HUBxCHLD 0.0838*

(0.247) (0.261) (0.583) CHILD 2.470*** 2.582*** 1.504**

(0.0184) (0.0203) (0.0249) HUBSAL 0.0808*** 0.0922*** 0.0591* HOME Model 1 Model 2 Model 3 Model 4 Model 5 Fitting logistic regression models for the probability of a wife being a homemaker (n=434)

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 10 20 30 40 50Husband's Annual Salary (in $1000)

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 10

Graphical Representation of Model 4

It is always good practice to only plot fitted curves in the range of the data whose relationships they describe.

It is particularly important for graphing logistic regression models on the probability metric, where there are clearly nonlinear relationships.

See today’s code for details. Label curves.

No Children

Children

How do we interpret the varying gap? As an interaction?

No! There is no interaction in Model 4.

The scale is not what it seems. This is actually a linear

model in the log-odds.

The distance is just a

s large at the extremes as it is

in the

center, it just d

oesn’t seem that way, since we are plotting

on the probability metric

.

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 10 20 30 40 50Husband's Annual Salary (in $1000)

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 11

Contrasting Graphical Representations of Model 4

In Probability Space:

In Logit (Log-Odds) Space:

Communicating differences in probabilities is best done by picking prototypical values of predictors, but, as you can see, the coefficient does not have a consistent relationship with differences in probabilities (nonlinear in probabilities).

Communicating differences in logits is fine if readers understand logits, and it has the benefit of having a consistent relationship with predictors (linear in the log-odds).

Finally, one can communicate differences in terms of odds ratios , but these are often misinterpreted as “relative risk,” a ratio of probabilities. They are, of course, a ratio of odds, not probabilities.

No Children

Children

-2-1

01

23

45

Log-

Od

ds th

at W

oman

is a

Ho

mem

ake

r

0 10 20 30 40 50Husband's Annual Salary (in $1000)

No Children

Children

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 12

Interpretation of Model 5

As always, with statistically significant interaction effects, main effects become difficult to interpret, graphical representations become essential, and post-hoc GLH tests become particularly relevant.

Write out the model:

* p<0.05, ** p<0.01, *** p<0.001HUBSAL = Husband's salary in 1976 $1000; CHILD = Are children present?Standard errors in parentheses r2_p 3.89e-15 0.0425 0.218 0.263 0.271 neg2ll 526.4 504.0 411.6 388.1 384.0 df_m 0 1 1 2 3 chi2 2.05e-12 22.40 114.8 138.4 142.5 (0.105) (0.263) (0.179) (0.354) (0.409) _cons 0.872*** -0.237 -0.606*** -1.948*** -1.461***

(0.0417) HUBxCHLD 0.0838*

(0.247) (0.261) (0.583) CHILD 2.470*** 2.582*** 1.504**

(0.0184) (0.0203) (0.0249) HUBSAL 0.0808*** 0.0922*** 0.0591* HOME Model 1 Model 2 Model 3 Model 4 Model 5 Fitting logistic regression models for the probability of a wife being a homemaker (n=434)

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 10 20 30 40 50Husband's Annual Salary (in $1000)

No Children

Children

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 10 20 30 40 50Husband's Annual Salary (in $1000)

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 13

Contrasting Graphical Representations of Model 5

In Probability Space:

In Logit (Log-Odds) Space:

No Children

Children

-2-1

01

23

45

67

Log-

Od

ds th

at W

oman

is a

Ho

mem

ake

r

0 10 20 30 40 50Husband's Annual Salary (in $1000)

No Children

Children

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 10 20 30 40 50Husband's Annual Salary (in $1000)

Foll

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 14

Post-Hoc GLH Tests: Gaps Between Conditional Logistic Curves

At HUBSAL = $1K, are differences statistically significant? Following our same old rules, where is HUBSAL and is CHILD:

At HUBSAL = $1K

Take the difference, and we test whether We can reject ,

No Children

Children

-2-1

01

23

45

67

Log-

Od

ds th

at W

oman

is a

Ho

mem

ake

r

0 10 20 30 40 50Husband's Annual Salary (in $1000)

No Children

Children

Prob > chi2 = 0.0036 chi2( 1) = 8.46

( 1) [HOME]CHILD + [HOME]HUBxCHLD = 0

. test CHILD+HUBxCHLD=0

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 10 20 30 40 50Husband's Annual Salary (in $1000)

Foll

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 15

Post-Hoc GLH Tests: Conditional Slopes

When CHILD = 0 or CHILD = 1, are coefficients on HUBSAL statistically significant? Following our same old rules, where is HUBSAL and is CHILD :

When CHILD = 0, And we can reject from the original regression output, .

When CHILD = 1, So we can test the significance of the effective coefficient:

We can reject , . Not surprising given results of previous test.

No Children

Children

-2-1

01

23

45

67

Log-

Od

ds th

at W

oman

is a

Ho

mem

ake

r

0 10 20 30 40 50Husband's Annual Salary (in $1000)

No Children

Children

Are these “slopes” 0 in the population?

Are these slopes 0 in the population?

Prob > chi2 = 0.0000 chi2( 1) = 18.21

( 1) [HOME]HUBSAL + [HOME]HUBxCHLD = 0

. test HUBSAL+HUBxCHLD=0

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 10 20 30 40 50Husband's Annual Salary (in $1000)

Foll

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 16

Even more “general” GLH Tests: Any two points.

Across two different values of two different variables, are population probabilities the same?

Following our same old rules, where is HUBSAL and is CHILD :

So we can test the significance of the gap by taking the difference of the two equations, We retain, .

No Children

Children

-2-1

01

23

45

67

Log-

Od

ds th

at W

oman

is a

Ho

mem

ake

r

0 10 20 30 40 50Husband's Annual Salary (in $1000)

No Children

Children

Does a wife with 1+ child and a

low-income husband ($1K) have

the same population probability

of being a homemaker as...

a wife with no children but a

more wealthy husband ($35K)?

Prob > chi2 = 0.5222 chi2( 1) = 0.41

( 1) 34*[HOME]HUBSAL - [HOME]CHILD - [HOME]HUBxCHLD = 0

. test 34*HUBSAL-CHILD-HUBxCHLD=0

0.2

.4.6

.81

Pro

bab

ility

tha

t Wom

an

is a

Hom

ema

ker

0 10 20 30 40 50Husband's Annual Salary (in $1000)

© Andrew Ho, Harvard Graduate School of Education Unit 4c – Slide 17

FollRevisiting Model Fit and Error Variance We cared a lot about error distributions in multiple regression. Why don’t we seem to care as much in logistic regression?

All the information about the variance of observations around the regression line is contained in the estimate of itself.

See Excel Demo. If those estimates (the fitted probabilities) describe the data,

operationalized by the lpoly fit, then the model fits the data, and the error variance assumptions–that we usually have to diagnose separately–are satisfied.

In practice, this is an “eyeballed” test, like many others. John Willett calls this “the only assumption that matters in

logistic regression.”