multiple regression 4 sociology 5811 lecture 25 copyright © 2005 by evan schofer do not copy or...

Multiple Regression 4

Sociology 5811 Lecture 25

Copyright © 2005 by Evan Schofer

Do not copy or distribute without permission

Announcements

• Schedule: – Today: Multiple regression hypothesis tests,

assumptions, and problems– Next Class: More diagnostics

• Including “outliers”, which you should address for the final paper. Don’t miss class!

• Reminder: Final paper deadline coming up soon!• Questions about the paper?

Review: Interaction Terms

• Interaction Terms: Effect of a variable changes within groups or levels of a third

• Example: Effect of income on happiness may be different for women and men

• If men are more materialistic, each dollar has a bigger effect

• Issue isn’t men = “more” or “less” than women– Rather, the slope of a variable coefficient (for income)

differs across groups • Essentially: different regression line (slope) for each group.

Review: Interaction Terms• Visually: Women = blue, Men = red

INCOME

100000800006000040000200000

HA

PP

Y

10

9

8

7

6

5

4

3

2

1

0

Overall slope for all data points

Note: Here, the slope for men and women

differs.

The effect of income on happiness (X1 on Y)

varies with gender (X2). This is called an

“interaction effect”


• Examples of interaction:– Effect of education on income may interact with type

of school attended (public vs. private)• Private schooling has bigger effect on income

– Effect of aspirations on educational attainment interacts with poverty

• Aspirations matter less if you don’t have money to pay for college.


• Interaction effects: Differences in the relationship (slope) between two variables for each category of a third variable

• Option #1: Analyze each group separately• Look for different sized slope in each group

• Option #2: Multiply the two variables of interest: (DFEMALE, INCOME) to create a new variable– Called: DFEMALE*INCOME– Add that variable to the multiple regression model.

Coefficientsa

8.855 1.744 5.076 .000

2.541 .118 .531 21.563 .000

6.636E-02 .396 .004 .167 .867

4.293 4.193 .088 1.024 .306

-.576 .332 -.149 -1.735 .083

(Constant)

EDUC

INCOM16

DBLACK

BL_EDUC

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: PRESTIGEa.

Review: Interaction Terms• Example: Interaction of Race and Education

affecting Job Prestige:

DBLACK*EDUC has a negative effect (nearly significant). Coefficient of -.576 indicates that the slope of education and job

prestige is .576 points lower for Blacks than for non-blacks.

Continuous Interaction Terms

• Two continuous variables can also interact

• Example: Effect of education and income on happiness– Perhaps highly educated people are less materialistic– As education increases, the slope between between

income and happiness would decrease

• Simply multiply Education and Income to create the interaction term “EDUCATION*INCOME”

• And add it to the model.

Interpreting Interaction Terms

• How do you interpret continuous variable interactions?

• Example: EDUCATION*INCOME: Coefficient = 2.0

• Answer: For each unit change in education, the slope of income vs. happiness increases by 2– Note: coefficient is symmetrical: For each unit

change in income, education slope increases by 2

• Dummy interactions effectively estimate 2 slopes: one for each group

• Continuous interactions result in many slopes: Each value of education*income yields a different slope.

Interpreting Interaction Terms

• Interaction terms alters the interpretation of “main effect” coefficients

• Including “EDUC*INCOME changes the interpretation of EDUC and of INCOME

• See Allison p. 166-9

– Specifically, coefficient for EDUC represents slope of EDUC when INCOME = 0

• Likewise, INCOME shows slope when EDUC=0

– Thus, main effects are like “baseline” slopes• And, the interaction effect coefficient shows how the slope

grows (or shrinks) for a given unit change.

Dummy Interactions

• It is also possible to construct interaction terms based on two dummy variables– Instead of a “slope” interaction, dummy interactions

show difference in constants• Constant (not slope) differs across values of a third variable

– Example: Effect of of race on school success varies by gender

• African Americans do less well in school; but the difference is much larger for black males.

Dummy Interactions

• Strategy for dummy interaction is the same: Multiply both variables– Example: Multiply DBLACK, DMALE to create

DBLACK*DMALE• Then, include all 3 variables in the model

– Effect of DBLACK*DMALE reflects difference in constant (level) for black males, compared to white males and black females

• You would observe a negative coefficient, indicating that black males fare worse in schools than black females or white males.

Interaction Terms: Remarks

• 1. If you make an interaction you should also include the component variables in the model:– A model with “DFEMALE * INCOME” should also

include DFEMALE and INCOME• There are rare exceptions. But when in doubt, include them

• 2. Sometimes interaction terms are highly correlated with its components

• That can cause problems (multicollinearity – which we’ll discuss more soon)

Interaction Terms: Remarks

• 3. Make sure you have enough cases in each group for your interaction terms– Interaction terms involve estimating slopes for sub-

groups (e.g., black females vs black males). • If you there are hardly any black females in the dataset, you

can have problems

• 4. “Three-way” interactions are also possible!• An interaction effect that varies across categories of yet

another variable– Ex: DMale*DBlack interaction may vary across class

• They are mainly used in experimental research settings with large sample sizes… but they are possible.

Multiple Regression Hypothesis Tests

• Hypothesis tests can be conducted independently for all slopes (b) of X variables

• For X1, X2…Xk, we can test hypotheses for b1, b2…bk

• Null/Alternative hypotheses are the same:• H0: k = 0

• H1: k 0; Or, one-tailed tests: H1: k > 0, H1: k < 0

• Hypothesis tests are about the slope controlling for other variables in the model

• Sometimes people explicitly mention this in hypotheses

• NOTE: Results with “controls” may differ from bivariate hypothesis tests!

Multiple Regression Hypothesis Tests

• Formula for MV hypothesis tests:

kb

kKN s

bt 1

• Where b is a slope, sb is a standard error

• k represents the kth independent variable

• K = total number of independent variables

• T-test degrees of freedom depends on N and number of independent variables

• Compare observed t-value to critical t; or p to

Multiple Regression Estimation

• Calculating b’s involves solving a set of equations to minimize squared error

• Analogous to bivariate, but math is more complex

• The optimal estimator has minimum variance and is referred to as “BLUE”:

• Best Linear, Unbiased Estimate

• The BLUE Multiple Regression has more assumptions than bivariate.

Multiple Regression Assumptions• As discussed in Knoke, p. 256

• Note: Allison refers to error (e) as disturbance (U); And uses slightly different language… but ideas are the same!

• 1. a. Linearity: The relationship between dependent and independent variables is linear

• Just like bivariate regression• Points don’t all have to fall exactly on the line; but error

(disturbance) must not have a pattern

– Check scatterplots of X’s and error (residual)• Watch out for non-linear trends: error is systematically

negative (or positive) for certain ranges of X• There are strategies to cope with non-linearity, such as

including X and X-squared to model curved relationship.

Multiple Regression Assumptions

• 1. b. And, the model is properly specified: – No extra variables are included in the model, and no

important variables are omitted. This is HARD!

• Correct model specification is critical• If an important variable is left out of the model, results are

biased (“omitted variable bias”)

– Example: If we model job prestige as a function of family wealth, but do not include education

• Coefficient estimate for wealth would be biased

– Use theory and previous research to decide what critical variables must be included in your model.


• 2. All variables are measured without error

• Unfortunately, error is common in measures– Survey questions can be biased– People give erroneous responses (or lie)– Aggregate statistics (e.g., GDP) can be inaccurate

• This assumption is often violated to some extent– We do the best we can:

• Design surveys well, use best available data

• And, there are advanced methods for dealing with measurement error.


• 3. The error term (ei) has certain properties• Recall: error is a cases deviation from the regression line

– Not the same as measurement error!

• After you run a regression, SPSS can tell you the error value for any or all cases (called the “residual”)

• 3. a. Error is conditionally normal– For bivariate, we looked to see if Y was conditionally

normal. For multivariate regression, we look to see if error is conditionally normal

• Examine “residuals” (ei) for normality at different values of X variables.


• 3. b. The error term (ei) has a mean of 0

– This affects the estimate of the constant (Not a huge problem)

• This is not a critical assumption to test.

• 3. c. The error term (ei) is homoskedastic (has constant variance)

• Note: This affects standard error estimates, hypothesis tests

– Look at residuals, to see if they spread out with changing values of X

• Or plot standardized residuals vs. standardized predicted values.


• 3. d. Predictors (Xis) are uncorrelated with error

– This most often happens when we leave out an important variable that is correlated with another Xi

– Example: Predicting job prestige with family wealth, but not including education

– Omission of education will affect error term. Those with lots of education will have large positive errors.

• Since wealth is correlated with education, it will be correlated with that error!

– Result: coefficient for family wealth will be biased (vastly overestimated).


• 4. In systems of equations, error terms of equations are uncorrelated

• Knoke, p. 256

– This is not a concern for us in this class• Worry about that later!


• 5. Sample is independent, errors are random• Technically, part of 3.c.

– Not only should errors not increase with X (heteroskedasticity), there should be no pattern at all!

• Things that cause patterns in error (autocorrelation):– Measuring data over long periods of time (e.g., every

year). Error from nearby years may be correlated.• Called: “Serial correlation”.


• More things that cause patterns in error (autocorrelation):– Measuring data in families. All members are similar,

will have correlated error– Measuring data in geographic space.

• Example: data on 50 US states. States in a similar region have correlated error

• Called “spatial autocorrelation”

• There are variations of regression models to address each kind of correlated error.


• Regression assumptions and final projects:

• 1. Check all your assumptions… but present results for only 1 or 2 X variables.

• 2. Multivariate assumption checks involve plots of e (“error” or “residual”) to test linearity, heteroskedasticity

• This contrasts bivariate, where you plotted X vs. Y

• Don’t forget to focus on “e”!

• 3. Also, you should check for outliers• To be discussed soon!

Regression: Outliers

• Note: Even if regression assumptions are met, slope estimates can have problems

• Example: Outliers -- cases with extreme values that differ greatly from the rest of your sample

• More formally: “influential cases”

• Outliers can result from:• Errors in coding or data entry

• Highly unusual cases

• Or, sometimes they reflect important “real” variation

• Even a few outliers can dramatically change estimates of the slope, especially if N is small.


• Outlier Example:

-4 -2 0 2 4

4

2

-2

-4

Extreme case that pulls regression

line up

Regression line with extreme case

removed from sample


• Strategy for identifying outliers:

• 1. Look at scatterplots or regression partial plots for extreme values

• Easiest. A minimum for final projects

• 2. Ask SPSS to compute outlier diagnostic statistics– Examples: “Leverage”, Cook’s D, DFBETA,

residuals, standardized residuals.


• SPSS Outlier strategy: Go to Regression – Save– Choose “influence” and “distance” statistics such as

Cook’s Distance, DFFIT, standardized residual– Result: SPSS will create new variables with values of

Cook’s D, DFFIT for each case– High values signal potential outliers– Note: This is less useful if you have a VERY large

dataset, because you have to look at each case value.

Scatterplots• Example: Study time and student achievement.

– X variable: Average # hours spent studying per day– Y variable: Score on reading test

Case X Y

1 2.6 28

2 1.4 13

3 .65 17

4 4.1 31

5 .25 8

6 1.9 16

7 3.5 6

Y axis

X axis

0 1 2 3 4

30

20

10

0

Outliers

• Results with outlier:Model Summaryb

.466a .217 .060 9.1618Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), HRSTUDYa.

Dependent Variable: TESTSCORb. Coefficientsa

10.662 6.402 1.665 .157

3.081 2.617 .466 1.177 .292

(Constant)

HRSTUDY

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.

Dependent Variable: TESTSCORa.

Outlier Diagnostics

• Residuals: The numerical value of the error– Error = distance that points falls from the line– Cases with unusually large error may be outliers– Note: residuals have many other uses!

• Standardized residuals– Z-score of residuals… converts to a neutral unit– Often, standardized residuals larger than 3 are

considered worthy of scrutiny• But, it isn’t the best outlier diagnostic.

Outlier Diagnostics

• Cook’s D: Identifies cases that are strongly influencing the regression line– SPSS calculates a value for each case

• Go to “Save” menu, click on Cook’s D

• How large of a Cook’s D is a problem?– Rule of thumb: Values greater than: 4 / (n – k – 1)– Example: N=7, K = 1: Cut-off = 4/5 = .80– Cases with higher values should be examined.

Outlier Diagnostics

• Example: Outlier/Influential Case Statistics

Hours Score Resid Std Resid Cook’s D

2.60 28 9.32 1.01 .124

1.40 13 -1.97 -.215 .006

.65 17 4.33 .473 .070

4.10 31 7.70 .841 .640

.25 8 -3.43 -.374 .082

1.90 16 -.515 -.056 .0003

3.50 6 -15.4 -1.68 .941

Outliers

• Results with outlier removed:Model Summaryb

.903a .816 .770 4.2587Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), HRSTUDYa.

Dependent Variable: TESTSCORb. Coefficientsa

8.428 3.019 2.791 .049

5.728 1.359 .903 4.215 .014

(Constant)

HRSTUDY

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.

Dependent Variable: TESTSCORa.


• Question: What should you do if you find outliers? Drop outlier cases from the analysis? Or leave them in?– Obviously, you should drop cases that are incorrectly

coded or erroneous– But, generally speaking, you should be cautious about

throwing out cases• If you throw out enough cases, you can produce any result

that you want! So, be judicious when destroying data.


• Circumstances where it can be good to drop outlier cases:

• 1. Coding errors

• 2. Single extreme outliers that radically change results

• Your results should reflect the dataset, not one case!

• 3. If there is a theoretical reason to drop cases– Example: In analysis of economic activity,

communist countries may be outliers• If the study is about “capitalism”, they should be dropped.


• Circumstances when it is good to keep outliers

• 1. If they form meaningful cluster– Often suggests an important subgroup in your data

• Example: Asian-Americans in a dataset on education

• In such a case, consider adding a dummy variable for them

– Unless, of course, research design is not interested in that sub-group… then drop them!

• 2. If there are many– Maybe they reflect a “real” pattern in your data.


• When in doubt: Present results both with and without outliers

• Or present one set of results, but mention how results differ depending on how outliers were handled

• For final projects: Check for outliers!• At least with scatterplots

• But, a better strategy is to use partialplots and Cooks D (or similar statistics)

– In the text: Mention if there were outliers, how you handled them, and the effect it had on results.

Extra Slides

Review

• Types of regression variables, and interpretation of coefficients:

• 1. Normal variable coefficient: Reflect slope of line relating one variable to the dependent var

• The effect of a 1-point change in X on Y

• 2. Dummy variable: Reflects difference in the constant for a group compared to omitted group

• Here, the effect is the difference in constant (level) of Y for different groups.

Review

• 3. Interaction term: Dummy * Continuous: Indicates differences in slope for different groups

• Example: DFEMALE*Education affecting income

– Coefficient indicates difference in slope for dummy group compared to slope of reference group

• 4. Interaction term: Dummy * Dummy: Indicate differences in the constant

• Example: DFEMALE*DBLACK

– Coefficient indicates difference in constant between black females and black males (and white females).

Review

• 5. Interaction term: Continuous * Continuous: Indicates differences in slope for different values of other variable

• Example: ParentsWealth*Education affecting income

– Coefficient indicates difference in slope for each unit change in other continuous variable.

Log Transformations

• 1. Linearity and log transformations: When should you log your variables?

• There are two common reasons:– 1. To reduce extreme skewness (which often leads to

non-linearity– 2. For variables where the social meaning is clearly

non-linear.

Log Transformations

• Example: Country GDP per capita– Highly skewed– Also, a shift from $1000 to $2000 is much more

socially significant than shift from $30,000 to 31,000

• Other example: wages (interval)

• Log transformations should be used judiciously• Don’t log all variables to achieve a modest improvement in

linearity.

Multiple Regression Problems

• Another common regression problem: Multicollinearity

• Definition: collinear = highly correlated– Multicollinearity = inclusion of highly correlated

independent variables in a single regression model

• Recall: High correlation of X variables causes problems for estimation of slopes (b’s)– Recall: variable denominators approach zero,

coefficients may wrong/too large.


• Multicollinearity symptoms:– Addition of a new variable to the model causes other

variables to change wildly• Note: occasionally a major change is expected (e.g., if a

key variable is added, or for continuous interaction terms)

– If a variable typically has a small effect; but when paired with another variable, BOTH have big effects in opposite directions.


• Diagnosing multicollinearity:

• 1. Look at correlations of all independent vars– Watch out for variables with correlations above .7– Correlations of over .9 are really bad

• 2. Use advanced tools: – Tolerances, VIF (Variance Inflation Factor)

• 3. Watch out for symptoms mentioned previously.


• Solutions to multcollinearity– It can be difficult if a fully specified model requires

several collinear variables

• 1. Drop unnecessary variables

• 2. If two collinear variables are really measuring the same thing, drop one or make an index– Example: Attitudes toward recycling; attitude toward

pollution. Perhaps they reflect “environmental views”

• 3. Advanced techniques: e.g., Quantile regression, Ridge regression.

Entering Variables Into Regressions

• Question: For final papers, how should you enter variables into a regression?

• Forward, backward, stepwise, or all at once?

– I recommend entering variables all at once, rather than using an automated procedure

• Automated procedures are more useful for advanced models

– It is often interesting to present more than one model• Example: Show how coefficients change with the addition

of new variables.

multiple regression 4 sociology 5811 lecture 25 copyright © 2005 by evan schofer do not copy or...

Documents

slope of income

education slope

effect of education

interaction effect coefficient

redthe effect of income

slope of educ

interaction term education

different slope