handout 16: introduction to multiple linear...

Psychology 252 Ewart Thomas

1

Handout 7: Introduction to Multiple Regression Readings: Howell, Chapter 15; Green & Salkind, lessons 31, 33 0. Introduction. In simple regression, we use a quantitative independent variable, X, to predict, or to explain the variance in, a quantitative dependent variable, Y. For example, we might use “the number of stressful events in a 6-month period” (X) to predict a teenager’s level of “depression” (Y). The least squares regression equation, , xbby 10ˆ +=where is the expected value of Y given X, E(Y | X), is very useful in predicting Y when we know a person’s value of X. The correlation, r, between X and Y (|r| is equal to the correlation between X and Y , because Y is a linear function of X) is an index of the strength of the linear relationship, and r

yˆ ˆ

2, the percentage of variance in Y that is explained by X, is useful as an index of the quality or “goodness” of our explanation. If r2 is “low” (e.g., 10%, but this depends on the research area), we might want to augment our explanation by including more than 1 predictor or explanatory variable. None of the foregoing requires any assumptions about the distribution of Y, given X. We need a statistical model only when we want to make inferences about the effect of X on Y. For example, to test if this effect is statistically significant, we would assume the LINE model and then use a t-test or F-test based on b1 or on r. In multiple regression, we use 2 or more quantitative independent variables, X1, X2, …, to predict, or to explain the variance in, a quantitative dependent variable, Y. For example, we might use “the number of stressful events in a 6-month period” (X1) and “the coping skill” (X2) to predict a teenager’s level of “depression” (Y). The least squares regression equation, , 22110ˆ xbxbby ++=is very useful in predicting Y when we know a person’s values of (X1, X2). The correlation, R, between X and

is an index of the goodness of the prediction equation, but the preferred index is RY 2, which has the precise interpretation of the percentage of variance in Y that is explained by (X1, X2). If R2 is “low”, we might want to augment our explanation by including even more predictors or explanatory variables, such as, X3 = X2

2 (i.e., a quadratic effect of “number of stressful events”), or X4 = X1*X2 (i.e., the interaction between “number of stressful events” and “coping skill”). This flexibility is a major reason why multiple regression is such a powerful tool. To make inferences about the effects of the Xi’s on Y, we would assume the LINE model and then use a t-test or F-test based on bi (to assess the effects of the i’th predictor) and R2 (to assess the overall fit of the regression model). 1. Examples of uses of multiple regression in prediction versus model or theory construction In many practical situations (e.g., the setting of policy), we would like to predict, as accurately as possible, some dependent variable, Y, from the known values of other (independent or predictor) variables, X1, X2, ...; and our interest is in maximising predictive accuracy rather than in understanding the theoretical mechanism that determines the dependent variable, Y. In other, research contexts (e.g., psychological research), we want to discover those variables, and only those variables, that appear to determine some dependent variable of interest, Y. Multiple regression is useful in both types of situations, although it is only in the second type (i.e., in model or theory construction) that we make statistical inferences about the effects of the independent variables. Therefore, it is only when we want to do more than predict (and for the researcher, this may be all of the time!) that we need to rely on the full distributional model (e.g., the LINE model) of regression. Some examples:


2

1.1. Y is the first-year college GPA (CGPA) of a student, and the Xi’s are high school GPA (HSGPA), SAT scores, and the rating (e.g., on a 1-5 scale) of the letters of recommendation for the student. Note that Y is quantitative, and all the Xi’s are also quantitative. If prediction is our only interest, we should use all the available information about the Xi’s and the Y’s from a random sample of previous college students to develop our multiple regression (i.e., prediction) equation. Armed with this equation, we could go to the folder of a college applicant, extract the Xi’s from the folder, enter the Xi’s into the prediction equation to get that applicant’s predicted college GPA, and then use this predicted GPA as one of the indices for determining whether or not to admit the applicant to college. In this frame of mind, we tend to have relatively little interest in whether the assumptions of the LINE model are satisfied. Suppose, now, that we want to discover the most parsimonious model of the determinants of 1st year college GPA (CGPA). For example, we might believe that only high school GPA (HSGPA) determines CGPA, but we need to answer a critic who says that SAT also determines CGPA. In that case, we would run a multiple regression of CGPA on HSGPA and SAT, and then test whether the coefficient of SAT is significantly different from 0. At the point of doing such a significance test, we would be relying on the assumptions of the LINE (or other) model. As researchers, we are almost always in this position, and this is why we need to understand the underlying statistical model of regression. 1.2. In a survey of people’s reaction to a highly publicised gift, Y is the perceived generosity of a donor, and the Xi’s are the perceived income of the donor and perceived size of the gift, whether the donor is from “old money” (e.g., is an heir to Standard Oil) or from “new money” (e.g., is the CEO of a Silicon Valley firm), etc. Note that Y is quantitative, and the Xi’s are quantitative or qualitative. 1.3. In an experiment of people’s reaction to scenarios about hypothetical donors, Y is the perceived generosity of the hypothetical donor, X1 is the stated income of the donor (e.g., with 3 levels), X2 is the stated size of the gift (e.g., with 4 levels), and X3 is whether the donor is from “old money” (e.g., is an heir to Standard Oil) or from “new money” (e.g., is the CEO of a Silicon Valley firm), etc. Note that Y is quantitative, and all the Xi’s are categorical (or qualitative). By using a ‘trick’ called dummy coding, we can convert a categorical factor with k levels into a set of k-1 dichotomous (e.g., 0/1) variables, the so-called ‘dummy variables’. Because a 0/1 variable behaves just like a quantitative variable, dummy coding allows us to apply standard multiple regression techniques (as in Ex. 1.1) even when the original independent variables are categorical. In essence, we are using the multiple regression procedure to do an ANOVA, and it is important to note that we can use regression to do a factorial ANOVA, even when the cell sizes are unequal. For this reason, regression (and the related General Linear Model) is a more general procedure than ANOVA. We need to spend some time learning the different ways to dummy code a categorical factor so that this factor can be entered into a regression analysis. 1.4. In a memory experiment, Y is the memory score, X1 is the amount of study time (e.g., with 3 levels), X2 is the time between study and test (e.g., with 4 levels), and X3 is the verbal ability of the participant. Note that Y is quantitative, the main independent factors are X1 and X2, which are categorical, and X3 is a quantitative ‘nuisance’ variable, called a covariate. If we ignore X3, and if we have equal n’s in each cell, we could do a 2-way factorial ANOVA on these data. If we want to take account of X3, we could use GLM or regression, and include X3 as a covariate. Such an analysis is called an Analysis of Covariance (ANCOVA). We should note that, by using the Custom Model option in SPSS’ GLM, we can test for interaction terms between a categorical factor and a covariate. If we use the regression procedure, we would have to define the interaction terms as additional columns in the SPSS data sheet, and then do the regression. 1.5. Y denotes whether a person was admitted (Y = 1) or not admitted (Y = 0), or passed/failed, or smiled/not smiled, or ..., i.e., Y is dichotomous; and the Xi’s are quantitative measures of the ability or attitudes of the


3

person, or categorical indices of the context. Note that Y is categorical, and the Xi’s may be quantitative or qualitative. Because Y is categorical, it cannot be Normally distributed; so the LINE model is violated and we should not use standard multiple regression methods. A good alternative method for the analysis of categorical dependent variables is logistic regression. (Even in this case, however, the use of the standard regression model often yields results that are comparable to those yielded by logistic regression – it’s just that it is hard to defend the standard model.) 2. SPSS regression procedures 2.1. Analyze > Regression > Linear; define the Dependent and the Independent variables. Use this procedure if (i) Y is quantitative, (ii) all the Xi’s are quantitative (possibly including dummy variables), and (iii) each interaction between the original independent variables is entered as a separate predictor variable (usually defined as the product of the original variables, e.g., as X3 = X1*X2, or as the product of the deviations of the original variables, e.g., as X3 = )(*)( 2211 XXXX −− ). I like to use this procedure because it facilitates exploring how well different sets of predictors explain the variance in Y. However, the ‘cost’ of using this procedure is that you have to dummy code every categorical predictor, and define each interaction or non-linear term ‘by hand.’ 2.2. Analyze > General Linear Model > Univariate; define the Dependent variable, enter the categorical factors in the Fixed and Random boxes, as appropriate; and enter the quantitative variables in the Covariate(s) box. Use this procedure if (i) Y is quantitative, (ii) the Xi’s are quantitative or categorical, and (iii) you are not interested in obtaining the regression coefficients of the Xi’s. I find this last feature, (iii), a drawback in some situations, and I would use the regression procedure (2.1) in its stead. The advantage of GLM is that SPSS does the dummy coding for you of the categorical factors, and automatically includes the interactions among the categorical factors. 2.3. Analyze > Regression > Binary Logistic; define the Dependent variable, enter the categorical factors or quantitative variables in the Covariate(s) box. Use this procedure if (i) Y is dichotomous (as in Ex. 1.4 above), and (ii) the Xi’s are quantitative or categorical. To test for the interaction between 2 covariates (in addition to their main effects), highlight the 2 variables, then click on the ‘>a*b>’ button to add the interaction to the list of ‘Covariates’. We need to spend some time learning how to interpret the SPSS output from this procedure. 3. An SPSS analysis of GPA prediction 3.1. The Model. The multiple regression approach extends the logic from 1 to k predictors:

∑=

+=k

iii xbby

10ˆ .

In model testing, we often start with 1 predictor, then add another predictor and ask whether this 2nd predictor is significant; then add a 3rd predictor, etc. The sequence of equations might be:

110ˆ xbby += ,

22110ˆ xbxbby ++= , and

3322110ˆ xbxbxbby +++= . Imagine we have the following data: we have for each subject college GPA, SAT (converted to a percentage), a rating from 1 to 5 for letters of recommendation, and also high school GPA. Imagine also that we are an admission office trying to come up with the best predictor of college GPA.


4

We can start by a simple bivariate regression using SAT (Analyze > Regression > Linear):

Model Summary

.692a .479 .450 .3052Model1

R R SquareAdjusted R

Square

Std. Errorof the

Estimate

Predictors: (Constant),a. SAT Coefficientsa

1.571 .447 3.518 .0022.230E-02 .005 .692 4.064 .001

(Constant)SAT

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: GPA_Ca.

Alternatively, we might include not only the SAT, but also the recommendation letter ratings. We use the same SPSS command and it yields:

Model Summary

.804a .646 .604 .2589Model1


Square

Std. Errorof the

Estimate

Predictors: (Constant), RECO, SATa.

ID GPA (College)

SAT (%) Reco GPA

(High)1 3.14 75.40 3 3.46 2 3.49 74.84 4 3.03 3 3.39 96.52 3 3.21 4 3.36 77.88 4 2.80 5 2.76 66.04 3 1.68 6 2.79 80.67 2 2.77 7 3.37 78.81 4 2.28 8 4.15 96.50 4 3.15 9 3.43 89.16 3 3.68

10 3.37 71.15 4 3.46 11 3.36 80.70 2 3.01 12 3.04 83.44 1 3.38 13 4.07 83.85 5 4.15 14 3.15 72.82 5 3.07 15 3.32 71.91 4 3.47 16 2.79 63.72 3 2.00 17 2.89 57.98 2 2.09 18 3.91 105.20 3 3.54 19 3.79 104.60 5 3.88 20 3.73 77.38 3 3.54

Coefficientsa

1.227 .398 3.086 .0072.005E-02 .005 .622 4.247 .001

.157 .055 .415 2.831 .012

(Constant)SATRECO

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.


Alternatively, we could have predicted college GPA from SATs and high school GPA:

Model Summary

.772a .596 .548 .2765Model1


Square

Std. Errorof the

Estimate

Predictors: (Constant), GPA_H, SATa.

Coefficientsa

1.412 .411 3.438 .0031.382E-02 .006 .429 2.204 .042

.273 .123 .432 2.220 .040

(Constant)SATGPA_H

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.



5

3.2. Note on Multicollinearity. In the 1st regression, the coefficient of SAT is 0.022 (p = 0.001). When ‘letters of recommendation’ (Reco) is added, the SAT coefficient changes very slightly to 0.020 (p = 0.001), and Reco is significant (p = 0.012). However, when HSGPA (GPA_H) is added to the 1st regression, the SAT coefficient changes a lot to 0.014 (p = 0.042), which is just significant, and HSGPA is also just significant (p = 0.040). This shows how the size of a given coefficient (e.g., that of SAT) depends on what other variables are included in the regression. One can try to understand this dependency in specific cases (as opposed to “in general”) by inspecting the correlation matrix.

Correlations

1.000 .692** .519* .693**. .001 .019 .001

20 20 20 20.692** 1.000 .168 .609**.001 . .479 .004

20 20 20 20.519* .168 1.000 .310.019 .479 . .183

20 20 20 20.693** .609** .310 1.000.001 .004 .183 .

20 20 20 20

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

GPA_C

SAT

RECO

GPA_H

GPA_C SAT RECO GPA_H

Correlation is significant at the 0.01 level (2-tailed).**.

Correlation is significant at the 0.05 level (2-tailed).*.

First, we can see that corr(CGPA, Reco) is 0.52 (significant), suggesting that Reco would be useful in predicting CGPA. However, corr(SAT, Reco) is 0.17 (not significant), suggesting that Reco and SAT have little “information” in common, and the “information” Reco brings to the prediction of CGPA is different from the “information” that SAT brings to the prediction of CGPA. This is why adding Reco to the regression of CGPA on SAT does not much affect the coefficient of SAT. Second, we see that corr(CGPA, SAT) and corr(CGPA, HSGPA) are 0.69 (highly significant), but that corr(SAT, HSGPA) is 0.61 (also highly significant). This last correlation suggests that there is much overlap in the “information” that SAT and HSGPA bring to the prediction of CGPA. In a sense, the predictive power of the pair, (SAT, HSGPA), is divided equally between SAT and HSGPA (the 2 p-values are both around 0.04). In general, when two predictors are highly correlated, the regression coefficients are very unstable, and inference based on them is unreliable. This problem of correlated predictors is called multicollinearity. 4. Interpretation of regression coefficients. How do we interpret the bi (SPSS B) coefficients in multivariate regression? It is the slope associated with xi when all other predictors are held constant. For this reason, these coefficients are sometimes called “partial regression coefficients.” One problem with the phrase, “other predictors are held constant”, is that, in a typical survey, the predictors tend to be correlated with each other. In that case, two people who differ by 1 unit in X1 will, in general, have different values of X2, etc., because of the correlations among the predictor variables. Therefore, what is being conceived with the phrase is (a) a selection of people with the same values on X2, X3, etc., and (b) the measurement, within this set of people, of the change in Y associated with a unit change in X1. We can say that we are controlling statistically the values of X2, X3, etc. In contrast, in the typical factorial experiment, this control of the levels of the factors (or variables) is part and parcel of the design of the experiment. In this case, we are exercising experimental control over the factor levels, and the factors are uncorrelated because subjects are randomly assigned to levels. Hence, in the typical experiment, it is transparent that we can vary one factor while holding others constant.


6

Another reason for being cautious when interpreting the regression coefficients is that each coefficient depends critically on what other predictors are included in the model. For instance, b1 in the model,

Y = b0 + b1X1 + ε,

need bear little relation to b1 in the model with another predictor X2 included,

Y = b0 + b1X1 + b2X2 + ε. The reason for this is that b1 can be thought of as the slope determined from a plot of Y versus X1•2 (the residuals of X1 regressed on X2). The value of b1 can vary widely, depending on how correlated X1 and X2 are. It can even change sign when different predictors are added to or removed from the model. However, if one is reasonably confident that all the relevant predictors are included in the equation, then the interpretation of the b’s is less problematic. We can test the significance of a bi (i.e. that the slope is not 0) by dividing it by its standard error (SPSS gives you the s.e. and the t for each predictor) with N-p-1 degrees of freedom, where N is the number of subjects and p is the number of predictors in the model. With 2 predictors, df = N-3. 4.1. Standardized regression coefficients (or betas). The standardized regression coefficients are the b’s we would obtain if all the variables were standardized. Note that in simple regression,

., rwherezz xy == ββ However, in multiple regression the standardized regression coefficients no longer correspond directly to the correlation. The relation between the raw coefficients, bi, and the standardized coefficients, βi, is

y

iii s

sb=β ,

where si is the sd of the predictor and sy is the sd of Y. It is common to use the βi as measures of the “importance” of the i’th predictor, but even this practice should be followed with caution, because these coefficients depend in complicated ways on the relevant correlations and variances. 5. Residual variance. As before, the residual variance is the variance in Y, given X; let us denote this variance by σ2. The estimate of σ2 is:

.1)ˆ(

1

2

−−

−=

−−=≡≡ ∑

pNYY

pNSSEMSEMSMS errorresid

The parallel with ANOVA is intentional. Indeed, SPSS gives an ANOVA table to summarise the regression output:

Source SS df MS Regression 2)ˆ( yy

i i −∑ p p

gSS Re

Error 2)ˆ( ii i yy −∑ N – p - 1 1−− pN

SSE

Total 2)( yyi i −∑ N - 1 --


7

5.1. The “dot” notation. We sometimes write MSE as , to be read as “the variance of Y controlling

for X

2)...( 1 pxxys •

1, …, Xp. The dot in a subscript means “controlling for.” Also, X1•2 refers to what is left in the variable X1 after we have controlled for (or taken out all that is related to) X2. To obtain X1•2, we would regress X1 on X2 and store the residuals as X1•2. 5.2. Multiple correlation coefficient, R. R represents how well the model fits the data. In fact,

YYrR ˆ= . Most often we will look at R2, because it is easy to interpret: it is the percentage of accountable variation. You can test the significance of R2:

)1()1(2

2

RpRpNF

−−−

= , with (p, N-p-1) degrees of freedom.

A significant F means the whole model accounts for a portion of variance that is significantly greater than 0. Note also the following analysis of R2 that expresses R2 as the weighted sum of the individual correlations,

: iYXr

2

ii YXR rβ= ∑ . 5.3. The adjusted R2. The value of R2 that we observe in a study is an estimate of the population multiple correlation, . It can be shown that R2Ρ 2 is a biased estimate of 2Ρ ; it is larger than it should be. Another bias in R2 is that it necessarily increases as the number of predictors increases (but I find this a reassuring property, not a drawback). To correct these biases, we define the adjusted R2, which is an unbiased estimate of 2Ρ :

1)1)(1(1

22

−−−−

−=pN

NRadjR .

From this definition, we can see that

.,11

11

1 222

2

RadjRimplyingpN

NR

adjR<>

−−−

=−

−

For a given data set (i.e., N is fixed), the size of this adjustment depends on the number of predictors. In deciding which of 2 sets of predictors is better at predicting Y, we should look at both R2 and adjR2. 6. Testing full vs. reduced models We can compare the fits of two sets of predictors of Y by comparing the associated R2 or adjR2; the better set is the one with the higher R2. However, there is no satisfactory method, in general, for saying that one set of predictors is significantly better than another set. The one case in which we can make such a comparison is that in which one set of predictors (the “full” model) contains the other set of predictors (the “reduced” model). Starting with the full model, if our null hypothesis is that certain of the coefficients are equal to 0, then the reduced model is the model obtained from the full model when the null hypothesis is true. Some examples:


8

Null Hypothesis Full Model Reduced Model ________________________________________________________________ H0 : b1 = 0 Y = b0 + b1 X1 + b1 X2 + ε Y = b0 + b2 X2 + ε H0 : b1 = b3 = 0 Y = b0 + b1 X1 + b2 X2 + b3 X3 + ε Y = b0 + b2 X2 + ε H0 : b1 = 0 Y = b0 + b1 X1 + ε Y = b0 + ε The logic of the test procedure is that if the null hypothesis is true, then there should be little difference between the “predictive power” of the full and reduced models. If, however, the null hypothesis is false, then the full model should explain substantially more variability in Y than does the reduced model. The test procedure formally specifies what constitutes “substantial” improvement in prediction. First, we should note that regardless of the number of predictors being used, the total sum of squares of Y (i.e., SST) is the same for the full and reduced models:

SST = SSReg + SSE . The more predictors in a model, the greater SSReg and the smaller SSE. The basis of the “extra sums of squares” test (also known as a “increment to R2” or ∆R2 test) is the difference between SSE (or SSReg) for the full and reduced models. If SSE is much smaller for the full model than for the reduced model, the null hypothesis is implausible. If SSEfull and SSEred are approximately equal, then the null hypothesis is very plausible. The test statistic is as follows:

).,(,/)1(

//)1(

)/()(/

)/()(2

2

2

22

fullfullfullfullfull

fullredredfull

fullfull

fullredfullred dfdfdfwithdfRdfR

dfR

dfdfRRdfSSE

dfdfSSESSEF ∆=

−

∆∆=

−

−−=

−−=

7. Partial correlation. Suppose we regress Y on the predictors, X2 and X3, and save the residuals as Y•23. We can regard these residuals as what is left of Y after the effects of X2 and X3 have been partialled out of (or removed from) Y. Next, suppose we regress the predictor, X1, on the predictors, X2 and X3, and save the residuals as X1•23. We can regard these residuals as what is left of X1 after the effects of X2 and X3 have been partialled out of (or removed from) X1. (As an example, refer to the data analysed in Sec. 3.1 above.) The correlation between Y•23 and X1•23 is called the partial correlation between X1 and Y, after partialling out X2 and X3. It measures the “overlap” between X1 and Y that remains after X2 and X3 have been partialled out. This statistic is very important in testing causal models of the interrelationships among Y, X1, X2 and X3. For example, if this partial correlation is 0, this would suggest that any simple correlation between X1 and Y is due to the joint dependence of X1 and Y on X2 and X3. In the simple case of 2 predictors, X1 and X2, the partial correlation between X1 and Y controlling for X2 is denoted by ry1•2. The formula for ry1•2 is:

.)1)(1( 2

22

12

212121

y

yyy

rr

rrrr

−−

−=•

It can be shown that

.)(

)|(Re

2

21221 XSSE

XXgSSry =•

This last expression shows that the square of the partial correlation is the proportion of variability unexplained by X2 that is explained when X1 is added to the model. The numerator is the amount of variability explained


9

by X1 beyond that explained by X2. The denominator is the leftover variability unexplained by X2. So the squared partial correlation is the proportion of previously unexplained variance that is explained when a predictor is added to the model. 7.1. Inference for simple and partial correlations. An additional test for the significance of r is the Fisher-Z test. This test is the preferred method for testing the difference between two indepenent r’s. Fisher’s r-to-Z transformation is:

1(.5) log1e

rZr

+=

−,

where Z has the standard error 1

3ZsN

=−

.

A 95% CI for E(Z) would be approximately 1 22 * ( , )ZZ s z z± ≡ . To derive a 95% CI for E(r) ≡ ρ, we would use the inverse, Z-to-r transformation to get:

1 2

1 2

2 2

1 22 2

1 1,1 1

z z

z z

e eande e

ρ ρ− −= =

+ +.

To test the difference between two independent r’s, r1 and r2, we would convert each r into Fisher’s-Z, and test the difference between the Z’s via:

1 2

1 2

1 13 3

Z Zz

N N

−=

+− −

,

using the standard Normal. To construct similar tests for partial correlations, we may use the Fisher Z-transform with one amendment. The variance of the transformed correlation, Z, is not 1/(N – 3); rather, it is:

,31)var( ...231 CN

Z y −−=•

where C = the number of variables being controlled for, or partialled out. For example, for ry1•2, only 1 variable is partialled out (C = 1), so the variance of Z would be 1/(N – 4), and for ry3•12, the variance of Z would be 1/(N – 5). 7.2. Venn diagram for partitioning variance in the DV, X0

X1e X0

b

a c f d

g X2


10

Area %SSY Interpretation

a+b+c+d 100% Total variance of X0 to be explained b+c+d 2

120⋅R Variance explained by regression

a 21201 ⋅− R Residual variance

b+c 201r Squared bivariate correlation

c+d 202r Squared bivariate correlation

b 2)21(0 ⋅r Squared semipartial correlation

d 2)12(0 ⋅r Squared semipartial correlation

b/(a+b) 2201⋅r Squared partial correlation

d/(a+d) 2102⋅r Squared partial correlation

The Venn diagram above represents the variance of X0 and of two predictors X1 and X2. Overlap between the circles represents shared variance. [Note that diagrams, such as this one, are used to illustrate concepts, not to replace algebraic derivations. Indeed, diagrams can be misleading in certain cases, as when some of the correlations are negative.] The diagram shows that the increase in R2 when a new predictor is entered in the model is the squared semi-partial correlation for that variable. 8. Multicollinearity. As shown in Example 3.1, multicollinearity, i.e., the presence of high correlations among predictor variables, causes the regression coefficients to become unreliable, and the interpretation of these coefficients becomes problematic. In particular, it is hard to attribute a change in Y to one predictor rather than to another predictor that is correlated with the first predictor. One may be able to detect multicollinearity (a) by examining the correlation matrix for the predictors and looking for high correlations, (b) by noticing large changes in the coefficient for a predictor when other predictors are added to the equation (as happened in Ex. 3.1), and (c) by noticing large standard errors for the coefficients. A more satisfactory method for assessing multicollinearity would be to rely on an index produced by the Regression procedure in SPSS. In this procedure, click on Statistics and check the Collinearity diagnostics box. For the i’th predictor variable, let Ri

2 be the proportion of variation in Xi that is explained by all the other predictors. Then if Ri

2 is high, we have multicollinearity. The tolerance is defined as 1 - Ri2; the reciprocal of

the tolerance is called the variance inflation factor (VIF):

.1

12i

i RVIF

−=

VIF’s in excess of about 10 are cause for concern. 9. Suppression and Mediation: by Benoit Monin (Psy 252 Instructor) 9.1 – Suppression In a study among students of the relationships among ‘GPA’, ‘SES’ (socio-economic status), and ‘Work’ (hours spent studying per week), SES is positively related to GPA. However, controlling for ‘work’, SES is


11

Correlations

1.000 .691** .344*. .000 .030

40 40 40.691** 1.000 .735**.000 . .000

40 40 40.344* .735** 1.000.030 .000 .

40 40 40

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

SES

WORK

GPA

SES WORK GPA

Correlation is significant at the 0.01 level (2-tailed).**.

Correlation is significant at the 0.05 level (2-tailed).*.

negatively related to GPA! This is one type of “suppression”; there are other types.

Coefficientsa

5.974E-02 .309 .194 .848-.969 .450 -.313 -2.152 .038 .344 -.334 -.2261.861 .285 .951 6.538 .000 .735 .732 .687

(Constant)SESWORK

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig. Zero-order Partial PartCorrelations

Dependent Variable: GPAa.

SES

3210-1-2-3

GP

A c

ontro

lling

for w

ork

(resi

d)

6

4

2

0

-2

-4

SES

3210-1-2-3

GPA

6

4

2

0

-2

-4

-6

-8

SES GPA

Work++

-SES GPA

+

For each causal model on the previous page, what are our predictions about

(a) the zero’th-order, simple correlations, e.g., corr(SES, Work) = r(SES)(Work),


12

(b)

WorkWork

SES (1)

GPA

SES

Work

GPA (2)

SES Work GPA (3)

SES GPA Work (4)

SES Work GPA (5)


13

(c) the increment in R2 , the squared multiple correlation, when new predictors are added to the regression, e.g., 2 2

.12 .1Y2YR R R∆ = − , the difference in predictive power between using (X1,

X2), rather than just (X1), as the predictor set for Y; and (d) the first-order, partial correlations, e.g., the correlation between SES and GPA, controlling for

Work, r(SES)(GPA)•(Work) . Useful ‘Theorem’: If X1 causes X2 and X2 causes X3 , i.e., if

X1 → X2 → X3, Then

r13 = r12 r23. 9.2- Mediation Analysis Sources: Dave Kenny’s website: http://nw3.nai.net/~dakenny/mediate.htm Baron & Kenny (1986) in JPSP, 51(6): 1173-1182

Kanny, Kashy & Bolger (1998) in Handbook of Social Psychology

IV

Med ba

c’cDV IV DV

The logic of mediation analysis is to determine whether a manipulation impacts a dependent variable using the psychological pathways predicted by a theoretical model. Examples: Negative feedback Drop in self esteem Hostility Stereotype threat Stereotype activation Performance decrement Low performance Poor metacognitive skills Inflated self-assessment Drug Drop in metabolic activity Fewer bar presses per minute Baron & Kenny (1986) laid out steps to test mediation that are now considered the Gospel by reviewers in social psychology. It is important to know the technique because of its currency. Run three regressions:

* DV = c IV + Intercept Gives you c * Med = a IV + Intercept Gives you a * DV = c’ IV + b Med + Intercept Gives you c’ and b

Then four conditions have to be met to claim mediation:

* c must be significant (that’s your original effect) * a must be significant * b must be significant * c’ must be reduced

- To nonsignificance: complete mediation - Still significant but lower than c: partial mediation, use Goodman test


14

The Goodman (1960) test enables you to evaluate whether the change from c to c’ when the mediator is introduced is significant. It relies on the fact that the direct path can be broken down into:

c = c’ + ab thus c – c’ = ab. So testing the drop in c is like testing that ab=0. The most common formula for a sample is

, where the b’s are unstandardized coefficients and the s’s are standard errors. 222222

baba sssasbabz

−+=

This is tested as a z (i.e., significant at the .05 level if the absolute value is greater than 1.96). There has been some debate about which to use, having to do mostly with the last term in the denominator (standard error) above. Baron & Kenny (1986) add it, which is more appropriate for a population than a sample. Sobel (1982) drops it altogether. In practice this doesn’t make much difference. This is often still referred to as a Sobel test even when people use one of the other formulae.

a) Complete mediation Coefficientsa

6.879 .178 38.624 .000.268 .113 .410 2.379 .024

(Constant)CHOICE

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.

Dependent Variable: ATT_1a.

Coefficientsa

61.685 3.861 15.975 .0007.811 2.442 .517 3.198 .003

(Constant)CHOICE

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.

Dependent Variable: AROUSALa.

Coefficientsa

5.672 .522 10.859 .000.115 .121 .176 .949 .351

1.956E-02 .008 .452 2.434 .022

(Constant)CHOICEAROUSAL

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.


Arousal

Choice Att_1)
.12, ns (.27*
7.8**
.02*


15

A handy applet for mediation analysis is at: http://www.unc.edu/~preacher/sobel/sobel.htm

Note that the direct link from Choice to Attitude drops to non-significance when the mediator is introduced. Thus we have complete mediation: all the impact of choice can be explained by arousal. Note also that 7.8*.02+.12 = .276, which with rounding error is the same as .27, the direct path.

b. Partial mediation

Coefficientsa

6.379 .178 35.816 .000.468 .113 .618 4.155 .000

(Constant)CHOICE

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.



16

Coefficientsa

61.685 3.861 15.975 .0007.811 2.442 .517 3.198 .003

(Constant)CHOICE

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.

Dependent Variable: AROUSALa.

Coefficientsa

5.172 .522 9.902 .000.315 .121 .416 2.597 .015

1.956E-02 .008 .390 2.434 .022

(Constant)CHOICEAROUSAL

Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.


Note how Choice is still a significant predictor of Attthe Goodman test. We find the standard errors in the

2 2 2 2 2 2 2.02 * 2.4a b a b

abzb s a s s s

= =+ −

This enables us to claim that arousal does mediate, buwould have gotten 1.64, and had we dropped it we wdifference which formula we use. 9.3. Alternate forms of the test statistic, z. The test statistic for mediation analysis was introduce

2 2 2 2 2 2a b a

abzb s a s ws s

=+ + b

,

in which Goodman recommends setting w = -1, Barorecommends setting w = 0. As stated above, this chonoted that the most conservative choice (leading to thgives the test statistic as:

Arousal

Choice Att_2
.32* (.47**)
7.8**

itude, even controllintables above:

2 2 2

7.8*.02

4 7.8 *.01 2+ −

t only partially. Notould have gotten 1.70

d above as:

(9.3.1)

n & Kenny recommeice usually makes lite smallest value of z

.02*

g for arousal… Here we definitely need

2 21.76, .08

.44 *.01p= =

e that had we added the last term we . This illustrates that it makes little

nd setting w = +1, and Sobel tle difference, although it should be ) would be w = +1. Howell (p. 577)


17

2 2 2 2 2 2

a b

b a a b a b

zs s ws s

β β

β β=

+ +,

where sa and sb are the standard errors of βa and βb, respectively, not of a and b. To avoid confusion, let us use sa and sb to denote the s.e.s of a and b, and Sa and Sb to denote the s.e.s of βa and βb. So we rewrite the foregoing equation as:

2 2 2 2 2 2

a b

b a a b a b

zS S wS S

β β

β β=

+ +. (9.3.2)

We also know from the applet shown below that it is possible to do the Sobel/Goodman test using only

two numbers (instead of four numbers), ta and tb. In this section, we show that (a) the test statistics in Eqs. (9.3.1) and (9.3.2) are algebraically equivalent, and (b) an equivalent expression (Eq. (9.3.3) below) can be derived involving only ta and tb. First, in Section 4.1 above, we gave the relation between the raw coefficients, bi, and the standardized (or ‘beta’) coefficients, βi, in a regression as:

y

iii s

sb=β ,

where si is the s.d. of the predictor, Xi, and sy is the s.d. of the dependent variable, Y. Applying this equation to the present context, we can replace the ratios, sy/si, by constants, ka and kb, and write:


18

a a b b a a a b b ba k and b k which implies s k S and s k Sβ β= = = = . It is not hard to show that, on substituting for a, b, sa and sb ,

2 2 2 2 2 2( ) ,a b a b a b a bab k k and b s a s ws sβ β= + + S= (kakb)2( ). 2 2 2 2 2 2b a a b a bS S wSβ β+ +

On substituting these expressions in Eq. (9.3.1), we get the expression in Eq. (9.3.2). Second, the t-statistics corresponding to the raw regression coefficients are:

, ,a b a aa b

a bt and t implying a s t and b s ts s

= = = b b=

+ w

.

On substituting for a and b,

2 2 2 2 2 2( ) ,a b a b a b a bab s s t t and b s a s ws s= + = (sasb)2( 2 2b at t+ + ).

On substituting these expressions in Eq. (9.3.1), we get

2 2

a b

b a

t tz

t t w=

+ +. (9.3.3)

Note that ta and tb are obtainable from the SPSS output for the regressions in a mediation analysis. Therefore, Eq. (9.3.3) is probably the simplest equation to use.

10. Dummy Coding How can the methods of Multiple Regression (MR) be extended to the case of qualitative predictor variables? (a) When the levels of a qualitative variable are coded, e.g., 1, 2, ..., k, these numbers lie on a nominal scale. For example, a `4' on a nominal scale is merely different from `2', it is not twice as large as `2'. For this reason, arithmetical operations (like adding or squaring) on the numbers from a nominal scale have no meaning; in particular, a MR analysis on such numbers has no meaning. (b) However, if the qualitative variable, A, has only 2 levels (e.g., `low' and `high'), these values can be coded as X = 0 and 1, respectively, and the equation,

E(Yi) = b0 + b1Xi , would imply that, for `low' levels of A (i.e., for Xi = 0),

E(Yi) = b0 + b1Xi = b0 + b1(0) = b0 , and, for `high' levels of A (i.e., for Xi = 1),

E(Yi) = b0 + b1Xi = b0 + b1(1) = b0 + b1.


19

Thus, the coefficient, b1, in the regression equation is equal to the difference between the mean Y for `high' and `low' levels of A, and b0 is the mean of Y for `low' I. For example, doing a t-test or F-test to see if the 2 groups have the same mean is equivalent to doing a MR and seeing if b1 is different from 0. In other words, Multiple Regression can be applied in this case to yield interpretable results. This 0-1 coding of a qualitative variable is called dummy coding. (c) Suppose now that the qualitative variable, A, has k (> 2) levels (e.g., `low', `moderate' and `high'). In this case, we introduce k-1 dummy variables, X1 , X2 , ..., Xk-1, as follows:

(i) Each observation in the study is assigned a value (0 or 1) on each of the k-1 dummy variables, and on the dependent variable, Y. (ii) Those observations at the 1st level of A are assigned a `1' on X1 and a 0 on the other dummy variables. (iii) Those observations at the 2nd level of A are assigned a `1' on X2 and a 0 on the other dummy variables. (iv) And so on. Those observations at the (k-1)'th level of A are assigned a `1' on Xk-1 and a 0 on the other dummy variables. (v) And those observations at the k'th level of A are assigned a `0' on all the dummy variables.

(d) Other coding schemes, such as effect coding and orthogonal coding, use three values of X, namely, -1, 0 and 1. The differences between these schemes and ‘dummy coding’ are not great, and some texts refer to this whole class of recoding categorical variables as dummy coding. We will provide details of these schemes later on. (e) In the case of 2 factors, how do we code for the interaction? Ans. We first dummy code each factor. Let X be the dummy variable for the dichotomous Factor 1, and U1 and U2 be the dummy variables for the 3-valued Factor 2. Second, we create new dummy variables for the interaction by multiplying each dummy variable for Factor 1 by every dummy variable for Factor 2. [The number of these `interaction' dummy variables is equal to the degrees of freedom for the interaction in the 2-way ANOVA, because, if a and b are the numbers of levels of the 2 factors, then there are a-1 dummy variables for Factor 1 and b-1 for Factor 2; thus there are (a-1)(b-1) products of dummy variables to represent the interaction.) For the present example, this would give X*U1 and X*U2, and the corresponding MR would be:

E(Y) = b0 + b1X + b2U1 + b3U2 + b4X*U1 + b5X*U2.

11: Comparing groups using means and regression slopes: A study of jury decision-making

Resources: ‘voting1.sav’, and ‘voting1.spo’ Consider a study done by Angela Cole (now a professor at Howard University) and me in which mock jurors (i.e., undergrads) at 3 research sites (‘site’) were shown abbreviated legal cases (involving voting rights) and were asked, among other questions, to judge: (i) the strength of the evidence presented by the plaintiff (who was a minority group) relative to that presented by the State (‘evid’), and (ii) how strongly they felt that they should uphold (versus reject) the plaintiff’s arguments (‘preference’). The (fictional) data are in ‘voting1.sav’. Theory: Information-processing models focus on the relationship between ‘preference’ and ‘perceived stimulus


20

strength’. Other theories of fairness, social identity, and social justice focus on variables, such as, ‘group membership’, ‘need to be respected’, etc. Regression Model. Consider the simple regression of ‘preference’ (P) on ‘evidence strength’ (E): P = b0 + b1E + ε, where var(ε) = σ2. The information-processing model predicts that b1 > 0. To check:

EVID

181614121086420

PREF

18

16

14

12

10

8

6

4

Exercises: (i) From the scatterplot, read off an estimate, , of b1b 1. Ans: Slope = (Vertical distance)/(Horiz dist) = (14.0 – 8.5)/(16 – 2) = 5.5/14 = .39 (approx). (ii) Read off the intercept; interpret it. (iii) Formally estimate the “best-fitting” line. Ans. The SPSS output gives the slope as .38 (Close!).

Coefficientsa

7.658 .560 13.670 .000.381 .061 .456 6.242 .000

(Constant)EVID

Model1

B Std. Error


Beta

StandardizedCoefficients

t Sig.

Dependent Variable: PREFa.

Site differences: At the descriptive (as opposed to ‘theoretical’) level, sites might differ with respect to: (a) mean evidence (Why?), (b) mean preference, (c) the dependence of preference on evidence (in which case, differences between the slope and intercept become

theoretically interesting – Why?), and (d) the variance of E and of P, and var(ε) = σ2. Ans. to (a)-(b)


21

ANOVA

427.215 2 213.607 36.174 .000868.035 147 5.905

1295.250 1497.709 2 3.854 .634 .532

893.164 147 6.076900.873 149

Between GroupsWithin GroupsTotalBetween GroupsWithin GroupsTotal

EVID

PREF

Sum ofSquares df Mean Square F Sig.

Ans to (c).

EVID

181614121086420

PREF

18

16

14

12

10

8

6

4

Research Site

BU

CC

RU

(i) From the scatterplots, read off estimates of the group slopes; compare them with 0.39. (ii) What formal statistical analysis would be helpful?

Tests of Between-Subjects Effects

Dependent Variable: PREF

352.572a 3 117.524 31.294 .0002.334E-03 1 2.334E-03 .001 .980

344.863 1 344.863 91.829 .000162.240 2 81.120 21.600 .000548.301 146 3.755

5386.576 150900.873 149

SourceCorrected ModelInterceptEVIDSITEErrorTotalCorrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .391 (Adjusted R Squared = .379)a.

Note that, although ‘preference’ did not vary among the 3 groups, we now have ‘site’ as a significant predictor of ‘preference’. Why? (iii) Might we use dummy coding to analyse the 2 df for ‘groups’ into 2 ‘interesting’ contrasts? Ans. Define ‘ethnicity’ as the contrast between BU and the average of RU and CC; and ‘tier’ as the contrast between RU and CC. Each contrast is significant.


22

Coefficientsa

5.593 .595 9.406 .000.721 .133 .412 5.435 .000

-.888 .209 -.293 -4.252 .000.619 .066 .741 9.389 .000

(Constant)ETHNICTIEREVID

Model1

B Std. Error


Beta


t Sig.


(iv) Is the dependence of ‘preference’ on ‘evidence’ the same in all groups? That is, is there a Site x Evidence interaction? Define ‘ethnicxev’ = ‘ethnic’*(‘evid’ – 8.68), and ‘tierxev’ = ‘tier’*(‘evid’ – 8.68).

Coefficientsa

5.201 .623 8.342 .000.659 .135 .376 4.873 .000

-.980 .232 -.323 -4.225 .000.682 .071 .817 9.618 .000

9.894E-02 .044 .170 2.248 .0267.224E-02 .096 .057 .750 .454

(Constant)ETHNICTIEREVIDETHNXEVTIERXEV

Model1

B Std. Error


Beta


t Sig.


And so on! Data file: Site Ethnicity Tier Evidence Preference Ethnxev Tierxev 1.00 1.00 1.00 7.00 11.00 -1.68 -1.68 1.00 1.00 1.00 9.00 9.00 .32 .32 2.00 1.00 -1.00 5.00 7.00 -3.68 3.68 2.00 1.00 -1.00 2.00 9.00 -6.68 6.68 3.00 -2.00 .00 16.00 14.00 -14.64 .00 3.00 -2.00 .00 9.00 10.00 -.64 .00

12. Testing differences between two correlations

12.1. Independent samples (see Sec. 7.1 above). Suppose we have two observed correlations, r1 and r2, from two independent samples of sizes n1 and n2, respectively. We wish to test the null hypothesis that E(r1) = E(r2). To do so, we (a) Convert each correlation into a Fisher’s Z-score, Z1 and Z2.

(b) Use the fact that 1var( )

3Z

n=

− to find the variance of the difference:

21 2

1 2

1 1var( )3 3 dZ Z s

n n− = + ≡

− −.


23

(c) Define a standard Normal variable as: 1 2

d

Z Zz

s−

= , and use this as the test statistic.

12.2. Related samples. Suppose r12, r13, and r23 are calculated for a given sample of size n. We wish to test if, e.g., E(r12) = E(r13). To do so, we need a complicated expression for the standard error. Let

2 2 2 2 2 3 2 212 13 23 23 12 13 12 13 23(1 ) (1 ) 2 (2 )(1 )Ds r r r r r r r r= − + − − − − − − − 2r

Then define a standard normal variate as:

12 131 ( )

D

n r rz

s− −

= ,

and use this as the test statistic. I extracted the above formula from an out-of-print text (Fundamental Research Statistics for the Behavioral Sciences, by John T. Roscoe, 1975, HRW) that I used to teach from at the University of Michigan 1971! The references given there are to Hendrickson et al., American Educational Research Journal, 7, 189-194 (1970); and 639-641 (1970). 12.3. Alternative approaches when the raw data are available. In the case of independent samples, we can introduce the variable, ‘group’, to index the different samples, and then do a multiple regression of Y on X and ‘group’. We may also include an X*‘group’ interaction term in the regression. This seems to be a better way to answer the questions of interest, but you would need to have access to the raw data. In the case of related samples, I cannot think of a better test of the specific hypothesis posed in 11.2 above. However, it is possible that the real issue of interest is not the difference between two correlations, but rather the validity of one or other model of mediation, suppression or other pattern of causality. These latter models can be tested as shown in Section 10 above.

13. Another Example 1. Kurt L., an eager young social scientist, has a theory that rich people live longer. To test it, he tries a simple study: he goes to 2 cemeteries (a Christian one and a Jewish one in two different parts of town) and picks 30 tombs randomly in each. He measures the height of each tombstone as a measure of wealth, and he computes the age of the deceased by looking at the dates on the tombstone. He predicts that the higher the tombstone, the older the deceased should be. The data is entered in the file cemetery.sav. Use SPSS for all the computations below except if otherwise noted.

a) Using Graphs/Scatter…/Simple, provide a scatterplot for this data (make sure that your choice of X and Y capture the hypothesized causal relationship).

b) Compute the correlation between height and age in the total. Is it significant? Interpret this statistic in terms of support for the hypothesis.

c) Provide a linear equation to predict age from tombstone height. Can you find the correlation score computed above anywhere on the regression diagnostics? Explain why.

d) Get standardized values (z-scores) of height and age. To do that use Statistics/Summarize/Descriptives and make sure you click the box “save standardized values as variables”. How could you have computed these values if SPSS did not provide this shortcut?

e) Repeat questions b) and c) using these newly computed z-scores. Compare these results with your results on unstandardized values.

f) Kurt realizes that his two samples may be too different to aggregate. Test whether the two samples differ in average height and average age.


24

g) Re-draw the scatterplot as in a), but now enter cemetery in “Set markers by:”. What do you observe?

h) Using Data/Split file/Compare groups, you can tell SPSS to conduct an analysis separately on two groups. Use that function to compute separate correlation coefficients in the two groups. Compare those to your answer to b) and discuss.

807060504030

height

96

94

92

90

88

86

84

82

age

R Sq Linear = 0.09

807060504030

height

96

94

92

90

88

86

84

82

age Fit line for Jewish

Fit line for ChristianJewishChristiancemetery

R Sq Linear = 0.212R Sq Linear = 0.375

Tests of Between-Subjects Effects

Dependent Variable: age

351.414a 3 117.138 32.721 .0005982.080 1 5982.080 1671.003 .000

2.588 1 2.588 .723 .39981.249 1 81.249 22.696 .000

3.280 1 3.280 .916 .343200.476 56 3.580

476773.650 60551.890 59

SourceCorrected ModelInterceptcemeteryheightcemetery * heightErrorTotalCorrected Total

Type III Sumof Squares df Mean Square F Sig.

R Squared = .637 (Adjusted R Squared = .617)a.

Interpret!

handout 16: introduction to multiple linear...

Documents