correlation & regression (chapslangcog.stanford.edu/expts/mll/coursework/statistical... · web...

Psychology 252, Fall 2011Thomas & Monin

Handout 5 – Part I : Introduction to Correlation and Regression

A- Simple Regression and Correlation

1. Types of data and types of tests of whether X and Y are relatedIn the class so far we have moved from

having both the independent variable and the dependent variable be categorical (2) to having the dependent variable be quantitative with 2 categories (t-test) or k categories (ANOVA) for the independent variable. Now we will learn to deal with the situation where both the dependent variable and the dependent variable are quantitative. I prefer to refer to these variables are quantitative because in psychology they are often response scales, that are not really continuous but are quantitative. As the table above suggests, we will still need to consider the case of logistic regression, which is used to predict categorical variables from quantitative ones.

2.The simple regression lineImagine we have 2 quantitative variables X and Y, and that we would like to predict Y from X. For

example we want to predict number of violent incidents by average temperature that day, or depression by number of dates on a given month, or latency of word recognition by the number of letters in the word. Let’s take the latter example, and use totally made-up data averaged over a number of imaginary subjects.

The figure to the left is called a scatterplot or scattergram and is used to eyeball the relationship between variables as well as give us a rough idea of how much a number of assumptions are met. Here it looks like a linear trend is present, i.e. that the longer a word, the longer it takes people to read it (and the rate of increase seems constant over time). How can we test that? Let’s call Length of word the predictor variable and call it X for now, and latency of word recognition the predicted or criterion variable and call it Y for the calculations.

We actually can use a lot of language from what we learn in analysis of variance. What we basically want to explain is the variability in Y, which is captured by the sum of squares of Y, or SSY. Suppose instead of the actual number of letters for each word we created three equal-size groups of

short, medium and long words. We would then do a simple ANOVA and get the table below, which shows a significant effect word length, F(2,57)=10.9, p<.001. We could then follow up with a contrast for a linear trend. But imagine we want to finegrain our analysis and not lump words into groups.> anova(lm(y~levelx,data=data0))Analysis of Variance TableResponse: y Df Sum Sq Mean Sq F value Pr(>F) levelx 2 89.257 44.628 10.923 9.64e-05 ***Residuals 57 232.879 4.086 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Independent VariableDependent Variable Categorical Quantitative

Categorical 2 test of independence

Logistic Regression

Quantitative z- and t-tests,ANOVA

Regression & Correlation

1


Note that the ANOVA above relies on the following structural model for each observation (I am not using population value notation for ease of understanding although that would be more rigorous for structural models):

Where each observation can best be thought of as the mean for its group plus some individual error. The logic of the regression is also to think of individual scores as random deviations around a meaningful underlying structure, but instead of the structure being a few group means the underlying structure generates a different score for every value of the predictor variable – in other words observed values of Y are thought of as random deviations around a regression line that is dependent only on X. The structural model becomes:

where we are now only indexing the y values with i because there are no more groups, instead pairs of observations (xi, yi). The notation refers to “predicted value” of given , and is basically the value on the regression line at . We define it as a function of , and because we assume a linear relationship between the two variables. We know we can capture a linear function with two parameters, denoted as the slope (b above, the amount of increase in for each increase of 1 in ) and the intercept (a above, corresponding to the value of for ).

Length of Word

1086420

Late

ncy

of w

ord

reco

gniti

on

10

8

6

4

2

0

In R you can graph the regression line with:> plot(x,y)> abline(reg=lm(y~x))3 – Computing the slope (b) and the intercept (a)

2

xi

yi

ŷi

ei

a

1

b

(xi, yi)


How do we choose a and b? It should be apparent from above that the ideal values should provide a model that is as close as possible to the data, i.e. that requires the smallest ei’s, or the smallest deviations from the predicted values. In other words we want to minimize the values, or the vertical distance between any point and the regression line. This “error” around the regression line is akin to the error around group means, and in fact again we will square these deviations up and sum them up to generate a sum of squares, only here we want to minimize this value to maximize the fit of the model. How can we choose a and b to minimize this sum of squares (SSResidual)? [This approach is referred to as the “Method of Least Squares”]

Here we introduce the means of X and Y because we suspect we will want to extract sums of square.

Now we use the formula for (a+b+c)2 = a2 + b2 + c2 + 2ab + 2bc + 2ac

Note that the last two terms, and , both sum

to zero, which is why they don’t appear in the sum above.

Thus

Let’s call the expression the “sum of products” or SP, and rewrite the above:

Remember that our goal is to choose a and b such that this sum is the smallest possible. Playing with a and b won’t affect the 1st and 4th terms, so we can leave them aside for now. This leaves us with the 2nd and 3rd term:

is minimized if we choose

is minimized if we choose

In turn, setting a and b in this fashion will thus minimize the sum of squares, and

enable us to draw the regression line bx+a that best fits the data available. Note that if we set a and b as

above, we have:

3


The output for this simple regression analysis would be the following, where the way to interpret this is that the best fitting regression line for this data has the equation: .

> lm(y~x)

Call:lm(formula = y ~ x)

Coefficients:(Intercept) x 2.3131 0.7465

> anova(lm(y~x))Analysis of Variance Table

Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 94.811 94.811 24.190 7.545e-06 ***Residuals 58 227.325 3.919 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> summary(lm(y~x))

Call:lm(formula = y ~ x)

Residuals: Min 1Q Median 3Q Max -4.1875 -1.3694 -0.3192 0.9844 4.9181

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3131 0.8694 2.660 0.0101 * x 0.7465 0.1518 4.918 7.54e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.98 on 58 degrees of freedomMultiple R-squared: 0.2943, Adjusted R-squared: 0.2822 F-statistic: 24.19 on 1 and 58 DF, p-value: 7.545e-06

The ANOVA table, is similar to the table on p.1, except it’s testing the significance of this regression model. Note that the total SS in the table (here 94.811+227.325=322.136) is the same as in the table on p.1, but that the regression model explains more of the sum of squares (94.811) than the the categorical model on p.1 (89.257). This leaves less SSError, and coupled with a lower dfnum, it explains why we get a larger F ratio. The regression approach is nearly always more powerful than lumping observations into groups. You lose precious information by lumping, whereas all is analyzed in the regression, which explains its greater power. Note the usual additivity of sums of square, such that SSTotal = SSRegression + SSResidual, where we already know how to compute SSTotal and SSResidual.

4


Thus SSResidual is a common index of error around the line. In fact if we divide it by (N-2) degrees of freedom, we get the residual variance or the error variance of the estimate, often denoted as the standard error of the estimate, sY.X (which is different from the s.e. attached to each coefficient!):

4- Covariance and Correlation Coefficients

The Sum of Products (SP) encountered above captures the level of associations between the two variables, and is in fact used to generate to common indices, the covariance and Pearson’s moment-product correlation coefficient. Let’s take a moment to build our intuition about the value of the SP.

It should be clear from the figure on the left that this number will be positive if most observations fall into the upper-right quadrant and the lower-left quadrant, negative if most fall within the upper-left and lower-right, and closer to zero if the observations are all over the place. Thus it is a good starting point to measure association between two variables..

However, it is also dependent on other things that we may want to strip from this index. For example, the more observations, the more terms in the sum, and the greater the product. We can for example divide it by (N-1),

where N is the number of observations. This makes it resemble a sample variance, and in fact we will call this value the covariance of X and Y:

Note that COVXX=VAR(X). This statistic is often used when we want to introduce a measure of association in a calculation of variances (e.g., to compute the variance of X-Y). However, when we’re not dealing directly with variances, it presents the issue that it is very dependent on the variance of X and Y. For example, by changing the scale arbitrarily (e.g., standardizing scores), we would change the covariance, whereas we would like a measure of association that is more independent of the variance of each of the two variables. That is where Pearson’s product-moment correlation coefficient comes in handy:

where sx and sY refer to the standard deviations of X and Y, and zX and zY refer to standardized values of X and Y. A couple of relationships between the correlation coefficient and regression are worth

noticing. First, we defined the slope of the line, b, as , and r as , thus:

5

Length of Word

1086420

Late

ncy

of w

ord

reco

gniti

on

10

8

6

4

2

0

0)(0)(

yyxx

i

i

0)(0)(

yyxx

i

i

0)(0)(

yyxx

i

i

0)(0)(

yyxx

i

i

x

y

X

1086420

Y

10

8

6

4

2

0

abxy ˆ

''ˆ aybx


Note that it means that one way to think of the r is that it is the slope of the line if both variables were standardized. Another thing that is apparent from this formula is that whereas r is reflexive (rXY=rYX), the regression slope b is not! It is different if X predicts Y or Y predicts X. Because the regression equation minimizes the vertical distances (as opposed to Euclidian distances which would be the line of distance from the line to the point perpendicular to the regression line), you cannot obtain the slope of intercept for X=f(Y) as a function of Y=f(X).

There is another enlightening relationship between the correlation coefficient and the regression model.

Remember that .

Since ,

This makes the r coefficient extremely meaningful: Consider that SSTotal is the variance to be explained, and SSResidual is the variance unexplained. It follows that r2 estimates the percentage of variance explained. If we find r=.30, we know 9% of the variance is explained.

Let’s make R work for a us a little. Here’s what the output for the covariance and the correlation would look like. Thus r=.54, and COV=2.15. This means that 29% of the variance is accounted for.

> var(x,y) # R uses the same command for variance and covariance[1] 2.152656

> sum((x-mean(x))*(y-mean(y)))/(length(x)-1)[1] 2.152656

> cor(x,y)[1] 0.5425129

> var(x,y)/(sd(x)*sd(y))[1] 0.5425129

> cor.test(x,y)

Pearson's product-moment correlation

data: x and y t = 4.9184, df = 58, p-value = 7.545e-06alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3346949 0.7000065 sample estimates: cor 0.5425129

Let’s dwell a little more on the partitioning of sum of square in the regression. First you’ll recall that in the ANOVA (see p.1): In regression we have analogue values:

6


And they also partition the total sum of square:The first term of the sum captures r2 of the variance, while the second captures (1-r2).

Thus one way to think of r2 is:

This last expression actually gives r a more general meaning that makes it applicable to most situations as an estimate of effect size. Because people have are familiar with r and how to interpret it, some authors have advocated that r be used as the common metric for effect size. Thus even for a t, you could us an r instead of a d with the appropriate transformations.One virtue of the r over the slope is that it is independent of the scales used and thus can be interpreted more readily with less knowledge about the study. Another way to deal with the issue is to standardize both the predictor and predicted variable before constructing the regression equation. If you do that you obtained the “standardized coefficient,” denoted by . In the case of the simple regression, Beta is the same thing as r.

5—Confidence interval on YAlthough you may not find yourself computing a CI on Y very often, it is useful to be exposed to it because it illustrates an important point about the quality of the prediction. The regression line is at its best when it predicts Y at mean of X (which means at mean of Y too because the regression line by definition goes through the point defined by the two means).

Because accuracy is best at mean of X, we apply a slight transformation to the standard error of estimate (sX.Y):

And then the CI is defined as:

You can see from the corrected standard error of the estimate why the CI curve becomes wider as you get away from mean of X on either side. On the graph above the CI for Y is captured by the outer lines. I have also included the CI for mean of Y because it makes the shape of the CI curves more obvious, although I am not showing you at to compute it at this point. If the

intercept was of particular interest, there exist techniques that place greatest accuracy around the intercept instead of the mean. In most cases the intercept is of little relevance.

6—Hypothesis testingFrom the standard error of the estimate (sY.X) we can generate a standard error for b. Realize that b is an estimate of a population parameter (sometimes called b*), just like was an estimate of . This estimate b is normally distributed around b* with a standard error approximated by:

And if we have an estimate of a population parameter and its standard error, we’re all set to compute a t!

7

X

1086420

Y

10

8

6

4

2

0


, and if were testing H0: b*=0, , distributed with N-2 df.

This is the t that R gave us for the b coefficient of X in the output on p.4. Note that R gave us an F for the model, and then a t for both the slope and the intercept. In reality the test of the model is equivalent to a test of the slope being different from zero. In fact you may have noticed in the ANOVA table that F (24.190)=t2 (4.918^2)! You may wonder why the test of the model is not testing the intercept. The test of the intercept (that a is not zero) is irrelevant for the quality of the model because we optimized it by definition. In fact if b=0, then , and therefore . The use of testing the significance of the model vs. testing the significance of a particular slope will become more apparent in multiple regression.

We can also test the significance of a correlation coefficient using: , with df=N-2.

There’s an interesting feature of that t that I am surprised to find rarely mentioned in textbooks: It only depends on N and r, and to look up its significance we only need to know N. In other words, for any given r and N we can readily compute p, and yet I have rarely seen tables of r in textbooks.In other words, without computing a t, if we have 100 observations, any r greater or equal to .20 would be significant, p<.05.

To test the difference between two indepenent r’s, we use a transformation called “Fisher’s r-to-z”on each of the two coefficients:

, and r’ now has the standard error .

The difference can thus be tested via: , using the standard normal. However, a

more sophisticated approach is of course to include an interaction term in an lm() model.

p<.05r N

0.07 10000.09 5000.20 1000.28 500.32 400.37 300.45 200.64 10

8


7—Binary logistic regression

Imagine that you want to predict relapse in alcoholics based on the number of counseling sessions attended. You have the data to the left, where numsessions is the number of sessions (standardized) and relapse is whether the patient relapsed (1) or not (0). You can graph the data like this (the line I added is a line obtained through logit):

Although you could technically run a regular regression through this, you’d be violating a number of assumptions about distribution, and the equation would be fairly meaningless because you’d get many impossible predicted values. Note also that you couldn’t simply choose a cutpoint because of the overlap of the distributions. The technique to use when the dv is categorical like this is a logistic regression. Notice that the dependent variable in this model is categorical (relapse) and binary in that it can only take two values. I enter numsessions in the model as a predictor. I get the following final output:

> glm(relapse~numsessions,family="binomial")

Call: glm(formula = relapse ~ numsessions, family = "binomial")

Coefficients:(Intercept) numsessions -0.1866 -2.0925

Degrees of Freedom: 19 Total (i.e. Null); 18 ResidualNull Deviance: 27.73 Residual Deviance: 17.64 AIC: 21.64

Note that the equation looks like a standard regression equation. It turns out that it is, except that instead of predicting relapse, it predicts log[p/(1-p)] (the “logit”), so:

How do we interpret that log[p/(1-p)] number? Well the number p/(1-p) is called the odds. In common parlance (and horse-betting jargon) it is noted as p:(1-p). If you have a 50% chance of winning the lottery, your odds are 1:1. If you have 33% chance, you odds are 1:2. If you have a 10% chance, your odds are 1:9, and so on…

numsessions relapse-1.74 No relapse1.15 No relapse1.87 No relapse.62 No relapse-.47 Relapse.88 No relapse-.99 Relapse.81 No relapse.44 No relapse

-1.35 Relapse.52 No relapse.14 Relapse-.49 Relapse.60 No relapse-.03 No relapse-.43 Relapse-.94 Relapse-.06 Relapse-.84 Relapse

-1.81 Relapse

9


So now we know how to use the logistic regression output. Remember that numsessions is standardized. Imagine that someone had the average number of sessions (numsessions=0). The equation gives

, or , suggesting that that person is less

likely (.83<1) to relapse than to be fine – that the probability of relapse is only 83% of the probability of no relapse. So at the average number or sessions you can be cautiously optimistic. At numsessions=1,

i.e., 1 SD above the mean, the odds ratio is , so the person is ten

times more likely to be fine than to relapse. At numsessions=-1, i.e., 1 SD below the mean, the odds ratio

is , so the person is over six times more likely to relapse than to be

fine.

10


Handout 5 – Part II : Introduction to Multiple Regression

A. Extending simple linear regression to multiple predictorsThe multiple regression approach extends the logic from 1 to p predictors: Often in multiple regression, y would be called x0, and a would be called b0, so with p predictors, the equation takes the form: In particular, with two predictors we have: We can then predict x0 (a.k.a. y) with the xi’s. Imagine the following data: we have for each subject college GPA, SAT (converted to a percentage), a rating from 1 to 5 for letters of recommendation, and also high school GPA. Imagine that we’re an admission officer trying to come up with the best predictor of college GPA.

We can start by a simple bivariate regression using SAT:

> summary(lm(coll_gpa~sat,data=data0))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.57321 0.44521 3.534 0.002373 ** sat 0.02228 0.00547 4.072 0.000715 ***---Residual standard error: 0.3043 on 18 degrees of freedomMultiple R-squared: 0.4795, Adjusted R-squared: 0.4506 F-statistic: 16.58 on 1 and 18 DF, p-value: 0.0007148

Alternatively, we could have decided to include not only the SAT, but also the recommendation letter ratings. We use the same R command to obtain:

> summary(lm(coll_gpa~sat+recs,data=data0))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.230924 0.396858 3.102 0.006481 ** sat 0.020041 0.004711 4.254 0.000535 ***recs 0.155883 0.055176 2.825 0.011669 * ---Residual standard error: 0.2583 on 17 degrees of freedomMultiple R-squared: 0.6458, Adjusted R-squared: 0.6042 F-statistic: 15.5 on 2 and 17 DF, p-value: 0.0001473

Another option would have been to predict college GPA from SATs and high school GPA:

> summary(lm(coll_gpa~sat+hs_gpa,data=data0))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.415248 0.409492 3.456 0.00302 **sat 0.013800 0.006254 2.207 0.04138 * hs_gpa 0.272462 0.122609 2.222 0.04013 * ---Residual standard error: 0.2756 on 17 degrees of freedomMultiple R-squared: 0.5967, Adjusted R-squared: 0.5492 F-statistic: 12.58 on 2 and 17 DF, p-value: 0.0004445

Note how introducing the second variable influences the predictive value of SAT differently. The correlation table to the right helps understand why. What is apparent from the table is that SAT is more correlated with High-school GPA (r=.61) than it is with Letters of Recommendation (r=.17). So High-School GPA and SAT are to a large extent redundant predictors, whereas SAT and Letters provides much more non-redundant predictive information. And each bi coefficient (the “estimate” column in lm()) captures the slope

ID GPA(College)

SAT(%) Reco GPA

(High)id coll_gpa sat recs hs_gpa1 3.14 75.40 3 3.462 3.49 74.84 4 3.033 3.39 96.52 3 3.214 3.36 77.88 4 2.805 2.76 66.04 3 1.686 2.79 80.67 2 2.777 3.37 78.81 4 2.288 4.15 96.50 4 3.159 3.43 89.16 3 3.6810 3.37 71.15 4 3.4611 3.36 80.70 2 3.0112 3.04 83.44 1 3.3813 4.07 83.85 5 4.1514 3.15 72.82 5 3.0715 3.32 71.91 4 3.4716 2.79 63.72 3 2.0017 2.89 57.98 2 2.0918 3.91 105.20 3 3.5419 3.79 104.60 5 3.8820 3.73 77.38 3 3.54

11

> round(cor(data0),2)

ID coll_gpa sat recs hs_gpa

ID 1.00 0.22 0.07 0.05 0.21

coll_gpa 0.22 1.00 0.69 0.52 0.69

sat 0.07 0.69 1.00 0.17 0.61

recs 0.05 0.52 0.17 1.00 0.31

hs_gpa 0.21 0.69 0.61 0.31 1.00

Psychology 252, Fall 2011Thomas & Moninassociated with xi when all other predictors are held constant. In the case of two predictors, you can think of it as the average slope for x1 at each level of x2.

As I show below, in multiple regression you essentially predict the outcome with the residuals of each predictor controlling for the other predictors. In the first graph I predict SAT with hs_gpa. The line is the part I am predicting, the spikes are the residuals. These residuals are displayed on their own on the second figure. Then in the third figure I use these residuals as predictors of coll_gpa. The slope in figure 3 is what is being captured by the coefficient for SAT in a multiple regression which includes hs_gpa as a second predictor of coll_gpa.

2.0 2.5 3.0 3.5 4.0

6070

8090

100

hs_gpa

sat

2.0 2.5 3.0 3.5 4.0

-15

-50

510

20

hs_gpa

mod

el2$

resi

dual

s

-15 -10 -5 0 5 10 15 20

2.8

3.2

3.6

4.0

model2$residuals (SAT controlling for hs_gpa)

coll_

gpa

> round(cor(hs_gpa,model2$residuals),3) # Residuals are the part of ‘sat’ which is uncorrelated with ‘hs_gpa’.[1] 0

Here’s the script we used for these figures:

par(mfrow=c(3,1),mai=c(.7,.6,.1,.1)) # Sets up my graphics areamodel2<-lm(sat~hs_gpa)plot(sat~hs_gpa)abline(reg=model2)segments(hs_gpa,model2$fitted.values,hs_gpa,sat) # Draws spikes in the first panelplot(model2$residuals~hs_gpa)abline(reg=lm(model2$residuals~1)) # This draws a line at the mean of a samplesegments(hs_gpa,0,hs_gpa,model2$residuals)

12

Psychology 252, Fall 2011Thomas & Moninplot(coll_gpa~model2$residuals,xlab="model2$residuals (SAT controlling for hs_gpa)")abline(reg=lm(coll_gpa~model2$residuals))par(mfrow=c(1,1))

Look at what happens when I predict coll_gpa with the residuals from the lm(sat~hs_gpa) model:

> summary(lm(coll_gpa~model2$resid))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.3650 0.0887 37.938 <2e-16 ***model2$resid 0.0138 0.0090 1.533 0.143 ---Residual standard error: 0.3967 on 18 degrees of freedomMultiple R-squared: 0.1155,Adjusted R-squared: 0.06638 F-statistic: 2.351 on 1 and 18 DF, p-value: 0.1426

Note that the coefficient when predicting coll_gpa with the residuals is .0138, the same as the coefficient for SAT when GPA_H is in the equation, even though we are doing a simple regression here. You can test the significance of a bi (i.e. that the slope is not 0) by dividing it by its standard error (R gives you the s.e. and the t for each predictor) with N-p-1 degrees of freedom, where N is the number of subjects and p is the number of predictors in the model. With 2 predictors, df=N-3.

Standardized regression coefficients (or betas) would be the b’s we obtained if all the variables were standardized. They are easier to interpret because they are scale-free. Note that in multiple regression they no longer correspond to the

correlation. You can go back and forth between the two, because , where si is the sd of the predictor and s0 is the

sd of the predicted value. In contrast to other software, R does not give you the betas by default.

Residual variance: As before it is essential to keep in mind the concepts of residual variance. You should inspect residuals and you can define residual variance as:

This is also called MSresidual or MSerror. The parallel with ANOVA is intentional.

Multiple correlation coefficient: Represents how well the model is capturing the data. In fact

> cor(coll_gpa,lm(coll_gpa~sat+hs_gpa)$fitted)[1] 0.7724583> cor(coll_gpa,lm(coll_gpa~sat+hs_gpa)$fitted)^2[1] 0.5966918

Most often we will look at R2, because it is easy to interpret: it is the percentage of accountable variation. Check that the value above (.59) matches the R-square listed in the lm output on the first page. Although the field is lax about this, we should really be using the adjusted R2:

For a given data set (N is fixed) an important correction factor in the formula above is the number of predictors. You can also test the significance of R2:

, with (p, N-p-1) degrees of freedom.

A significant F means the whole model is predicting a chunk of the variance greater than 0.

Here’s the model with all the predictors in the mix. Compare the coefficients with the “zero-order correlations” on p.1.

> summary(lm(coll_gpa~hs_gpa+sat+recs,data=data0))

13

Psychology 252, Fall 2011Thomas & MoninCall:lm(formula = coll_gpa ~ hs_gpa + sat + recs, data = data0)

Residuals: Min 1Q Median 3Q Max -0.33996 -0.14539 -0.04915 0.15624 0.45895

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.171032 0.375079 3.122 0.00657 **hs_gpa 0.200249 0.112204 1.785 0.09328 . sat 0.014177 0.005519 2.569 0.02060 * recs 0.130288 0.053883 2.418 0.02790 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2431 on 16 degrees of freedomMultiple R-squared: 0.7046, Adjusted R-squared: 0.6492 F-statistic: 12.72 on 3 and 16 DF, p-value: 0.0001661

A useful function to know about is confint(), which gives you 95% confidence intervals on the coefficients. Try:> confint(lm(coll_gpa~sat+hs_gpa,data=data0)) 2.5 % 97.5 %(Intercept) 0.375899813 1.96616432hs_gpa -0.037613795 0.43811176sat 0.002477472 0.02587656recs 0.016061439 0.24451436

We can use ANOVA to see if adding a predictor to the model significantly changes the R2 (see below). Here we see that if we drop ‘hs_gpa’, the drop (from .70 to .65) is marginal, p = .09, and corresponds to the significance of the coefficient for ‘hs_gpa’. This may seem redundant with the above, but note that ANOVA enables us to test the impact of dropping/adding whole sets of predictors, to compare complex models with simpler (more parsimonious) models. The -0.18828 value is the reduction in SS residual, or alternatively, the increase of SS explained by the model, divided by the change in number of predictors (here 1). The F value is simply this change in SS divided by the MSE of the full model.

> model2<-lm(coll_gpa~hs_gpa+sat+recs,data=data0)> model3<-lm(coll_gpa~sat+recs,data=data0)> anova(model2,model3)Analysis of Variance Table

Model 1: coll_gpa ~ hs_gpa + sat + recsModel 2: coll_gpa ~ sat + recs Res.Df RSS Df Sum of Sq F Pr(>F) 1 16 0.94582 2 17 1.13410 -1 -0.18828 3.1851 0.09328 .---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The formula to test this change in R2 is , where is the

R2 for the full model, is the R2 for the reduced model, f is the number of predictors in the full model, r is the number of predictors in the reduced model, and N is the total number of observations. To compare model3 (find the R2 a couple of pages back) and model2 by hand, we do the following. We get the same F.

> length(coef(model2))-1->f> length(coef(model3))-1->r> length(data0$coll_gpa)->N> summary(model2)[8][[1]]->R2f> summary(model3)[8][[1]]->R2r> > (N-f-1)*(R2f-R2r)/((f-r)*(1-R2f))[1] 3.185082

B. Measures of association

14


icecream drownings heat

icecream 1.00 0.46 0.71

drownings 0.46 1.00 0.58

heat 0.71 0.58 1.00


racetime practicetime practicetrack

racetime 1.00 0.11 0.06

practicetime 0.11 1.00 -0.91

practicetrack 0.06 -0.91 1.00

Psychology 252, Fall 2011Thomas & Monin1 – Partial correlation

: Partial correlation between x0 and x1 partialing out x2 out of both.Partial correlation is defined by the two variables that are being correlated, and those that are being partialled out. So is the correlation between Y and X1 “partialling out” (or “controlling for”) X2. It is obtained by regressing Y on X2, getting the residuals, regressing X1 on X2, getting the residuals, and correlating the two sets of residuals.

Surprisingly, I couldn’t find a partial correlation in base package in R (you can find it in the psych package for example – it’s called partial.r()). But this is a great illustration of the flexibility of R and how to create a new function on the fly:

> bm.partial<-function(x,y,z) {round((cor(x,y)-cor(x,z)*cor(y,z))/sqrt((1-cor(x,z)^2)*(1-cor(y,z)^2)),2)}> ls()[1] "bm.partial" "data1" > bm.partial(data1$icecream,data1$drownings,data1$heat)[1] 0.08

# Now I am repeating it with the formula from the ‘psych’ package> library(psych)> partial.r(data1,1:2,3) icecream drowningsicecream 1.00 0.08drownings 0.08 1.00

# Note that we obtain the same result by correlating residuals:> cor(lm(icecream~heat,data=data1)$residuals,lm(drownings~heat,data=data1)$residuals)[1] 0.0813568

2- Semi-partial (or part) correlation: Semi-partial or part correlation between x0 and x1

partialing out x2 out of x1. Here we only partial out the extraneous predictor from the other predictor but not from the predicted value, for which we use raw scores. It is noted .

> bm.semipartial<-function(x,y,z) {round((cor(x,y)-cor(x,z)*cor(y,z))/sqrt((1-cor(y,z)^2)),2)}

> bm.semipartial(racetime,practicetime,practicetrack)[1] 0.39

# Note that you get a very similar result by correlating a residual with ‘racetime’# But in contrast to the partial correl, only one of the two terms is a residual here.> cor(data2$racetime,lm(practicetime~practicetrack,data=data2)$residuals)[1] 0.3969366

3- Breaking down the variance

15

X0

X2

X1

a

b

c

d

e

f

g

Psychology 252, Fall 2011Thomas & MoninThe Venn diagram to the right represents the variance of X0 and of two predictors X1 and X2. Overlap between the circles represents shared variance. [This diagram is great to understand the concepts, but is not always strictly true – e.g., with suppressor variables]

It’s important to note from this graph that the increase in R2

when a new predictor is entered in the model is the squared semi-partial correlation for that variable.

4- How do they all relate? We’ve mentioned a number of ways that association between variables could be measured. In this summary I remind you what each is. I give formulas that are not meant for computation but to get an intuition of the relationship between the various indices. Try to understand each formula and what it captures.

: Pearson product-moment correlation coefficient, or simply correlation Also zero-order correlation, here between x0 and x1:

: Partial correlation between x0 and x1 partialing out x2 out of both. The numerator is easy to interpret: it’s the “direct path” minus the “indirect path”:

: Semi-partial or part correlation between x0 and x1 partialing out x2 out of x1. Note that this is very similar to the partial correlation, only one term was removed from the denominator to reflect that x2 is not partialed out of x0 this time around:

: Unstandardized regression coefficient or “slope” for x1 predicting x0. Amount of increase in x0 for each increase of one unit in x1, keeping other predictors in the equation constant.

: Standardized regression coefficient or “slope” for x1. Corresponds to the unstandardized coefficient you would get if both the predictors and the predicted variables were standardized (“z-scored”). Amount of increase in x0 (in

Area %SSY Interpretationa+b+c+d 100% Total variance of X0 to be explainedb+c+d Variance explained by regression

a Residual variance

b+c Squared bivariate correlation

c+d Squared bivariate correlation

b Squared semipartial correlation

d Squared semipartial correlation

b/(a+b) Squared partial correlation

d/(a+d) Squared partial correlation

16


standard deviations) for each increase of one standard deviation in x1, keeping other predictors in the equation constant. The standardized coefficient is the unstandardized coefficient controlling for scale:

and

So the standardized coefficient is the semi-partial correlation divided by the square root of the tolerance (see below). If tolerance is 1, and are the same. Note also that these two equations enable us to express as a function of the semi-partial correlation.: The multiple correlation coefficient, also , is the correlation between the outcome variable and the values predicted by the regression equation (a linear combination of the predictor variables). Squared, it gives you the “R-square” that we keep talking about in evaluating regression models.

: This is another use of the multiple correlation coefficient. This corresponds to predicting any predictor with all the other predictors in an equation, and correlating those predicted values with the real values of the target predictor.[12…(i)….k means “all the values from 0 to k except for i].Used to assess collinearity between predictor. thus represents the proportion of variability in xi

that is independent from the other predictors, and is referred to as tolerance.

C. Building intuition for regression models

1. With multicollinearity, R2 can be high while none of the predictors are significant!

> summary(lm(Y2~X3+X4,data=data3))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.4557 0.7796 -1.867 0.0741 .X3 1.1314 1.6791 0.674 0.5069 X4 0.0366 1.6864 0.022 0.9829 ---Residual standard error: 2.31 on 24 degrees of freedomMultiple R-squared: 0.7945, Adjusted R-squared: 0.7773 F-statistic: 46.39 on 2 and 24 DF, p-value: 5.68e-09

I am showing you this extreme example because this is such an important intuition to have in multiple regression. Here nearly 80% is explained by the model, the model is significant on the whole, F(2,24) = 46, p < .001, and yet neither predictor is anywhere near significance! Of course you know why that is the case by now, but I wanted to make sure you had seen it for yourself. The table of correlations below confirms what you might suspect: X3 and X4 are basically capturing the same variance ( r > .99). It also tells you that simple regressions would yield most likely significant slopes for both X3 and X4 – we know that because remember that the r is identical to the beta in simple (but not multiple) regression.

> cor(data3[,7:9]) X3 X4 Y2X3 1.0000000 0.9973899 0.8913316X4 0.9973899 1.0000000 0.8891501Y2 0.8913316 0.8891501 1.0000000

2. Illustrating multiple regression with residuals

17

X1 X2 Y X2.1 X1.2

X1 1.000 0.751 0.422 0.000 0.661

X2 0.751 1.000 0.567 0.661 0.000

Y 0.422 0.567 1.000 0.378 -0.005

X2.1 0.000 0.661 0.378 1.000 -0.751

X1.2 0.661 0.000 -0.005 -0.751 1.000

Psychology 252, Fall 2011Thomas & MoninImagine that we have 3 variables, X1, X2, which we will use to predict Y. Compare simple regression coefficients with multiple regression ones:

> summary(lm(Y~X1,data=data3))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.058 23.520 -0.087 0.9310 X1 14.530 6.237 2.330 0.0282 *---Residual standard error: 61.48 on 25 degrees of freedomMultiple R-squared: 0.1784, Adjusted R-squared: 0.1455 F-statistic: 5.428 on 1 and 25 DF, p-value: 0.02819

> summary(lm(Y~X2,data=data3))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -12.4663 19.9323 -0.625 0.53736 X2 1.1639 0.3382 3.442 0.00204 **---Residual standard error: 55.87 on 25 degrees of freedomMultiple R-squared: 0.3215, Adjusted R-squared: 0.2944 F-statistic: 11.85 on 1 and 25 DF, p-value: 0.002042

> summary(lm(Y~X1+X2,data=data3))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -12.1999 22.2745 -0.548 0.5890 X1 -0.2571 8.7545 -0.029 0.9768 X2 1.1755 0.5224 2.250 0.0339 *---Residual standard error: 57.02 on 24 degrees of freedomMultiple R-squared: 0.3215, Adjusted R-squared: 0.265 F-statistic: 5.687 on 2 and 24 DF, p-value: 0.009513

Notice how X1 is no longer significant once you enter X2. Also the R2’s for these 3 models are .18, .32, and .32. So X1 doesn’t add any predictive value as you go from X2 alone to X1 and X2.You would never do what I’m doing next, I am just doing it for illustrative purposes. I predict X1 with X2, and vice versa, making sure I save the residuals. I call these X1.2 and X2.1. They are the parts of the variance in X1 and X2 that are independent of the other predictor. In other words X1.2 should be uncorrelated with X2.1, and vice-versa. The residuals are the spikes from the regression line (left panel). Once you save the residuals they are by definition unrelated to the first predictor.

1 2 3 4 5 6

2040

6080

100

X1

X2

1 2 3 4 5 6

-40

-20

020

40

X1

X2.

1 (R

esid

uals

)

> lm(X1~X2,data=data3)$residuals->data3$X1.2

To illustrate this we compute correlations between our five variables. Notice that the correlation between X1 and X2 is high (.751**), which signals collinearity. It’s not surprising then that the coefficients varied widely when we entered predictors alone or conjointly. Second, notice that as we expected, X1.2 and X2 are independent (.00), as are X2.1 and X1. As planned, these residual scores capture that part of each predictor that is independent of the other.

18

Psychology 252, Fall 2011Thomas & MoninPartial and semi-partial (“part”) correlations. The table above also shows the semi-partial or part correlations between X1 (controlling for X2) and Y [CORREL(X1.2,Y) = -.005] and X2 (controlling for X1) and Y [CORREL(X2.1,Y) = .378]. Note that there are no equivalents to partial correlations above. If we had saved the residuals when predicting Y with X1, and called them Y.1, then we would obtain CORREL (X2.1,Y.1) = .417*, which is the same as the partial correlation below. The difference between part and partial is that in partial correlations you control for the variance explained by the 3rd variable in both variables in the correlation.

> partial.r(data3,2:3,1) X2 YX2 1.00 0.42Y 0.42 1.00

Notice what happens when we run simple regressions using the residual scores. We are now getting the same slopes as we did using X1 and X2 in the multiple regression. The R2’s, by the way, are .000 and .143. Notice that .14 is the difference between .32 (when X1 and X2 were in the model) and .18 (when only X1 was in the model), so it’s an estimate of the proportion of variance in Y that is uniquely accounted for by X2.

> summary(lm(Y~X1.2,data=data3))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 45.2999 13.0536 3.470 0.0019 **X1.2 -0.2571 10.4135 -0.025 0.9805 ---Residual standard error: 67.83 on 25 degrees of freedomMultiple R-squared: 2.438e-05, Adjusted R-squared: -0.03997 F-statistic: 0.0006094 on 1 and 25 DF, p-value: 0.9805

> summary(lm(Y~X2.1,data=data3))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 45.2999 12.0834 3.749 0.000941 ***X2.1 1.1755 0.5752 2.044 0.051662 . ---Residual standard error: 62.79 on 25 degrees of freedomMultiple R-squared: 0.1431, Adjusted R-squared: 0.1089 F-statistic: 4.176 on 1 and 25 DF, p-value: 0.05166

If we enter X1 and X2.1 together (see below), we get the slope for X1 that we obtained in the simple regression, and the slope of X2.1 corresponding to the slope of X2 in the multiple regression. Note also that because the two variables are independent by definition, the slope for X2.1 is not perturbed by the introduction of X1 (compare with above), though the s.e. of the slope does change and therefore the t too. The R2 is .322, as you would expect.

> summary(lm(Y~X1+X2.1,data=data3))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.0579 21.8137 -0.094 0.9256 X1 14.5303 5.7842 2.512 0.0191 *X2.1 1.1755 0.5224 2.250 0.0339 *---Residual standard error: 57.02 on 24 degrees of freedomMultiple R-squared: 0.3215, Adjusted R-squared: 0.265 F-statistic: 5.687 on 2 and 24 DF, p-value: 0.009513

What would happen if we predicted Y with both X1.2 and X2.1? This sounds like it would give you a pretty clean solution, but it is misleading. As you can see in the table of correlations above, X1.2 and X2.1 are highly correlated (-.751**), and in fact their correlation is exact opposite as the one between X1 and X2 (.751**). This always going to the case: CORREL (X1.2, X2.1) = - CORREL (X1,X2). So when we enter them together it’s not pretty (see below). The R2 is the same as when you enter X1 and X2 together, it’s .322. The rest looks very different. We know we have multicollinearity, so we’re not even sure what to make of the coefficients. What it looks like is a suppression effect: X1.2 jumps from a slope of -.257 on its own to a slope of 33*!

19

Psychology 252, Fall 2011Thomas & Monin> summary(lm(Y~X1.2+X2.1,data=data3))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 45.2999 10.9740 4.128 0.000381 ***X1.2 33.2847 13.2500 2.512 0.019134 * X2.1 2.6663 0.7906 3.372 0.002523 ** ---Residual standard error: 57.02 on 24 degrees of freedomMultiple R-squared: 0.3215, Adjusted R-squared: 0.265 F-statistic: 5.687 on 2 and 24 DF, p-value: 0.009513

Here’s a summary of all the regression equations above to facilitate comparison (I have also added the equation involving X2 and X1.2. In your own work you would most likely only use something like rows 1 to 3. But the way to look at this table is to eyeball down a column to see if and when a coefficient changes, then try to understand why. Note that the R2 can never get over .32.

Predictors: X1 X1.2 X2 X2.1 Intercept R2

X1 14.5* -2.1 .18X2 1.2** -12.5 .32X1 and X2 -.3 1.2* -12.2 .32X1.2 -.3 45.3** .00X2.1 1.2† 45.3** .14X1 and X2.1 14.5* 1.2* -2.1 .32X2 and X1.2 -.3 1.2** -12.5 .32X1.2 and X2.1 33.3* 2.7** 45.3** .32

**: p < .01; *: p < .05; †: p < .10

20

correlation & regression (chapslangcog.stanford.edu/expts/mll/coursework/statistical... · web...

Documents