lecture 6: multiple regression laura mcavinue school of psychology trinity college dublin

43
Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Lecture 6:Multiple Regression

Laura McAvinue

School of Psychology

Trinity College Dublin

Page 2: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Previous Lectures

• Relationship between two variables

• Correlation– Measure of strength of association between two variables

• Simple linear regression– Measure of the ability of one variable (X) to predict the other

variable (Y)– Computes a regression equation that describes the relationship

between the response variable (Y) and the predictor variable (X) by expressing Y as a function of X

Page 3: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Multiple Regression

• Used when there is more than one predictor variable

• Two purposes– To predict Y, given a combination of predictor

variables– To assess the relative importance of each predictor

variable in explaining the response variable Y

Page 4: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Regression Equations

Simple Linear Regression ˆ Y bX a

Multiple Regression Y = a + b1X1 + b2X2 +… + bkXk

b1 = Regression coefficient for first predictor variable, X1

b2 = Regression coefficient for second predictor variable, X2

a = Intercept, value of Y when all predictor variables are 0

Page 5: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Statistical Models

• Running a regression analysis is not a simple matter of inputting data, clicking a button and obtaining a ‘fixed’ model of the data

• You create the model of your data– Subjective process in many respects– You shape the model you create– Your job is to create the model that best describes the data

Page 6: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Multiple Regression

• Assessing the relative contribution of each predictor variable to the response variable– Which variable contributes most?– Which is the second biggest predictor?– Which variables don’t seem to contribute to prediction?

• Problem– The order with which you input the variables into the analysis

influences the model– Variable entered first is attributed more variance– By the time the last variable is entered, there might be very little

variance left to explain

Page 7: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Variance in Y

related to X2

Variance in Y

related to X1

Variance in Y related to shared variance between X1 & X2

Which variable gets credit, X1 or X2?

Multiple CorrelationThe predictor variables are correlated with each other and with the response variableWhich predictor variable gets credit for this shared variance?

Page 8: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Different Methods of Multiple Regression

• Hierarchical Regression

• Entry / Standard Regression

• Sequential Methods– Forward Addition– Backward Selection– Stepwise

• Combinatorial Approach

Page 9: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Hierarchical Regression

• You decide the order in which the variables are entered

• Based on theory / prior research

• Allows you to assess whether each predictor adds anything to the model, given the predictors that are already in the model

Page 10: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Entry / Standard Regression

• Computer package enters all predictor variables into the model simultaneously– Creates a regression equation including all predictor variables– Allows us to assess the unique contribution of each predictor

variable when all other variables are held constant

• Advantages & Disadvantages– Easy to see which variables significantly predict the response

variable– May not create the best model for predicting Y as it will include

variables that don’t significantly predict Y

Page 11: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Sequential Models

• Aim to create the ‘best model’– The combination of variables that best predicts the response

variable

• Build several models in a series of steps, adding or deleting variables at each step, depending on their contribution to predicting the response variable

• Final model includes only variables which significantly and uniquely predict the response variable

Page 12: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Sequential Methods

• Forward Addition

• Begins with only one variable in the model– The variable that makes the biggest contribution to the response

variable (highest r)

• Adds the variable with the next highest contribution

• Continues to add variables until there are no more variables that make a significant contribution to the response variable over and above the variables that are already in the equation

Page 13: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Sequential Methods

• Backward Selection• Begins with all predictor variables in the model and

successively deletes variables until only significant ones remain

• Stepwise Regression• Similar to previous two but more versatile• Generally moves forward, adding significant variables,

but can move backward to eliminate a variable if it no longer significantly predicts when another variable is added

Page 14: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Sequential Methods

• Drawbacks– Inclusion in the model depends on mathematical

criterion rather than psychological theory or research

– Variable selection could depend upon tiny differences in correlation between each predictor variable and the response variable

• Slight numerical differences could therefore lead to major differences in theoretical interpretation

– Difficult to replicate results

Page 15: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Combinatorial Methods

• Best Subsets Method

• Computes models with all possible combinations of the predictor variables and chooses the model that explains most variance in the response variable

Page 16: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Critical Considerations for MR

• Sample size

• Distribution requirements: Residuals– Data must be normally distributed

• Outliers

• Multi-collinearity

Page 17: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Sample Size

• Ratio of cases to predictors should be substantial

• Stevens (1996) advised about 15 participants per predictor variable– Size matters: The more people in your sample the better the

chance of the results being replicated

• However, an even bigger ratio is needed when– Response variable has skewed data distribution– Poor reliability in measures - substantial measurement error

reduces size of true relationships of variables– Stepwise methods (45-50 participants per predictor)

Page 18: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Residuals

• Recall the Method of Least Squares– Fits the regression line by minimising the prediction error of the

line

– Minimises the sum of squares of the residuals (Y-Y’)2 • Fits a line of the form

Y = a + b1X1 + b2X2 +… + bkXk + e

Assumes: Y = Fit + noise

Page 19: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Residuals

• Method of Least Squares models the noise (e) in the data using the normal distribution

– Assumes the noise is normally distributed with mean of 0 and

variance σ2

• If this assumption is violated, the results of your regression analysis may not be valid

– You need to check this by plotting the residuals– Standardised Residual Plots

• Histogram• Normal Probability Plot

Page 20: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Histogram

0.00 200.00 400.00 600.00 800.00 1000.00

variable

0

10

20

30

40

50

60

70

Fre

qu

ency

Mean = 72.5765Std. Dev. = 123.52254N = 85

Page 21: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Normal Probability Plot

Plots the residual value that was obtained for eachdata point (observed) against the value you wouldexpect if the residuals werenormally distributed (expected)

Should be a straight diagonalline

Page 22: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Outliers

• Data points that lie far from the rest of the data and have large residuals

• Big influence on regression analysis

• You can check for outliers

– Scatterplots examining relationship between response variable and predictor variables separately

– Casewise diagnostics in SPSS– Plots of the standardised residuals

Page 23: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Plot of Standardised Residuals

Plot of Standardized Predicted ValuesX

Studentised Deleted Residuals(Residual scores divided by their standard deviation, which is calculatedleaving out any suspiciously outlyingdata points)

Based on the assumption of normality:99.9% of residuals should lie within+3 & - 3 standard deviations

Any point outside this range is an outlier

-3 -2 -1 0 1 2 3

0

-1

1

3

2

4

-2

-3

Page 24: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Multi-Collinearity

• Occurs when predictor variables are highly correlated with one another

– High bivariate correlations (.7 / .8 or above)– High multivariate correlation

• Not a desired feature of the dataset

– Some predictor variables are redundant– Statistically, leads to unstable results

Page 25: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Multi-Collinearity

• To assess whether multi-collinearity is present

– Examine the bivariate correlations between predictor variables– Tolerance Statistic

• 1 – Multiple correlation (correlation between each predictor variable and all others)

• If low, then multiple correlation must be high and multi-collinearity is a problem

• Solution

– Leave out one of the predictor variables– Combine two highly correlated predictor variables

Page 26: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Let’s take an example

• Interested in a theory which suggests that a person’s level of optimism (X1) and the social support (X2) that he/she has in his/her life predicts how long he/she will survive (Y) after being diagnosed with cancer.

• Three steps to Regression Analysis:

– A. Examine the relationship between the predictor and response variables separately

– B. Perform and interpret the multiple regression

– C. Assess the appropriateness of the regression analysis

Page 27: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Let’s take an example

• Open the following dataset• Software / Kevin Thomas / Multiple Regression Dataset• Run Correlations between…

– Survival & Optimism– Survival & Social Support

Correlations

1 .599**

.000

200 200

.599** 1

.000

200 200

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

survival in weeks

Optimism

survival inweeks LOT score

Correlation is significant at the 0.01 level (2-tailed).**.

Correlations

1 .326**

.000

200 200

.326** 1

.000

200 200

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

survival in weeks

Social Support

survival inweeks SS score

Correlation is significant at the 0.01 level (2-tailed).**.

Page 28: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Create Scatterplots & fit regression line

Graphs / Scatter / Simple Scatter / y = Survival, X = Predictor Variable

Fit regression line: Double click on chart, then Elements / Fit line at total

Page 29: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Step 2: The Multiple Regression

• Analyse, Regression, Linear

– Dependent variable: Survival– Independent variable: Social, optimism

• Method: Enter (gives a standard multiple regression)

• Statistics

– Regression Coefficients• Estimates • Model fit • Descriptives

Page 30: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Answer the questions on your worksheet

1. Does this model (i.e. combination of social support and optimism) significantly predict the response variable (survival in months)?

ANOVAb

528045.0 2 264022.487 67.733 .000a

767907.0 197 3898.005

1295952 199

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), SS score, LOT scorea.

Dependent Variable: survival in weeksb.

Yes, F (2, 199) = 67.73, p < .001

Page 31: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Answer the questions on your worksheet

2. What percentage of variance in the response variable, survival in months, is explained by this model?

40.1%

Model Summary

.638a .407 .401 62.43401Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), SS score, LOT scorea.

R Square adjusted = Estimate of the population proportion of variation in survival due to optimism & support

Penalises for number of variables in the model

Page 32: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Answer the questions on your worksheet

3. Write the regression equation

Survival in months = 3.67(optimism) + 12.99(social support) + 4.34

Coefficientsa

4.340 22.760 .191 .849

3.670 .367 .558 10.005 .000

12.987 3.226 .225 4.026 .000

(Constant)

LOT score

SS score

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: survival in weeksa.

Page 33: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Answer the questions on your worksheet

4. What does this equation tell us about the relationship between months of survival and social support?

As social support increases by one unit, survival in months increases by almost 13 months

Coefficientsa

4.340 22.760 .191 .849

3.670 .367 .558 10.005 .000

12.987 3.226 .225 4.026 .000

(Constant)

LOT score

SS score

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: survival in weeksa.

Page 34: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Answer the questions on your worksheet

5. Do both variables significantly predict survival in months?

Yes, for optimism, t = 10, p < .001 & for social support, t = 4.026, p < .001

Coefficientsa

4.340 22.760 .191 .849

3.670 .367 .558 10.005 .000

12.987 3.226 .225 4.026 .000

(Constant)

LOT score

SS score

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: survival in weeksa.

Page 35: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Answer the questions on your worksheet

6. Which of the predictor variables contributes most to the response variable?

Optimism has a Beta value of .558 and so, contributes more than social support, which has a Beta value of .225

Coefficientsa

4.340 22.760 .191 .849

3.670 .367 .558 10.005 .000

12.987 3.226 .225 4.026 .000

(Constant)

LOT score

SS score

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: survival in weeksa.

Beta = Standardized Regression Coefficient (B / Std. Error)

Can be used to compare strength of contribution of predictor variables

Page 36: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Answer the questions on your worksheet

7. Use the regression equation to make the following prediction: If a person has an optimism score of 10 and a social support score of 2, how long would you expect them to survive?

Survival in months = 3.67(optimism) + 12.99(social support) + 4.34

Survival in months = 3.67(10) + 12.99(2) + 4.34

Survival in months = 36.7 + 25.98 + 4.34

Survival in months = 67.02

67 months!

Page 37: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Answer the questions on your worksheet

8. What is the standard error of this prediction?

62.43 months

Model Summary

.638a .407 .401 62.43401Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), SS score, LOT scorea.

Page 38: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Step 2: Assess the appropriateness of the Analysis

• Distribution of Residuals• Outliers• Multi-collinearity

• Re-run regression but this time…– Statistics

• Collinearity Diagnostics• Residuals, casewise diagnostics

– Outliers outside 3 standard deviations

– Plots• Histogram• Normal Probability Plot• Plot of Standardized Predicted Values (Y: ZPRED) by

Studentized Deleted Residuals (X: SDRESID)

Page 39: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Distribution of Residuals

Page 40: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Outliers

Residuals Statistics a

59.9595 326.4863 193.2000 51.5121 200-182.7696 162.2596 5.400E-15 62.1195 200

-2.587 2.587 .000 1.000 200

-2.927 2.599 .000 .995 200

Predicted ValueResidualStd. Predicted Value

Std. Residual

Minimum Maximum Mean Std. Deviation N

Dependent Variable: length of survival (months)a.

All residuals lie within -3 and 3 standard deviations

Note that you expect 1% of cases to lie outside this area so in a large sample, if you have one or two, that could be ok

Page 41: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Outliers

All residuals lie within -3 and 3 standard deviations

Page 42: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Multi-Collinearity

Bivariate correlations seem to be low (r = .182) even though significant (p = .01)

Tolerance is high, meaning that the multiple correlation is small, meaning that multi-collinearity is not a feature of this dataset

Correlations

1 .182**

.010

200 200

.182** 1

.010

200 200

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

SS score

LOT score

SS score LOT score

Correlation is significant at the 0.01 level (2-tailed).**.

Coefficientsa

4.340 22.760 .191 .849

3.670 .367 .558 10.005 .000 .967 1.034

12.987 3.226 .225 4.026 .000 .967 1.034

(Constant)

LOT score

SS score

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Tolerance VIF

Collinearity Statistics

Dependent Variable: survival in weeksa.

Page 43: Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin

Summary

• Multiple Regression– To predict Y given a combination of predictor variables– To assess the relative importance of each predictor variable in

explaining the response variable

• Statistical modelling– Different Methods

• Three steps– Examine the relationship between the predictor and response

variables separately– Perform and interpret the multiple regression– Assess the appropriateness of the regression analysis

• There are a number of critical considerations