correlation and regression by walden university statsupport team march 2011

Correlation and RegressionCorrelation and Regression

ByBy

Walden University Statsupport TeamWalden University Statsupport Team

March 2011March 2011

Correlation and RegressionCorrelation and Regression

• IntroductionIntroduction• Linear CorrelationLinear Correlation

• AssumptionsAssumptions• Linear RegressionLinear Regression

• AssumptionsAssumptions

Correlation measures the strength and direction of Correlation measures the strength and direction of relationship between two variables. It is used as a relationship between two variables. It is used as a measure of association based on assumptions such as measure of association based on assumptions such as linearity of relationships, the same level of relationship linearity of relationships, the same level of relationship throughout the range of the independent variable throughout the range of the independent variable (homoscedasticity) and interval or near-interval data.(homoscedasticity) and interval or near-interval data.

Homoscedasticity refers to constant conditional Homoscedasticity refers to constant conditional variance over time.variance over time.

Regression deals with a functional relationship Regression deals with a functional relationship between a dependent variable and independent between a dependent variable and independent variable. variable.

Regression analysis is used when you want to predict a Regression analysis is used when you want to predict a continuous dependent variable from a number of continuous dependent variable from a number of independent variables. If the dependent variable is independent variables. If the dependent variable is dichotomous, then logistic regression should be used. dichotomous, then logistic regression should be used.

IntroductionIntroduction

The most commonly used measure of linear The most commonly used measure of linear correlation is product-moment correlation (Pearson's correlation is product-moment correlation (Pearson's r). r).

Pearson's r is a measure of association which varies Pearson's r is a measure of association which varies from -1 to +1, with 0 indicating no relationship from -1 to +1, with 0 indicating no relationship (random pairing of values) and 1 indicating perfect (random pairing of values) and 1 indicating perfect relationship, taking the form: the more the x, the relationship, taking the form: the more the x, the more the y, and vice versa. more the y, and vice versa.

A value of -1 is a perfect negative relationship, taking A value of -1 is a perfect negative relationship, taking the form: the more the x, the less the y, and vice the form: the more the x, the less the y, and vice versa. versa.

Since it is a measure of association, the presence of Since it is a measure of association, the presence of significant linear correlation between two variables significant linear correlation between two variables does not imply causation.does not imply causation.

Linear CorrelationLinear Correlation

In situations where the assumptions of linear correlation are In situations where the assumptions of linear correlation are violated, correlation becomes inadequate to explain a given violated, correlation becomes inadequate to explain a given relationship. The three crucial assumptions in linear correlation relationship. The three crucial assumptions in linear correlation are:are:

1. Normality1. Normality

2. Linearity2. Linearity

3. Homoscedasticity3. Homoscedasticity The assumption of normality requires that the distribution of both The assumption of normality requires that the distribution of both

variables approximates the normal distribution and is not skewed variables approximates the normal distribution and is not skewed in either the positive or the negative direction. in either the positive or the negative direction.

The linearity assumption requires that the relationship between The linearity assumption requires that the relationship between the two variables is linear and proportional.the two variables is linear and proportional.

Homoscedasticity assumption requires that for the variance to Homoscedasticity assumption requires that for the variance to remain constant over time for each variable studied. In other remain constant over time for each variable studied. In other words it calls for constancy of the variance of a measure over the words it calls for constancy of the variance of a measure over the levels of the factor under study.levels of the factor under study.

Assumptions in Using Linear Assumptions in Using Linear CorrelationCorrelation

Let us look at the linear relationship between percent of students receiving reduced-fee lunch and percent of students hearing a bicycle helmets. Here the X variable is socioeconomic status measured as the percentage of children in a neighborhood receiving free or reduced-fee lunches at school. The Y variable is bicycle helmet use measured as the percentage of bicycle riders in the neighborhood wearing helmets.

The bicycle data is shown in the next slide.The first step in conducting linear correlation analysis is to use scatter plots to visually inspect the pattern of relationship between the two variables.

To generate scatter plot in SPSS do the following:

Graphs > Legacy Dialogs > Scatter/Dot… and then click on simple scatter. Then click on the Define button and then move percent receiving reduced-fee lunch to X-axis and percent wearing helmets to Y-axis. Then Finally click OK.

Data on the relationship between percent receiving reduced-fee lunch and percent wearing helmets.

Simple Scatter Plot Selected

X-axis and Y-axis variables selected for scatter plot

Scatter plot of percent receiving reduced-fee lunch and percent wearing helmets

The scatterplot looks fairly linear. The direction of relationship is

such that the two variables are inversely related.

We also observe some outliers in the scatter plots. An outlier is

an observation that lies an abnormal distance from other values

in a random sample from a population.

To obtain the linear correlation coefficient, do the following in SPSS:

Analyze > correlate > Bivariate and then move both variables to

the Variables box and then click OK.

You would obtain the output indicated in the correlations Table.

A demonstration on how to execute the linear correlation coefficient calculation in SPSS

Demonstration on how to select the variables for which correlation is to be determined.

Correlations

Percent receiving

reduced or free meals

percent wering

helmets

Percent receiving reduced or free

meals

Pearson Correlation 1 -.581*

Sig. (2-tailed) .037

N 13 13

percent wering helmets Pearson Correlation -.581* 1

Sig. (2-tailed) .037

N 13 13

*. Correlation is significant at the 0.05 level (2-tailed).

As it can be seen in the correlations Table above, the Pearson

correlation =

-0.581 and its p-value is 0.037 which indicates that there is

statistically significant linear relationship between percent receiving

reduced-fee lunch and percent wearing bicycle helmets. Not that the

negative sign indicates that there relationship is an inverse one.

That means in neighborhoods that have lower percentage of

students receiving

Reduced-fee lunch there are higher percentage of students wearing

helmets

and vice versa.

Linear Regression

Linear regression models the relationship between two variables by

fitting a linear equation to observed data. One variable is considered to be

an explanatory variable, and the other is considered to be a dependent

variable.

A linear regression line has an equation of the form Y = a + bX, where X

is the explanatory variable and Y is the dependent variable. The slope of

the line is b, and a is the intercept (the value of y when x = 0).

Regression is better suited than correlation for studying samples in

which the investigator fixes the distribution of X. That means the

investigator can control changes in the level of X so as to examine

corresponding changes in Y.

The most common method for fitting a regression line is the method of

least-squares. This method calculates the best-fitting line for the observed

data by minimizing the sum of the squares of the vertical deviations from

each data point to the straight line.

Assumptions of Linear Regression

There are four principal assumptions which justify the use of linear regression models :

1. linearity of the relationship between dependent and independent variables

2. independence of the errors (no serial correlation) 3. homoscedasticity (constant variance) of the errors

(a) versus time (b) versus the predictions (or versus any independent variable)

4. normality of the error distribution.

Nonlinearity can be detected by plotting the observed versus predicted

values or by plotting of residuals versus predicted values, which are a part

of standard regression output. The points should be symmetrically

distributed around a diagonal line in the former plot or a horizontal line in

the latter plot. Look carefully for evidence of a "bowed" pattern, indicating

that the model makes systematic errors whenever it is making unusually

large or small predictions.

The best test method of detecting for the independence assumption is to examine the

autocorrelation plot of the residuals. most of the residual autocorrelations should fall

within the 95% confidence bands around zero, which are located at roughly

plus-or-minus 2-over-the-square-root-of-n, where n is the sample size. The

Durbin-Watson statistic can also help to test for significant residual autocorrelation.

Violations of the homoscedasticity assumption can be detected by looking at plots of

residuals versus predicted value, and residuals that are getting larger

(i.e., more spread-out) either as a function of time or as a function of the predicted value

suggests the presence of heteroscedassticity. A plot of residuals versus some of the

independent variables might also help to discern the presence of heteroscedasticity.

A check for violation of the normality assumption can be done by normal probability

plot of the residuals. The normal probability plot is a plot of the fractiles of error

distribution versus the fractiles of a normal distribution having the same mean and

variance. If the distribution is normal, the points on this plot should fall close to the

diagonal line.

Assumptions of Linear Regression Continued…

An illustration of regression techniques will be given as follows using

the

bicycle data. The regression model and its parameter estimates can

be

generated in SPSS by Clicking :

Analyze > Regression > Linear and then move percent receiving

reduced-fee

lunch to the independent(s) box and percent wearing helmets to the

dependent box. Then Click OK.

This gives us the important outputs such as model summary table,

ANOVA

table and Coefficients table.

A screen demonstrating the steps for running linear regression.

Demonstration of how to pick the dependent and independent variables for fitting the linear regression model.

Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .581a .338 .277 14.2824

a. Predictors: (Constant), Percent receiving reduced or free meals

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression 1143.847 1 1143.847 5.607 .037a

Residual 2243.842 11 203.986

Total 3387.689 12

a. Predictors: (Constant), Percent receiving reduced or free meals

b. Dependent Variable: percent wearing helmets

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig.B Std. Error Beta

1 (Constant) 43.638 6.282 6.946 .000

Percent receiving reduced

or free meals

-.331 .140 -.581 -2.368 .037

a. Dependent Variable: percent wering helmets

Linear regression model output in which percent wearing helmets is estimated as a function of percent receiving reduced-fee.

Interpretation of the fitted regression model output:

1.The model summary table indicates that the R square value is 0.338 0.34. This can be viewed as poor model fit since it means that only about 34% of the proportion of variability in the percent wearing helmets is explained by percent receiving reduced-fee.

2. The ANOVA table indicates that the fitted regression model is statistically significant since the p-value is 0.037 which is less than 0.05.

3. The coefficients table shows that the intercept is 43.638 and the slope is -0.331. The p-value for the slope is 0.037 which is less than 0.05. Therefore, percent receiving reduced-fee is a significant predictor of percent wearing helmets.

The slope of the regression model is interpreted as the average change in Y per unit change in X. In this case, the slope of -0.331 predicts fewer helmet users per 100 bicycle riders for each additional percentage of children receiving reduced-fee meals.

Final Remarks

In regression analysis, residual analysis and the tasks of

identifying the influence of outliers and influential points are

crucial. For instance in this dataset, observation 13 was found to

be an outlier from the scatter plot made earlier. If we remove

this observation and refit the regression model, the model

parameter estimates change significantly.

A thorough analysis of the effects of outliers and influential

points will be covered under multiple regression in Week 12.

It is also important to note that statistical associations are not

always causal. The distinction between causal and non-causal

associations in health and disease has several practical

relevance.