correlation and regression by walden university statsupport team march 2011
TRANSCRIPT
Correlation and RegressionCorrelation and Regression
ByBy
Walden University Statsupport TeamWalden University Statsupport Team
March 2011March 2011
Correlation and RegressionCorrelation and Regression
• IntroductionIntroduction• Linear CorrelationLinear Correlation
• AssumptionsAssumptions• Linear RegressionLinear Regression
• AssumptionsAssumptions
Correlation measures the strength and direction of Correlation measures the strength and direction of relationship between two variables. It is used as a relationship between two variables. It is used as a measure of association based on assumptions such as measure of association based on assumptions such as linearity of relationships, the same level of relationship linearity of relationships, the same level of relationship throughout the range of the independent variable throughout the range of the independent variable (homoscedasticity) and interval or near-interval data.(homoscedasticity) and interval or near-interval data.
Homoscedasticity refers to constant conditional Homoscedasticity refers to constant conditional variance over time.variance over time.
Regression deals with a functional relationship Regression deals with a functional relationship between a dependent variable and independent between a dependent variable and independent variable. variable.
Regression analysis is used when you want to predict a Regression analysis is used when you want to predict a continuous dependent variable from a number of continuous dependent variable from a number of independent variables. If the dependent variable is independent variables. If the dependent variable is dichotomous, then logistic regression should be used. dichotomous, then logistic regression should be used.
IntroductionIntroduction
The most commonly used measure of linear The most commonly used measure of linear correlation is product-moment correlation (Pearson's correlation is product-moment correlation (Pearson's r). r).
Pearson's r is a measure of association which varies Pearson's r is a measure of association which varies from -1 to +1, with 0 indicating no relationship from -1 to +1, with 0 indicating no relationship (random pairing of values) and 1 indicating perfect (random pairing of values) and 1 indicating perfect relationship, taking the form: the more the x, the relationship, taking the form: the more the x, the more the y, and vice versa. more the y, and vice versa.
A value of -1 is a perfect negative relationship, taking A value of -1 is a perfect negative relationship, taking the form: the more the x, the less the y, and vice the form: the more the x, the less the y, and vice versa. versa.
Since it is a measure of association, the presence of Since it is a measure of association, the presence of significant linear correlation between two variables significant linear correlation between two variables does not imply causation.does not imply causation.
Linear CorrelationLinear Correlation
In situations where the assumptions of linear correlation are In situations where the assumptions of linear correlation are violated, correlation becomes inadequate to explain a given violated, correlation becomes inadequate to explain a given relationship. The three crucial assumptions in linear correlation relationship. The three crucial assumptions in linear correlation are:are:
1. Normality1. Normality
2. Linearity2. Linearity
3. Homoscedasticity3. Homoscedasticity The assumption of normality requires that the distribution of both The assumption of normality requires that the distribution of both
variables approximates the normal distribution and is not skewed variables approximates the normal distribution and is not skewed in either the positive or the negative direction. in either the positive or the negative direction.
The linearity assumption requires that the relationship between The linearity assumption requires that the relationship between the two variables is linear and proportional.the two variables is linear and proportional.
Homoscedasticity assumption requires that for the variance to Homoscedasticity assumption requires that for the variance to remain constant over time for each variable studied. In other remain constant over time for each variable studied. In other words it calls for constancy of the variance of a measure over the words it calls for constancy of the variance of a measure over the levels of the factor under study.levels of the factor under study.
Assumptions in Using Linear Assumptions in Using Linear CorrelationCorrelation
Let us look at the linear relationship between percent of students receiving reduced-fee lunch and percent of students hearing a bicycle helmets. Here the X variable is socioeconomic status measured as the percentage of children in a neighborhood receiving free or reduced-fee lunches at school. The Y variable is bicycle helmet use measured as the percentage of bicycle riders in the neighborhood wearing helmets.
The bicycle data is shown in the next slide.The first step in conducting linear correlation analysis is to use scatter plots to visually inspect the pattern of relationship between the two variables.
To generate scatter plot in SPSS do the following:
Graphs > Legacy Dialogs > Scatter/Dot… and then click on simple scatter. Then click on the Define button and then move percent receiving reduced-fee lunch to X-axis and percent wearing helmets to Y-axis. Then Finally click OK.
Data on the relationship between percent receiving reduced-fee lunch and percent wearing helmets.
Simple Scatter Plot Selected
X-axis and Y-axis variables selected for scatter plot
Scatter plot of percent receiving reduced-fee lunch and percent wearing helmets
The scatterplot looks fairly linear. The direction of relationship is
such that the two variables are inversely related.
We also observe some outliers in the scatter plots. An outlier is
an observation that lies an abnormal distance from other values
in a random sample from a population.
To obtain the linear correlation coefficient, do the following in SPSS:
Analyze > correlate > Bivariate and then move both variables to
the Variables box and then click OK.
You would obtain the output indicated in the correlations Table.
A demonstration on how to execute the linear correlation coefficient calculation in SPSS
Demonstration on how to select the variables for which correlation is to be determined.
Correlations
Percent receiving
reduced or free meals
percent wering
helmets
Percent receiving reduced or free
meals
Pearson Correlation 1 -.581*
Sig. (2-tailed) .037
N 13 13
percent wering helmets Pearson Correlation -.581* 1
Sig. (2-tailed) .037
N 13 13
*. Correlation is significant at the 0.05 level (2-tailed).
As it can be seen in the correlations Table above, the Pearson
correlation =
-0.581 and its p-value is 0.037 which indicates that there is
statistically significant linear relationship between percent receiving
reduced-fee lunch and percent wearing bicycle helmets. Not that the
negative sign indicates that there relationship is an inverse one.
That means in neighborhoods that have lower percentage of
students receiving
Reduced-fee lunch there are higher percentage of students wearing
helmets
and vice versa.
Linear Regression
Linear regression models the relationship between two variables by
fitting a linear equation to observed data. One variable is considered to be
an explanatory variable, and the other is considered to be a dependent
variable.
A linear regression line has an equation of the form Y = a + bX, where X
is the explanatory variable and Y is the dependent variable. The slope of
the line is b, and a is the intercept (the value of y when x = 0).
Regression is better suited than correlation for studying samples in
which the investigator fixes the distribution of X. That means the
investigator can control changes in the level of X so as to examine
corresponding changes in Y.
The most common method for fitting a regression line is the method of
least-squares. This method calculates the best-fitting line for the observed
data by minimizing the sum of the squares of the vertical deviations from
each data point to the straight line.
Assumptions of Linear Regression
There are four principal assumptions which justify the use of linear regression models :
1. linearity of the relationship between dependent and independent variables
2. independence of the errors (no serial correlation) 3. homoscedasticity (constant variance) of the errors
(a) versus time (b) versus the predictions (or versus any independent variable)
4. normality of the error distribution.
Nonlinearity can be detected by plotting the observed versus predicted
values or by plotting of residuals versus predicted values, which are a part
of standard regression output. The points should be symmetrically
distributed around a diagonal line in the former plot or a horizontal line in
the latter plot. Look carefully for evidence of a "bowed" pattern, indicating
that the model makes systematic errors whenever it is making unusually
large or small predictions.
The best test method of detecting for the independence assumption is to examine the
autocorrelation plot of the residuals. most of the residual autocorrelations should fall
within the 95% confidence bands around zero, which are located at roughly
plus-or-minus 2-over-the-square-root-of-n, where n is the sample size. The
Durbin-Watson statistic can also help to test for significant residual autocorrelation.
Violations of the homoscedasticity assumption can be detected by looking at plots of
residuals versus predicted value, and residuals that are getting larger
(i.e., more spread-out) either as a function of time or as a function of the predicted value
suggests the presence of heteroscedassticity. A plot of residuals versus some of the
independent variables might also help to discern the presence of heteroscedasticity.
A check for violation of the normality assumption can be done by normal probability
plot of the residuals. The normal probability plot is a plot of the fractiles of error
distribution versus the fractiles of a normal distribution having the same mean and
variance. If the distribution is normal, the points on this plot should fall close to the
diagonal line.
Assumptions of Linear Regression Continued…
An illustration of regression techniques will be given as follows using
the
bicycle data. The regression model and its parameter estimates can
be
generated in SPSS by Clicking :
Analyze > Regression > Linear and then move percent receiving
reduced-fee
lunch to the independent(s) box and percent wearing helmets to the
dependent box. Then Click OK.
This gives us the important outputs such as model summary table,
ANOVA
table and Coefficients table.
A screen demonstrating the steps for running linear regression.
Demonstration of how to pick the dependent and independent variables for fitting the linear regression model.
Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .581a .338 .277 14.2824
a. Predictors: (Constant), Percent receiving reduced or free meals
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 1143.847 1 1143.847 5.607 .037a
Residual 2243.842 11 203.986
Total 3387.689 12
a. Predictors: (Constant), Percent receiving reduced or free meals
b. Dependent Variable: percent wearing helmets
Coefficientsa
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.B Std. Error Beta
1 (Constant) 43.638 6.282 6.946 .000
Percent receiving reduced
or free meals
-.331 .140 -.581 -2.368 .037
a. Dependent Variable: percent wering helmets
Linear regression model output in which percent wearing helmets is estimated as a function of percent receiving reduced-fee.
Interpretation of the fitted regression model output:
1.The model summary table indicates that the R square value is 0.338 0.34. This can be viewed as poor model fit since it means that only about 34% of the proportion of variability in the percent wearing helmets is explained by percent receiving reduced-fee.
2. The ANOVA table indicates that the fitted regression model is statistically significant since the p-value is 0.037 which is less than 0.05.
3. The coefficients table shows that the intercept is 43.638 and the slope is -0.331. The p-value for the slope is 0.037 which is less than 0.05. Therefore, percent receiving reduced-fee is a significant predictor of percent wearing helmets.
The slope of the regression model is interpreted as the average change in Y per unit change in X. In this case, the slope of -0.331 predicts fewer helmet users per 100 bicycle riders for each additional percentage of children receiving reduced-fee meals.
Final Remarks
In regression analysis, residual analysis and the tasks of
identifying the influence of outliers and influential points are
crucial. For instance in this dataset, observation 13 was found to
be an outlier from the scatter plot made earlier. If we remove
this observation and refit the regression model, the model
parameter estimates change significantly.
A thorough analysis of the effects of outliers and influential
points will be covered under multiple regression in Week 12.
It is also important to note that statistical associations are not
always causal. The distinction between causal and non-causal
associations in health and disease has several practical
relevance.