chapter 15. introduction when referring to interval-ratio variables a commonly used synonym for...
TRANSCRIPT
Chapter 15
IntroductionWhen referring to interval-ratio variables a
commonly used synonym for association is correlation
We will be looking for the existence, strength, and direction of the relationship
We will only look at bivariate relationships in this chapter
ScattergramsThe first step is to construct and examine a
scattergramExample in the book
Analysis of how dual wage-earner families cope with housework
They want to know if the number of children in the family is related to the amount of time the husband contributes to housekeeping chores
Scattergram of Relationship Between the Two Variables
Number of Children
6543210
Hou
rs P
er W
eek
Hus
band
Spe
nds
on H
ouse
wor
k
8
6
4
2
0
-2
Regression of Husband’s Hours of Housework
By The Number of Children in the Family
Construction of a ScattergramDraw two axes of about equal length and at
right angles to each otherPut the independent (X) variable along the
horizontal axis (the abscissa) and the dependent (Y) variable along the vertical axis (the ordinate)
For each person, locate the point along the abscissa that corresponds to the scores of that person on the X variableDraw a straight line up from that point and at right
angles to the axisThen locate the point along the ordinate that
corresponds to the score of that same case on the Y variable
Place a dot there to represent the case, and then repeat with all cases
Regression Line and its PurposeIt checks for linearity of the data points on
the scattergramIt gives information about the existence,
strength, and direction of the associationIt is used to predict the score of a case on
one variable from the score of that case on the other variable
It is a floating mean through all the data points
Scattergram of Relationship Between the Two Variables
Number of Children
6543210
Hou
rs P
er W
eek
Hus
band
Spe
nds
on H
ouse
wor
k
8
6
4
2
0
-2
Regression of Husband’s Hours of Housework
By The Number of Children in the Family
Existence of a RelationshipTwo variables are associated if the
distributions of Y change for the various conditions of XThe scores along the abscissa (number of
children) are conditions of values of XThe dots above each X value can be thought
of as the conditional distributions of Y (scores on Y for each value of X) In other words, Y tends to increase as X
increases
Existence of a RelationshipThe existence of a relationship is reinforced
by the fact that the regression line lies at an angle to the X axis (the abscissa)There is no linear relationship between two
interval-level variables when the regression line on a scattergram is parallel to the horizontal axis
Scattergram of Relationship Between the Two Variables
Number of Children
6543210
Hou
rs P
er W
eek
Hus
band
Spe
nds
on H
ouse
wor
k
8
6
4
2
0
-2
Regression of Husband’s Hours of Housework
By The Number of Children in the Family
Strength of the AssociationThe strength of the association is judged
by observing the spread of the dots around the regression lineA perfect association between variables can be
seen on a scattergram when all dots lie on the regression line
The closer the dots to the regression line, the stronger the association
So, for a given X. there should not be much variety on the Y variable
Scattergram of Relationship Between the Two Variables
Number of Children
6543210
Hou
rs P
er W
eek
Hus
band
Spe
nds
on H
ouse
wor
k
8
6
4
2
0
-2
Regression of Husband’s Hours of Housework
By The Number of Children in the Family
Direction of the RelationshipThe direction of the relationship can be
judged by observing the angle of the regression line with respect to the abscissaThe relationship is positive when the line slopes
upward from left to rightThe association is negative when it slopes downYour book shows a positive relationship,
because cases with high scores on X also tend to have high scores on Y
For a negative relationship, high scores on X would tend to have low scores on Y, and vice versa
Your book also shows a zero relationship—no association between variables, in that they are randomly associated with each other
Linearity The key assumption (first step in the five-
step model) with correlation and regression is that the two variables have an essentially linear relationshipThe points or dots must form a pattern of a
straight lineIt is important to begin with a scattergram
before doing correlations and regressionsIf the relationship is nonlinear, you may need
to treat the variables as if they were ordinal rather than interval-ratio
Regression and PredictionThe final use of the scattergram is to
predict scores of cases on one variable from their score on the other
May want to predict the number of hours of housework a husband with a family of four children would do each week
You use regression to predict outside the range of the data with caution, since you do not have any data points to show what happens beyond the scope of the data—it may have suddenly gone down
The Predicted Score on YThe symbol for this is Y’, or Y prime, though in
other books, it is most often Y hat, but that symbol is difficult to do on a computer or to print in books
It is found by first locating the score on X (X=4, for four children) and then drawing a straight line from that point on the abscissa to the regression line
From the regression line, another straight line parallel to the abscissa is drawn across to the Y axis or ordinate
Y ’ is found at the point where the line from the regression line crosses the Y axis
Or, you can compute Y’ = a + bXY ’ is the expected Y value for a given X
Formula for the Regression LineThe formula for a straight line that fits
closest to the conditional means of Y Y = a + bX Where Y = score on the dependent variable a = the Y intercept or the point where the
regression line crosses the Y axis b = the slope of the regression line or the amount of
change produced in Y by a unit change in X X = score on the independent variable
Regression LineThe position of the least-squares regression
line is defined by two elementsThe Y intercept and the slope of the lineIt also crosses the point where the mean of X
meets the mean of YThe weaker the effect of X on Y (the weaker
the association between the variables) the lower the value of the slope (b)
If the two variables are unrelated, the least-squares regression line would be parallel to the abscissa, and b would be 0 (the line would have no slope)
Scattergram of Relationship Between the Two Variables
Number of Children
6543210
Hou
rs P
er W
eek
Hus
band
Spe
nds
on H
ouse
wor
k
8
6
4
2
0
-2
Regression of Husband’s Hours of Housework
By The Number of Children in the Family
Equations for the Slope of the Regression Line
You need to compute “b” first, since it is needed in the formula for “a”
Slope:
Which is the covariance of X and Y divided by the variance of X
bX X Y Y
X X
2
Interpretation of the Value of the SlopeIf you put your scattergram on graph
paper, you can see that as X increases one box, “b” is how many units that Y increases on the regression line
So, a slope of .69 indicates that, for each unit increase in X, there is an increase of .69 units in YIf the slope is 1.5, for every unit of change
in X, there is an increase of 1.5 units in YThey refer to units, since correlation and
regression allow you to compare apples and oranges—two completely different variables
Scattergram of Relationship Between the Two Variables
Number of Children
6543210
Hou
rs P
er W
eek
Hus
band
Spe
nds
on H
ouse
wor
k
8
6
4
2
0
-2
Regression of Husband’s Hours of Housework
By The Number of Children in the Family
Interpretation of “b” cont.So, to find what one unit of X is or one unit
of Y is, you have to go back to the labels for each variable
For the example in your book which has a “b” (beta) of .69The addition of each child (an increase of
one unit in X—one unit is one child)Results in an increase of .69 hours of
housework being done by the husband (an increase of .69 units—or hours—in Y)
Formula for the Intercept of the Regression Line
a Y b X
Interpretation of the InterceptThe intercept for the example in the book is
1.49The least-squares regression line will cross
the Y axis at the point where Y equals 1.49You need a second point to draw the
regression lineYou can begin at Y of 1.49, and for the next value
of X, which is 1 child, you will go up .69 units of Y
Or, you can use the intersection of the mean of X and the mean of Y—the regression line always goes through this point
Interpretation of “a” cont.Most of the time, you can’t interpret the
value of the interceptTechnically, it is the value that Y would take if
X were zero But, most often a zero X is not meaningful Or, in the case in your book, zero is outside the range
of the data You don’t have any information about the hours of
housework that husbands do when they have no children
Technically, the intercept of 1.49 is the amount of predicted housework a husband with zero children would do, but you can’t say that with certainty
Least Squares Regression LineNow that you know “a” and “b”, you can
fill in the full least-squares regression line
Y = a + bXY = (1.49) + (.69) X
This formula can be used to predict scores on Y as was mentioned earlier For any value of X, it will give you the
predicted value of Y (Y’) The predictions of husband’s housework are
“educated guesses”The accuracy of our predictions will
increase as relationships become stronger (as dots are closer to the regression line)
The Correlation Coefficient (Pearson’s r)Pearson’s r varies from 0 to plus or minus
1With 0 indicating no associationAnd + 1 and – 1 indicating perfect positive and
perfect negative relationshipsThe definitional formula for Pearson’s r is in
your bookSimilar to the formula for b (beta), the
numerator is the covariation between X and Y (usually called the covariance)
Interpretating r and r-squaredInterpretation of “r ” will be the same as
all the other measures of associationAn “r ” of .5 would be a moderate positive
linear relationship between the variables
Interpretation of the Coefficient of Determination (r-squared)The square of Pearson’s r is also called the
coefficient of determinationWhile “r ” measures the strength of the linear
relationship between two variables But values between 0 and 1 or -1 have no direct
interpretation
Interpretation, cont.The coefficient of determination can be
interpreted with the logic of PRE (proportional reduction in error)First Y is predicted while ignoring the information
supplied by XSecond the independent variable is taken into
account when predicting the dependentWhen working with variables measured at the
interval-ratio level, the predictions of Y under the first condition (while ignoring X) will be the mean of the Y scores (Y bar) for every caseWe know that the mean of any distribution is closer
than any other point to all the scores in the distribution
Interpretation, cont.Will make many errors in predicting YThe amount of error is shown in Figure
16.6The formula for the error is the sum of (Y minus
Y bar) squaredThis is called the total variation in Y, meaning
the total amount that all the points are off the mean of Y
The next step will be to find the extent to which knowledge of X improves our ability to predict Y (Will we make predictions that come closer to the actual points than predictions made using the mean of Y?)
Interpretation, cont.If the two variables have a linear
relationship, then predicting scores on Y from the least-squares regression equation will use knowledge of X and reduce our errors of prediction
The formula for the predicted Y score for each value of X will be: Y’ = a + bXThis is also the formula for the regression line
Unexplained VariationThat suggests that some of the variation
in Y is unexplained by XThe proportion of the total variation in Y
unexplained by X can also be found by subtracting the value of r-squared from 1.00
Other variables will be needed to explain one hundred percent of the variation in Y (the dependent variable)
Unexplained Variation, cont.Unexplained variation is usually attributed
to the influence of three things:Some combination of other variables, as in
the example of the husband’s houseworkMeasurement error
People over or under estimate how much time they spend doing housework
Random chance Your sample may be biased, particularly if it is
small
Testing Pearson’s r for SignificanceWhen “r ” is based on data from a random
sample, you need to test “r” for its statistical significance
When testing Pearson’s r for significance, the null hypothesis is that there is no linear association between the variables in the population from which the sample was drawnWe will use the t distribution for this test
Assumptions for the Significance TestWe make some additional assumptions in
Step 1Need to assume that both variables are
normal in distribution Need to assume that the relationship
between the two variables is roughly linear in form
The third new assumption involves the concept of homoscedasticity
HomoscedasticityA homoscedastistic relationship is one where
the variance of the Y scores is uniform for all values of XIf the Y scores are evenly spread above and below
the regression line for the entire length of the line, the relationship is homoscedastistic
If the variance around the regression line is greater at one end or the other, the relationship is heteroscedastistic
A visual inspection of the scattergram is usually sufficient to find the extent the relationship conforms to the assumptions of linearity and homoscedasticity
If the data points fall in a roughly symmetrical, cigar-shaped pattern, whose shape can be approximated with a straight line, then it is appropriate to proceed with this test of significance
Scattergram of Relationship Between the Two Variables
Number of Children
6543210
Hou
rs P
er W
eek
Hus
band
Spe
nds
on H
ouse
wor
k
8
6
4
2
0
-2
Regression of Husband’s Hours of Housework
By The Number of Children in the Family