slide 2 chapter 4 descriptive methods in regression and correlation
TRANSCRIPT
Slide 2
Chapter 4
Descriptive Methods in
Regression and Correlation
Slide 3
C4S1 – Linear Equation with One Independent Variable
Linear equations with one independent variable can be written as y = b0 + b1x
b0 and b1 are constants (fixed numbers) and x is the independent variable and y is the dependent variable.
The graph of a linear equation is a straight line. y = mx + b
Linear Equations
Slide 4
int (0, y)
int (x, 0)
y mx b
y
x
Slide 5
Figure 4.6
Positive SlopeFalls right to left
Negative SlopeFalls left to right
Horizontal Line Has a slope of 0
Slide 6
Plotting the data in a scatterplot helps us visualize any apparent relationship between x and y. Generally speaking, a scatterplot (or scatter diagram) is a graph of data from two quantitative variables of a population. To construct a scatterplot, we use a horizontal axis for the observations of one variable and a vertical axis for the observations of the other. Each pair of observations is then plotted as a point.
C4S2 – The Regression Equation
Slide 7
Because we could draw many different lines through the cluster of data points, we need a method to choose the “best” line.
The method, called the least-squares criterion, is based on an analysis of the errors made in using a line to fit the data points.
0 1y b b x
Slide 8
To avoid confusion, we use to denote the y-value predicted for a value of x.
To measure quantitatively how well a line fits the data, we first consider the errors, e, made in using the line to predict the y-values of the data points.
In general, an error, e, is the signed vertical distance from the line to a data point. The error made in using the line to predict the y-value is e = y −
The decide which line best fits the data we compute the sum of the squared errors
The line with the smaller sum of squared error is the one that fits the data better.
y
y
2ie
Slide 9
Slide 10
Regression Equation for a set of n data points is
2
0 22
1 22
y-intcept
slope
y x x xyb
n x x
n xy x yb
n x x
0 1y b b x
yy
n
Mean for y
Slide 11
ExtrapolationSuppose that a scatterplot indicates a linear relationship between two variables.
Then, within the range of the observed values of the predictor variable, we can reasonably use the regression equation to make predictions for the response variable.
However, to do so outside that range, which is called extrapolation, may not be reasonable because the linear relationship between the predictor and response variables may not hold there.
Grossly incorrect predictions can result from extrapolation.
Slide 12
Outliers and Influential ObservationsAn outlier is an observation that lies outside the overall pattern of the data.
In the context of regression, an outlier is a data point that lies far from the regression line, relative to the other data points.
An outlier can sometimes have a significant effect on a regression analysis.
We must also watch for influential observations.
In regression analysis, an influential observation is a data point whose removal causes the regression equation (and line) to change considerably.
A data point separated in the x-direction from the other data points is often an influential observation because the regression line is “pulled” toward such a data point without counteraction by other data points.
Slide 13
Regression analysis is used when you want to show if and/or how one variable can predict or cause changes in another variable.
Correlation between x and y Sx and Sy are the standard deviations of x and y
Slope of best fit line y
x
sm r
s
Slide 14
C4S3 – The Coefficient of Determination
Slide 15
The coefficient of determination, r2, always lies between 0 and 1.
r2 near 0 suggests that the regression equation is not very useful for making predictions
r2 near 1 suggest that the regression equation is quite useful for making predictions
Shows us if we can use the regression equation instead of the mean.
Percentage of variation.
Slide 16
Regression Identity
The total of the squares equals the regression sum of squares plus the error sum of squares.
SST = SSR + SSE
Equation is always true
Slide 17
C4S4 – Linear CorrelationWe here things like “there is a positive correlation between x and y” and “x and y are uncorrelated” these are explained in this section.
Linear Correlations measures the strength of the linear relationship between two variables.
Used for hand calculations
Reveals the meaning and basic properties
Slide 18
Understanding the Linear Correlation Coefficient
r is the independent of the of the choice of units and always lies between -1 and 1.
Close to ±1 then there is a strong linear relationship and is useful in making predictions. Regression equation is extremely useful. The data points are clustered closely about the regression line.
Near 0 then the linear relationship is weak and a poor predictor. The data points are essentially scattered about a horizontal line.
Keep in mind that r measures the strength of the linear relationship between two variables and that the following properties of r are meaningful only when the data points are scattered about a line.
• r reflects the slope of the scatterplot.• The magnitude of r indicates the strength of the linear
relationship.• The sign of r suggests the type of linear relationship.• The sign of r and the sign of the slope of the regression line are
identical.
Slide 19Figure 4.17
Understanding the Linear Correlation Coefficient
To graphically portray the meaning of the linear correlation coefficient, we present various degrees of linear correlation in Fig. 4.17.
Slide 20
Relationship Between the Correlation Coefficient and the Coefficient of Determination
The coefficient of determination, r2, is a descriptive measure of the utility of the regression equation for making predictions.
The coefficient of determination, r2, equals the square of the linear correlation coefficient, r.
Linear correlation coefficient, r, is a descriptive measure of the strength of the linear relationship between two variables.
Because linear correlation coefficient describes the strength of the linear relationship between two variables it should be used as a descriptive measure only when a scatterpoint indicates that the data points are scattered about the line.
Slide 21
Relationship Between the Correlation Coefficient and the Coefficient of Determination
When using linear correlation coefficient you must also watch for outliers and influential observation because sample means and sample standard deviations are not resistant to outliers and other extreme values.
We cannot say the a value of r near 0 implies there is no relationship and we cannot say that values of r near ± 1 implies that a linear relationship exists. Only meaningful when the scatterplot indicate that the data points are scattered about a line.