elementary statistics correlation and regression
TRANSCRIPT
Correlation
What type of relationship exists between the two variables and is the correlation significant?
x y
Cigarettes smoked per day
Score on SAT
Height
Hours of Training
Explanatory(Independent) Variable
Response(Dependent) Variable
A relationship between two variables
Number of Accidents
Shoe Size Height
Lung Capacity
Grade Point Average
IQ
Negative Correlation–as x increases, y decreases
x = hours of trainingy = number of accidents
Scatter Plots and Types of Correlation
60
50
40
30
20
10
0
0 2 4 6 8 10 12 14 16 18 20
Hours of Training
Acc
iden
ts
Positive Correlation–as x increases, y increases
x = SAT scorey = GPA
GP
AScatter Plots and Types of Correlation
4.003.753.50
3.002.752.502.252.00
1.501.75
3.25
300 350 400 450 500 550 600 650 700 750 800
Math SAT
No linear correlation
x = height y = IQ
Scatter Plots and Types of Correlation
160150140130120110
1009080
60 64 68 72 76 80
Height
IQ
Correlation CoefficientA measure of the strength and direction of a linear
relationship between two variables
The range of r is from –1 to 1.
If r is close to 1 there is a
strong positive
correlation.
If r is close to –1 there is a strong negative correlation.
If r is close to 0 there is no
linear correlation.
–1 0 1
x y 8 78 2 92 5 9012 5815 43 9 74 6 81
AbsencesFinalGrade
Application
959085807570656055
4540
50
0 2 4 6 8 10 12 14 16
Fin
al G
rade
XAbsences
6084846481003364184954766561
624 184450696 645666486
57 516 3751 579 39898
1 8 78 2 2 92 3 5 90 4 12 58 5 15 43 6 9 74 7 6 81
64 4 25144225 81 36
xy x2 y2
Computation of rn x y
r is the correlation coefficient for the sample. The correlation coefficient for the population is (rho).
The sampling distribution for r is a t-distribution with n – 2 d.f.
Standardized teststatistic
For a two tail test for significance:
For left tail and right tail to testnegative or positive significance:
Hypothesis Test for Significance
(The correlation is not significant)
(The correlation is significant)
A t-distribution with 5 degrees of freedom
Test of Significance
You found the correlation between the number of times absent and a final grade r = –0.975. There were seven pairs of data.Test the significance of this correlation. Use = 0.01.
1. Write the null and alternative hypothesis.
2. State the level of significance.
3. Identify the sampling distribution.
(The correlation is not significant)
(The correlation is significant)
= 0.01
t0 4.032–4.032
Rejection Regions
Critical Values ± t0
4. Find the critical value.
5. Find the rejection region.
6. Find the test statistic.
t0–4.032 –4.032
t = –9.811 falls in the rejection region. Reject the null hypothesis.
There is a significant correlation between the number of times absent and final grades.
7. Make your decision.
8. Interpret your decision.
The equation of a line may be written as y = mx + b where m is the slope of the line and b is the y-intercept.
The line of regression is:
The slope m is:
The y-intercept is:
Once you know there is a significant linear correlation, you can write an equation describing the relationship between the x and y variables. This equation is called the line of regression or least squares line.
The Line of Regression
180
190
200
210
220
230
240
250
260
1.5 2.0 2.5 3.0Ad $
= a residual
(xi,yi) = a data pointre
venu
e= a point on the line with the same x-value
Calculate m and b.
Write the equation of the line of regression with x = number of absences and y = final grade.
The line of regression is: = –3.924x + 105.667
6084846481003364184954766561
624 184450696 645666486
57 516 3751 579 39898
1 8 78 2 2 92 3 5 90 4 12 58 5 15 43 6 9 74 7 6 81
64 4 25144225 81 36
xy x2 y2x y
0 2 4 6 8 10 12 14 16
404550556065707580859095
Absences
Fin
alG
rade
m = –3.924 and b = 105.667
The line of regression is:
Note that the point = (8.143, 73.714) is on the line.
The Line of Regression
The regression line can be used to predict values of y for values of x falling within the range of the data.
The regression equation for number of times absent and final grade is:
Use this equation to predict the expected grade for a student with
(a) 3 absences (b) 12 absences
(a)
(b)
Predicting y Values
= –3.924(3) + 105.667 = 93.895
= –3.924(12) + 105.667 = 58.579
= –3.924x + 105.667
The coefficient of determination, r2, is the ratio of explained variation in y to the total variation in y.
The correlation coefficient of number of times absent and final grade is r = –0.975. The coefficient of determination is r2 = (–0.975)2 = 0.9506.
Interpretation: About 95% of the variation in final grades can be explained by the number of times a student is absent. The other 5% is unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied, etc.
The Coefficient of Determination
The Standard Error of Estimate, se,is the standard
deviation of the observed yi values about the predicted
value.
The Standard Error of Estimate
1 8 78 74.275 13.8756 2 2 92 97.819 33.8608 3 5 90 86.047 15.6262 4 12 58 58.579 0.3352 5 15 43 46.807 14.4932 6 9 74 70.351 13.3152 7 6 81 82.123 1.2611
92.767
= 4.307
x y
Calculate for each x.
The Standard Error of Estimate
Given a specific linear regression equation and x0, a specific value of x, a c-prediction interval for y is:
where
Use a t-distribution with n – 2 degrees of freedom.
The point estimate is and E is the maximum error of estimate.
Prediction Intervals
Construct a 90% confidence interval for a final grade when a student has been absent 6 times.
1. Find the point estimate:
The point (6, 82.123) is the point on the regression line with x-coordinate of 6.
Application
Construct a 90% confidence interval for a final grade when a student has been absent 6 times.
2. Find E,
At the 90% level of confidence, the maximum error of estimate is 9.438.
Application