Download - Regression and Co-Relation
![Page 1: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/1.jpg)
Regression and Correlation
Analysis
1
![Page 2: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/2.jpg)
Objectives To determine the relationship
between response variable and independent variables for prediction purposes
2
![Page 3: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/3.jpg)
• compute a simple linear regression model • interpret the slope and intercept in a linear
regression model• Model adequacy checking• Use the model for prediction purposes
3
![Page 4: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/4.jpg)
Contents
1. Introductionregression and correlation
2. Simple Linear Regression- Simple linear regression model ( deals with one independent variable)- Least- square estimation of parameters- Hypothesis testing on the parameters- Interpretation
4
![Page 5: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/5.jpg)
3. Correlation -Correlation co-efficient- Co- efficient of determination and
its interpretation
5
![Page 6: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/6.jpg)
Learning Outcomes• Student will be able to identify the nature
of the association between a given pair of variables
• Find a suitable regression model to a given set of data of two variables
• Check for model assumptions• Interpret the model parameters of the fixed
model• Predict or estimate Y values for given X
values6
![Page 7: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/7.jpg)
Reference
1. Introduction to Linear Regression Analysis (3 rd edition) D.C. Montgomery, E.A. Peck and G.G. Vining, John Wiley ( 2004)
2. Applied Regression Analysis ( 3rd edition) N.R. Draper, H. Smith, John Wiley ( 1998)
7
![Page 8: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/8.jpg)
Introduction
Regression and correlation are very important statistical tools which are used to identify and quantify the relationship between two or more variables
Application of regression occurs almost in every field such as engineering, physical and chemical sciences, economics, life and biological sciences and social science
8
![Page 9: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/9.jpg)
Regression analysis was first developed by Sir Francis Galton ( 1822-1911)
Regression and correlation are two different but closely related concepts
Regression is a quantitative expression of the basic nature of the relationship between the dependent and independent variables
Correlation is the strength of the relationship. That means correlation measures how strong the relationship between two variables is?
9
![Page 10: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/10.jpg)
Dependent variable
• In a research study, the dependent variable is the variable that you believe might be influenced or modified by some treatment or exposure. It may also represent the variable you are trying to predict. Sometimes the dependent variable is called the outcome variable. This definition depends on the context of the study
10
![Page 11: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/11.jpg)
If one variable is depended on other we can say that one variable is a function of another
Y = ƒ (X)Hear Y depends on X in some mannerAs Y depends on X , Y is called the dependent
variable, criterion variable or response variable..
11
![Page 12: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/12.jpg)
Independent variable
In a research study, an independent variable is a variable that you believe might influence your outcome measure.
X is called the independent variable, predictor variable, regress or explanatory variable
12
![Page 13: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/13.jpg)
This might be a variable that you control, like a treatment, or a variable not under your control, like an exposure.
It also might represent a demographic factor like age or gender
13
![Page 14: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/14.jpg)
Regression
Simple Y = ƒ (X)
Multiple Y = ƒ (X1,X2,…X3)
Linear
Non linear
Linear
Non linear
14
![Page 15: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/15.jpg)
CONTENTS•Coefficients of correlation
–meaning–values–role–significance
•Regression–line of best fit–prediction–significance
15
![Page 16: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/16.jpg)
•Correlation–the strength of the linear relationship between two variables
•Regression analysis–determines the nature of the relationship
Ex : Is there a relationship between the number of units of alcohol consumed and the likelihood of developing cirrhosis of the liver? 16
![Page 17: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/17.jpg)
Correlation and Covariance
Correlation is the standardized covariance:
17
![Page 18: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/18.jpg)
Measures the relative strength of the linear relationship between two variables
The correlation is scale invariant and the units of measurement don't matter (unit-less)
This gives the direction (- or +) and strength (0 to1) of the linear relationship between X and Y.
18
![Page 19: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/19.jpg)
• It is always true that -1≤ corr(X; Y ) ≤ 1. That means ranges between –1 and 1
• The closer to –1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker any linear relationship Though a value close to zero indicates almost no linear association it does not mean no relationship
19
![Page 20: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/20.jpg)
Scatter Plots of Data with Various Correlation Coefficients
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3r = +1
Y
Xr = 0 20
![Page 21: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/21.jpg)
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationshipsLinear Correlation
21
![Page 22: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/22.jpg)
Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationshipsLinear Correlation
22
![Page 23: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/23.jpg)
Linear Correlation
Y
X
Y
X
No relationship
23
![Page 24: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/24.jpg)
interpreting the Pearson correlation coefficient
• The value of r for this data is 0.39. thus indicating weak positive linear association.
• Omitting the last observation, r is 0.96.
• Thus, r is sensitive to extreme observations.
Hight (inches)
Wei
ght (
lbs)
7672686460
170
160
150
140
130
120
110
100
90
Scatterplot of Weight (lbs) vs Hight (inches)
Extreme observation
24
![Page 25: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/25.jpg)
• The value of r here is 0.94.
• However, a straight line model may not be suitable.
• The relationship appears curvilinear.
Predictor
Res
pons
e
20151050
90
80
70
60
50
40
30
20
10
25
![Page 26: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/26.jpg)
continued…
Extreme Observation
• The value of r is -0.07.
• But the plot indicates positive linear association.
• Again, this anomaly is due to extreme data values.
OBT marks
Final
mar
ks
9080706050403020
70
60
50
40
30
20
10
Scatterplot of Final marks vs OBT marks
26
![Page 27: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/27.jpg)
• The value of r is around 0.006, thus indicating almost no linear association.
• However, from the plot, we find strong relationship between the two variables.
• This exemplifies that r does not provide evidence of all relationships.
• These examples highlight the importance of looking at scatter plots of data prior to deciding on a model function.
Age in years
Reac
tion
time
in S
econ
ds
403020100
50
40
30
20
10
0
Scatterplot of Reaction time in Seconds vs Age in years
27
![Page 28: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/28.jpg)
17.28
Coefficient of Determination
R2 has a value of .6483. This means 64.83% of the variation in the auction selling prices (y) is explained by your regression model. The remaining 35.17% is unexplained, i.e. due to error..
28
![Page 29: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/29.jpg)
Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions.
In general the higher the value of R2, the better the model fits the data.
R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y
29
![Page 30: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/30.jpg)
Coefficient of determination
x1 x2
y1
y2
y
Two data points (x1,y1) and (x2,y2) of a certain sample are shown.
22
21 )yy()yy( 2
22
1 )yy()yy( 222
211 )yy()yy(
Total variation in y = Variation explained by the regression line
+ Unexplained variation (error)
Variation in y = SSR + SSE
30
![Page 31: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/31.jpg)
Coefficient of Determination• How “strong” is relationship between predictor &
outcome? (Fraction of observed variance of outcome variable explained by the predictor variables).
• Relationship Among SST, SSR, SSE
where:where: SST = total sum of squaresSST = total sum of squares SSR = sum of squares due to regressionSSR = sum of squares due to regression SSE = sum of squares due to errorSSE = sum of squares due to error
SST = SSR + SSESST = SSR + SSE
2( )iy y 2ˆ( )iy y 2ˆ( )i iy y
31
![Page 32: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/32.jpg)
REGRESSION
32
![Page 33: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/33.jpg)
Estimation ProcessRegression ModelRegression Model
yy = = 00 + + 11xx + +Regression EquationRegression Equation
EE((yy) = ) = 00 + + 11xxUnknown ParametersUnknown Parameters
00, , 11
Sample Data:Sample Data:x yx yxx11 y y11. .. . . .. . xxnn yynn
bb00 and and bb11provide estimates ofprovide estimates of
00 and and 11
EstimatedEstimatedRegression EquationRegression Equation
Sample StatisticsSample Statistics
bb00, , bb11
0 1y b b x
33
![Page 34: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/34.jpg)
Introduction
• We will examine the relationship between quantitative variables x and y via a mathematical equation.
• The motivation for using the technique:– Forecast the value of a dependent variable (y) from
the value of independent variables (x1, x2,…xk.).– Analyze the specific relationships between the
independent variables and the dependent variable.
34
![Page 35: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/35.jpg)
For a continuous variable X the easiest way of checking for a linear relationship with Y is by means of a scatter plot of Y against X.
Hence, regression analysis can be started with a scatter plot.
35
![Page 36: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/36.jpg)
3636
Least SquaresLeast Squares
• 1.1. ‘Best Fit’ Means Difference Between ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values Are Actual Y Values & Predicted Y Values Are a Minimum. a Minimum. ButBut Positive Differences Off- Positive Differences Off-Set Negative. So square errors!Set Negative. So square errors!
• 2.2. LS Minimizes the Sum of the Squared LS Minimizes the Sum of the Squared Differences (errors) (SSE)Differences (errors) (SSE)
n
ii
n
iii YY
1
2
1
2ˆˆ
36
![Page 37: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/37.jpg)
3737
Coefficient EquationsCoefficient Equations• Prediction equationPrediction equation
• Sample slopeSample slope
• Sample Y - interceptSample Y - intercept
ii xy 10ˆˆˆ
21xx
yyxxSSSS
i
iixx
xy
xy 10 ˆˆ
37
![Page 38: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/38.jpg)
Interpreting regression coefficients
You should interpret the slope and the intercept of this line as follows: –The slope represents the estimated average change in Y when X increases by one unit. –The intercept represents the estimated average value of Y when X equals zero
38
![Page 39: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/39.jpg)
3939
Interpretation of CoefficientsInterpretation of Coefficients
• 1.1. Slope (Slope (11))– Estimated Estimated YY changes by changes by 11 for each 1 unit increase for each 1 unit increase
in in XX• If If 11 = 2, then = 2, then YY is Expected to Increase by 2 for each 1 is Expected to Increase by 2 for each 1
unit increase in unit increase in XX
• 2.2. Y-Intercept (Y-Intercept (00))– Average Value of Average Value of YY when when XX = 0 = 0
• If If 00 = 4, then Average of = 4, then Average of YY is expected to be 4 is expected to be 4 when when X X is 0 is 0
^
^
^^
^
39
![Page 40: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/40.jpg)
The Model
• The first order linear model
y = dependent variablex = independent variable0 = y-intercept1 = slope of the line = error variable
xy 10
x
y
0Run
Rise = Rise/Run
0 and 1 are unknown populationparameters, therefore are estimated from the data.
40
![Page 41: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/41.jpg)
The Least Squares (Regression) Line
A good line is one that minimizes the sum of squared differences between the points and the line.
41
![Page 42: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/42.jpg)
Model adequacy cheking
When conducting linear regression, it is important to make sure the assumptions behind the model are met. It is also important to verify that the estimated linear regression model is a good fit for the data (often a linear regression line can be estimated by SAS, SPSS, MINITAB etc. even if it’s not appropriate—in this case it is up to you to judge whether the model is a good one).
42
![Page 43: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/43.jpg)
Assumptions
• The relationship between the explanatory variable and the outcome variable is linear. In other words, each increase by one unit in an explanatory variable is associated with a
fixed increase in the outcome variable.• The regression equation describes the mean
value of the dependent variable for a given values of independent variable.
43
![Page 44: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/44.jpg)
• The individual data points of Y (the response variable) for each value of the explanatory variable are normally distributed about the line of means (regression line).
• The variance of the data points about the line of means is the same for each value of explanatory variable.
44
![Page 45: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/45.jpg)
Assumptions About the Error Term
1. The error1. The error is a random variable with mean of zero.is a random variable with mean of zero.
2.2. The variance ofThe variance of , , denoted by denoted by 22, , is the same foris the same for all values of the independent variable.all values of the independent variable.
3.3. The values ofThe values of are independent (randomly distributed.are independent (randomly distributed.
4.4. The errorThe error is a normally distributed randomis a normally distributed random variable with mean zero and variancevariable with mean zero and variance22 . .
45
![Page 46: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/46.jpg)
Testing the assumptions for regression - 2
• Normality (interval level variables)– Skewness & Kurtosis must lie within acceptable limits
(-1 to +1)• How to test?• You can examine a histogram. Normality of distribution of
Y data points can be checked by plotting a histogram of the residuals.
46
![Page 47: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/47.jpg)
• If condition violated? – Regression procedure can overestimate significance, so
should add a note of caution to the interpretation of results (increases type I error rate)
47
![Page 48: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/48.jpg)
Testing the assumptions - normality
To compute skewness and kurtosis for the included cases, select Descriptive Statistics|Descriptives… from the Analyze menu.
1
48
![Page 49: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/49.jpg)
Testing the assumptions - normality
Second, click on the Continue button to complete the options.
First, mark the checkboxes for Kurtosis and Skew ness.
49
![Page 50: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/50.jpg)
Analysis of Residual• To examine whether the regression model is
appropriate for the data being analyzed, we can check the residual plots.
• Residual plots are:– Plot a histogram of the residuals– Plot residuals against the fitted values.– Plot residuals against the independent variable.– Plot residuals over time if the data are chronological.
50
![Page 51: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/51.jpg)
Analysis of Residual• A histogram of the residuals provides a check on the
normality assumption. A Normal quantile plot of the residuals can also be used to check the Normality assumptions.
• Regression Inference is robust against moderate lack of Normality. On the other hand, outliers and influential observations can invalidate the results of inference for regression
• Plot of residuals against fitted values or the independent variable can be used to check the assumption of constant variance and the aptness of the model.
51
![Page 52: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/52.jpg)
Analysis of Residual• Plot of residuals against time provides a
check on the independence of the error terms assumption.
• Assumption of independence is the most critical one.
52
![Page 53: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/53.jpg)
Residual plots• The residuals should
have no systematic pattern.
• The residual plot to right shows a scatter of the points with no individual observations or systematic change as x increases.
Degree Days Residual Plot
-1
-0.5
0
0.5
1
0 20 40 60
Degree DaysRe
sidua
ls
53
![Page 54: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/54.jpg)
Residual plots• The points in this
residual plot have a curve pattern, so a straight line fits poorly
54
![Page 55: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/55.jpg)
Residual plots• The points in this plot
show more spread for larger values of the explanatory variable x, so prediction will be less accurate when x is large.
55
![Page 56: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/56.jpg)
Heteroscedasticity• When the requirement of a constant variance is violated
we have a condition of heteroscedasticity.• Diagnose heteroscedasticity by plotting the residual
against the predicted y.
+ + ++
+ ++
++
+
+
+
+
+
+
+
+
+
+
++
+
+
+
The spread increases with y
y
Residualy
+
+++
+
++
+
++
+
+++
+
+
+
+
+
++
+
+
56
![Page 57: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/57.jpg)
Patterns in the appearance of the residuals indicates that autocorrelation exists.
+
+++ +
++
++ + +
++ + + +
+ + ++
+
+
+
+
+Time
Residual Residual
Time+
+
+
Note the runs of positive residuals,replaced by runs of negative residuals
Note the oscillating behavior of the residuals around zero.
0 0
Non Independence of Error Variables
57
![Page 58: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/58.jpg)
Outliers• An outlier is an observation that is unusually small or
large.• Several possibilities need to be investigated when an
outlier is observed:– There was an error in recording the value.– The point does not belong in the sample.– The observation is valid.
• Identify outliers from the scatter diagram.• It is customary to suspect an observation is an outlier if
its |standard residual| > 258
![Page 59: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/59.jpg)
• DFITTS value of the data point is >2
59
![Page 60: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/60.jpg)
Variable transformations• If the residual plot suggests that the variance is not constant,
a transformation can be used to stabilize the variance.• If the residual plot suggests a non linear relationship
between x and y, a transformation may reduce it to one that is approximately linear.
• Common linearizing transformations are:
• Variance stabilizing transformations are:
)log(,1 x
x
2,),log(,1 yyyy 60
![Page 61: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/61.jpg)
The Model
• The first order linear model
y = dependent variablex = independent variable0 = y-intercept1 = slope of the line = error variable
xy 10
x
y
0Run
Rise = Rise/Run
0 and 1 are unknown populationparameters, therefore are estimated from the data.
61
![Page 62: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/62.jpg)
The Least Squares (Regression) Line
A good line is one that minimizes the sum of squared differences between the points and the line.
62
![Page 63: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/63.jpg)
Example
• Following observations are made on an experiment that was carried out to measure the relationship of a mathematics placement test conducted at a faculty and final grades of 20 students as faculty decided not to give admissions to those students got marks below 35 at the placement test.
63
![Page 64: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/64.jpg)
Tableplacement test Final grade 50 53 35 41 35 51 40 62 55 68 65 63 35 22 60 70 90 85 35 40
placement test Final grade 90 75
80 91 60 58 60 71 60 71 40 49 55 58 50 57 65 77 50 59
64
![Page 65: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/65.jpg)
Scatter plot
90807060504030
100
90
80
70
60
50
40
30
20
placement test
Final
gra
de
Scatterplot of Final grade vs placement test
65
![Page 66: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/66.jpg)
Correlations: Daily RF(0.01cm), Particle weight (µg/m3
• Pearson correlation of Daily RF(0.01cm) and Particle weight (µg/m3) = 0.726
• P-Value = 0.011
66
![Page 67: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/67.jpg)
SAS For Regression and Correlation
67
![Page 68: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/68.jpg)
PROG REG
Submit the following program in SAS. In addition to the first two statements with which you are familiar, the third statement requests a plot of the residuals by weight and the fourth statement requests a plot of the studentized (standardized) residuals by weight:
PROC REG DATA = blood; MODEL level = weight; PLOT level * weight; PLOT residual. * weight; PLOT student. * weight; RUN;
68
![Page 69: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/69.jpg)
Interpreting Output
Notice that the overall F-test has a p-value of 0.2160, which is greater than 0.05. Therefore, we would conclude that blood level and weight are independent (fail to reject Ho: β1 = 0).
Now look at the following plots:
69
![Page 70: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/70.jpg)
Plot of Regression Line: Notice it is the same plot as the one you created from PROC GPLOT, except the fitted regression line
has been added to it.
70
![Page 71: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/71.jpg)
Plot of residuals * weight: you want an even spread of points above and below the dashed line. This is a good way
to eyeball the data for potential outliers.
71
![Page 72: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/72.jpg)
Plot of studentized residuals * weight: look for values with an absolute value larger than 2.6 to
determine if there are any outliers.
72
![Page 73: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/73.jpg)
You can see from the plot that the observation with weight = 128 (observation #4) is an outlier.
The residual plots also help you determine whether the assumption of constant variance is met. Because the residuals appear to be randomly scattered without any definite pattern, this suggests that the data are independent with constant variance.
73
![Page 74: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/74.jpg)
The Normality Assumption A convenient way to test for normality is by
constructing a “Normal Quantile Quantile” plot. This plots the residuals you would see under normality versus the residuals that are actually observed. If the data are completely normal, the residuals will follow a 45° line. Use the following code in SAS to make the NQQ plot:
PLOT residual. * nqq.;RUN;
74
![Page 75: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/75.jpg)
Residual vs. NQQ Plot
75
![Page 76: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/76.jpg)
Interpreting the NQQ Plot
The residuals do not clearly follow a 45° line. Because the tails of this line seem curved, this suggests that the data may be skewed, not normally distributed.
76
![Page 77: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/77.jpg)
Recommendations• It is extremely important to look at plots of raw
data prior to selecting a tentative model
• Need to be cautious in interpreting the correlation coefficient r.
• Proper model assessment should be done prior to using the fitted model for predictions.
• Need to focus on the range of x values used for building the model prior to making predictions at a desired x value. 77
![Page 78: Regression and Co-Relation](https://reader036.vdocuments.site/reader036/viewer/2022070510/58a8f4fc1a28ab837c8b4cd9/html5/thumbnails/78.jpg)
78