review session linear regression. correlation pearson’s r –measures the strength and type of a...
TRANSCRIPT
Review Session
Linear Regression
Correlation
• Pearson’s r– Measures the strength and
type of a relationship between the x and y variables
– Ranges from -1 to +1
Value Relationship Meaning
+1 Perfect direct relationship between x and y.
As x gets bigger, so does y.
0 No relationship exists between x and y.
-1 Perfect Inverse relationship between x and y.
As x bigger, y gets smaller.
Correlation printout in Minitab
• Top number is the correlation
• Bottom number is the p-value
Simple Linear Regression
y response
x1 predictor
B0 constant(y-intercept)
B1 Coefficient for x1
(slope)
e error
y=b0 + b1x1 + e
Simple Linear RegressionMaking A Point Prediction
y = b0 + b1x1 + e
GPA = 1.47 + 0.00323(GMAT)
For a person with a GMAT Score of 400, what is the expected 1st year GPA?
GPA = 1.47 + 0.00323(GMAT)GPA = 1.47 + 0.00323(400)GPA = 1.47 + 1.292GPA = 2.76
Simple Linear Regressiony = b0 + b1x1 + e
GPA = 1.47 + 0.00323(GMAT)
What’s the 95% CI for the GPA of a person with a GMAT score of 400?
GPA = 2.76
SE = 0.26
2.76 +/- 2(0.26)
95% CI = (2.24, 3.28)
Coefficient CI’s and Testingy = b0 + b1x1 + e
GPA = 1.47 + 0.00323(GMAT)
b0 = 1.47 +/- 2(0.22) = 1.47 +/- 0.44 = (1.03, 1.91)
Find the 95% CI for the coefficients.
b1 = 0.0032 +/- 2(0.0004) = 0.0032 +/- 0.0008 = 0.0026, 0.0040
Coefficient Testingy = b0 + b1x1 + e
GPA = 1.47 + 0.00323(GMAT)
H0: b = 0H1: b <> 0
The p-value for each coefficient is the result of a hypothesis test
If p-value <= 0.05, reject H0 and accept the coefficient.
R2
• r2 and R2
• Square of Pearson's r• Little r2 is for simple regression• Big R2 is used for multiple
regression
R2 interpretation
0 No correlation
1 Perfect correlation
Sample R2 values
0
20
40
60
80
100
120
140
160
20 22 24 26 28 30 32 34 36 38 40
X
0
20
40
60
80
100
120
140
160
20 22 24 26 28 30 32 34 36 38 40
X
0
20
40
60
80
100
120
140
160
20 22 24 26 28 30 32 34 36 38 40
X
0
20
40
60
80
100
120
140
160
20 22 24 26 28 30 32 34 36 38 40
X
R2 = 0.80
R2 = 0.60
R2 = 0.30
R2 = 0.20
Regression ANOVA
• H0: b1 = b2 = …. = bk = 0
• Ha: at least one b <> 0
• F-statistic, df1, df2 p-value
• If p <= 0.05, at least one of the b’s is not zero
• If p > 0.05, it’s possible that all of the b’s are zero
Diagnostics - Residuals
• Residuals = errors• Residuals should be normally distributed• Residuals should have a constant variance
– Heteroscedasticity: pattern in the residual distribution• Autocorrelation: error magnitude increases or decreases with
the magnitude of an independent variable• Heteroscedasticity and autocorrelation indicate problems
with the model
– Homoscedasticity: no pattern in the residual distribution
• Use the 4-in-one plot for these diagnostics
Adding a Power Transformation
• Each “bump” or “U” shape in a scatter plot indicates that an additional power may be involved.– 0 bumps: x– 1 bump: x2
– 2 bumps: x3
• Standard equation is y = b0 + b1x + b2x2
• Don’t forget: Check to see if b1 and b2 are statistically significant, and that the model is also statistically significant.
Categorical Variables
• Occasionally it is necessary to add a categorical variable to a regression model.
• Suppose that we have a car dealership, and we want to model the sale price based on the time on the lot and the sales person (Tom, Dick, or Harry). – The time on the lot is a linear variable. – Salesperson is a categorical variable.
Categorical Variables
• Categorical variables are modeled in regression using Boolean logic
Boolean Logic
Yes 1
No 0
Example: y = b0 + btimextime + bTomxTom + bDickxDick
xTom xDick
Tom 1 0
Dick 0 1
Harry 0 0
Categorical VariablesxTom xDick
Tom 1 0
Dick 0 1
Harry 0 0
Harry is the baseline category for the model
Tom and Dick’s performance will be gauged in relation to Harry, but not each other.
Example: y = b0 + btimextime + bTomxTom + bDickxDick
Categorical Variables
• Interpretation– Tom’s average sale price is bTom more than Harry’s– Dick’s average sale price is bDick more than Harry’s
xTom xDick
Tom 1 0 y = b0 + btimextime + bTom
Dick 0 1 y = b0 + btimextime + bDick
Harry 0 0 y = b0 + btimextime
y = b0 + btimextime + bTomxTom + bDickxDick
Multicolinearity
• Multicolinearity: Predictor variables are correlated with each other.
• Multicolinearity results in instability in the estimation of the b’s– P-values will be larger– Confidence in the b’s decreases or disappears
(magnitude and sign may be different from the expected values)
– A small change in the data results in large variations in the coefficients
– Read 11.11
VIF-Variance Inflation Factor
• Measures the degree to which the confidence in the estimate of the coefficient is decreased by multicolinearity.
• The larger the VIF, the greater a problem multicolinearity is.
• If VIF > 10 then there may be a problem
• If VIF >=15 then there may be a serious problem
Model Selection
1) Start with everything.
2) Delete variables with high VIF factors one at a time.
3) Delete variables one at a time, deleting the one with the largest p-value.
4) Stop when all p-values are less than 0.05.
Demand Price Curve
The demand-price function is nonlinear:
D=b0Pb1
A log transformation makes it linear:
ln(D)=ln(b0) +b1ln(P)
Run the Regression on the transformed variables
Plug the coefficients into the equation below:
D=eb0Pb1
Make your projections on this last equation.
Demand Price Curve
1) Create a variable for the natural log of demand and the natural log of the independent variables.
– In Excel : =ln(demand), =ln(price), =ln(income), etc.2) Run the regression on the transformed variables.3) Place the coefficients in the equation: d=econstantpb1ib2
4) Simplify to: d=kpb1ib2
(Note that econstant=k)5) If income is not included, then the equation is just: d=kpb1
The demand-price function is nonlinear:d=kpb1
A log transformation makes it linear:ln(d)=b0 +bpln(p)