review session linear regression. correlation pearson’s r –measures the strength and type of a...

Review Session

Linear Regression

Correlation

• Pearson’s r– Measures the strength and

type of a relationship between the x and y variables

– Ranges from -1 to +1

Value Relationship Meaning

+1 Perfect direct relationship between x and y.

As x gets bigger, so does y.

0 No relationship exists between x and y.

-1 Perfect Inverse relationship between x and y.

As x bigger, y gets smaller.

Correlation printout in Minitab

• Top number is the correlation

• Bottom number is the p-value

Simple Linear Regression

y response

x1 predictor

B0 constant(y-intercept)

B1 Coefficient for x1

(slope)

e error

y=b0 + b1x1 + e

Simple Linear RegressionMaking A Point Prediction

y = b0 + b1x1 + e

GPA = 1.47 + 0.00323(GMAT)

For a person with a GMAT Score of 400, what is the expected 1st year GPA?

GPA = 1.47 + 0.00323(GMAT)GPA = 1.47 + 0.00323(400)GPA = 1.47 + 1.292GPA = 2.76

Simple Linear Regressiony = b0 + b1x1 + e

GPA = 1.47 + 0.00323(GMAT)

What’s the 95% CI for the GPA of a person with a GMAT score of 400?

GPA = 2.76

SE = 0.26

2.76 +/- 2(0.26)

95% CI = (2.24, 3.28)

Coefficient CI’s and Testingy = b0 + b1x1 + e

GPA = 1.47 + 0.00323(GMAT)

b0 = 1.47 +/- 2(0.22) = 1.47 +/- 0.44 = (1.03, 1.91)

Find the 95% CI for the coefficients.

b1 = 0.0032 +/- 2(0.0004) = 0.0032 +/- 0.0008 = 0.0026, 0.0040

Coefficient Testingy = b0 + b1x1 + e

GPA = 1.47 + 0.00323(GMAT)

H0: b = 0H1: b <> 0

The p-value for each coefficient is the result of a hypothesis test

If p-value <= 0.05, reject H0 and accept the coefficient.

R2

• r2 and R2

• Square of Pearson's r• Little r2 is for simple regression• Big R2 is used for multiple

regression

R2 interpretation

0 No correlation

1 Perfect correlation

Sample R2 values

0

20

40

60

80

100

120

140

160

20 22 24 26 28 30 32 34 36 38 40

X

0

20

40

60

80

100

120

140

160

20 22 24 26 28 30 32 34 36 38 40

X

0

20

40

60

80

100

120

140

160

20 22 24 26 28 30 32 34 36 38 40

X

0

20

40

60

80

100

120

140

160

20 22 24 26 28 30 32 34 36 38 40

X

R2 = 0.80

R2 = 0.60

R2 = 0.30

R2 = 0.20

Regression ANOVA

• H0: b1 = b2 = …. = bk = 0

• Ha: at least one b <> 0

• F-statistic, df1, df2 p-value

• If p <= 0.05, at least one of the b’s is not zero

• If p > 0.05, it’s possible that all of the b’s are zero

Diagnostics - Residuals

• Residuals = errors• Residuals should be normally distributed• Residuals should have a constant variance

– Heteroscedasticity: pattern in the residual distribution• Autocorrelation: error magnitude increases or decreases with

the magnitude of an independent variable• Heteroscedasticity and autocorrelation indicate problems

with the model

– Homoscedasticity: no pattern in the residual distribution

• Use the 4-in-one plot for these diagnostics

Adding a Power Transformation

• Each “bump” or “U” shape in a scatter plot indicates that an additional power may be involved.– 0 bumps: x– 1 bump: x2

– 2 bumps: x3

• Standard equation is y = b0 + b1x + b2x2

• Don’t forget: Check to see if b1 and b2 are statistically significant, and that the model is also statistically significant.

Categorical Variables

• Occasionally it is necessary to add a categorical variable to a regression model.

• Suppose that we have a car dealership, and we want to model the sale price based on the time on the lot and the sales person (Tom, Dick, or Harry). – The time on the lot is a linear variable. – Salesperson is a categorical variable.


• Categorical variables are modeled in regression using Boolean logic

Boolean Logic

Yes 1

No 0

Example: y = b0 + btimextime + bTomxTom + bDickxDick

xTom xDick

Tom 1 0

Dick 0 1

Harry 0 0

Categorical VariablesxTom xDick

Tom 1 0

Dick 0 1

Harry 0 0

Harry is the baseline category for the model

Tom and Dick’s performance will be gauged in relation to Harry, but not each other.

Example: y = b0 + btimextime + bTomxTom + bDickxDick


• Interpretation– Tom’s average sale price is bTom more than Harry’s– Dick’s average sale price is bDick more than Harry’s

xTom xDick

Tom 1 0 y = b0 + btimextime + bTom

Dick 0 1 y = b0 + btimextime + bDick

Harry 0 0 y = b0 + btimextime

y = b0 + btimextime + bTomxTom + bDickxDick

Multicolinearity

• Multicolinearity: Predictor variables are correlated with each other.

• Multicolinearity results in instability in the estimation of the b’s– P-values will be larger– Confidence in the b’s decreases or disappears

(magnitude and sign may be different from the expected values)

– A small change in the data results in large variations in the coefficients

– Read 11.11

VIF-Variance Inflation Factor

• Measures the degree to which the confidence in the estimate of the coefficient is decreased by multicolinearity.

• The larger the VIF, the greater a problem multicolinearity is.

• If VIF > 10 then there may be a problem

• If VIF >=15 then there may be a serious problem

Model Selection

1) Start with everything.

2) Delete variables with high VIF factors one at a time.

3) Delete variables one at a time, deleting the one with the largest p-value.

4) Stop when all p-values are less than 0.05.

Demand Price Curve

The demand-price function is nonlinear:

D=b0Pb1

A log transformation makes it linear:

ln(D)=ln(b0) +b1ln(P)

Run the Regression on the transformed variables

Plug the coefficients into the equation below:

D=eb0Pb1

Make your projections on this last equation.

Demand Price Curve

1) Create a variable for the natural log of demand and the natural log of the independent variables.

– In Excel : =ln(demand), =ln(price), =ln(income), etc.2) Run the regression on the transformed variables.3) Place the coefficients in the equation: d=econstantpb1ib2

4) Simplify to: d=kpb1ib2

(Note that econstant=k)5) If income is not included, then the equation is just: d=kpb1

The demand-price function is nonlinear:d=kpb1

A log transformation makes it linear:ln(d)=b0 +bpln(p)

review session linear regression. correlation pearson’s r –measures the strength and type of a...

Documents

b0 b1x1 egpa

year gpa

coefficient testingy

simple regressionbig

coefficient cis

gmat score

hypothesis testif pvalue

minitabtop number