regression and model selection book chapters 3 …1. simple linear regression 2. multiple linear...

183
Regression and Model Selection Book Chapters 3 and 6. Carlos M. Carvalho The University of Texas McCombs School of Business 1

Upload: others

Post on 09-Jul-2020

31 views

Category:

Documents


9 download

TRANSCRIPT

Page 1: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Regression and Model SelectionBook Chapters 3 and 6.

Carlos M. CarvalhoThe University of Texas McCombs School of Business

1

Page 2: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

1. Simple Linear Regression2. Multiple Linear Regression3. Dummy Variables4. Residual Plots and Transformations5. Variable Selection and Regularization6. Dimension Reduction Methods

Page 3: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

1. Regression: General Introduction

I Regression analysis is the most widely used statistical tool forunderstanding relationships among variables

I It provides a conceptually simple method for investigatingfunctional relationships between one or more factors and anoutcome of interest

I The relationship is expressed in the form of an equation or amodel connecting the response or dependent variable and oneor more explanatory or predictor variable

1

Page 4: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

1st Example: Predicting House Prices

Problem:

I Predict market price based on observed characteristics

Solution:

I Look at property sales data where we know the price andsome observed characteristics.

I Build a decision rule that predicts price as a function of theobserved characteristics.

2

Page 5: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Predicting House Prices

It is much more useful to look at a scatterplot

1.0 1.5 2.0 2.5 3.0 3.5

6080

100120140160

size

price

In other words, view the data as points in the X × Y plane.

3

Page 6: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Regression Model

Y = response or outcome variableX 1,X 2,X 3, . . . ,Xp = explanatory or input variables

The general relationship approximated by:

Y = f (X1,X2, . . . ,Xp) + e

And a linear relationship is written

Y = b0 + b1X1 + b2X2 + . . .+ bpXp + e

4

Page 7: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Linear Prediction

Appears to be a linear relationship between price and size:As size goes up, price goes up.

The line shown was fit by the “eyeball” method.

5

Page 8: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Linear Prediction

Y

X

b0

2 1

b1

Y = b0 + b1X

Our “eyeball” line has b0 = 35, b1 = 40.

6

Page 9: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Linear Prediction

Can we do better than the eyeball method?

We desire a strategy for estimating the slope and interceptparameters in the model Y = b0 + b1X

A reasonable way to fit a line is to minimize the amount by whichthe fitted value differs from the actual value.

This amount is called the residual.

7

Page 10: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Linear Prediction

What is the “fitted value”?

Yi

Xi

Ŷi

The dots are the observed values and the line represents our fittedvalues given by Yi = b0 + b1X1 .

8

Page 11: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Linear Prediction

What is the “residual”’ for the ith observation’?

Yi

Xi

Ŷi ei = Yi – Ŷi = Residual i

We can write Yi = Yi + (Yi − Yi ) = Yi + ei .

9

Page 12: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Least Squares

Ideally we want to minimize the size of all residuals:

I If they were all zero we would have a perfect line.

I Trade-off between moving closer to some points and at thesame time moving away from other points.

I Minimize the “total” of residuals to get best fit.

Least Squares chooses b0 and b1 to minimize∑N

i=1 e2i

N∑i=1

e2i = e2

1 +e22 +· · ·+e2

N = (Y1−Y1)2+(Y2−Y2)2+· · ·+(YN−YN)2

=n∑

i=1

(Yi − [b0 + b1Xi ])2

10

Page 13: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Least Squares – Excel Output

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.909209967R Square 0.826662764Adjusted R Square 0.81332913Standard Error 14.13839732Observations 15

ANOVAdf SS MS F Significance F

Regression 1 12393.10771 12393.10771 61.99831126 2.65987E-06Residual 13 2598.625623 199.8942787Total 14 14991.73333

Coefficients Standard Error t Stat P-value Lower 95%Intercept 38.88468274 9.09390389 4.275906499 0.000902712 19.23849785Size 35.38596255 4.494082942 7.873900638 2.65987E-06 25.67708664

Upper 95%58.5308676345.09483846

11

Page 14: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

2nd Example: Offensive Performance in Baseball

1. Problems:I Evaluate/compare traditional measures of offensive

performanceI Help evaluate the worth of a player

2. Solutions:I Compare prediction rules that forecast runs as a function of

either AVG (batting average), SLG (slugging percentage) orOBP (on base percentage)

12

Page 15: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

2nd Example: Offensive Performance in Baseball

13

Page 16: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Baseball Data – Using AVGEach observation corresponds to a team in MLB. Each quantity isthe average over a season.

I Y = runs per game; X = AVG (average)

LS fit: Runs/Game = -3.93 + 33.57 AVG

14

Page 17: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Baseball Data – Using SLG

I Y = runs per game

I X = SLG (slugging percentage)

LS fit: Runs/Game = -2.52 + 17.54 SLG15

Page 18: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Baseball Data – Using OBP

I Y = runs per game

I X = OBP (on base percentage)

LS fit: Runs/Game = -7.78 + 37.46 OBP16

Page 19: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Place your Money on OBP!!!

1

N

N∑i=1

e2i =

∑Ni=1

(Runsi − Runsi

)2

N

Average Squared Error

AVG 0.083SLG 0.055OBP 0.026

17

Page 20: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

The Least Squares Criterion

The formulas for b0 and b1 that minimize the least squares

criterion are:

b1 = rxy ×sy

sxb0 = Y − b1X

where,

I X and Y are the sample mean of X and Y

I corr(x , y) = rxy is the sample correlation

I sx and sy are the sample standard deviation of X and Y

18

Page 21: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Sample Mean and Sample Variance

I Sample Mean: measure of centrality

Y =1

n

n∑i=1

Yi

I Sample Variance: measure of spread

s2y =

1

n − 1

n∑i=1

(Yi − Y

)2

I Sample Standard Deviation:

sy =√

s2y

19

Page 22: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Example

s2y =

1

n − 1

n∑i=1

(Yi − Y

)2

!!

!!

!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!!

!!

!!!

!

!

!!

!

!!

!

!

!!!

!!

!

!

!!!!!

!

!

!

!!

!

!

!!

!

!!

0 10 20 30 40 50 60

−40

−20

020

sample

X

!!

!

!

!!

!

!

!!

!

!

!

!

!

!!

!

!

!

!!

!

!

!!!!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

0 10 20 30 40 50 60

−40

−20

020

sample

Y

X

Y

(Xi − X)

(Yi − Y )

sx = 9.7 sy = 16.020

Page 23: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

CovarianceMeasure the direction and strength of the linear relationship between Y and X

Cov(Y ,X ) =

∑ni=1 (Yi − Y )(Xi − X )

n − 1

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

−20 −10 0 10 20

−40

−20

020

X

Y

(Yi − Y )(Xi − X) > 0

(Yi − Y )(Xi − X) < 0(Yi − Y )(Xi − X) > 0

(Yi − Y )(Xi − X) < 0

X

Y I sy = 15.98, sx = 9.7

I Cov(X ,Y ) = 125.9

How do we interpret that?

21

Page 24: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Correlation

Correlation is the standardized covariance:

corr(X ,Y ) =cov(X ,Y )√

s2x s2

y

=cov(X ,Y )

sxsy

The correlation is scale invariant and the units of measurementdon’t matter: It is always true that −1 ≤ corr(X ,Y ) ≤ 1.

This gives the direction (- or +) and strength (0→ 1)of the linear relationship between X and Y .

22

Page 25: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Correlation

corr(Y ,X ) =cov(X ,Y )√

s2x s2

y

=cov(X ,Y )

sxsy=

125.9

15.98× 9.7= 0.812

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

−20 −10 0 10 20

−40

−20

020

X

Y(Yi − Y )(Xi − X) > 0

(Yi − Y )(Xi − X) < 0(Yi − Y )(Xi − X) > 0

(Yi − Y )(Xi − X) < 0

X

Y

23

Page 26: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Correlation

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = 1

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = .5

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = .8

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = -.8

24

Page 27: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Correlation

Only measures linear relationships:corr(X ,Y ) = 0 does not mean the variables are not related!

-3 -2 -1 0 1 2

-8-6

-4-2

0

corr = 0.01

0 5 10 15 20

05

1015

20

corr = 0.72

Also be careful with influential observations.

25

Page 28: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Back to Least Squares

1. Intercept:b0 = Y − b1X ⇒ Y = b0 + b1X

I The point (X , Y ) is on the regression line!I Least squares finds the point of means and rotate the line

through that point until getting the “right” slope

2. Slope:

b1 = corr(X ,Y )× sYsX

=

∑ni=1(Xi − X )(Yi − Y )∑n

i=1(Xi − X )2

=Cov(X ,Y )

var(X )

I So, the right slope is the correlation coefficient times a scalingfactor that ensures the proper units for b1

26

Page 29: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Decomposing the Variance

How well does the least squares line explain variation in Y ?Remember that Y = Y + e

Since Y and e are uncorrelated, i.e. corr(Y , e) = 0,

var(Y ) = var(Y + e) = var(Y ) + var(e)∑ni=1(Yi − Y )2

n − 1=

∑ni=1(Yi − ¯Y )2

n − 1+

∑ni=1(ei − e)2

n − 1

Given that e = 0, and ¯Y = Y (why?) we get to:

n∑i=1

(Yi − Y )2 =n∑

i=1

(Yi − Y )2 +n∑

i=1

e2i

27

Page 30: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Decomposing the Variance

SSR: Variation in Y explained by the regression line.SSE: Variation in Y that is left unexplained.

SSR = SST ⇒ perfect fit.

Be careful of similar acronyms; e.g. SSR for “residual” SS.

28

Page 31: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

A Goodness of Fit Measure: R2

The coefficient of determination, denoted by R2,

measures goodness of fit:

R2 =SSR

SST= 1− SSE

SST

I 0 < R2 < 1.

I The closer R2 is to 1, the better the fit.

29

Page 32: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Back to the House Data

!""#$%%&$'()*"$+,Applied Regression AnalysisCarlos M. Carvalho

Back to the House Data

''-''.

''/

R2 =SSR

SST= 0.82 =

12395

14991

30

Page 33: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Back to Baseball

Three very similar, related ways to look at a simple linearregression... with only one X variable, life is easy!

R2 corr SSE

OBP 0.88 0.94 0.79SLG 0.76 0.87 1.64AVG 0.63 0.79 2.49

31

Page 34: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Prediction and the Modeling Goal

There are two things that we want to know:

I What value of Y can we expect for a given X?

I How sure are we about this forecast? Or how different couldY be from what we expect?

Our goal is to measure the accuracy of our forecasts or how muchuncertainty there is in the forecast. One method is to specify arange of Y values that are likely, given an X value.

Prediction Interval: probable range for Y-values given X

32

Page 35: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Prediction and the Modeling Goal

Key Insight: To construct a prediction interval, we will have toassess the likely range of error values corresponding to a Y valuethat has not yet been observed!

We will build a probability model (e.g., normal distribution).

Then we can say something like “with 95% probability the errorwill be no less than -$28,000 or larger than $28,000”.

We must also acknowledge that the “fitted” line may be fooled byparticular realizations of the residuals.

33

Page 36: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Prediction and the Modeling Goal

I Suppose you only had the purple points in the graph. Thedashed line fits the purple points. The solid line fits all thepoints. Which line is better? Why?

●●

● ●

0.25 0.26 0.27 0.28 0.29

4.5

5.0

5.5

6.0

AVG

RP

G

I In summary, we need to work with the notion of a “true line”and a probability distribution that describes deviation aroundthe line.

34

Page 37: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

The Simple Linear Regression Model

The power of statistical inference comes from the ability to makeprecise statements about the accuracy of the forecasts.

In order to do this we must invest in a probability model.

Simple Linear Regression Model: Y = β0 + β1X + ε

ε ∼ N(0, σ2)

I β0 + β1X represents the “true line”; The part of Y thatdepends on X .

I The error term ε is independent “idosyncratic noise”; Thepart of Y not associated with X .

35

Page 38: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

The Simple Linear Regression Model

Y = β0 + β1X + ε

1.6 1.8 2.0 2.2 2.4 2.6

160

180

200

220

240

260

x

y

36

Page 39: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

The Simple Linear Regression Model – Example

You are told (without looking at the data) that

β0 = 40; β1 = 45; σ = 10

and you are asked to predict price of a 1500 square foot house.

What do you know about Y from the model?

Y = 40 + 45(1.5) + ε

= 107.5 + ε

Thus our prediction for price is Y |X = 1.5 ∼ N(107.5, 102)and a 95% Prediction Interval for Y is 87.5 < Y < 127.5

37

Page 40: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Conditional DistributionsY = β0 + β1X + ε

0.5 1.0 1.5 2.0 2.5

6080

100

120

140

x

y

The conditional distribution for Y given X is Normal:Y |X = x ∼ N(β0 + β1x , σ2).

38

Page 41: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Estimation of Error Variance

We estimate s2 with:

s2 =1

n − 2

n∑i=1

e2i =

SSE

n − 2

(2 is the number of regression coefficients; i.e. 2 for β0 and β1).

We have n − 2 degrees of freedom because 2 have been “used up”in the estimation of b0 and b1.

We usually use s =√

SSE/(n − 2), in the same units as Y . It’salso called the regression standard error.

39

Page 42: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Estimation of Error Variance

Where is s in the Excel output?

!""#$%%%&$'()*"$+Applied Regression Analysis Carlos M. Carvalho

Estimation of 2

!,"-"$).$s )/$0,"$123"($4506507

8"9"9:"-$;,"/"<"-$=45$.""$>.0?/*?-*$"--4-@$-"?*$)0$?.$estimated.0?/*?-*$*"<)?0)4/&$ ).$0,"$.0?/*?-*$*"<)?0)4/

.

Remember that whenever you see “standard error” read it asestimated standard deviation: σ is the standard deviation.

40

Page 43: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

One Picture Summary of SLRI The plot below has the house data, the fitted regression line

(b0 + b1X ) and ±2 ∗ s...I From this picture, what can you tell me about β0, β1 and σ2?

How about b0, b1 and s2?

1.0 1.5 2.0 2.5 3.0 3.5

6080

100

120

140

160

size

pric

e

41

Page 44: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Variation... Runs per Game and AVG

I blue line: all points

I red line: only purple points

I Which slope is closer to the true one? How much closer?

●●

● ●

0.25 0.26 0.27 0.28 0.29

4.5

5.0

5.5

6.0

AVG

RP

G

42

Page 45: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

The Importance of Understanding Variation

When estimating a quantity, it is vital to develop a notion of theprecision of the estimation; for example:

I estimate the slope of the regression line

I estimate the value of a house given its size

I estimate the expected return on a portfolio

I estimate the value of a brand name

I estimate the damages from patent infringement

Why is this important?We are making decisions based on estimates, and these may bevery sensitive to the accuracy of the estimates!

43

Page 46: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Sampling Distribution of b1

The sampling distribution of b1 describes how estimator b1 = β1

varies over different samples with the X values fixed.

It turns out that b1 is normally distributed (approximately):b1 ∼ N(β1, s

2b1

).

I b1 is unbiased: E [b1] = β1.

I sb1 is the standard error of b1. In general, the standard error isthe standard deviation of an estimate. It determines how closeb1 is to β1.

I This is a number directly available from the regression output.

44

Page 47: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Sampling Distribution of b1

Can we intuit what should be in the formula for sb1?

I How should s figure in the formula?

I What about n?

I Anything else?

s2b1

=s2∑

(Xi − X )2=

s2

(n − 1)s2x

Three Factors:sample size (n), error variance (s2), and X -spread (sx).

45

Page 48: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Sampling Distribution of b0

The intercept is also normal and unbiased: b0 ∼ N(β0, s2b0

).

s2b0

= var(b0) = s2

(1

n+

X 2

(n − 1)s2x

)What is the intuition here?

46

Page 49: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Variation... Runs per Game and AVG

Regression with all points

SUMMARY  OUTPUT

Regression  Statistics

Multiple  R 0.798496529R  Square 0.637596707Adjusted  R  Square 0.624653732Standard  Error 0.298493066Observations 30

ANOVAdf SS MS F Significance  F

Regression 1 4.38915033 4.38915 49.26199 1.239E-­‐07Residual 28 2.494747094 0.089098Total 29 6.883897424

Coefficients Standard  Error t  Stat P-­‐value Lower  95% Upper  95%

Intercept -­‐3.936410446 1.294049995 -­‐3.04193 0.005063 -­‐6.587152 -­‐1.2856692AVG 33.57186945 4.783211061 7.018689 1.24E-­‐07 23.773906 43.369833

sb1 = 4.78

47

Page 50: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Variation... Runs per Game and AVG

Regression with subsample

SUMMARY  OUTPUT

Regression  Statistics

Multiple  R 0.933601392R  Square 0.87161156Adjusted  R  Square 0.828815413Standard  Error 0.244815842Observations 5

ANOVAdf SS MS F Significance  F

Regression 1 1.220667405 1.220667 20.36659 0.0203329Residual 3 0.17980439 0.059935Total 4 1.400471795

Coefficients Standard  Error t  Stat P-­‐value Lower  95% Upper  95%

Intercept -­‐7.956288201 2.874375987 -­‐2.76801 0.069684 -­‐17.10384 1.191259AVG 48.69444328 10.78997028 4.512936 0.020333 14.355942 83.03294

sb1 = 10.78

48

Page 51: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Confidence Intervals

I 68% Confidence Interval: b1 ± 1× sb1

I 95% Confidence Interval: b1 ± 2× sb1

I 99% Confidence Interval: b1 ± 3× sb1

Same thing for b0

I 95% Confidence Interval: b0 ± 2× sb0

The confidence interval provides you with a set of plausible valuesfor the parameters

49

Page 52: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Example: Runs per Game and AVG

Regression with all points

SUMMARY  OUTPUT

Regression  Statistics

Multiple  R 0.798496529R  Square 0.637596707Adjusted  R  Square 0.624653732Standard  Error 0.298493066Observations 30

ANOVAdf SS MS F Significance  F

Regression 1 4.38915033 4.38915 49.26199 1.239E-­‐07Residual 28 2.494747094 0.089098Total 29 6.883897424

Coefficients Standard  Error t  Stat P-­‐value Lower  95% Upper  95%

Intercept -­‐3.936410446 1.294049995 -­‐3.04193 0.005063 -­‐6.587152 -­‐1.2856692AVG 33.57186945 4.783211061 7.018689 1.24E-­‐07 23.773906 43.369833

[b1 − 2× sb1 ; b1 + 2× sb1 ] ≈ [23.77; 43.36]

50

Page 53: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Testing

Suppose we want to assess whether or not β1 equals a proposedvalue β0

1 . This is called hypothesis testing.

Formally we test the null hypothesis:

H0 : β1 = β01

vs. the alternative

H1 : β1 6= β01

51

Page 54: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Testing

That are 2 ways we can think about testing:

1. Building a test statistic... the t-stat,

t =b1 − β0

1

sb1

This quantity measures how many standard deviations theestimate (b1) from the proposed value (β0

1).

If the absolute value of t is greater than 2, we need to worry(why?)... we reject the hypothesis.

52

Page 55: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Testing

2. Looking at the confidence interval. If the proposed value isoutside the confidence interval you reject the hypothesis.

Notice that this is equivalent to the t-stat. An absolute valuefor t greater than 2 implies that the proposed value is outsidethe confidence interval... therefore reject.

This is my preferred approach for the testing problem. Youcan’t go wrong by using the confidence interval!

53

Page 56: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Example: Mutual FundsLet’s investigate the performance of the Windsor Fund, anaggressive large cap fund by Vanguard...

!""#$%%&$'()*"$+,Applied Regression AnalysisCarlos M. Carvalho

Another Example of Conditional Distributions

-"./0$(11#$2.$2$032.."45426$17$.8"$*2.29$

The plot shows monthly returns for Windsor vs. the S&P500 54

Page 57: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Example: Mutual Funds

Consider a CAPM regression for the Windsor mutual fund.

rw = β0 + β1rsp500 + ε

Let’s first test β1 = 0

H0 : β1 = 0. Is the Windsor fund related to the market?

H1 : β1 6= 0

55

Page 58: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Example: Mutual Funds

!""#$%%%&$'()*"$+,Applied Regression Analysis Carlos M. Carvalho

-"./(($01"$!)2*345$5"65"33)42$72$8$,9:;<

b! sb! bsb!

!

Hypothesis Testing – Windsor Fund Example

I t = 32.10... reject β1 = 0!!

I the 95% confidence interval is [0.87; 0.99]... again, reject!!

56

Page 59: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Example: Mutual Funds

Now let’s test β1 = 1. What does that mean?

H0 : β1 = 1 Windsor is as risky as the market.

H1 : β1 6= 1 and Windsor softens or exaggerates market moves.

We are asking whether or not Windsor moves in a different waythan the market (e.g., is it more conservative?).

57

Page 60: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Example: Mutual Funds

!""#$%%%&$'()*"$+,Applied Regression Analysis Carlos M. Carvalho

-"./(($01"$!)2*345$5"65"33)42$72$8$,9:;<

b! sb! bsb!

!

Hypothesis Testing – Windsor Fund Example

I t = b1−1sb1

= −0.06430.0291 = −2.205... reject.

I the 95% confidence interval is [0.87; 0.99]... again, reject,but...

58

Page 61: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Testing – Why I like Conf. Int.

I Suppose in testing H0 : β1 = 1 you got a t-stat of 6 and theconfidence interval was

[1.00001, 1.00002]

Do you reject H0 : β1 = 1? Could you justify that to youboss? Probably not! (why?)

59

Page 62: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Testing – Why I like Conf. Int.

I Now, suppose in testing H0 : β1 = 1 you got a t-stat of -0.02and the confidence interval was

[−100, 100]

Do you accept H0 : β1 = 1? Could you justify that to youboss? Probably not! (why?)

The Confidence Interval is your best friend when it comes totesting!!

60

Page 63: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Testing – Summary

I Large t or small p-value mean the same thing...

I p-value < 0.05 is equivalent to a t-stat > 2 in absolute value

I Small p-value means something weird happen if the nullhypothesis was true...

I Bottom line, small p-value → REJECT! Large t → REJECT!

I But remember, always look at the confidence interval!

61

Page 64: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Forecasting

0 2 4 6 8

0100

200

300

400

Size

Price

I Careful with extrapolation!

62

Page 65: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

House Data – one more time!

I R2 = 82%

I Great R2, we are happy using this model to predict houseprices, right?

1.0 1.5 2.0 2.5 3.0 3.5

6080

100

120

140

160

size

pric

e

63

Page 66: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

House Data – one more time!I But, s = 14 leading to a predictive interval width of about

US$60,000!! How do you feel about the model now?I As a practical matter, s is a much more relevant quantity than

R2. Once again, intervals are your friend!

1.0 1.5 2.0 2.5 3.0 3.5

6080

100

120

140

160

size

pric

e

64

Page 67: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

2. The Multiple Regression Model

Many problems involve more than one independent variable orfactor which affects the dependent or response variable.

I More than size to predict house price!

I Demand for a product given prices of competing brands,advertising,house hold attributes, etc.

In SLR, the conditional mean of Y depends on X. The MultipleLinear Regression (MLR) model extends this idea to include morethan one independent variable.

65

Page 68: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

The MLR Model

Same as always, but with more covariates.

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp + ε

Recall the key assumptions of our linear regression model:

(i) The conditional mean of Y is linear in the Xj variables.

(ii) The error term (deviations from line)I are normally distributedI independent from each otherI identically distributed (i.e., they have constant variance)

Y |X1 . . .Xp ∼ N(β0 + β1X1 . . . + βpXp, σ2)

66

Page 69: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

The MLR ModelIf p = 2, we can plot the regression surface in 3D.Consider sales of a product as predicted by price of this product(P1) and the price of a competing product (P2).

Sales = β0 + β1P1 + β2P2 + ε

67

Page 70: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Least Squares

Y = β0 + β1X1 . . .+ βpXp + ε, ε ∼ N(0, σ2)

How do we estimate the MLR model parameters?

The principle of Least Squares is exactly the same as before:

I Define the fitted values

I Find the best fitting plane by minimizing the sum of squaredresiduals.

68

Page 71: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Least Squares

Model: Salesi = β0 + β1P1i + β2P2i + εi , ε ∼ N(0, σ2)SUMMARY OUTPUT

Regression StatisticsMultiple R 0.99R Square 0.99Adjusted R Square 0.99Standard Error 28.42Observations 100.00

ANOVAdf SS MS F Significance F

Regression 2.00 6004047.24 3002023.62 3717.29 0.00Residual 97.00 78335.60 807.58Total 99.00 6082382.84

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 115.72 8.55 13.54 0.00 98.75 132.68p1 -97.66 2.67 -36.60 0.00 -102.95 -92.36p2 108.80 1.41 77.20 0.00 106.00 111.60

b0 = β0 = 115.72, b1 = β1 = −97.66, b2 = β2 = 108.80,s = σ = 28.42

69

Page 72: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Least Squares

Just as before, each bi is our estimate of βi

Fitted Values: Yi = b0 + b1X1i + b2X2i . . .+ bpXp.

Residuals: ei = Yi − Yi .

Least Squares: Find b0, b1, b2, . . . , bp to minimize∑n

i=1 e2i .

In MLR the formulas for the bi ’s are too complicated so we won’ttalk about them...

70

Page 73: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Fitted Values in MLRUseful way to plot the results for MLR problems is to look atY (true values) against Y (fitted values).

0 200 400 600 800 1000

0200

400

600

800

1000

y.hat (MLR: p1 and p2)

y=Sales

If things are working, these values should form a nice straight line. Canyou guess the slope of the blue line?

71

Page 74: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Fitted Values in MLRWith just P1...

2 4 6 8

0200

400

600

800

1000

p1

y=Sales

300 400 500 600 700

0200

400

600

800

1000

y.hat (SLR: p1)

y=Sales

0 200 400 600 800 1000

0200

400

600

800

1000

y.hat (MLR: p1 and p2)

y=Sales

I Left plot: Sales vs P1

I Right plot: Sales vs. y (only P1 as a regressor)72

Page 75: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Fitted Values in MLR

Now, with P1 and P2...

300 400 500 600 700

0200

400

600

800

1000

y.hat(SLR:p1)

y=Sales

0 200 400 600 800 1000

0200

400

600

800

1000

y.hat(SLR:p2)

y=Sales

0 200 400 600 800 1000

0200

400

600

800

1000

y.hat(MLR:p1 and p2)

y=Sales

I First plot: Sales regressed on P1 alone...

I Second plot: Sales regressed on P2 alone...

I Third plot: Sales regressed on P1 and P2

73

Page 76: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

R-squared

I We still have our old variance decomposition identity...

SST = SSR + SSE

I ... and R2 is once again defined as

R2 =SSR

SST= 1− SSE

SST

telling us the percentage of variation in Y explained by theX ’s.

I In Excel, R2 is in the same place and “Multiple R” refers tothe correlation between Y and Y .

74

Page 77: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Intervals for Individual Coefficients

As in SLR, the sampling distribution tells us how close we canexpect bj to be from βj

The LS estimators are unbiased: E [bj ] = βj for j = 0, . . . , d .

I We denote the sampling distribution of each estimator as

bj ∼ N(βj , s2bj

)

75

Page 78: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Intervals for Individual Coefficients

Intervals and t-statistics are exactly the same as in SLR.

I A 95% C.I. for βj is approximately bj ± 2sbj

I The t-stat: tj =(bj − β0

j )

sbjis the number of standard errors

between the LS estimate and the null value (β0j )

I As before, we reject the null when t-stat is greater than 2 inabsolute value

I Also as before, a small p-value leads to a rejection of the null

I Rejecting when the p-value is less than 0.05 is equivalent torejecting when the |tj | > 2

76

Page 79: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Intervals for Individual Coefficients

IMPORTANT: Intervals and testing via bj & sbj are one-at-a-timeprocedures:

I You are evaluating the j th coefficient conditional on the otherX ’s being in the model, but regardless of the values you’veestimated for the other b’s.

77

Page 80: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

The Sales Data:

I Sales : units sold in excess of a baseline

I P1: our price in $ (in excess of a baseline price)

I P2: competitors price (again, over a baseline)

78

Page 81: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

I If we regress Sales on our own price, we obtain a somewhatsurprising conclusion... the higher the price the more we sell!!

27

The Sales Data

In this data we have weekly observations onsales:# units (in excess of base level)p1=our price: $ (in excess of base)p2=competitors price: $ (in excess of base).

p1 p2 Sales5.13567 5.2042 144.493.49546 8.0597 637.257.27534 11.6760 620.794.66282 8.3644 549.01......

(each row correspondsto a week)

If we regressSales on own price,we obtain thesomewhatsurprisingconclusionthat a higherprice is associatedwith more sales!!

9876543210

1000

500

0

p1

Sal

es

S = 223.401 R-Sq = 19.6 % R-Sq(adj) = 18.8 %

Sales = 211.165 + 63.7130 p1

Regression Plot

The regression linehas a positive slope !!

I It looks like we should just raise our prices, right? NO, not ifyou have taken this statistics class!

79

Page 82: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

I The regression equation for Sales on own price (P1) is:

Sales = 211 + 63.7P1

I If now we add the competitors price to the regression we get

Sales = 116− 97.7P1 + 109P2

I Does this look better? How did it happen?

I Remember: −97.7 is the affect on sales of a change in P1with P2 held fixed!!

80

Page 83: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

I How can we see what is going on? Let’s compare Sales in twodifferent observations: weeks 82 and 99.

I We see that an increase in P1, holding P2 constant,corresponds to a drop in Sales!

28

Sales on own price:

The multiple regression of Sales on own price (p1) andcompetitor's price (p2) yield more intuitive signs:

How does this happen ?

The regression equation isSales = 211 + 63.7 p1

The regression equation isSales = 116 - 97.7 p1 + 109 p2

Remember: -97.7 is the affect on sales of a change inp1 with p2 held fixed !!

151050

9

8

7

6

5

4

3

2

1

0

p2

p1

9876543210

1000

500

0

p1

Sale

s82

82

99

99

If we compares sales in weeks 82 and 99, we see that an increase in p1, holding p2 constant(82 to 99) corresponds to a drop is sales.

How can we see what is going on ?

Note the strong relationship between p1 and p2 !!I Note the strong relationship (dependence) between P1 and

P2!!81

Page 84: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

I Let’s look at a subset of points where P1 varies and P2 isheld approximately constant...

29

9876543210

1000

500

0

p1Sa

les

151050

9

8

7

6

5

4

3

2

1

0

p2

p1Here we select a subset of points where p variesand p2 does is help approximately constant.

For a fixed level of p2, variation in p1 is negativelycorrelated with sale!

0 5 10 15

24

68

sales$p2

sale

s$p1

2 4 6 8

020

040

060

080

010

00

sales$p1

sale

s$S

ales

Different colors indicate different ranges of p2.

p1

p2

Sales

p1

for each fixed level of p2there is a negative relationshipbetween sales and p1

larger p1 are associated withlarger p2

I For a fixed level of P2, variation in P1 is negatively correlatedwith Sales!!

82

Page 85: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

I Below, different colors indicate different ranges for P2...

29

9876543210

1000

500

0

p1

Sale

s

151050

9

8

7

6

5

4

3

2

1

0

p2

p1

Here we select a subset of points where p variesand p2 does is help approximately constant.

For a fixed level of p2, variation in p1 is negativelycorrelated with sale!

0 5 10 15

24

68

sales$p2

sale

s$p1

2 4 6 8

020

040

060

080

010

00

sales$p1

sale

s$S

ales

Different colors indicate different ranges of p2.

p1

p2

Sales

p1

for each fixed level of p2there is a negative relationshipbetween sales and p1

larger p1 are associated withlarger p2

83

Page 86: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

I Summary:

1. A larger P1 is associated with larger P2 and the overall effectleads to bigger sales

2. With P2 held fixed, a larger P1 leads to lower sales3. MLR does the trick and unveils the “correct” economic

relationship between Sales and prices!

84

Page 87: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

Beer Data (from an MBA class)

I nbeer – number of beers before getting drunk

I height and weight

31

The regression equation isnbeer = - 36.9 + 0.643 height

Predictor Coef StDev T PConstant -36.920 8.956 -4.12 0.000height 0.6430 0.1296 4.96 0.000

75706560

20

10

0

height

nbee

rIs nbeer relatedto height ?

Yes,very clearly.

200150100

75

70

65

60

weight

heig

ht

Is nbeer relatedto height ?

No, not all.

nbeer weightweight 0.692height 0.582 0.806

The correlations:

The regression equation isnbeer = - 11.2 + 0.078 height + 0.0853 weight

Predictor Coef StDev T PConstant -11.19 10.77 -1.04 0.304height 0.0775 0.1960 0.40 0.694weight 0.08530 0.02381 3.58 0.001

S = 2.784 R-Sq = 48.1% R-Sq(adj) = 45.9%

The two x’s arehighly correlated !!

Is number of beers related to height?85

Page 88: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

nbeers = β0 + β1height + εSUMMARY OUTPUT

Regression StatisticsMultiple R 0.58R Square 0.34Adjusted R Square 0.33Standard Error 3.11Observations 50.00

ANOVAdf SS MS F Significance F

Regression 1.00 237.77 237.77 24.60 0.00Residual 48.00 463.86 9.66Total 49.00 701.63

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -36.92 8.96 -4.12 0.00 -54.93 -18.91height 0.64 0.13 4.96 0.00 0.38 0.90

Yes! Beers and height are related...

86

Page 89: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

nbeers = β0 + β1weight + β2height + εSUMMARY OUTPUT

Regression StatisticsMultiple R 0.69R Square 0.48Adjusted R Square 0.46Standard Error 2.78Observations 50.00

ANOVAdf SS MS F Significance F

Regression 2.00 337.24 168.62 21.75 0.00Residual 47.00 364.38 7.75Total 49.00 701.63

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -11.19 10.77 -1.04 0.30 -32.85 10.48weight 0.09 0.02 3.58 0.00 0.04 0.13height 0.08 0.20 0.40 0.69 -0.32 0.47

What about now?? Height is not necessarily a factor...

87

Page 90: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

31

The regression equation isnbeer = - 36.9 + 0.643 height

Predictor Coef StDev T PConstant -36.920 8.956 -4.12 0.000height 0.6430 0.1296 4.96 0.000

75706560

20

10

0

height

nbee

r

Is nbeer relatedto height ?

Yes,very clearly.

200150100

75

70

65

60

weight

heig

ht

Is nbeer relatedto height ?

No, not all.

nbeer weightweight 0.692height 0.582 0.806

The correlations:

The regression equation isnbeer = - 11.2 + 0.078 height + 0.0853 weight

Predictor Coef StDev T PConstant -11.19 10.77 -1.04 0.304height 0.0775 0.1960 0.40 0.694weight 0.08530 0.02381 3.58 0.001

S = 2.784 R-Sq = 48.1% R-Sq(adj) = 45.9%

The two x’s arehighly correlated !!

I If we regress “beers” only on height we see an effect. Biggerheights go with more beers.

I However, when height goes up weight tends to go up as well...in the first regression, height was a proxy for the real cause ofdrinking ability. Bigger people can drink more and weight is amore accurate measure of “bigness”.

88

Page 91: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

31

The regression equation isnbeer = - 36.9 + 0.643 height

Predictor Coef StDev T PConstant -36.920 8.956 -4.12 0.000height 0.6430 0.1296 4.96 0.000

75706560

20

10

0

height

nbee

r

Is nbeer relatedto height ?

Yes,very clearly.

200150100

75

70

65

60

weight

heig

ht

Is nbeer relatedto height ?

No, not all.

nbeer weightweight 0.692height 0.582 0.806

The correlations:

The regression equation isnbeer = - 11.2 + 0.078 height + 0.0853 weight

Predictor Coef StDev T PConstant -11.19 10.77 -1.04 0.304height 0.0775 0.1960 0.40 0.694weight 0.08530 0.02381 3.58 0.001

S = 2.784 R-Sq = 48.1% R-Sq(adj) = 45.9%

The two x’s arehighly correlated !!

I In the multiple regression, when we consider only the variationin height that is not associated with variation in weight, wesee no relationship between height and beers.

89

Page 92: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

nbeers = β0 + β1weight + εSUMMARY OUTPUT

Regression StatisticsMultiple R 0.69R Square 0.48Adjusted R Square0.47Standard Error 2.76Observations 50

ANOVAdf SS MS F Significance F

Regression 1 336.0317807 336.0318 44.11878 2.60227E-08Residual 48 365.5932193 7.616525Total 49 701.625

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -7.021 2.213 -3.172 0.003 -11.471 -2.571weight 0.093 0.014 6.642 0.000 0.065 0.121

Why is this a better model than the one with weight and height??

90

Page 93: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Understanding Multiple Regression

In general, when we see a relationship between y and x (or x ’s),that relationship may be driven by variables “lurking” in thebackground which are related to your current x ’s.

This makes it hard to reliably find “causal” relationships. Anycorrelation (association) you find could be caused by othervariables in the background... correlation is NOT causation

Any time a report says two variables are related and there’s asuggestion of a “causal” relationship, ask yourself whether or notother variables might be the real reason for the effect. Multipleregression allows us to control for all important variables byincluding them into the regression. “Once we control for weight,height and beers are NOT related”!!

91

Page 94: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Back to Baseball – Let’s try to add AVG on top of OBP

t o

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.948136R Square 0.898961Adjusted R Square 0.891477Standard Error 0.160502Observations 30

ANOVAdf SS MS F Significance F

Regression 2 6.188355 3.094177 120.1119098 3.63577E‐14Residual 27 0.695541 0.025761Total 29 6.883896

Coefficients andard Err t Stat P‐value Lower 95% Upper 95%Intercept ‐7.933633 0.844353 ‐9.396107 5.30996E‐10 ‐9.666102081 ‐6.201163AVG 7.810397 4.014609 1.945494 0.062195793 ‐0.426899658 16.04769OBP 31.77892 3.802577 8.357205 5.74232E‐09 23.9766719 39.58116

R/G = β0 + β1AVG + β2OBP + ε

Is AVG any good?92

Page 95: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Back to Baseball - Now let’s add SLG

t o

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.955698R Square 0.913359Adjusted R Square 0.906941Standard Error 0.148627Observations 30

ANOVAdf SS MS F Significance F

Regression 2 6.28747 3.143735 142.31576 4.56302E‐15Residual 27 0.596426 0.02209Total 29 6.883896

Coefficients andard Err t Stat P‐value Lower 95% Upper 95%Intercept ‐7.014316 0.81991 ‐8.554984 3.60968E‐09 ‐8.69663241 ‐5.332OBP 27.59287 4.003208 6.892689 2.09112E‐07 19.37896463 35.80677SLG 6.031124 2.021542 2.983428 0.005983713 1.883262806 10.17899

R/G = β0 + β1OBP + β2SLG + ε

What about now? Is SLG any good93

Page 96: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Back to Baseball

CorrelationsAVG 1OBP 0.77 1SLG 0.75 0.83 1

I When AVG is added to the model with OBP, no additionalinformation is conveyed. AVG does nothing “on its own” tohelp predict Runs per Game...

I SLG however, measures something that OBP doesn’t (power!)and by doing something “on its own” it is relevant to helppredict Runs per Game. (Okay, but not much...)

94

Page 97: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

3. Dummy Variables... Example: House Prices

We want to evaluate the difference in house prices in a couple ofdifferent neighborhoods.

Nbhd SqFt Price

1 2 1.79 114.3

2 2 2.03 114.2

3 2 1.74 114.8

4 2 1.98 94.7

5 2 2.13 119.8

6 1 1.78 114.6

7 3 1.83 151.6

8 3 2.16 150.7

... ... ... ...

95

Page 98: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Dummy Variables... Example: House Prices

Let’s create the dummy variables dn1, dn2 and dn3...

Nbhd SqFt Price dn1 dn2 dn3

1 2 1.79 114.3 0 1 0

2 2 2.03 114.2 0 1 0

3 2 1.74 114.8 0 1 0

4 2 1.98 94.7 0 1 0

5 2 2.13 119.8 0 1 0

6 1 1.78 114.6 1 0 0

7 3 1.83 151.6 0 0 1

8 3 2.16 150.7 0 0 1

... ... ...

96

Page 99: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Dummy Variables... Example: House Prices

Pricei = β0 + β1dn1i + β2dn2i + β3Sizei + εi

E [Price|dn1 = 1, Size] = β0 + β1 + β3Size (Nbhd 1)

E [Price|dn2 = 1, Size] = β0 + β2 + β3Size (Nbhd 2)

E [Price|dn1 = 0, dn2 = 0, Size] = β0 + β3Size (Nbhd 3)

97

Page 100: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Dummy Variables... Example: House Prices

Pricei = β0 + β1dn1 + β2dn2 + β3Size + εiSUMMARY  OUTPUT

Regression  StatisticsMultiple  R 0.828R  Square 0.685Adjusted  R  Square 0.677Standard  Error 15.260Observations 128

ANOVAdf SS MS F Significance  F

Regression 3 62809.1504 20936 89.9053 5.8E-­31Residual 124 28876.0639 232.87Total 127 91685.2143

Coefficients Standard  Error t  Stat P-­value Lower  95%Upper  95%Intercept 62.78 14.25 4.41 0.00 34.58 90.98dn1 -­41.54 3.53 -­11.75 0.00 -­48.53 -­34.54dn2 -­30.97 3.37 -­9.19 0.00 -­37.63 -­24.30size 46.39 6.75 6.88 0.00 33.03 59.74

Pricei = 62.78− 41.54dn1− 30.97dn2 + 46.39Size + εi

98

Page 101: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Dummy Variables... Example: House Prices

1.6 1.8 2.0 2.2 2.4 2.6

80100

120

140

160

180

200

Size

Price

Nbhd = 1Nbhd = 2Nbhd = 3

99

Page 102: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Dummy Variables... Example: House Prices

Pricei = β0 + β1Size + εiSUMMARY  OUTPUT

Regression  StatisticsMultiple  R 0.553R  Square 0.306Adjusted  R  Square 0.300Standard  Error 22.476Observations 128

ANOVAdf SS MS F Significance  F

Regression 1 28036.4 28036.36 55.501 1E-­11Residual 126 63648.9 505.1496Total 127 91685.2

CoefficientsStandard  Error t  Stat P-­valueLower  95%Upper  95%Intercept -­10.09 18.97 -­0.53 0.60 -­47.62 27.44size 70.23 9.43 7.45 0.00 51.57 88.88

Pricei = −10.09 + 70.23Size + εi

100

Page 103: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Dummy Variables... Example: House Prices

1.6 1.8 2.0 2.2 2.4 2.6

80100

120

140

160

180

200

Size

Price

Nbhd = 1Nbhd = 2Nbhd = 3Just Size

101

Page 104: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Sex Discrimination Case

60 70 80 90

3040

5060

7080

90

Year Hired

Salary

Does it look like the effect of experience on salary is the same formales and females? 102

Page 105: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Sex Discrimination Case

Could we try to expand our analysis by allowing a different slopefor each group?

Yes... Consider the following model:

Salaryi = β0 + β1Expi + β2Sexi + β3Expi × Sexi + εi

For Females:Salaryi = β0 + β1Expi + εi

For Males:

Salaryi = (β0 + β2) + (β1 + β3)Expi + εi

103

Page 106: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Sex Discrimination Case

How does the data look like?

YrHired Gender Salary Sex SexExp

1 92 Male 32.00 1 92

2 81 Female 39.10 0 0

3 83 Female 33.20 0 0

4 87 Female 30.60 0 0

5 92 Male 29.00 1 92

... ... ...

208 62 Female 30.00 0 62

104

Page 107: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Sex Discrimination Case

Salaryi = β0 + β1Sexi + β2Exp + β3Exp ∗ Sex + εi

Srs

S

o

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.799130351R Square 0.638609318Adjusted R 0.63329475Standard Er 6.816298288Observation 208

ANOVAdf SS MS F ignificance F

Regression 3 16748.88 5582.96 120.16 7.513E-45Residual 204 9478.232 46.4619Total 207 26227.11

Coefficients tandard Err t Stat P-value Lower 95% Upper 95%Intercept 61.12479795 8.770854 6.96908 4E-11 43.831649 78.41795Gender 114.4425931 11.7012 9.78041 9E-19 91.371794 137.5134YrHired -0.279963351 0.102456 -2.7325 0.0068 -0.4819713 -0.077955GenderExp -1.247798369 0.136676 -9.1296 7E-17 -1.5172765 -0.97832

Salaryi = 61 + 114Sexi +−0.27Exp +−1.24Exp ∗ Sex + εi105

Page 108: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Sex Discrimination Case

60 70 80 90

3040

5060

7080

90

Year Hired

Salary

106

Page 109: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Variable Interaction

So, the effect of experience on salary is different for males andfemales... in general, when the effect of the variable X1 onto Ydepends on another variable X2 we say that X1 and X2 interactwith each other.

We can extend this notion by the inclusion of multiplicative effectsthrough interaction terms.

Yi = β0 + β1X1i + β2X2i + β3(X1iX2i) + ε

∂E[Y |X1,X2]

∂X1= β1 + β3X2

This is our first non-linear model!

107

Page 110: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

4. Residual Plots and Transformations

What kind of properties should the residuals have??

ei ≈ N(0, σ2) iid and independent from the X’s

I We should see no pattern between e and each of the X ’s

I This can be summarized by looking at the plot betweenY and e

I Remember that Y is “pure X ”, i.e., a linear function of theX ’s.

If the model is good, the regression should have pulled out of Y allof its “x ness”... what is left over (the residuals) should havenothing to do with X .

108

Page 111: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Non Linearity

Example: Telemarketing

I How does length of employment affect productivity (numberof calls per day)?

109

Page 112: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Non Linearity

Example: Telemarketing

I Residual plot highlights the non-linearity!

110

Page 113: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Non Linearity

What can we do to fix this?? We can use multiple regression andtransform our X to create a no linear model...

Let’s tryY = β0 + β1X + β2X 2 + ε

The data...

months months2 calls

10 100 18

10 100 19

11 121 22

14 196 23

15 225 25

... ... ...

111

Page 114: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

TelemarketingAdding Polynomials

Week VIII. Slide 5Applied Regression Analysis Carlos M. Carvalho

Linear Model

112

Page 115: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

TelemarketingAdding Polynomials

Week VIII. Slide 6Applied Regression Analysis Carlos M. Carvalho

With X2

113

Page 116: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

TelemarketingAdding Polynomials

Week VIII. Slide 7Applied Regression Analysis Carlos M. Carvalho

114

Page 117: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Telemarketing

What is the marginal effect of X on Y?

∂E[Y |X ]

∂X= β1 + 2β2X

I To better understand the impact of changes in X on Y youshould evaluate different scenarios.

I Moving from 10 to 11 months of employment raisesproductivity by 1.47 calls

I Going from 25 to 26 months only raises the number of callsby 0.27.

115

Page 118: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Polynomial Regression

Even though we are limited to a linear mean, it is possible to getnonlinear regression by transforming the X variable.

In general, we can add powers of X to get polynomial regression:Y = β0 + β1X + β2X 2 . . .+ βmXm

You can fit any mean function if m is big enough.Usually, m = 2 does the trick.

116

Page 119: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Closing Comments on Polynomials

We can always add higher powers (cubic, etc) if necessary.

Be very careful about predicting outside the data range. The curvemay do unintended things beyond the observed data.

Watch out for over-fitting... remember, simple models are“better”.

117

Page 120: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Be careful when extrapolating...

10 15 20 25 30 35 40

2025

30

months

calls

118

Page 121: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

...and, be careful when adding more polynomial terms!

10 15 20 25 30 35 40

1520

2530

3540

months

calls

238

119

Page 122: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Non-constant Variance

Example...

This violates our assumption that all εi have the same σ2.

120

Page 123: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Non-constant Variance

Consider the following relationship between Y and X :

Y = γ0X β1(1 + R)

where we think about R as a random percentage error.

I On average we assume R is 0...

I but when it turns out to be 0.1, Y goes up by 10%!

I Often we see this, the errors are multiplicative and thevariation is something like ±10% and not ±10.

I This leads to non-constant variance (or heteroskedasticity)

121

Page 124: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

The Log-Log Model

We have data on Y and X and we still want to use a linearregression model to understand their relationship... what if we takethe log (natural log) of Y ?

log(Y ) = log[γ0X β1(1 + R)

]log(Y ) = log(γ0) + β1 log(X ) + log(1 + R)

Now, if we call β0 = log(γ0) and ε = log(1 + R) the above leads to

log(Y ) = β0 + β1 log(X ) + ε

a linear regression of log(Y ) on log(X )!

122

Page 125: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Elasticity and the log-log Model

In a log-log model, the slope β1 is sometimes called elasticity.

In english, a 1% increase in X gives a beta % increase in Y.

β1 ≈d%Y

d%X(Why?)

123

Page 126: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Price Elasticity

In economics, the slope coefficient β1 in the regressionlog(sales) = β0 + β1 log(price) + ε is called price elasticity.

This is the % change in sales per 1% change in price.

The model implies that E [sales] = A ∗ priceβ1

where A = exp(β0)

124

Page 127: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Price Elasticity of OJ

A chain of gas station convenience stores was interested in thedependency between price of and Sales for orange juice...They decided to run an experiment and change prices randomly atdifferent locations. With the data in hands, let’s first run anregression of Sales on Price:

Sales = β0 + β1Price + εSUMMARY OUTPUT

Regression StatisticsMultiple R 0.719R Square 0.517Adjusted R Square 0.507Standard Error 20.112Observations 50.000

ANOVAdf SS MS F Significance F

Regression 1.000 20803.071 20803.071 51.428 0.000Residual 48.000 19416.449 404.509Total 49.000 40219.520

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 89.642 8.610 10.411 0.000 72.330 106.955Price -20.935 2.919 -7.171 0.000 -26.804 -15.065

125

Page 128: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Price Elasticity of OJ

1.5 2.0 2.5 3.0 3.5 4.0

2040

6080

100120140

Fitted Model

Price

Sales

1.5 2.0 2.5 3.0 3.5 4.0-20

020

4060

80

Residual Plot

Price

residuals

No good!!

126

Page 129: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Price Elasticity of OJ

But... would you really think this relationship would be linear?Moving a price from $1 to $2 is the same as changing it form $10to $11?? We should probably be thinking about the price elasticityof OJ... log(Sales) = γ0 + γ1 log(Price) + ε

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.869R Square 0.755Adjusted R Square 0.750Standard Error 0.386Observations 50.000

ANOVAdf SS MS F Significance F

Regression 1.000 22.055 22.055 148.187 0.000Residual 48.000 7.144 0.149Total 49.000 29.199

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 4.812 0.148 32.504 0.000 4.514 5.109LogPrice -1.752 0.144 -12.173 0.000 -2.042 -1.463

How do we interpret γ1 = −1.75?(When prices go up 1%, sales go down by 1.75%)

127

Page 130: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Price Elasticity of OJ

1.5 2.0 2.5 3.0 3.5 4.0

2040

6080

100120140

Fitted Model

Price

Sales

1.5 2.0 2.5 3.0 3.5 4.0-0.5

0.0

0.5

Residual Plot

Price

residuals

Much better!!

128

Page 131: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Making Predictions

What if the gas station store wants to predict their sales of OJ ifthey decide to price it at $1.8?

The predicted log(Sales) = 4.812 + (−1.752)× log(1.8) = 3.78

So, the predicted Sales = exp(3.78) = 43.82.How about the plug-in prediction interval?

In the log scale, our predicted interval in

[ log(Sales)− 2s; log(Sales) + 2s] =[3.78− 2(0.38); 3.78 + 2(0.38)] = [3.02; 4.54].

In terms of actual Sales the interval is[exp(3.02), exp(4.54)] = [20.5; 93.7]

129

Page 132: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Making Predictions

1.5 2.0 2.5 3.0 3.5 4.0

2040

6080

100

120

140

Plug-in Prediction

Price

Sales

0.4 0.6 0.8 1.0 1.2 1.42.0

2.5

3.0

3.5

4.0

4.5

5.0

Plug-in Prediction

log(Price)

log(Sales)

I In the log scale (right) we have [Y − 2s; Y + 2s]

I In the original scale (left) we have[exp(Y ) ∗ exp(−2s); exp(Y ) exp(2s)]

130

Page 133: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Some additional comments...

I Another useful transformation to deal with non-constantvariance is to take only the log(Y ) and keep X the same.Clearly the “elasticity” interpretation no longer holds.

I Always be careful in interpreting the models after atransformation

I Also, be careful in using the transformed model to makepredictions

131

Page 134: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Summary of Transformations

Coming up with a good regression model is usually an iterativeprocedure. Use plots of residuals vs X or Y to determine the nextstep.

Log transform is your best friend when dealing with non-constantvariance (log(X ), log(Y ), or both).

Add polynomial terms (e.g. X 2) to get nonlinear regression.

The bottom line: you should combine what the plots and theregression output are telling you with your common sense andknowledge about the problem. Keep playing around with it untilyou get something that makes sense and has nothing obviouslywrong with it.

132

Page 135: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Airline Data

Monthly passengers in the U.S. airline industry (in 1,000 ofpassengers) from 1949 to 1960... we need to predict the number ofpassengers in the next couple of months.

0 20 40 60 80 100 120 140

100

200

300

400

500

600

Time

Passengers

Any ideas?

133

Page 136: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Airline Data

How about a “trend model”? Yt = β0 + β1t + εt

0 20 40 60 80 100 120 140

100

200

300

400

500

600

Time

Passengers

Fitted Values

What do you think?

134

Page 137: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Airline Data

Let’s look at the residuals...

0 20 40 60 80 100 120 140

-2-1

01

23

Time

std

resi

dual

s

Is there any obvious pattern here? YES!!

135

Page 138: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Airline Data

The variance of the residuals seems to be growing in time... Let’stry taking the log. log(Yt) = β0 + β1t + εt

0 20 40 60 80 100 120 140

5.0

5.5

6.0

6.5

Time

log(Passengers)

Fitted Values

Any better?

136

Page 139: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Airline Data

Residuals...

0 20 40 60 80 100 120 140

-2-1

01

2

Time

std

resi

dual

s

Still we can see some obvious temporal/seasonal pattern....

137

Page 140: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Airline Data

Okay, let’s add dummy variables for months (only 11 dummies)...log(Yt) = β0 + β1t + β2Jan + ...β12Dec + εt

0 20 40 60 80 100 120 140

5.0

5.5

6.0

6.5

Time

log(Passengers)

Fitted Values

Much better!!

138

Page 141: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Airline Data

Residuals...

0 20 40 60 80 100 120 140

-2-1

01

2

Time

std

resi

dual

s

I am still not happy... it doesn’t look normal iid to me...139

Page 142: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Airline Data

Residuals...

-0.15 -0.10 -0.05 0.00 0.05 0.10

-0.15

-0.10

-0.05

0.00

0.05

0.10

corr(e(t),e(t-1))= 0.786

e(t-1)

e(t)

I was right! The residuals are dependent on time...140

Page 143: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Airline Data

We have one more tool... let’s add one legged term.log(Yt) = β0 + β1t + β2Jan + ...β12Dec + β13 log(Yt−1) + εt

0 20 40 60 80 100 120 140

5.0

5.5

6.0

6.5

Time

log(Passengers)

Fitted Values

Okay, good...

141

Page 144: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Airline Data

Residuals...

0 20 40 60 80 100 120 140

-2-1

01

2

Time

std

resi

dual

s

Much better!!142

Page 145: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Airline Data

Residuals...

-0.10 -0.05 0.00 0.05

-0.10

-0.05

0.00

0.05

corr(e(t),e(t-1))= -0.11

e(t-1)

e(t)

Much better indeed!!143

Page 146: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Model Building Process

When building a regression model remember that simplicity is yourfriend... smaller models are easier to interpret and have fewerunknown parameters to be estimated.

Keep in mind that every additional parameter represents a cost!!

The first step of every model building exercise is the selection ofthe the universe of variables to be potentially used. This task isentirely solved through you experience and context specificknowledge...

I Think carefully about the problem

I Consult subject matter research and experts

I Avoid the mistake of selecting too many variables

144

Page 147: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Model Building Process

With a universe of variables in hand, the goal now is to select themodel. Why not include all the variables in?

Big models tend to over-fit and find features that are specific tothe data in hand... ie, not generalizable relationships.The results are bad predictions and bad science!

In addition, bigger models have more parameters and potentiallymore uncertainty about everything we are trying to learn...

We need a strategy to build a model in ways that accounts for thetrade-off between fitting the data and the uncertainty associatedwith the model

145

Page 148: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

5. Variable Selection and Regularization

When working with linear regression models where the number ofX variables is large, we need to think about strategies to selectwhat variables to use...

We will focus on 3 ideas:

I Subset Selection

I Shrinkage

I Dimension Reduction

146

Page 149: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Subset Selection

The idea here is very simple: fit as many models as you can andcompare their performance based on some criteria!

Issues:

I How many possible models? Total number of models = 2p

Is this large?

I What criteria to use?Just as before, if prediction is what we have in mind,out-of-sample predictive ability should be the criteria

147

Page 150: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Information Criteria

Another way to evaluate a model is to use Information Criteriametrics which attempt to quantify how well our model would havepredicted the data (regardless of what you’ve estimated for theβj ’s).

A good alternative is the BIC: Bayes Information Criterion, whichis based on a “Bayesian” philosophy of statistics.

BIC = n log(s2) + p log(n)

You want to choose the model that leads to minimum BIC.

148

Page 151: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Information Criteria

One nice thing about the BIC is that you caninterpret it in terms of model probabilities. Given a list of possiblemodels {M1,M2, . . . ,MR}, the probability that model i is correct is

P(Mi) ≈e−

12BIC(Mi )∑R

r=1 e− 1

2BIC(Mr )

=e−

12

[BIC(Mi )−BICmin]∑Rr=1 e

− 12

[BIC(Mr )−BICmin]

(Subtract BICmin = min{BIC(M1) . . .BIC(MR)} for numerical stability.)

Similar, alternative criteria include AIC, Cp, adjusted R2...bottom line: these are only useful if we lack the ability to comparemodels based on their out-of-sample predictive ability!!!

149

Page 152: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Search Strategies: Stepwise Regression

One computational approach to build a regression modelstep-by-step is “stepwise regression” There are 3 options:

I Forward: adds one variable at the time until no remainingvariable makes a significant contribution (or meet a certaincriteria... could be out of sample prediction)

I Backwards: starts will all possible variables and removes oneat the time until further deletions would do more harm themgood

I Stepwise: just like the forward procedure but allows fordeletions at each step

150

Page 153: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Shrinkage Methods

An alternative way to deal with selection is to work with all ppredictors at once while placing a constraint on the size of theestimated coefficients

This idea is a regularization technique that reduces the variabilityof the estimates and tend to lead to better predictions.

The hope is that by having the constraint in place, the estimationprocedure will be able to focus on “the important β’s”

151

Page 154: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Ridge Regression

Ridge Regression is a modification of the least squares criteria thatminimizes (as a function of β’s)

n∑i=1

Yi − β0 −p∑

j=1

βjXij

2

+ λ

p∑j=1

β2j

for some value of λ > 0

I The “blue” part of the equation is the traditional objectivefunction of LS

I The “red” part is the shrinkage penalty, ie, something thatmakes costly to have big values for β

152

Page 155: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Ridge Regression

n∑i=1

Yi − β0 −p∑

j=1

βjXij

2

+ λ

p∑j=1

β2j

I if λ = 0 we are back to least squares

I when λ→∞, it is “too expensive” to allow for any β to bedifferent than 0...

I So, for different values of λ we get a different solution to theproblem

153

Page 156: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Ridge Regression

I What ridge regression is doing is exploring the bias-variancetrade-off! The larger the λ the more bias (towards zero) isbeing introduced in the solution, ie, the less flexible the modelbecomes... at the same time, the solution has less variance

I As always, the trick to find the “right” value of λ that makesthe model not too simple but not too complex!

I Whenever possible, we will choose λ by comparing theout-of-sample performance (usually via cross-validation)

154

Page 157: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Ridge Regression

1e−01 1e+01 1e+03

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

λ ‖βRλ ‖2/‖β‖2

bias2 (black), var(green), test MSE (purple)Comments:

I Ridge is computationally very attractive as the “computingcost” is almost the same of least squares (contrast that withsubset selection!)

I It’s a good practice to always center and scale the X ’s beforerunning ridge

155

Page 158: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

LASSO

The LASSO is a shrinkage method that performs automaticselection. It is similar to ridge but it will provide solutions that aresparse, ie, some β’s exactly equal to 0! This facilitatesinterpretation of the results...

Ridge LASSO

1e!02 1e+00 1e+02 1e+04

!300

!100

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

IncomeLimitRatingStudent

0.0 0.2 0.4 0.6 0.8 1.0

!300

!100

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

! !"R! !2/!"!2

20 50 100 200 500 2000 5000

!200

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

0.0 0.2 0.4 0.6 0.8 1.0

!300

!100

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

IncomeLimitRatingStudent

! !"L! !1/!"!1

156

Page 159: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

LASSO

The LASSO solves the following problem:

arg minβ

n∑

i=1

Yi − β0 −p∑

j=1

βjXij

2

+ λ

p∑j=1

|βj |

I Once again, λ controls how flexible the model gets to be

I Still a very efficient computational strategy

I Whenever possible, we will choose λ by comparing theout-of-sample performance (usually via cross-validation)

157

Page 160: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Ridge vs. LASSO

Why does the LASSO outputs zeros?

158

Page 161: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Ridge vs. LASSO

Which one is better?

I It depends...

I In general LASSO will perform better than Ridge when arelative small number of predictors have a strong effect in Ywhile Ridge will do better when Y is a function of many of theX ’s and the coefficients are of moderate size

I LASSO can be easier to interpret (the zeros help!)

I But, if prediction is what we care about the only way todecide which method is better is comparing theirout-of-sample performance

159

Page 162: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Choosing λ

The idea is to solve the ridge or LASSO objective function over agrid of possible values for λ...

-2 0 2 4 6

0.35

0.40

0.45

0.50

0.55

Ridge CV (k=10)

log(lambda)

RMSE

-8 -6 -4 -2

0.35

0.40

0.45

0.50

0.55

LASSO CV (k=10)

log(lambda)

RMSE

160

Page 163: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

6. Dimension Reduction Methods

Sometimes, the number (p) of X variables available is too large forus to work with the methods presented above.

Perhaps, we could first summarize the information in the predictorsinto a smaller set of variables (m << p) and then try to predict Y .

In general, these summaries are often linear combinations of theoriginal variables.

161

Page 164: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Regression

A very popular way to summarize multivariate data is PrincipalComponents Analysis (PCA).

PCA is a dimensionality reduction technique that tries torepresent p variables with a k < p “new” variables.

These “new” variables are create by linear combinations of theoriginal variables and the hope is that a small number of them areable to effectively represent what is going on in the original data.

162

Page 165: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Analysis

Assume we have a dataset where p variables are observed. Let Xi

be the i th observation of the p-dimensional vector X . PCA writes:

xij = b1jzi1 + b2jzi2 + · · ·+ bkjzik + eij

where zij is the i th observation of the j th principal component.

You can think about these z variables as the “essential variables”responsible for all the action in X .

163

Page 166: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components AnalysisHere’s a picture...

Fitting Principal Components via Least Squares

PCA looks for high-variance projections from multivariate x

(i.e., the long direction) and finds the least squares fit.

-2 -1 0 1 2

-2-1

01

2

x1

x2

PC1PC2

Components are ordered by variance of the fitted projection.

9

Thesetwo variables x1 and x2 are very correlated. PC 1 tells you almosteverything that is going on in this dataset!PCA will look for linear combinations of the original variables thataccount for most of their variability! 164

Page 167: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Analysis

Let’s look at a simple example... Data: Protein consumption byperson by country for 7 variables: red meat, white meat, eggs,milk, fish, cereals, starch, nuts, vegetables.

-1 0 1 2

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

RedMeat

WhiteMeat

Albania

Austria

Belgium

Bulgaria

CzechoslovakiaDenmark

E Germany

Finland

France

Greece

Hungary

Ireland

Italy

Netherlands

Norway

Poland

Portugal

Romania

Spain

Sweden

Switzerland

UK

USSR

W Germany

Yugoslavia

-4 -2 0 2 4

-10

12

34

PC1

PC2

Albania

Austria

Belgium

Bulgaria

Czechoslovakia

DenmarkE Germany

Finland

FranceGreece

HungaryIreland

Italy

Netherlands

NorwayPoland

Portugal

Romania

Spain

Sweden

Switzerland

UK USSRW Germany

Yugoslavia

Looks to me that PC1 measures how rich you are and PC2something to do with the Mediterranean diet!

165

Page 168: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Analysis

-4 -2 0 2 4

-10

12

34

PC1

PC2

Albania

Austria

Belgium

Bulgaria

Czechoslovakia

DenmarkE Germany

Finland

FranceGreece

HungaryIreland

Italy

Netherlands

NorwayPoland

Portugal

Romania

Spain

Sweden

Switzerland

UK USSRW Germany

Yugoslavia

These are the weights defining the principal components...

166

Page 169: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Analysis3 variables might be enough to represent this data... Most of thevariability (75%) is explained with PC1, PC2 and PC3.

Food Principal Components Variance

PC

Variances

01

23

4

167

Page 170: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Analysis: Comments

I PCA is a great way to summarize data

I It “clusters” both variables and observations simultaneously!

I The choice of k can be evaluated as a function of theinterpretation of the results or via the fit (% of the variationexplained)

I The units of each PC is not interpretable in an absolute sense.However, relative to each other it is... see example above.

I Always a good idea to center the data before running PCA.

168

Page 171: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Regression (PCR)

Let’s go back to and think of predicting Y with a potentially largenumber of X variables...

PCA is sometimes used as a way to reduce the dimensionality ofX ... if only a small number of PC’s are enough to represent X , Idon’t need to use all the X ’s, right? Remember, smaller modelstend to do better in predictive terms!

This is called Principal Component Regression. First represent Xvia k principal components (Z ) and then run a regression of Yonto Z . PCR assumes that the directions in which shows the mostvariation (the PCs), are the directions associated with Y .

The choice of k can be done by comparing the out-of-samplepredictive performance.

169

Page 172: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Regression (PCR)

Example: Roll Call Votes in Congress... all votes in the 111th

Congress (2009-2011); p = 1647, n = 445.Goal: Predict party how “liberal” a district is a function of thevotes by their representative

Let’s first take the principal component decomposition of thevotes...

Rollcall-Vote Principle Components

PC

Variances

0100

200

300

400

500

It looks like 2 PC capture muchof what is going on...

170

Page 173: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Regression (PCR)Histogram of loadings on PC1... What bills are important indefining PC1?

1st Principle Component Vote-Loadings

Frequency

-0.04 -0.02 0.00 0.02 0.04

0100

200

300

400

500

Afford. Health (amdt.) TARP

171

Page 174: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Regression (PCR)Histogram of PC1... “Ideology Score”

1st Principle Component Vote-Scores

Frequency

-40 -30 -20 -10 0 10 20 30

050

100

150

Royball (CA-34)Deutch (FL-19)Rosleht (FL-18)Conaway (TX-11)

172

Page 175: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Regression (PCR)

TX-11 CA-34

The two extremes...

173

Page 176: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Regression (PCR)

FL-18 FL-19

The swing state!

174

Page 177: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Regression (PCR)

-40 -30 -20 -10 0 10 20

3040

5060

7080

90

PC1

% O

bam

a

-40 -30 -20 -10 0 10 20-80

-60

-40

-20

0

PC1

PC2

All we need is PC1 to predict party affiliation!

How can this picture help you understand what happened in the2010 election?

175

Page 178: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Principal Components Regression (PCR)

-40 -30 -20 -10 0 10 20

3040

5060

7080

90

PC1

% O

bam

a

-40 -30 -20 -10 0 10 20

3040

5060

7080

90

PC1

% O

bam

a

losers in 2010

176

Page 179: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Partial Least Squares (PLS)

PLS works very similarly to PCR as it will create “new” variables(Z ) by taking linear combinations of the original variables (X ).

The different is that PLS attempts to find the directions ofvariation in X that help explain BOTH X and Y .

It is a supervised learning alternative to PCR...

177

Page 180: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Partial Least Squares (PLS)

PLS works as follows:

1. The weights of the first linear combination (Z1) is defined bythe regression of Y onto each of the X ’s... i.e., large weightsare going to placed on the X variables most related to Y in aunivariate sense

2. Regress each X variable onto Z1 and compute the residuals

3. Repeat step (1) using the residuals from (2) in place of X

4. iterate

As always, the choice of where to stop, i.e., how many Z variablesto use should be done by comparing the out-of-sample predictiveperformance.

178

Page 181: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Partial Least Squares (PLS)

Roll Call Data again... it looks like the first component from PLSis the same as the first principal component!

-40 -30 -20 -10 0 10 20

3040

5060

7080

90

PC1

% O

bam

a

-40 -30 -20 -10 0 10 20

3040

5060

7080

90

Comp1 (PLS)

% O

bam

a

179

Page 182: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Partial Least Squares (PLS)

But, using two components PLS does better than PCR!!

40 45 50 55 60

3040

5060

7080

90

PCR ( R^2= 0.448 )

Fitted

% O

bam

a

40 50 60 70

3040

5060

7080

90

PLS ( R^2= 0.558 )

Fitted

% O

bam

a

180

Page 183: Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear Regression 3. Dummy Variables 4. Residual Plots and Transformations 5. Variable Selection

Partial Least Squares (PLS)Not easy to understand the difference between the secondcomponent in each method (how is that for a homework!)... thebottom line is that by using the information from Y insummarizing the X variables, PLS find a second component thathas the ability to explain part of Y .

-40 -30 -20 -10 0 10 20

-80

-60

-40

-20

0

PCR

PC1

PC2

-40 -30 -20 -10 0 10 20

-10

010

2030

PLS

Comp 1

Com

p 2

181