week12 annotated

51
ACTL2002/ACTL5101 Probability and Statistics: Week 12 ACTL2002/ACTL5101 Probability and Statistics c Katja Ignatieva School of Risk and Actuarial Studies Australian School of Business University of New South Wales [email protected] Week 12 Probability: Week 1 Week 2 Week 3 Week 4 Estimation: Week 5 Week 6 Review Hypothesis testing: Week 7 Week 8 Week 9 Linear regression: Week 10 Week 11 Video lectures: Week 1 VL Week 2 VL Week 3 VL Week 4 VL Week 5 VL

Upload: bob

Post on 28-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

.

TRANSCRIPT

Page 1: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

ACTL2002/ACTL5101 Probability and Statistics

c© Katja Ignatieva

School of Risk and Actuarial StudiesAustralian School of Business

University of New South Wales

[email protected]

Week 12Probability: Week 1 Week 2 Week 3 Week 4

Estimation: Week 5 Week 6 Review

Hypothesis testing: Week 7 Week 8 Week 9

Linear regression: Week 10 Week 11

Video lectures: Week 1 VL Week 2 VL Week 3 VL Week 4 VL Week 5 VL

Page 2: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

First nine weeks

Introduction to probability;

Moments: (non)-central moments, mean, variance (standarddeviation), skewness & kurtosis;

Special univariate (parametric) distributions (discrete &continue);

Joint distributions;

Convergence; with applications LLN & CLT;

Estimators (MME, MLE, and Bayesian);

Evaluation of estimators;

Interval estimation.3301/3343

Page 3: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Last two weeks

Simple linear regression:- Idea;

- Estimating using LSE (& BLUE estimator & relation MLE);

- Partition of variability of the variable;

- Testing:i) Slope;

ii) Intercept;

iii) Regression line;

iv) Correlation coefficient.

Multiple linear regression:- Matrix notation;

- LSE estimates;

- Tests;

- R-squared and adjusted R-squared.3302/3343

Page 4: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Confounding effects

Modelling with Linear Regression

Modelling assumptions in linear regressionConfounding effectsCollinearityHeteroscedasticity

Special explanatory variablesInteraction of explanatory variablesCategorial explanatory variables

Model selectionReduction of number of explanatory variablesModel validation

Page 5: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Confounding effects

Confounding effects

Linear regression measures the effect of explanatory variablesX1, . . . ,Xn on the dependent variable Y .

The assumptions are:

Effects of the covariates (explanatory variables) must beadditive;

Homoskedastic (constant) variance;

Errors must be independent of the explanatory variables withmean zero (weak assumptions);

Errors must be Normally distributed, and hence, symmetric(strong assumptions).

But what about confounding variables?

Correlation does not imply causality!3303/3343

Page 6: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Confounding effects

Confounding effects

C is a confounder of the relation between X and Y if: Cinfluences X and C influences Y , but X does not influence Y(directly).

3304/3343

Page 7: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Confounding effects

How to correctly use/don’t use confounding variables?

If confounding variable is observable: add the confoundingvariable.

If confounding variable is unobservable: be careful withinterpretation.

The predictor variable has an indirect influence on dependentvariable.Example: Age ⇒ Experience ⇒ Probability of car accident.Experience can not be measured, thus age can be a proxy forexperience.

The predictor variable has no direct influence on dependentvariable.Example: Becoming older does not make you a better driver.

Hence, a predictor variable works as a predictor, but actiontaken on the predictor itself will have no effect.

3305/3343

Page 8: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Collinearity

Modelling with Linear Regression

Modelling assumptions in linear regressionConfounding effectsCollinearityHeteroscedasticity

Special explanatory variablesInteraction of explanatory variablesCategorial explanatory variables

Model selectionReduction of number of explanatory variablesModel validation

Page 9: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Collinearity

Collinearity

Multicollinearity occurs when one explanatory variable is a(nearly) linear combination of the other explanatory variables.

If explanatory variable is collinear, this variable is redundant,it provides no/little additional information.

Example: perfect fit for: y = −87 + x1 + 18x2, but also fory = −7 + 9x1 + 2x2:

i 1 2 3 4

yi 23 83 63 103xi1 2 8 6 10xi2 6 9 8 10

Note that x2 = 5 + x1/2, thus β2 = β2 + c , β1 = β1 − c/2and β0 = β0 − 5c .

3306/3343

Page 10: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Collinearity

CollinearityCollinearity:

Does not influence fit, nor predictions.

Estimates of error variance, thus also model adequacy, are stillreliable.

Standard errors of individual regression coefficients are higher,leading to small t-ratio.

Detecting collinearity:i) Regress xj on the other explanatory variables

ii) Determine coefficient of determination R2.

iii) Calculate Variance Inflation Factor: VIFj = (1− R2j )−1. If

large (> 10), severe collinearity exists.

When severe collinearity exists, often the only option is toremove one or more variables from the regression equation.

3307/3343

Page 11: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Heteroscedasticity

Modelling with Linear Regression

Modelling assumptions in linear regressionConfounding effectsCollinearityHeteroscedasticity

Special explanatory variablesInteraction of explanatory variablesCategorial explanatory variables

Model selectionReduction of number of explanatory variablesModel validation

Page 12: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Heteroscedasticity

Heteroscedasticity

We have assumed homoscedastic residuals, if the variance ofthe residuals are different for different observations, then wehave heteroscedasticity.

Least squares estimator is unbiased, even in presence ofheteroscedasticity.

Least squares estimator might not be the optimal estimator.

Confidence intervals and hypothesis tests depends onhomoscedastic residuals.

3308/3343

Page 13: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Heteroscedasticity

Graphical: plot the estimated residuals against the endogenousvariable:

0 100 200 300−100

0

100

200

yi

ε i

0 2 4−100

0

100

200

x1i

ε i

0 10 20−100

0

100

200

x2i

ε i

0 2 4 6−1

−0.5

0

0.5

1

yi*=log(y

i)

ε i*

Solution: Use a transformation for Y , i.e., y?i = log(yi ).3309/3343

Page 14: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Heteroscedasticity

Graphical: plot the estimated residuals against the explanatoryvariables (LM = SSM/2 = 34, χ2

0.99(2) = 9.21 and χ20.95(2) = 5.99):

0 200 400−10

0

10

20

yi

ε i

0 50 100−10

0

10

20

x1i

ε i

0 10 20−10

0

10

20

x2i

ε i

0 50 1000

5

10

15

20

x1i

x 2i

3310/3343

Page 15: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Heteroscedasticity

Detecting heteroscedasticityF-test (using two groups of data), White (1980)-test

Bruesch and Pagan (1980) test:

Test H0 : homoscedastic residuals v.s.H1 : Var(yi ) = σ2 + z>i γ, where z i is a know vector ofvariables and γ is a p-dimensional vector of parameters.

Test procedure:1. Fit the regression model and determine the residuals εi .

2. Calculate the squared standardized residuals ε?2i = ε2

i /s2.

3. Fit a regression model of ε?2i on z i (can be all Xi ).

4. Test statistic: LM = SSM2 , where SSM =

∑ni=1

(ε?2i − ε?2

)2

5. Reject H0 if LM > χ21−α(p).3311/3343

Page 16: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Heteroscedasticity

GLS & WLS (OPTIONAL)

Sometimes, you know how much variation there should be inthe residual for each observation.

Example: Difference in exposure of risk.

Application: Mortality modeling (exposures-ages), proportionclaiming (changing portfolio sizes).

Heteroscedasticity: Ω−1 6= In

E[εε>]

= σ2Ω Ω−1 = P>P.

Find the OLS estimates of the regression model which ispre-multiplied by P:

Py = PXβ + Pε ⇒ y = Xβ + ε

Find P using (for example) the Cholesky decomposition.3312/3343

Page 17: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Heteroscedasticity

GLS & WLS (OPTIONAL)

Note that we can apply OLS on the pre-multiplied by Pmodel:

E[εε>]

= E[εPε>P>

]= PE

[εε>]

P> = Pσ2ΩP> = σ2In.

Hence, we have the Generalized Least Squares estimator:

β =(

X>X)−1

X>y

=(

X>P>PX)−1

P>X>Py

=(

X>Ω−1X)−1

X>Ω−1y

Var(β) =σ2(

X>X)−1

=σ2(

X>Ω−1X)−1

3313/3343

Page 18: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Heteroscedasticity

GLS & WLS (OPTIONAL)

Weighted least squares, each variable is weighted by 1/√ωi :

E[εε>]

=σ2Ω = σ2

ω1 0 00 ω2 0

· · · . . .. . .

0 0 ωn

Ω−1 =

1/ω1 0 0

0 1/ω2 0

· · · . . .. . .

0 0 1/ωn

= P>P ⇒ P =

1/√ω1

1/√ω2

. . .1/√ωn

.Only applicable when you know the relative variancesω1, . . . , ωn.

Can we also estimate the variance-covariance matrix Ω?3314/3343

Page 19: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Modelling assumptions in linear regression

Heteroscedasticity

EGLS/FGLS (OPTIONAL)

Feasible GLS or Estimated GLS does not impose the structureof heteroskedasticity, but estimates it from the data.

Estimation procedure:

1. Estimate the regression using OLS.

2. Regress the squared residuals on explanatory variables.

3. Determine the expected squared residuals⇒ ωi ⇒ Ω = diag(ω1, . . . , ωn).

4. Use WLS with weights ωi to find the EGLS/FGLS estimate.

3315/3343

Page 20: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Interaction of explanatory variables

Modelling with Linear Regression

Modelling assumptions in linear regressionConfounding effectsCollinearityHeteroscedasticity

Special explanatory variablesInteraction of explanatory variablesCategorial explanatory variables

Model selectionReduction of number of explanatory variablesModel validation

Page 21: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Interaction of explanatory variables

Interaction of explanatory variables

In linear regression covariates are additive.

Multiplicative relations are non-additive.

Example: Liquidity of stocks can be explained by: Price,Volume, and Value (=Price×Volume).

Note that interaction terms might lead to high collinearity.

For symmetric distribution, if the explanatory variables arecentered, there is no correlation between interaction term andmain effects. Thus center variables to reduce collinearityissues.

Example: yi = β0 + β1 · X1 + β2 · X2 + β3 · X1X2 + εi ,where β1 and β2 are main effects, β3 is the interaction effect.The marginal effect of X1|X2 = x2 is ∂yi/∂x1 = β1 + β3x2.

3316/3343

Page 22: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Interaction of explanatory variables

Interaction of random variables

A moderator variable is a predictor (e.g., X1) that interactswith another predictor (e.g., X2) in explaining variance in apredicted variable (e.g., Y ).

Moderating effects are equivalent to interaction effects.

Rearranging we get:

yi = (β0 + β1 · X1)︸ ︷︷ ︸intercept depends on X1

+ (β2 + β3 · X1)︸ ︷︷ ︸slope depends on X1

· X2 + εi ,

Always include the marginal effects in the regression.

3317/3343

Page 23: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Interaction of explanatory variables

Example: interaction of random variables

Regression output (see excel file):

Main effects Full model Excluding X2 CenteredCoef t − stat Coef t − stat Coef t − stat Coef t − stat

intercept −0.71 −0.94 −0.61 −1.51 −0.46 −1.55 1.90 6.08X1 3.88 4.29 0.83 1.24 1.02 1.83 2.73 5.30X2 0.46 0.71 0.18 0.53 1.08 3.01X1 · X2 1.11 6.57 1.12 6.83 1.11 6.57

Correlations (where xi = xi − E[Xi ]):

non-centered centeredx1 x2 x1 · x2 x1 x2 x1 · x2

y 0.92 0.82 0.97 y 0.92 0.82 0.51x1 0.86 0.8 x1 0.86 0.07x2 0.8 x2 0.07

3318/3343

Page 24: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Interaction of explanatory variables

Always include marginal effects (heteroscedastic variance):

−1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5

−2

0

2

x1i

ε i

Scatter plot residuals v.s. explanatory variables

0 2 4 6 8 10 12 14

−2

0

2

x1i

⋅ x2i

ε i

variance of εi decreasing function of x

1i⋅ x

2i

3319/3343

Page 25: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

Modelling with Linear Regression

Modelling assumptions in linear regressionConfounding effectsCollinearityHeteroscedasticity

Special explanatory variablesInteraction of explanatory variablesCategorial explanatory variables

Model selectionReduction of number of explanatory variablesModel validation

Page 26: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

Binary explanatory variables

Categorical variables provide a numerical label formeasurements of observations that fall in distinct groups orcategories.

Binary variables is a variable which can only take two values,namely zero and one.

Example: Gender (male=1, female=0), number of yearseducation (1 if more than 12 years, 0 otherwise).

Regression (see next slide):

LnFace i = β0+ β1 LnIncomei+ β2 Singlei+ εi

LnFace i = −0.42+ 1.12 LnIncomei− 0.51 Singlei+ εi(0.56) (0.05) (0.16) (0.57)

3320/3343

Page 27: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

Binary explanatory variablesInterpretation coefficients:

- β0: Intercept for non-singles;

- β1: Marginal effect of LnIncome;

- β2: Difference in intercept singles, non-singles i.e., β0 + β2 isthe intercept for singles.

6 7 8 9 10 11 12 136

7

8

9

10

11

12

13

14

15

LnIncome

LnFace

Non−singlesSingles

3321/3343

Page 28: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

6 7 8 9 10 11 12 136

7

8

9

10

11

12

13

14

15

LnIncome

LnFa

ce

Non−singlesSingles

3322/3343

Page 29: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

6 7 8 9 10 11 12 13−1.5

−1

−0.5

0

0.5

1

1.5

LnIncome

Res

idua

l

Non−singlesSingles

3323/3343

Page 30: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

Binary explanatory variables

Regression (see next slide):

LnFace i = β0+ β1 LnIi+ β2 Si+ β3 Si × LnIi+ εi

LnFace i = 0.11+ 1.07 LnIi− 3.28 Si+ 0.27 Si × LnIi+ εi(0.61) (0.06) (1.41) (0.14) (0.56)

Interpretation coefficients:

- β0: Intercept for non-singles;

- β1: Marginal effect of LnIncome;

- β2: Difference in intercept singles, non-singles i.e., β0 + β2 isthe intercept for singles.

- β3: Difference in marginal effect of LnIncome singles,non-singles i.e., β1 + β3 is the marginal effect of LnIncome forsingles.

3324/3343

Page 31: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

6 7 8 9 10 11 12 136

7

8

9

10

11

12

13

14

15

LnIncome

LnFa

ce

Non−singlesSingles

3325/3343

Page 32: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

6 7 8 9 10 11 12 13−1.5

−1

−0.5

0

0.5

1

1.5

LnIncome

Res

idua

l

Non−singlesSingles

3326/3343

Page 33: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

Binary explanatory variables

Now consider the example where we try to explain the hourlywage of individuals in a random sample.

As explanatory variable we have years of education.

Question: Why are you here?

Regression (see next slide):

HW i = β0+ β1 YEi+ εiHW i = −406+ 33 YEi+ εi

(8.97) (0.65) (12)

Solution: To earn more money later on?

Question: Explain the extend of correlation and causality inthis case.

3327/3343

Page 34: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

12 12.5 13 13.5 14 14.5 15 15.5 160

50

100

150

years edu

Hou

r wag

e

3328/3343

Page 35: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

12 12.5 13 13.5 14 14.5 15 15.5 16−20

−15

−10

−5

0

5

10

15

20

years edu

Res

idua

l

3329/3343

Page 36: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

Binary explanatory variables

Regression (see next slide):

HW i = β0+ β1 YEi+ β2 (YEi − 14)× Di+ εiHW i = −151+ 13.7 YEi+ 35 (YEi − 14)× Di+ εi

(3.91) (0.30) (0.47) (2.37)

where Di = 1 if YEi > 14 and zero otherwise.Interpretation coefficients:

- β0: Intercept for hourly wage;

- β1: Marginal effect of years education before 14 years;

- β2: Difference in marginal effect of years education before andafter 14 years i.e., β1 + β2 is the marginal effect of yearseducation for years >14.

3330/3343

Page 37: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

12 12.5 13 13.5 14 14.5 15 15.5 160

50

100

150

years edu

Hou

r wag

e

3331/3343

Page 38: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

12 12.5 13 13.5 14 14.5 15 15.5 16

−6

−4

−2

0

2

4

6

years edu

Res

idua

l

3332/3343

Page 39: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

Categorial explanatory variables

Now consider the example where we try to explain the hourlywage of individuals in a random sample.

As explanatory variable we have highest degree: high school,college, university, PhD.

Define: Ci =

0, if highest degree is high school;1, if highest degree is college;2, if highest degree is university degree;3, if highest degree is PhD.

Regression (see next slide):

YI i = β0+ β1 Ci+ εiYI i = 33.0+ 8.00 Ci+ εi

(1.36) (0.91) (15.7)3333/3343

Page 40: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

0 1 2 30

10

20

30

40

50

60

70

80

90

100

Edu level

Year

ly in

com

e

Empirical

0 1 2 3−50

−40

−30

−20

−10

0

10

20

30

40

50

Edu level

Resid

ual

Categorial

3334/3343

Page 41: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

0 1 2 30

10

20

30

40

50

60

70

80

90

100

Edu level

Year

ly in

com

e

Empirical

0 1 2 3−50

−40

−30

−20

−10

0

10

20

30

40

50

Edu level

Resid

ual

Dummies

3335/3343

Page 42: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Special explanatory variables

Categorial explanatory variables

Comparing categorial and dummy explanatory variables

Change the categorical variable in dummy variables:

Regression (see previous slide):

YI i = β0+ β1 D1,i+ β2 D2,i+ β3 D3,i+ εiYI i = 33.4+ 5.32 D1,i+ 18.07 D2,i+ 14.50 D3,i+ εi

(1.53) (2.12) (2.02) (4.05) (15.5)

Interpretation:- β0: Average income of high school.- β1: Average additional income of college relative to HS.- β2: Average additional income of university relative to HS.- β3: Average additional income of PhD relative to HS.

Conclusion: Only use categorical variables if the marginaleffects of all categories are equal!

3336/3343

Page 43: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Model selection

Reduction of number of explanatory variables

Modelling with Linear Regression

Modelling assumptions in linear regressionConfounding effectsCollinearityHeteroscedasticity

Special explanatory variablesInteraction of explanatory variablesCategorial explanatory variables

Model selectionReduction of number of explanatory variablesModel validation

Page 44: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Model selection

Reduction of number of explanatory variables

Reduction of number of explanatory variables

There are many possible combinations of explanatoryvariables:

number of X combinations number of X combinations

1 1 6 632 3 7 1273 7 8 2554 15 9 5115 31 10 1023

You do not want to check them all.

How to decide what explanatory variables to include?

3337/3343

Page 45: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Model selection

Reduction of number of explanatory variables

Stepwise regression algorithm

(i) Consider all possible regressions using one explanatoryvariable.For each of the regressions calculate the t-ratio.Select the explanatory variable with the highest absolutet-ratio (if larger than CV).

(ii) Add a variable to the model from the previous step. Thevariable to enter is the one that makes the largest significantcontribution. The t-ratio must be above the CV.

(iii) Delete a variable to the model from the previous step. Thevariable to be removed is the one that makes the smallestcontribution. The t-ratio must be below the CV.

(iv) Repeat steps (ii) and (iii) until all possible additions anddeletions are performed.

3338/3343

Page 46: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Model selection

Reduction of number of explanatory variables

Stepwise regression algorithm+ Useful algorithm that quickly search trough a number of

candidate models.

- Procedure “snoops” through a large number of candidatemodels and may fit the data “too well”.

- No guarantee that the selected model is the best.

- The algorithm use one criterion, namely t-ratio and does notconsider other criteria such as s, R2, R2

a , and so on (s willdecrease if the absolute value of the t-ratio is larger than one).

- The algorithm does not take into account the joint effect ofexplanatory variables.

- Purely automatic procedures may not take into account aninvestigator’s special knowledge.3339/3343

Page 47: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Model selection

Model validation

Modelling with Linear Regression

Modelling assumptions in linear regressionConfounding effectsCollinearityHeteroscedasticity

Special explanatory variablesInteraction of explanatory variablesCategorial explanatory variables

Model selectionReduction of number of explanatory variablesModel validation

Page 48: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Model selection

Model validation

Comparing modelsHow to compare two regression models with the same numberof explanatory variables:

- F -statistic;- The variability of the residual (s);- R-squared.

How to compare two regression models with the unequalnumber of explanatory variables:

- The variability of the residual (s);- Adjusted R-squared;- Likelihood ratio test:

−2 · (`p − `p+q) ∼ χ2q

Reject that the model with p + q variables as good as themodel with p variables if −2 · (`p − `p+q) > χ2

1−α(q).- (Optional:) Information Criterions ⇒ select the one with the

lowest AIC= 2k − 2`.3340/3343

Page 49: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Model selection

Model validation

Out-of-Sample Validation Procedure (OPTIONAL)

(i) Begin with a sample of size n, divide into two subsampleswith size n1 and n2.

(ii) Using the model development subsample, fit a candidatemodel to the data set i = 1, . . . , n1

(iii) Using the model in step (ii) predict yi in the validationsubsample.

(iv) Assess the proximity of the predictions to the held-out data.One measure is the sum of squared prediction errors:

SSPE =n∑

i=n1+1

(yi − yi )2

3341/3343

Page 50: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Model selection

Model validation

PRESS Validation Procedure (OPTIONAL)Predicted residual sum squares:

(i) For the full sample, omit the i th point and use the remainingn − 1 observations to compute regression coefficients.

(ii) Use the regression coefficients in (i) to compute the predictedresponse for the i th point, which is Y(i).

(iii) Repeat step (i) and (ii) for i = 1, . . . , n, define:

PRESS =n∑i=

(yi − y(i)

)2

Note: one can rewrite it into computational less incentiveprocedure!

yi − y(i) =εi

1− x i · (X>X)−1x i3342/3343

Page 51: Week12 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 12

Model selection

Model validation

Data analysis and modelingBox (1980): examine data, hypothesize a model, comparedata to a candidate model, formulate an improved model.

1. Examine the data graphically, use prior knowledge ofrelationships (economic theory, industry practice).

2. Based on assumptions in the model, must be consistent withthe data.

3. Diagnostic checks (data and model criticism) data and modelmust be consistent with one other before inferences can bemade.3343/3343