heteroskedasticity and errors in variables

. *DEFINITIONS OF ARTIFICIAL DATA SET . mat m=(12,20,0) /*matrix of means of RHS vars: edu, exp, error*/ . mat c=(5,-.6, 0 \ -.6,119,0 \ 0,0,.1) /*covariance matrix of RHS vars */ . mat l m /*displays matrix of means */ c1 c2 c3 r1 12 20 0 . mat l c /*displays covariance matrix*/ symmetric c[3,3] c1 c2 c3 r1 5 r2 -.6 119 r3 0 0 .1 . drawnorm edu exp e,n(2300) means(m) cov(c) (obs 2300) . *Compare normal and lognormal distribution . g Y=exp(logY) . gr Y,bin(40) norm saving($pathc\e1,replace) . gr logY,bin(40) norm saving($pathc\e2,replace) . gr using $pathc\e1 $pathc\e2

Fra

ctio

n

Y1492.82 18969.20

.117757

Fra

ctio

n

logY7.30842 9.850570

.079907

. *============= HETEROSKEDASTICITY ======================= References: Stata Reference Manual [N-R], regression diagnostic, pp.357- Stata Programming [P], _robust, pp.342 Wooldridge, Heteroskedasticity, pp.257 Kennedy, ch.8, pp.133-156 . *Original error w/o heteroskedasticity . reg logY edu exp exp2 Source | SS df MS Number of obs = 2140 -------------+------------------------------ F( 3, 2136) = 220.21 Model | 62.7069133 3 20.9023044 Prob > F = 0.0000 Residual | 202.746788 2136 .094918908 R-squared = 0.2362 -------------+------------------------------ Adj R-squared = 0.2352 Total | 265.453702 2139 .124101777 Root MSE = .30809 ------------------------------------------------------------------------------ logY | Coef. Std. Err. t P>|t| [95% Conf. Interval] edu | .0675944 .0029829 22.66 0.000 .0617447 .0734441 exp | .0111987 .0028507 3.93 0.000 .0056083 .0167891 exp2 | -.0004682 .000069 -6.78 0.000 -.0006035 -.0003328 _cons | 7.636382 .0448211 170.37 0.000 7.548485 7.72428 . predict res,res . g res2=res^2 . predict logY_h (option xb assumed; fitted values) . gr res logY_h,xlab ylab yline(0) t1("No heter") saving($pathc\e3,replace) . mat se=sqrt(el(e(V),1,1)) /*sqrt(diagonal elements of V-C)=std.error of the estimator */ . mat l se symmetric se[1,1] c1 r1 .00298291 . hettest /* test using fitted values of logY */ Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of logY chi2(1) = 1.90 Prob > chi2 = 0.1676 . hettest edu /* test using edu */ Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: edu chi2(1) = 1.76 Prob > chi2 = 0.1843 . hettest,rhs /* test using exp */

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: edu exp exp2 chi2(3) = 3.34 Prob > chi2 = 0.3415 . * Heteroskedastic error term: variance is function of edu . g e_a=sqrt(edu)*e . gr e_a edu,xlab ylab yline(0) t1("Heter=f(edu)") saving($pathc\e4,replace) . g logY_a=7.6+ edu*.07 + exp*.012- exp2*.0005 + e_a . reg logY_a edu exp exp2 Source | SS df MS Number of obs = 2140 -------------+------------------------------ F( 3, 2136) = 15.94 Model | 53.7596024 3 17.9198675 Prob > F = 0.0000 Residual | 2400.99192 2136 1.12405989 R-squared = 0.0219 -------------+------------------------------ Adj R-squared = 0.0205 Total | 2454.75152 2139 1.14761642 Root MSE = 1.0602 ------------------------------------------------------------------------------ logY_a | Coef. Std. Err. t P>|t| [95% Conf. Interval] edu | .0640565 .010265 6.24 0.000 .0439261 .0841869 exp | .0095317 .00981 0.97 0.331 -.0097065 .0287699 exp2 | -.000398 .0002375 -1.68 0.094 -.0008637 .0000677 _cons | 7.693405 .1542414 49.88 0.000 7.390926 7.995884 . predict logY_ah (option xb assumed; fitted values) . predict res_a,res . g res_a2=res_a^2 . gr res_a logY_ah,xlab ylab yline(0) t1("Heter=f(edu)") saving($pathc\e5,replace) . hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of logY_a chi2(1) = 16.79 Prob > chi2 = 0.0000 . hettest edu Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: edu chi2(1) = 23.74 Prob > chi2 = 0.0000 . reg res_a2 edu,noc /* Note that coefficient on edu=VAR(e) */ Source | SS df MS Number of obs = 2140 -------------+------------------------------ F( 1, 2139) = 1070.55

Model | 2749.7806 1 2749.7806 Prob > F = 0.0000 Residual | 5494.14973 2139 2.56855995 R-squared = 0.3336 -------------+------------------------------ Adj R-squared = 0.3332 Total | 8243.93034 2140 3.8523039 Root MSE = 1.6027 res_a2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- edu | .0933613 .0028534 32.72 0.000 .0877656 .098957 . * Heteroskedastic error term: variance =f(external var) . g x=uniform() /* Generate normaly distributed variable x */ . g e_b=e*(x+.01) /* Heteroskedastic error: variance =f(external variable x) */ . gr e_b x,xlab ylab yline(0) t1("Heter=f(x)") saving($pathc\e6,replace) . g logY_b=7.6+ edu*.07 + exp*.012- exp2*.0005 + e_b . reg logY_b edu exp exp2 Source | SS df MS Number of obs = 2140 -------------+------------------------------ F( 3, 2136) = 661.37 Model | 64.8910038 3 21.6303346 Prob > F = 0.0000 Residual | 69.858911 2136 .032705483 R-squared = 0.4816 -------------+------------------------------ Adj R-squared = 0.4808 Total | 134.749915 2139 .062996688 Root MSE = .18085 ------------------------------------------------------------------------------ logY_b | Coef. Std. Err. t P>|t| [95% Conf. Interval] edu | .0683546 .0017509 39.04 0.000 .0649209 .0717884 exp | .0117226 .0016733 7.01 0.000 .0084411 .0150042 exp2 | -.0004879 .0000405 -12.04 0.000 -.0005673 -.0004084 _cons | 7.624937 .0263097 289.81 0.000 7.573342 7.676532 ------------------------------------------------------------------------------ . hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of logY_b chi2(1) = 3.80 Prob > chi2 = 0.0511 . hettest,rhs Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: edu exp exp2 chi2(3) = 8.93 Prob > chi2 = 0.0302 . hettest x Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: x chi2(1) = 852.34 Prob > chi2 = 0.0000

. gr using $pathc\e3 $pathc\e4 $pathc\e5 $pathc\e6 No heter

Res

idua

ls

F itted values8 8.5 9

-1

0

1

Heter=f(edu)

e_a

edu5 10 15 20

-4

-2

0

2

4

Heter=f(edu)

Res

idua

ls

F itted values8 8.5 9

-4

-2

0

2

4

Heter=f(x)

e_b

x0 .5 1

-1

-.5

0

.5

1

. *Heteroskedasticity robust estimate of coefficient V-C matrix: sandwich estimator . *Heteroskedasticity robust estimate of coef. V-C matrix: sandwich estimator . reg logY_b edu exp exp2,robust /* Robust estimation of V-C matrix */ Regression with robust standard errors Number of obs = 2163 F( 3, 2159) = 574.80 Prob > F = 0.0000 R-squared = 0.4513 Root MSE = .1911 ------------------------------------------------------------------------------ | Robust logY_b | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- edu | .0679193 .0019325 35.15 0.000 .0641294 .0717091 exp | .0091909 .0017577 5.23 0.000 .0057439 .012638 exp2 | -.0004475 .0000433 -10.33 0.000 -.0005325 -.0003626 _cons | 7.656033 .0283748 269.82 0.000 7.600388 7.711678 ------------------------------------------------------------------------------ . mat Vreg=e(V) /* Robust coef. V-C matrix */ . mat l Vreg symmetric Vreg[4,4] edu exp exp2 _cons edu 3.735e-06 exp -4.410e-08 3.090e-06 exp2 1.573e-09 -7.360e-08 1.878e-09 _cons -.00004498 -.0000256 5.474e-07 .00080513 . reg logY_b edu exp exp2,mse1 /* OLS w/o robust V_C */ Source | SS df MS Number of obs = 2163 -------------+------------------------------ F( 3, 2163) = 21.62 Model | 64.8580318 3 21.6193439 Prob > F = 0.0000 Residual | 78.843017 2163 .036450771 R-squared = 0.4513 -------------+------------------------------ Adj R-squared = 0.4516 Total | 143.701049 2162 .06646672 Root MSE = 1 ------------------------------------------------------------------------------ logY_b | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- edu | .0679193 .0099265 6.84 0.000 .0484527 .0873858 exp | .0091909 .0091562 1.00 0.316 -.008765 .0271469 exp2 | -.0004475 .0002255 -1.98 0.047 -.0008897 -5.38e-06 _cons | 7.656033 .146813 52.15 0.000 7.368124 7.943942 ------------------------------------------------------------------------------ . predict res_b,res . mat D=e(V) /* Non-robust V-C matrix */ . mat l D symmetric D[4,4] edu exp exp2 _cons edu .00009854 exp -1.571e-06 .00008384 exp2 5.026e-08 -1.995e-06 5.083e-08 _cons -.00118645 -.00069289 .00001473 .02155404

. matrix accum M = edu exp exp2 [iweight=res_b^2] /*Salami for the Sandwich*/

. mat l M symmetric M[4,4] edu exp exp2 _cons edu 12021.856 exp 19469.987 39234.904 exp2 475286.24 1056954.2 30403276 _cons 957.88854 1604.5539 39234.904 78.843017 . mat V=e(N)/(e(N)-e(df_m)-1) * D*M*D /*Sandwich X'WX */ . mat l V symmetric V[4,4] edu exp exp2 _cons edu 3.735e-06 exp -4.410e-08 3.090e-06 exp2 1.573e-09 -7.360e-08 1.878e-09 _cons -.00004498 -.0000256 5.474e-07 .00080513 *Compare martrix V and Vres. They are identical

MEASUREMENT ERROR

Measurement Error in the Dependent Variable (1) uxxxy KK +++++= ββββ ...22110

* (2) 0

* eyy +=

Suppose equation (1) represents the population model. Instead of observing *

,y we observe y, or *y plus measurement error (e0, where E[e0]=0).

43421v

KK euxxxy )(... 022110 ++++++= ββββ

20

2)( euvVar σσ += (requires Cov(u, e0)=0)

To see the implications of measurement error in y, plug eq. (2) into eq. (1). The OLS estimators of the βj will be affected to the extent that the composite error v is correlated with the explanatory variables. If the measurement error e0 is correlated with x, the OLS estimators will be biased and inconsistent. Under the classic error-in-variables assumption, Cov( *

,y e0)=0, and thus v and x are uncorrelated.

Measurement Error in an Explanatory Variable (K=1) (3) uxy ++= *

110 ββ (4) 1

*11 exx += or

Suppose instead that one of the explanatory variables is measured with error—that is, we observe x1 instead of *

1x in equation (3). (Again, E[e1]=0).

43421v

euxy )( 11110 βββ −++=

plim )(),(ˆ

1

111 xVar

vxCov+= ββ

To see the implications of measurement error in x1, plug eq. (4) into eq. (3). The OLS estimators of β1 will be affected to the extent that the composite error v is correlated with x1. Under the classic error-in-variables assumption, Cov( *

1x , e1)=0. Thus, [ ] 2

11111*11 ))((),( eeuexEvxCov σββ −=−+=

2

12

*1*1

*11 )()()( exeVarxVarxVar σσ +=+=

plim

+

= 21

2*1

2*1

11̂ex

x

σσσββ

plim

+

= 21

2*1

2*1

11ˆ

er

r

σσσ

ββ

Under the classic error-in-variables assumption, it can be shown that the OLS estimator is inconsistent and (asymptotically) biased downward (as shown at left). The term multiplying β1 is called the attenuation bias (it is always <1). When K>1 (and *

1x is the only mismeasured variable), the attenuation bias is as shown at left ( *

1r is the population error from the regression of *

1x on all other explanatory variables).

. =========== ERRORS IN VARIABLES ========================= . *Case A: Error on logY . g error=invnorm(uniform()) /* Measurement error*/ . g logYX=logY+.2*error /* logY with error */ . dotplot logY logYX , ny(25) saving($pathc\e7,replace) . gr logY logYX logY,xlab ylab s(op) saving($pathc\e8,replace) . reg logY edu exp exp2 /* Model w/o error */ Source | SS df MS Number of obs = 2140 -------------+------------------------------ F( 3, 2136) = 220.21 Model | 62.7069133 3 20.9023044 Prob > F = 0.0000 Residual | 202.746788 2136 .094918908 R-squared = 0.2362 -------------+------------------------------ Adj R-squared = 0.2352 Total | 265.453702 2139 .124101777 Root MSE = .30809 ------------------------------------------------------------------------------ logY | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- edu | .0675944 .0029829 22.66 0.000 .0617447 .0734441 exp | .0111987 .0028507 3.93 0.000 .0056083 .0167891 exp2 | -.0004682 .000069 -6.78 0.000 -.0006035 -.0003328 _cons | 7.636382 .0448211 170.37 0.000 7.548485 7.72428 ------------------------------------------------------------------------------ . reg logYX edu exp exp2 /* Model with error in logy */ * See that edu coefficient is not changed, only std. error and R2*/ Source | SS df MS Number of obs = 2140 -------------+------------------------------ F( 3, 2136) = 170.44 Model | 70.1650238 3 23.3883413 Prob > F = 0.0000 Residual | 293.116262 2136 .137226715 R-squared = 0.1931 -------------+------------------------------ Adj R-squared = 0.1920 Total | 363.281286 2139 .169836973 Root MSE = .37044 ------------------------------------------------------------------------------ logYX | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- edu | .0703127 .0035866 19.60 0.000 .0632792 .0773463 exp | .0115047 .0034276 3.36 0.001 .0047828 .0182265 exp2 | -.0005005 .000083 -6.03 0.000 -.0006632 -.0003378 _cons | 7.613564 .0538921 141.27 0.000 7.507877 7.71925 ------------------------------------------------------------------------------

. *Case B: Stochastic error in edu

. g eduX=edu+2*error /* Education years with error */ . dotplot edu eduX , ny(25) saving($pathc\e9,replace) . gr edu eduX edu,xlab(0,9,13,18) ylab(0,9,13,18) s(op) saving($pathc\e10,replace) . reg logY edu exp exp2 Residual | 202.746788 2136 .094918908 R-squared = 0.2362 ------------------------------------------------------------------------------ logY | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- edu | .0675944 .0029829 22.66 0.000 .0617447 .0734441 .. ------------------------------------------------------------------------------ . reg logY eduX exp exp2 /* See that edu coefficient is smaller*/ Source | SS df MS Number of obs = 2140 -------------+------------------------------ F( 3, 2136) = 140.63 Model | 43.7823947 3 14.5941316 Prob > F = 0.0000 Residual | 221.671307 2136 .103778702 R-squared = 0.1649 -------------+------------------------------ Adj R-squared = 0.1638 Total | 265.453702 2139 .124101777 Root MSE = .32215 ------------------------------------------------------------------------------ logY | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- eduX | .0388487 .0022919 16.95 0.000 .034354 .0433434 exp | .0104975 .0029804 3.52 0.000 .0046528 .0163422 exp2 | -.0004398 .0000721 -6.10 0.000 -.0005813 -.0002983 _cons | 7.979665 .0394515 202.27 0.000 7.902298 8.057033 ------------------------------------------------------------------------------ . corr eduX error,cov /* Bias ~ COV(edux,error)/VAR(eduX) */ | eduX error -------------+------------------ eduX | 9.24254 error | 2.05978 .996781 . gr using $pathc\e7 $pathc\e8 $pathc\e9 $pathc\e10

. *Case C: systematic error =f(edu) . g eduQ=.8*edu /* Education years with error */ . gr edu eduQ edu,xlab(0,9,13,18) ylab(0,9,13,18) saving($pathc\e11,replace) . dotplot edu eduQ , ny(25) saving($pathc\e12,replace) . reg logY edu exp exp2 /* Pure regression */ Source | SS df MS Number of obs = 2140 Residual | 202.746788 2136 .094918908 R-squared = 0.2362 logY | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- edu | .0675944 .0029829 22.66 0.000 .0617447 .0734441 .. . reg logY eduQ exp exp2 /* See that edu coefficient is larger*/ Source | SS df MS Number of obs = 2140 -------------+------------------------------ F( 3, 2136) = 220.21 Model | 62.706915 3 20.902305 Prob > F = 0.0000 Residual | 202.746787 2136 .094918908 R-squared = 0.2362 -------------+------------------------------ Adj R-squared = 0.2352 Total | 265.453702 2139 .124101777 Root MSE = .30809 ------------------------------------------------------------------------------ logY | Coef. Std. Err. t P>|t| [95% Conf. Interval] eduQ | .0844929 .0037286 22.66 0.000 .0771808 .0918051 exp | .0111987 .0028507 3.93 0.000 .0056083 .0167891 exp2 | -.0004682 .000069 -6.78 0.000 -.0006035 -.0003328 _cons | 7.636382 .0448211 170.37 0.000 7.548485 7.72428

. gr using $pathc\e11 $pathc\e12

edu

edu eduQ

0 9 13 180

9

13

18

edu eduQ

3.27135

19.6281

. locpoly logY eduX,plot(scatter logY edu)

78

910

logY

0 5 10 15 20eduX

logY locpoly smooth: logYlogY

Degree: 0Local polynomial smooth

heteroskedasticity and errors in variables

Documents