polsci 702 non-normality and heteroskedasticity

POLSCI 702Non-Normality and Heteroskedasticity

Dave Armstrong

University of Wisconsin – MilwaukeeDepartment of Political Science

e: [email protected]: www.quantoid.net/UWM702.html

1 / 47

Goals of this Lecture

• Discuss methods for detecting non-normality, non-constant errorvariance, and nonlinearity

• Each of these reflect problems with the specification of the model

• Discuss various ways that transformations can be used to remedythese problems

2 / 47

Non-Normal ErrorsAssessing Non-normality

Non-constant Error VarianceAssessing Non-constant Error VarianceTesting for Non-constant Error VarianceFixing Non-constant Error Variance: With a ModelFixing Non-constant Error Variance: Without a Model

3 / 47

Non-nomrally distributed errors

• The least-squares fit is based on the conditional mean• The mean is not a good measure of center for either a highly skewed

distribution or a multi-modal distribution• Non-Normality does not produce bias in the coe�cient estimates,but it does have two important consequences:

• It poses problems for e�ciency - i.e., the OLS standard errors are nolonger the smallest. Weighted least squares (WLS) is more e�cient

• Standard errors can be biased - i.e., confidence intervals andsignificance test may lead to wrong conclusions. Robust standarderrors can compensate for this problem

• Transformations can often remedy the heavy-tailed problem

• Re-specification of the model - i.e., include a missing discretepredictor - can sometimes fix a multi-modal problem

4 / 47

Distribution of the Residuals Example: Inequality data

• Quantile comparison plots anddensity estimates of theresiduals from a model areuseful for assessing normality

• The density estimate of thestudentized residuals clearlyshows a positive skew, andthe possibility of a grouping ofcases to the right

−2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

rstudent(mod1)Pr

obab

ility

dens

ity fu

nctio

n

> library(sm)> Weakliem <- read.table("http://www.quantoid.net/702/Weakliem.txt")> mod1 <- lm(secpay~gini + democrat, data=Weakliem)> sm.density(rstudent(mod1), model="normal")

5 / 47

Assessing Unusual Cases

• A quantile comparison plotcan give us a sense of whichobservations depart fromnormality.

• We can see that the pointswith the biggest departure arethe Czech Republic andSlovakia.

−2 −1 0 1 2

−10

12

34

t Quantiles

Stud

entiz

ed R

esid

uals

(mod

1)

●●

● ● ● ●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●●

●

●●●

●●

●

●

●

●

●

> library(car)> qqPlot(mod1)

6 / 47

Studentized Residuals after Removing the Czech Republic and Slovakia

−2 −1 0 1 2

−3−2

−10

12

t Quantiles

Stud

entiz

ed R

esid

uals

(mod

2)

●

● ● ●●

●

●●●●●●

●●

●●●●●●

●●●●●●●

●●●●

●●●

●●●●●●

●●

●

● ●

●●

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

rstudent(mod2)

Prob

abilit

y de

nsity

func

tion

> Weakliem <- Weakliem[-c(25,49), ]> mod2 <- lm(secpay ~ gini*democrat, data=Weakliem)> qqPlot(mod2, simulate=T, labels=FALSE)

> sm.density(rstudent(mod2), model="normal")

7 / 47



8 / 47

Non-constant Error Variance

• Also called Heteroskedasticity

• An important assumption of the least-squares regression model isthat the variance of the errors around the regression surface iseverywhere the same: V(E) = V(Y |x1, . . . , xk

) = �2.• Non-constant error variance does not cause biased estimates, but itdoes pose problems for e�ciency and the usual formulas forstandard errors are inaccurate

• OLS estimates are ine�cient because they give equal weight to allobservations regardless of the fact that those with large residualscontain less information about the regression

• Two types of nonconstant error variance are relatively common:• Error variance increases as the expectation of Y increases;• There is a systematic relationship between the errors and one of the

X’s

9 / 47



10 / 47

Assessing Non-constant Error Variance

• Direct examination of the data is usually not helpful in assessingnon-constant error variance, especially if there are many predictors.Instead, we look to the residuals to uncover the distribution of theerrors.

• It is also not helpful to plot Y against the residuals E, because thereis a built-in correlation between Y and E:

Y = Y + E

• The least squares fit ensures that the correlation between Y and E is0, so a plot of these (residual plot) can help us uncover nonconstanterror variance.

• The pattern of changing spread is often more easily seen usingstudentized residuals E

⇤2i

against Y

• If the values of Y are all positive, we can use a Spread-level plot• plot log(| E⇤

i

|) (called the log spread) against log Y (called the loglevel)

• The slope b of the regression line fit to this plot suggests thevariance-stabilizing transformation Y

(p), with p = 1 � b

11 / 47

Assessing Heteroskedasticity: Example - Inequality Data (1)

• Two things are obvious:1. The residuals have a

recognizable pattern,suggesting the model ismissing somethingsystematic

2. There are two outlyingcases (Czech Republic andSlovakia)

• We next take out the outliersand fit the model including aninteraction betweendemocracy and gini.

●●

●

●

●

●

●●●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●

●

1.15 1.20 1.25

−10

12

34

Studentized Residuals versus Fitted Values

fitted.values(mod1)

rstudent(mod1)

> plot(fitted.values(mod1), rstudent(mod1),+ main="Studentized Residuals versus Fitted Values")

12 / 47


> mod2 <- lm(secpay~gini*democrat, data=Weakliem)> plot(fitted.values(mod2), rstudent(mod2),+ main="Studentized Residuals versus Fitted Values")> abline(h=0, lty=2)

13 / 47


• The non-constant errorvariance is not as obviouslyproblematic as it was before,but as we will see below,recent research o↵ersassistance on testing for andremediating heteroskedasticity.

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

1.10 1.15 1.20 1.25 1.30

−3−2

−10

12

Studentized Residuals versus Fitted Values

fitted.values(mod2)

rstudent(mod2)

14 / 47


●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

1.10 1.15 1.20 1.25 1.30

−3−2

−10

12

Studentized Residuals vs Fitted Values

fitted.values(mod2)

rstudent(mod2)

• In the residual plot , we see the familiar “fanning”out of the data -i.e., the variance of the residuals is increasing as the fitted values getlarger

> plot(fitted.values(mod2), rstudent(mod2),+ main="Studentized Residuals vs Fitted Values")> abline(h=0, lty=2)

15 / 47



16 / 47

Testing for Non-Constant Error variance (1)

• Assume that a discrete X (or combination of X’s) partitions the datainto m groups.

• Let Y

i j

denote that ith of n

j

outcome-variable scores in group j

• Within-group sample variances are then calculated as follows:

S

2j

=

P

n

j

i=1(Yi j

� Y

j

)2

n

j

� 1

• We could then compare these within-group sample variances to seeif they di↵er

• If the distribution of the errors is non-normal, however, tests thatexamine S

2j

directly are not valid because the mean is not a goodsummary of the data

17 / 47

Testing for Non-Constant Error variance (2): Score Test

• A score test for the null hypothesis that all of the error variances �2

are the same provides a better alternative

1. We start by calculating the standardized squared residuals

U

i

=E

2i

�2 =E

2i

P

E

2i

n

2. Regress the U

i

on all of the explanatory variable X’s, finding thefitted values:

U

i

= ⌘0 + ⌘1X

i1 + · · · + ⌘p

X

ip

+ !i

3. The score test, which s distributed as �2 with p degrees of freedomis:

S

20 =

P

(Ui

� U)2

2

18 / 47

R-script testing for non-constant error variance

• The ncvTest function in the car library provides a simple way tocarry out the score test

• The result below shows that the nonconstant error variance isstatistically significant> ncvTest(mod2, data=Weakliem)

Non-constant Variance Score Test

Variance formula: ~ fitted.values

Chisquare = 7.926311 Df = 1 p = 0.004872103

> ncvTest(mod2, var.formula=~gini*democrat, data=Weakliem[-c(25,49),])

Non-constant Variance Score Test

Variance formula: ~ gini * democrat

Chisquare = 10.69581 Df = 3 p = 0.01348979

19 / 47



20 / 47

Weighted least squares (1)

• If the error variances are proportional to a particular X (i.e., theerror variances are known up to a constant of proportionality �2

", sothat V("1) = �2

1w

i

), weighted least squares provides a goodalternative to OLS

• WLS minimizes the weighted sum of squaresP

w

i

E

2i

giving greaterweight to observations with smaller variance

• The WLS maximum likelihood estimators are defined as:

� = (X

0WX)�1

X

0Wy

�2" =

P

E

2i

w

i

n

• The estimated asymptotic covariance for the estimators is:

V(�) = �2"�

X

0WX

��1

• Here W is a square diagonal matrix with individual weights w

i

on thediagonal and zeros elsewhere

21 / 47

Weighted Least Squares Example: Inequality Data

• The“fanning”pattern in the residual plot for the DHS modelindicates that the error variance is proportional to the log ofGDP/capita.

• We could then proceed to estimate a WLS using the weight gini.

22 / 47

WLS Example: WB Data (2)

In R , we simply add a weight argument to the lm function:> mod.wls <- update(mod2, weight=gini, data=Weakliem)> summary(mod.wls)

Call:

lm(formula = secpay ~ gini * democrat, data = Weakliem, weights = gini)

Weighted Residuals:

Min 1Q Median 3Q Max

-0.86399 -0.26798 -0.01882 0.27793 0.99575

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.926664 0.057076 16.236 < 2e-16 ***

gini 0.005343 0.001365 3.915 0.000318 ***

democrat 0.499980 0.087891 5.689 1.04e-06 ***

gini:democrat -0.011182 0.002347 -4.764 2.18e-05 ***

---

Signif. codes: 0

23 / 47

Interpreting WLS Results

• It is important to remember that the parameters in the WLS modelare estimators of �, just like the OLS parameters are (these are justmore e�cient in the presence of heteroskedasticity).

• Thus interpretation takes the same form as it does with the OLSparameters.

• The R

2 is less interesting here because we are explaining variance inLife Expectancy ⇥ log(GDP/capita), rather than Life Expectancy.

24 / 47

Generalized Least Squares (1)

• Sometimes, we do not know the relationship between x

i

andvar(u

i

|xi

).• In this case, we can use a Feasible GLS model.

• FGLS estimates the weight from the data. That weight is then usedin a WLS fashion.

25 / 47

GLS: Steps

1. Regress y on x

i

and obtain residuals u

i

.

2. Create log(u2i

) by squaring and then taking the natural log of theOLS residuals from step 1.

3. Run a regression of log(u2i

) on x

i

and obtain the fitted values g

i

.

4. Generate h

i

= exp(gi

).5. Estimate the WLS of y on x

i

with weights of 1h

i

.

26 / 47

FGLS Example: WB Data

> mod1.ols <- lm(secpay ~ gini*democrat, data=Weakliem)> aux.mod1 <- update(mod1.ols, log(resid(mod1.ols)^2) ~ .)> h <- exp(predict(aux.mod1))> mod.fgls <- update(mod1.ols, weight=1/h)> with(summary(mod.fgls), printCoefmat(coefficients))


(Intercept) 0.9619416 0.0476415 20.1913 < 2.2e-16 ***

gini 0.0044149 0.0013294 3.3210 0.001836 **

democrat 0.4662928 0.0806427 5.7822 7.577e-07 ***

gini:democrat -0.0102999 0.0022211 -4.6374 3.288e-05 ***

---

Signif. codes: 0

27 / 47

Table : Comparing Models

OLS WLS FGLS(Intercept) 0.941 0.927 0.962

(0.060) (0.057) (0.048)gini 0.005 0.005 0.004

(0.002) (0.001) (0.001)democrat 0.486 0.500 0.466

(0.088) (0.088) (0.081)gini:democrat -0.011 -0.011 -0.010

(0.002) (0.002) (0.002)

28 / 47



29 / 47

Robust Standard Errors (1)

• Robust standard errors can be calculated to compensate for anunknown pattern of non-constant error variance

• Robust standard errors require fewer assumptions about the modelthan WLS (which is better if there is increasing error variance in thelevel of Y)

• Robust standard errors do not change the OLS coe�cient estimatesor solve the ine�ciency problem, but do give more accurate p-values.

• There are several methods for calculating heteroskedasticityconsistent standard errors (e.g., known variously as White, Eicker orHuber standard errors) but most are variants on the methodoriginally proposed by White (1980).

30 / 47

Robust Standard Errors (2): White’s Standard Errors

• The covariance matrix of the OLS estimator is:

V(b) = (X

0X)�1

X

0⌃X(X

0X)�1

= (X

0X)�1

X

0V(y)X(X

0X)�1

• Where V(y) = �2"In

if the assumptions of normality andhomoskedasticity are satisfied. The variance simplifies to:

V(b) = �2"(X

0X)�1

• In the presence of nonconstant error variance, however, V(y)contains nonzero covariance and unequal variances

• In these cases, White suggests a consistent estimator of the variancethat constrains ⌃ to a diagonal matrix containing only squaredresiduals

31 / 47

Robust Standard Errors (3): White’s Standard Errors

• The heteroskedasticity consistent covariance matrix (HCCM)estimator is then:

V(b) = (X

0X)�1

X

0�X(X

0X)�1

where � = e

2i

I

n

and the e

i

are the OLS residuals

• This is what is known as HC0 - White’s (1980) original recipe.

32 / 47

Hat Values

Other HCCMs use the“hat value”which are the diagonal elements ofX

(X

0X

)�1X

0

• These give a sense of how far each observation is from the mean ofthe X’s.

• Below is a figure that shows two hypothetical X variables and theplotting symbols are proportional in size to the hat value

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1

−2−1

01

X[, 2]

X[, 3

]

33 / 47

Other HCCM’s

MacKinnon and White (1985) considered three alternatives: HC1, HC2and HC3, each of which o↵ers a di↵erent method for finding �.

• HC1: N

N�K

⇥ HC0.• HC2: � = diag

e

2i

1�h

ii

�

where h

ii

= x

i

(X

0X)�1

x

0i

• HC3: � = diag

e

2i

(1�h

ii

)2

�

Long and Ervin (2000) find that in small samples (e.g., < 500) the HC3errors are the“best” in terms of size and power.

• They suggest using HC3 all of the time, as they do not do poorly inthe presence of homoskedasticity and outperform all other options inthe presence of heteroskedasticity.

34 / 47

Small Sample Properties of Screening Tests

• Long and Ervin (2000) performed a monte carlo study on screeningtests for heteroskedasticity.

• They find that in small samples (i.e., < 250), the standard tests (asdiscussed above) have very little power.

• With small samples, if there is any reason to suspectheteroskedasticity may be a problem, use HC3 robust SEs

35 / 47

HC4 Standard Errors

• HC3 standard errors are shown to outperform the alternatives insmall samples

• HC3 standard errors can still fail to generate the appropriate Type Ierror rate when outliers are present.

• HC4 standard errors can produce the appropriate test statistics evenin the presence of outliers:

� = diag

2

6

6

6

6

4

e

2i

(1 � h

ii

)�i

3

7

7

7

7

5

• �i

= min

n

4, Nh

ii

p

o

with n = number of obs, and p = number ofparameters in model

• HC4 outperform HC3 in the presence of influential observations, butnot in other situations.

36 / 47

HC4m Standard Errors

• HC4 standard errors are not universally better than others and asCribari-Neto and da Silva (2011) show, HC4 SEs have relativelypoor performance when there are many regressors and when themaximal leverage point is extreme.

• Cribari-Neto and da Silva propose a modified HC4 estimator, calledHC4m, where, as above

� = diag

2

6

6

6

6

4

e

2i

(1 � h

ii

)�i

3

7

7

7

7

5

• and here, �i

= min

n

�1,nh

ii

p

o

+ min

n

�2,nh

ii

p

o

• They find that the best values of the � parameters are �1 = 1 and�2 = 1.5.

37 / 47

HC5

• HC5 standard errors are supposed to also provide di↵erentdiscounting than HC4 and HC4m estimators. The HC5 standarderrors are operationalized as:

� = diag

2

6

6

6

6

4

e

2i

(1 � h

ii

)�i

3

7

7

7

7

5

• and here, �i

= min

n

nh

ii

p

,max

n

4, nkh

max

p

o

with k = 0.7.• For observations with bigger hat-values, their residuals get increasedin size, thus increasing the standard error (generally).

38 / 47

Deltas for the Inequality Model

0.05 0.10 0.15 0.20 0.25 0.30

0.5

1.0

1.5

2.0

2.5

3.0

3.5

h

δ i

HC4HC4mHC5

39 / 47

Denominator of HCC for Inequality Data

0.05 0.10 0.15 0.20 0.25 0.30

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

h

(1−h i)

δ i

HC4HC4mHC5

40 / 47

Discounts for Inequality Data

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●●

●

●● ●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

−0.15 −0.10 −0.05 0.00 0.05 0.10 0.15

0.000

0.010

0.020

0.030

e

e2

(1−h i)

δ i

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●●

●

●● ●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●●

●

●● ●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●●

●

●● ●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

HC4HC4mHC5

41 / 47

Function for lm output that includes Robust Standard Errors

• I modified the summary.lm() function to allow an argument forrobust standard error type.

• THis just makes a call to vcovHC() from the sandwich package.

> source("http://www.quantoid.net/reg3/summary.lm.robust.R")

42 / 47

Summary of DHS Model


(Intercept) 0.9407661 0.0596787 15.7638 < 2.2e-16 ***

gini 0.0049952 0.0015163 3.2945 0.00198 **

democrat 0.4860714 0.0881817 5.5122 1.864e-06 ***

gini:democrat -0.0108399 0.0024760 -4.3780 7.528e-05 ***

---

Signif. codes: 0

43 / 47

Output for robust.se

> summary.lm.robust(mod2, typeHC="HC4m")

Call:

lm(formula = secpay ~ gini * democrat, data = Weakliem)

Residuals:

Min 1Q Median 3Q Max

-0.179825 -0.046637 -0.004133 0.047674 0.155604

Coefficients:


(Intercept) 0.940766 0.058297 16.137 < 2e-16 ***

gini 0.004995 0.001747 2.859 0.00653 **

democrat 0.486071 0.106380 4.569 4.09e-05 ***

gini:democrat -0.010840 0.002998 -3.616 0.00078 ***

---

Signif. codes: 0

44 / 47

Comparison

t−statistic

1617

1819

20

HC0

HC1

HC2

HC3

HC4HC4m HC

5

●

●

●

●●

●

●

(Intercept)

4.6

4.8

5.0

5.2

5.4

●

●

●

● ●

●

●

democrat

3.0

3.2

3.4

●

●

●

●●

●

●

giniHC0

HC1

HC2

HC3

HC4

HC4m

HC5

−4.2

−4.0

−3.8

−3.6

●

●

●

●

●

●

●

gini:democrat

45 / 47

Robust Standard Errors (4)

• Since the HCCM is found without a formal model of theheteroskedasticity, relying instead on only the regressors andresiduals from the OLS for its computation, it can be easily adaptedto many applications

• For example. robust standard errors can be used to improvestatistical inference from clustered data, pooled time-series datawith autocorrelated errors (e.g., Newey-West SE’s) and panel data

•Cautions

• Robust standard errors should not be seen as a substitute for carefulmodel specification. In particular, if the pattern of heteroskedasticityis known, it can often be more e↵ectively corrected - and the modelmore e�ciently estimated - using WLS

46 / 47

Conclusions

• Heteroskedasticity, while not bias-inducing, can cause problems withe�ciency of the model.

• Tests of heteroskedasticity only have su�cient power when n is large(e.g., > 250).

• If the errors are found to be heteroskedastic, there are a number ofpotential fixes that all require various assumptions:1. If the heteroskedasticity is proportional to a single variable, weighted

least squares can be used.2. A model of the variance can be obtained through FGLS (if the model

is a“nuisance”model) or heteroskedastic regression (if the model ofthe variance is substantively interesting).

3. If heteroskedasticity is thought to exist and no suitable functionalform of the variance can be found, then HC3 robust standard errorsare the best bet (especially in small samples) or HC4 standard errorsin the presence of high-leverage points.

47 / 47

polsci 702 non-normality and heteroskedasticity

Documents