lecture23: transformationsandweightedleast squares...squares pratheepajeganathan 11/13/2019. recap i...

Lecture 23: Transformations and Weighted LeastSquares

Pratheepa Jeganathan

11/13/2019

Recap

I What is a regression model?I Descriptive statistics – graphicalI Descriptive statistics – numericalI Inference about a population meanI Difference between two population meansI Some tips on RI Simple linear regression (covariance, correlation, estimation,

geometry of least squares)I Inference on simple linear regression modelI Goodness of fit of regression: analysis of variance.I F -statistics.I Residuals.I Diagnostic plots for simple linear regression (graphical

methods).

Recap

I Multiple linear regressionI Specifying the model.I Fitting the model: least squares.I Interpretation of the coefficients.I Matrix formulation of multiple linear regressionI Inference for multiple linear regression

I T -statistics revisited.I More F statistics.I Tests involving more than one β.

I Diagnostics – more on graphical methods and numericalmethodsI Different types of residualsI InfluenceI Outlier detectionI Multiple comparison (Bonferroni correction)I Residual plots:

I partial regression (added variable) plot,I partial residual (residual plus component) plot.

Recap

I Adding qualitative predictorsI Qualitative variables as predictors to the regression model.I Adding interactions to the linear regression model.I Testing for equality of regression relationship in various subsets

of a populationI ANOVA

I All qualitative predictors.I One-way layoutI Two-way layout

Transformation

Outline

I We have been working with linear regression models so far inthe course.

I Some models are nonlinear, but can be transformed to a linearmodel (CH Chapter 6).

I We will also see that transformations can sometimes stabilizethe variance making constant variance a more reasonableassumption (CH Chapter 6).

I Finally, we will see how to correct for unequal variance using atechnique weighted least squares (WLS) (CH Chapter 7).

Bacterial colony decay (CH Chapter 6.3, Page 167)

I Here is a simple dataset showing the number of bacteria alivein a colony, nt as a function of time t.

I A simple linear regression model is clearly not a very good fit.bacteria.table = read.table('http://stats191.stanford.edu/data/bacteria.table',

header=T)head(bacteria.table)

## t N_t## 1 1 355## 2 2 211## 3 3 197## 4 4 166## 5 5 142## 6 6 106

Fitting (Bacterial colony decay)

0

100

200

300

4 8 12t

N_t

Diagnostics (Bacterial colony decay)

0 50 100 150 200 250

−50

100

Fitted values

Res

idua

ls

Residuals vs Fitted1

15

8

−1 0 1

−1

2

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q1

15

8

0 50 100 150 200 250

0.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−Location1

15 8

0.00 0.05 0.10 0.15 0.20 0.25−

12

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance0.5

Residuals vs Leverage1

1514

Exponential decay model

I Suppose the expected number of cells grows like

E (nt) = n0eβ1t , t = 1, 2, 3, . . .

I If we take logs of both sides

log E (nt) = log n0 + β1t.

I A reasonable (?) model:

log nt = log n0 + β1t + εt , εtIID∼ N(0, σ2).

bacteria.log.lm = lm(log(N_t) ~ t, bacteria.table)df = cbind(bacteria.table,

lm.fit = fitted(bacteria.lm),lm.log.fit = exp(fitted(bacteria.log.lm)))

p = p +geom_abline(intercept = bacteria.lm$coef[1],slope = bacteria.lm$coef[2],

col = "red") +geom_line(data = df,

aes(x = t, y = lm.log.fit),color = "green")

0

100

200

300

4 8 12t

N_t

Diagnostics

par(mfrow=c(2,2))plot(bacteria.log.lm, pch=23, bg='orange')

3.0 3.5 4.0 4.5 5.0 5.5

−0.

20.

2

Fitted values

Res

idua

ls


210

−1 0 1

−2

1


Sta

ndar

dize

d re

sidu

als

Normal Q−Q7

2 10

3.0 3.5 4.0 4.5 5.0 5.5

0.0

1.0

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−Location7 210

0.00 0.05 0.10 0.15 0.20 0.25

−2

1

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance0.5

0.5

Residuals vs Leverage

2

17

Logarithmic transformation

I This model is slightly different than original model:

E (log nt) ≤ log E (nt)

but may be approximately true.

I If εt ∼ N(0, σ2) then

nt = n0 · γt · eβ1t .

I γt = eεt is called a log-normal (0, σ2) random variable.

Linearizing regression function

I We see that an exponential growth or decay model can bemade (approximately) linear.

I Here are a few other models that can be linearized:I y = αxβ , use ỹ = log(y), x̃ = log(x);I y = αeβx , use ỹ = log(y);I y = x/(αx − β), use ỹ = 1/y , x̃ = 1/x .I More in textbook.

Caveats

I Just because expected value linearizes, doesn’t mean that theerrors behave correctly.

I In some cases, this can be corrected using weighted leastsquares (more later).

I Constant variance, normality assumptions should still bechecked.

Stabilizing variance

I Sometimes, a transformation can turn non-constant varianceerrors to “close to” constant variance. This is another situationin which we might consider a transformation.

I Example: by the “delta rule”, if

Var(Y ) = σ2E (Y )

thenVar(√

Y ) ' σ2

4 .

I In practice, we might not know which transformation is best.Box-Cox transformations offer a tool to find a “best”transformation.

http://en.wikipedia.org/wiki/Power_transform

Delta rule

The following approximation is ubiquitous in statistics.

I Taylor series expansion:

f (Y ) = f (E (Y )) + ḟ (E (Y ))(Y − E (Y )) + . . .

I Taking expectations of both sides yields:

Var(f (Y )) ' ḟ (E (Y ))2 · Var(Y )

Delta rule

I So, for our previous example:

Var(√

Y ) ' Var(Y )4 · E (Y )

I Another example

Var(log(Y )) ' Var(Y )E (Y )2 .

Caveats

I Just because a transformation makes variance constant doesn’tmean regression function is still linear (or even that it waslinear)!

I The models are approximations, and once a model is selectedour standard diagnostics should be used to assess adequacy offit.

I It is possible to have non-constant variance but the variancestabilizing transformation may destroy linearity of theregression function.I Solution: try weighted least squares (WLS).

Weighted Least Squares (CH Chapter 7)

Correcting for unequal variance: weighted least squares

I We will now see an example in which there seems to be strongevidence for variance that changes based on Region.

I After observing this, we will create a new model that attemptsto correct for this and come up with better estimates.

I Correcting for unequal variance, as we describe it here,generally requires a model for how the variance depends onobservable quantities.

Correcting for unequal variance: weighted least squares(CH Chapter 7.4, Page 197)

Variable DescriptionY Per capita education expenditure by stateX1 Per capita income in 1973 by stateX2 Proportion of population under 18X3 Proportion in urban areasRegion Which region of the country are the states located in

lm

education.table = read.table('http://stats191.stanford.edu/data/education1975.table', header=T)education.table$Region = factor(education.table$Region)education.lm = lm(Y ~ X1 + X2 + X3, data=education.table)

lm

summary(education.lm)

Diagnostics

par(mfrow=c(2,2))plot(education.lm)

200 250 300 350 400 450

−10

010

0

Fitted values

Res

idua

ls


10

7

−2 −1 0 1 2

−2

2


Sta

ndar

dize

d re

sidu

als

Normal Q−Q49

10

7

200 250 300 350 400 450

0.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−Location49

107

0.0 0.1 0.2 0.3 0.4

−2

2

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance1

0.5


7

18

Diagnostics

I there is an outlier, let’s drop the outlier and refitboxplot(rstandard(education.lm) ~ education.table$Region,

col=c('red', 'green', 'blue', 'yellow'))

1 2 3 4

−2

−1

01

23

education.table$Region

rsta

ndar

d(ed

ucat

ion.

lm)

Fit a model without the outlier

keep.subset = (education.table$STATE != 'AK')education.noAK.lm = lm(Y ~ X1 + X2 + X3,

subset=keep.subset,data=education.table)

Fit a model without the outlier

summary(education.noAK.lm)

par(mfrow=c(2,2))plot(education.noAK.lm)

220 240 260 280 300 320 340

−10

050

Fitted values

Res

idua

ls

Residuals vs Fitted

10

15 7

−2 −1 0 1 2

−2

1


Sta

ndar

dize

d re

sidu

als

Normal Q−Q

10

157

220 240 260 280 300 320 340

0.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als


0.00 0.10 0.20 0.30−

21

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance0.5

0.5

Residuals vs Leverage427 3

Diagnostics (refitted model)

par(mfrow=c(1,1))boxplot(rstandard(education.noAK.lm) ~ education.table$Region[keep.subset],


1 2 3 4

−2

−1

01

2

education.table$Region[keep.subset]

rsta

ndar

d(ed

ucat

ion.

noA

K.lm

)

Re-weighting observations

I If you have a reasonable guess of variance as a function of thepredictors, you can use this to re-weight the data.

I Hypothetical example

Yi = β0 + β1Xi + εi , εi ∼ N(0, σ2X 2i ).

I Setting Ỹi = Yi/Xi , X̃i = 1/Xi , model becomes

Ỹi = β0X̃i + β1 + γi , γi ∼ N(0, σ2).

Weighted Least SquaresI Fitting this model is equivalent to minimizing

n∑i=1

1X 2i

(Yi − β0 − β1Xi)2

I Weighted Least Squares

SSE (β,w) =n∑

i=1wi (Yi − β0 − β1Xi)2 , wi =

1X 2i

.

I In general, weights should be like:

wi =1

Var(εi).

I Our education expenditure example assumes

wi = WRegion[i]

Common weighting schemes

I If you have a qualitative variable, then it is easy to estimateweight within groups (our example today).

I “Often”Var(εi) = Var(Yi) = V (E (Yi))

I Many non-Gaussian (non-Normal) models behave like this:logistic, Poisson regression.

What if we didn’t re-weight?

I Our (ordinary) least squares estimator with design matrix X is

β̂ = β̂OLS = (XT X )−1XT Y = β + (XT X )−1XT �.

I Our model says that ε|X ∼ N(0, σ2X ) so

E [(XT X )−1XT �] = E [(XT X )−1XT �|X ]= 0

So the OLS estimator is unbiased.

I Variance of β̂OLS is

Var((XT X )−1XT �) = σ2(XT X )−1XT VX (XT X )−1,

where V = diag(X 21 , . . . ,X 2n ).

Two-stage procedure

I Suppose we have a hypothesis about the weights, i.e. they areconstant within Region, or they are something like

w−1i = Var(�i) = α0 + α1X2i1.

I We pre-whiten:1. Fit model using OLS (Ordinary Least Squares) to get initial

estimate β̂OLS2. Use predicted values from this model to estimate wi .3. Refit model using WLS (Weighted Least Squares).4. If needed, iterate previous two steps.

Example (two-stage procedure)

I Let’s use w−1i = Var(�i).# Weight vector for each observationeduc.weights = 0 * education.table$Yfor (region in levels(education.table$Region)) {

# remove the outlier Alaskasubset.region = (education.table$Region[

keep.subset] == region)

educ.weights[subset.region] = 1.0/(sum(resid(education.noAK.lm)[subset.region]^2) /sum(subset.region))

}

Example (two-stage procedure)

I Weights for the observations in each Regionunique(educ.weights)

## [1] 0.0006891263 0.0004103443 0.0040090885 0.0010521628

Weighted least squares regression

I Here is our new model.I Note that the scale of the estimates is unchanged.I Numerically the estimates are similar.I What changes most is the Std. Error column.

education.noAK.weight.lm =lm(Y ~ X1 + X2 + X3,weights=educ.weights,subset=keep.subset,data=education.table)

Weighted least squares regression

summary(education.noAK.weight.lm)

Diagnostics

par(mfrow=c(2,2))plot(education.noAK.weight.lm)

200 240 280 320

−10

050

Fitted values

Res

idua

ls

Residuals vs Fitted

10

1514

−2 −1 0 1 2

−1

2


Sta

ndar

dize

d re

sidu

als

Normal Q−Q42

45

47

200 240 280 320

0.0

1.0

Fitted values

Sta

ndar

dize

d re

sidu

als


0.0 0.1 0.2 0.3

−2

1

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance0.5

0.5


45

7

Diagnostics

I Let’s look at the boxplot again. It looks better, but perhapsnot perfect.

par(mfrow=c(1,1))boxplot(resid(education.noAK.weight.lm,

type='pearson') ~education.table$Region[keep.subset],


Diagnostics

1 2 3 4

−1.

5−

0.5

0.5

1.5

education.table$Region[keep.subset]resi

d(ed

ucat

ion.

noA

K.w

eigh

t.lm

, typ

e =

"pe

arso

n")

Unequal variance: effects on inference

I So far, we have just mentioned that things may have unequalvariance, but not thought about how it affects inference.

I In general, if we ignore unequal variance, our estimates ofvariance are not very good. The covariance has the “sandwichform” we saw above

Cov(β̂OLS) = (X ′X )−1(X ′W−1X )(X ′X )−1.

with W = diag(1/σ2i ).

I ** If our Std. Error is incorrect, so are our conclusions basedon t-statistics!**

I In the education expenditure data example, correcting forweights seemed to make the t-statistics larger. ** This will notalways be the case!**

Pratheepa Jeganathann by n diagonal matrix


Unequal variance: effects on inference

I Weighted least squares estimator

β̂WLS = (XT WX )−1XT WY

I If we have the correct weights, then

Cov(β̂WLS) = (XT WX )−1.

Efficiency

I The efficiency of an unbiased estimator of β is 1 / variance.

I Estimators can be compared by their efficiency: the moreefficient, the better.

I The other reason to correct for unequal variance (besides sothat we get valid inference) is for efficiency.



Illustrative example

I Suppose

Zi = µ+ εi , εi ∼ N(0, i2 · σ2), 1 ≤ i ≤ n.

I Three unbiased estimators of µ:

µ̂1 =1n

n∑i=1

Zi

µ̂2 =1∑n

i=1 i−2n∑

i=1i−2Zi

µ̂3 =1∑n

i=1 i−1n∑

i=1i−1Zi



I The estimator µ̂2 will always have lower variance, hence tighterconfidence intervals.

I The estimator µ̂3 has incorrect weights, but they are “closer”to correct than the naive mean’s weights which assume eachobservation has equal variance.


ntrial = 1000 # how many trials will we be doing?nsample = 20 # how many points in each trialsd = c(1:20) # how does the variance changemu = 2.0

get.sample = function() {return(rnorm(nsample)*sd + mu)

}

unweighted.estimate = numeric(ntrial)weighted.estimate = numeric(ntrial)suboptimal.estimate = numeric(ntrial)


I Let’s simulate a number of experiments and compare the threeestimates.

for (i in 1:ntrial) {cur.sample = get.sample()unweighted.estimate[i] = mean(cur.sample)weighted.estimate[i] = sum(cur.sample/sd^2) / sum(1/sd^2)suboptimal.estimate[i] = sum(cur.sample/sd) / sum(1/sd)

}


I Compute SE (µ̂i)data.frame(mean(unweighted.estimate),

sd(unweighted.estimate))

## mean.unweighted.estimate. sd.unweighted.estimate.## 1 1.991227 2.590985data.frame(mean(weighted.estimate),

sd(weighted.estimate))

## mean.weighted.estimate. sd.weighted.estimate.## 1 2.014456 0.7788919data.frame(mean(suboptimal.estimate),

sd(suboptimal.estimate))

## mean.suboptimal.estimate. sd.suboptimal.estimate.## 1 2.016207 1.227501





−4 −2 0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

0.5

Comparison of sampling distribution of the estimators

densX

dens

Y

optimalsuboptimalmean





Reference

I CH Chapter 6 and Chapter 7I Lecture notes of Jonathan Taylor .

http://statweb.stanford.edu/~jtaylo/

TransformationWeighted Least Squares (CH Chapter 7)

lecture23: transformationsandweightedleast squares...squares pratheepajeganathan 11/13/2019. recap i...

Documents