lecture23: transformationsandweightedleast squares...squares pratheepajeganathan 11/13/2019. recap i...

53
Lecture 23: Transformations and Weighted Least Squares Pratheepa Jeganathan 11/13/2019

Upload: others

Post on 10-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Lecture 23: Transformations and Weighted LeastSquares

    Pratheepa Jeganathan

    11/13/2019

  • Recap

    I What is a regression model?I Descriptive statistics – graphicalI Descriptive statistics – numericalI Inference about a population meanI Difference between two population meansI Some tips on RI Simple linear regression (covariance, correlation, estimation,

    geometry of least squares)I Inference on simple linear regression modelI Goodness of fit of regression: analysis of variance.I F -statistics.I Residuals.I Diagnostic plots for simple linear regression (graphical

    methods).

  • Recap

    I Multiple linear regressionI Specifying the model.I Fitting the model: least squares.I Interpretation of the coefficients.I Matrix formulation of multiple linear regressionI Inference for multiple linear regression

    I T -statistics revisited.I More F statistics.I Tests involving more than one β.

    I Diagnostics – more on graphical methods and numericalmethodsI Different types of residualsI InfluenceI Outlier detectionI Multiple comparison (Bonferroni correction)I Residual plots:

    I partial regression (added variable) plot,I partial residual (residual plus component) plot.

  • Recap

    I Adding qualitative predictorsI Qualitative variables as predictors to the regression model.I Adding interactions to the linear regression model.I Testing for equality of regression relationship in various subsets

    of a populationI ANOVA

    I All qualitative predictors.I One-way layoutI Two-way layout

  • Transformation

  • Outline

    I We have been working with linear regression models so far inthe course.

    I Some models are nonlinear, but can be transformed to a linearmodel (CH Chapter 6).

    I We will also see that transformations can sometimes stabilizethe variance making constant variance a more reasonableassumption (CH Chapter 6).

    I Finally, we will see how to correct for unequal variance using atechnique weighted least squares (WLS) (CH Chapter 7).

  • Bacterial colony decay (CH Chapter 6.3, Page 167)

    I Here is a simple dataset showing the number of bacteria alivein a colony, nt as a function of time t.

    I A simple linear regression model is clearly not a very good fit.bacteria.table = read.table('http://stats191.stanford.edu/data/bacteria.table',

    header=T)head(bacteria.table)

    ## t N_t## 1 1 355## 2 2 211## 3 3 197## 4 4 166## 5 5 142## 6 6 106

  • Fitting (Bacterial colony decay)

    0

    100

    200

    300

    4 8 12t

    N_t

  • Diagnostics (Bacterial colony decay)

    0 50 100 150 200 250

    −50

    100

    Fitted values

    Res

    idua

    ls

    Residuals vs Fitted1

    15

    8

    −1 0 1

    −1

    2

    Theoretical Quantiles

    Sta

    ndar

    dize

    d re

    sidu

    als

    Normal Q−Q1

    15

    8

    0 50 100 150 200 250

    0.0

    1.5

    Fitted values

    Sta

    ndar

    dize

    d re

    sidu

    als

    Scale−Location1

    15 8

    0.00 0.05 0.10 0.15 0.20 0.25−

    12

    Leverage

    Sta

    ndar

    dize

    d re

    sidu

    als

    Cook's distance0.5

    Residuals vs Leverage1

    1514

  • Exponential decay model

    I Suppose the expected number of cells grows like

    E (nt) = n0eβ1t , t = 1, 2, 3, . . .

    I If we take logs of both sides

    log E (nt) = log n0 + β1t.

    I A reasonable (?) model:

    log nt = log n0 + β1t + εt , εtIID∼ N(0, σ2).

  • bacteria.log.lm = lm(log(N_t) ~ t, bacteria.table)df = cbind(bacteria.table,

    lm.fit = fitted(bacteria.lm),lm.log.fit = exp(fitted(bacteria.log.lm)))

    p = p +geom_abline(intercept = bacteria.lm$coef[1],slope = bacteria.lm$coef[2],

    col = "red") +geom_line(data = df,

    aes(x = t, y = lm.log.fit),color = "green")

  • 0

    100

    200

    300

    4 8 12t

    N_t

  • Diagnostics

    par(mfrow=c(2,2))plot(bacteria.log.lm, pch=23, bg='orange')

    3.0 3.5 4.0 4.5 5.0 5.5

    −0.

    20.

    2

    Fitted values

    Res

    idua

    ls

    Residuals vs Fitted7

    210

    −1 0 1

    −2

    1

    Theoretical Quantiles

    Sta

    ndar

    dize

    d re

    sidu

    als

    Normal Q−Q7

    2 10

    3.0 3.5 4.0 4.5 5.0 5.5

    0.0

    1.0

    Fitted values

    Sta

    ndar

    dize

    d re

    sidu

    als

    Scale−Location7 210

    0.00 0.05 0.10 0.15 0.20 0.25

    −2

    1

    Leverage

    Sta

    ndar

    dize

    d re

    sidu

    als

    Cook's distance0.5

    0.5

    Residuals vs Leverage

    2

    17

  • Logarithmic transformation

    I This model is slightly different than original model:

    E (log nt) ≤ log E (nt)

    but may be approximately true.

    I If εt ∼ N(0, σ2) then

    nt = n0 · γt · eβ1t .

    I γt = eεt is called a log-normal (0, σ2) random variable.

  • Linearizing regression function

    I We see that an exponential growth or decay model can bemade (approximately) linear.

    I Here are a few other models that can be linearized:I y = αxβ , use ỹ = log(y), x̃ = log(x);I y = αeβx , use ỹ = log(y);I y = x/(αx − β), use ỹ = 1/y , x̃ = 1/x .I More in textbook.

  • Caveats

    I Just because expected value linearizes, doesn’t mean that theerrors behave correctly.

    I In some cases, this can be corrected using weighted leastsquares (more later).

    I Constant variance, normality assumptions should still bechecked.

  • Stabilizing variance

    I Sometimes, a transformation can turn non-constant varianceerrors to “close to” constant variance. This is another situationin which we might consider a transformation.

    I Example: by the “delta rule”, if

    Var(Y ) = σ2E (Y )

    thenVar(√

    Y ) ' σ2

    4 .

    I In practice, we might not know which transformation is best.Box-Cox transformations offer a tool to find a “best”transformation.

    http://en.wikipedia.org/wiki/Power_transform

  • Delta rule

    The following approximation is ubiquitous in statistics.

    I Taylor series expansion:

    f (Y ) = f (E (Y )) + ḟ (E (Y ))(Y − E (Y )) + . . .

    I Taking expectations of both sides yields:

    Var(f (Y )) ' ḟ (E (Y ))2 · Var(Y )

  • Delta rule

    I So, for our previous example:

    Var(√

    Y ) ' Var(Y )4 · E (Y )

    I Another example

    Var(log(Y )) ' Var(Y )E (Y )2 .

  • Caveats

    I Just because a transformation makes variance constant doesn’tmean regression function is still linear (or even that it waslinear)!

    I The models are approximations, and once a model is selectedour standard diagnostics should be used to assess adequacy offit.

    I It is possible to have non-constant variance but the variancestabilizing transformation may destroy linearity of theregression function.I Solution: try weighted least squares (WLS).

  • Weighted Least Squares (CH Chapter 7)

  • Correcting for unequal variance: weighted least squares

    I We will now see an example in which there seems to be strongevidence for variance that changes based on Region.

    I After observing this, we will create a new model that attemptsto correct for this and come up with better estimates.

    I Correcting for unequal variance, as we describe it here,generally requires a model for how the variance depends onobservable quantities.

  • Correcting for unequal variance: weighted least squares(CH Chapter 7.4, Page 197)

    Variable DescriptionY Per capita education expenditure by stateX1 Per capita income in 1973 by stateX2 Proportion of population under 18X3 Proportion in urban areasRegion Which region of the country are the states located in

  • lm

    education.table = read.table('http://stats191.stanford.edu/data/education1975.table', header=T)education.table$Region = factor(education.table$Region)education.lm = lm(Y ~ X1 + X2 + X3, data=education.table)

  • lm

    summary(education.lm)

  • Diagnostics

    par(mfrow=c(2,2))plot(education.lm)

    200 250 300 350 400 450

    −10

    010

    0

    Fitted values

    Res

    idua

    ls

    Residuals vs Fitted49

    10

    7

    −2 −1 0 1 2

    −2

    2

    Theoretical Quantiles

    Sta

    ndar

    dize

    d re

    sidu

    als

    Normal Q−Q49

    10

    7

    200 250 300 350 400 450

    0.0

    1.5

    Fitted values

    Sta

    ndar

    dize

    d re

    sidu

    als

    Scale−Location49

    107

    0.0 0.1 0.2 0.3 0.4

    −2

    2

    Leverage

    Sta

    ndar

    dize

    d re

    sidu

    als

    Cook's distance1

    0.5

    Residuals vs Leverage49

    7

    18

  • Diagnostics

    I there is an outlier, let’s drop the outlier and refitboxplot(rstandard(education.lm) ~ education.table$Region,

    col=c('red', 'green', 'blue', 'yellow'))

    1 2 3 4

    −2

    −1

    01

    23

    education.table$Region

    rsta

    ndar

    d(ed

    ucat

    ion.

    lm)

  • Fit a model without the outlier

    keep.subset = (education.table$STATE != 'AK')education.noAK.lm = lm(Y ~ X1 + X2 + X3,

    subset=keep.subset,data=education.table)

  • Fit a model without the outlier

    summary(education.noAK.lm)

  • par(mfrow=c(2,2))plot(education.noAK.lm)

    220 240 260 280 300 320 340

    −10

    050

    Fitted values

    Res

    idua

    ls

    Residuals vs Fitted

    10

    15 7

    −2 −1 0 1 2

    −2

    1

    Theoretical Quantiles

    Sta

    ndar

    dize

    d re

    sidu

    als

    Normal Q−Q

    10

    157

    220 240 260 280 300 320 340

    0.0

    1.5

    Fitted values

    Sta

    ndar

    dize

    d re

    sidu

    als

    Scale−Location1015 7

    0.00 0.10 0.20 0.30−

    21

    Leverage

    Sta

    ndar

    dize

    d re

    sidu

    als

    Cook's distance0.5

    0.5

    Residuals vs Leverage427 3

  • Diagnostics (refitted model)

    par(mfrow=c(1,1))boxplot(rstandard(education.noAK.lm) ~ education.table$Region[keep.subset],

    col=c('red', 'green', 'blue', 'yellow'))

    1 2 3 4

    −2

    −1

    01

    2

    education.table$Region[keep.subset]

    rsta

    ndar

    d(ed

    ucat

    ion.

    noA

    K.lm

    )

  • Re-weighting observations

    I If you have a reasonable guess of variance as a function of thepredictors, you can use this to re-weight the data.

    I Hypothetical example

    Yi = β0 + β1Xi + εi , εi ∼ N(0, σ2X 2i ).

    I Setting Ỹi = Yi/Xi , X̃i = 1/Xi , model becomes

    Ỹi = β0X̃i + β1 + γi , γi ∼ N(0, σ2).

  • Weighted Least SquaresI Fitting this model is equivalent to minimizing

    n∑i=1

    1X 2i

    (Yi − β0 − β1Xi)2

    I Weighted Least Squares

    SSE (β,w) =n∑

    i=1wi (Yi − β0 − β1Xi)2 , wi =

    1X 2i

    .

    I In general, weights should be like:

    wi =1

    Var(εi).

    I Our education expenditure example assumes

    wi = WRegion[i]

  • Common weighting schemes

    I If you have a qualitative variable, then it is easy to estimateweight within groups (our example today).

    I “Often”Var(εi) = Var(Yi) = V (E (Yi))

    I Many non-Gaussian (non-Normal) models behave like this:logistic, Poisson regression.

  • What if we didn’t re-weight?

    I Our (ordinary) least squares estimator with design matrix X is

    β̂ = β̂OLS = (XT X )−1XT Y = β + (XT X )−1XT �.

    I Our model says that ε|X ∼ N(0, σ2X ) so

    E [(XT X )−1XT �] = E [(XT X )−1XT �|X ]= 0

    So the OLS estimator is unbiased.

    I Variance of β̂OLS is

    Var((XT X )−1XT �) = σ2(XT X )−1XT VX (XT X )−1,

    where V = diag(X 21 , . . . ,X 2n ).

  • Two-stage procedure

    I Suppose we have a hypothesis about the weights, i.e. they areconstant within Region, or they are something like

    w−1i = Var(�i) = α0 + α1X2i1.

    I We pre-whiten:1. Fit model using OLS (Ordinary Least Squares) to get initial

    estimate β̂OLS2. Use predicted values from this model to estimate wi .3. Refit model using WLS (Weighted Least Squares).4. If needed, iterate previous two steps.

  • Example (two-stage procedure)

    I Let’s use w−1i = Var(�i).# Weight vector for each observationeduc.weights = 0 * education.table$Yfor (region in levels(education.table$Region)) {

    # remove the outlier Alaskasubset.region = (education.table$Region[

    keep.subset] == region)

    educ.weights[subset.region] = 1.0/(sum(resid(education.noAK.lm)[subset.region]^2) /sum(subset.region))

    }

  • Example (two-stage procedure)

    I Weights for the observations in each Regionunique(educ.weights)

    ## [1] 0.0006891263 0.0004103443 0.0040090885 0.0010521628

  • Weighted least squares regression

    I Here is our new model.I Note that the scale of the estimates is unchanged.I Numerically the estimates are similar.I What changes most is the Std. Error column.

    education.noAK.weight.lm =lm(Y ~ X1 + X2 + X3,weights=educ.weights,subset=keep.subset,data=education.table)

  • Weighted least squares regression

    summary(education.noAK.weight.lm)

  • Diagnostics

    par(mfrow=c(2,2))plot(education.noAK.weight.lm)

    200 240 280 320

    −10

    050

    Fitted values

    Res

    idua

    ls

    Residuals vs Fitted

    10

    1514

    −2 −1 0 1 2

    −1

    2

    Theoretical Quantiles

    Sta

    ndar

    dize

    d re

    sidu

    als

    Normal Q−Q42

    45

    47

    200 240 280 320

    0.0

    1.0

    Fitted values

    Sta

    ndar

    dize

    d re

    sidu

    als

    Scale−Location42 4547

    0.0 0.1 0.2 0.3

    −2

    1

    Leverage

    Sta

    ndar

    dize

    d re

    sidu

    als

    Cook's distance0.5

    0.5

    Residuals vs Leverage42

    45

    7

  • Diagnostics

    I Let’s look at the boxplot again. It looks better, but perhapsnot perfect.

    par(mfrow=c(1,1))boxplot(resid(education.noAK.weight.lm,

    type='pearson') ~education.table$Region[keep.subset],

    col=c('red', 'green', 'blue', 'yellow'))

  • Diagnostics

    1 2 3 4

    −1.

    5−

    0.5

    0.5

    1.5

    education.table$Region[keep.subset]resi

    d(ed

    ucat

    ion.

    noA

    K.w

    eigh

    t.lm

    , typ

    e =

    "pe

    arso

    n")

  • Unequal variance: effects on inference

    I So far, we have just mentioned that things may have unequalvariance, but not thought about how it affects inference.

    I In general, if we ignore unequal variance, our estimates ofvariance are not very good. The covariance has the “sandwichform” we saw above

    Cov(β̂OLS) = (X ′X )−1(X ′W−1X )(X ′X )−1.

    with W = diag(1/σ2i ).

    I ** If our Std. Error is incorrect, so are our conclusions basedon t-statistics!**

    I In the education expenditure data example, correcting forweights seemed to make the t-statistics larger. ** This will notalways be the case!**

    Pratheepa Jeganathann by n diagonal matrix

    Pratheepa Jeganathan

  • Unequal variance: effects on inference

    I Weighted least squares estimator

    β̂WLS = (XT WX )−1XT WY

    I If we have the correct weights, then

    Cov(β̂WLS) = (XT WX )−1.

  • Efficiency

    I The efficiency of an unbiased estimator of β is 1 / variance.

    I Estimators can be compared by their efficiency: the moreefficient, the better.

    I The other reason to correct for unequal variance (besides sothat we get valid inference) is for efficiency.

    Pratheepa Jeganathan

    Pratheepa Jeganathan

  • Illustrative example

    I Suppose

    Zi = µ+ εi , εi ∼ N(0, i2 · σ2), 1 ≤ i ≤ n.

    I Three unbiased estimators of µ:

    µ̂1 =1n

    n∑i=1

    Zi

    µ̂2 =1∑n

    i=1 i−2n∑

    i=1i−2Zi

    µ̂3 =1∑n

    i=1 i−1n∑

    i=1i−1Zi

    Pratheepa Jeganathan

  • Illustrative example

    I The estimator µ̂2 will always have lower variance, hence tighterconfidence intervals.

    I The estimator µ̂3 has incorrect weights, but they are “closer”to correct than the naive mean’s weights which assume eachobservation has equal variance.

  • Illustrative example

    ntrial = 1000 # how many trials will we be doing?nsample = 20 # how many points in each trialsd = c(1:20) # how does the variance changemu = 2.0

    get.sample = function() {return(rnorm(nsample)*sd + mu)

    }

    unweighted.estimate = numeric(ntrial)weighted.estimate = numeric(ntrial)suboptimal.estimate = numeric(ntrial)

  • Illustrative example

    I Let’s simulate a number of experiments and compare the threeestimates.

    for (i in 1:ntrial) {cur.sample = get.sample()unweighted.estimate[i] = mean(cur.sample)weighted.estimate[i] = sum(cur.sample/sd^2) / sum(1/sd^2)suboptimal.estimate[i] = sum(cur.sample/sd) / sum(1/sd)

    }

  • Illustrative example

    I Compute SE (µ̂i)data.frame(mean(unweighted.estimate),

    sd(unweighted.estimate))

    ## mean.unweighted.estimate. sd.unweighted.estimate.## 1 1.991227 2.590985data.frame(mean(weighted.estimate),

    sd(weighted.estimate))

    ## mean.weighted.estimate. sd.weighted.estimate.## 1 2.014456 0.7788919data.frame(mean(suboptimal.estimate),

    sd(suboptimal.estimate))

    ## mean.suboptimal.estimate. sd.suboptimal.estimate.## 1 2.016207 1.227501

    Pratheepa Jeganathan

    Pratheepa Jeganathan

    Pratheepa Jeganathan

  • Illustrative example

    −4 −2 0 2 4 6 8

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    Comparison of sampling distribution of the estimators

    densX

    dens

    Y

    optimalsuboptimalmean

    Pratheepa Jeganathan

    Pratheepa Jeganathan

    Pratheepa Jeganathan

    Pratheepa Jeganathan

  • Reference

    I CH Chapter 6 and Chapter 7I Lecture notes of Jonathan Taylor .

    http://statweb.stanford.edu/~jtaylo/

    TransformationWeighted Least Squares (CH Chapter 7)