topic 9: remedies - purdue universitybacraig/notes512/topic_9.pdf · poisson (count data) – gamma...

Topic 9: Remedies

Outline

• Review diagnostics for residuals• Discuss remedies

– Nonlinear relationship– Nonconstant variance– Non-Normal distribution– Outliers

Diagnostics for residuals• Look at residuals to find serious/obvious

violations of the model assumptions – nonlinear relationship – non-constant variance– non-Normal errors

• presence of outliers• a strongly skewed distribution

Recommendations for checking assumptions

• Plot Y vs X (is it a linear relationship?)

• Use the i=sm## in symbol statement if using SAS to get smoothed curve fit

• If reasonable, fit model to get residuals• Look at distribution of residuals• Plot residuals vs X, time/run order, or any

other potential explanatory variable

Plots of Residuals• Plot residuals vs

– Time (run order)– X or predicted value (b0+b1X)

• In all cases look for –nonrandom patterns–outliers (unusual observations)

1. Residuals vs Time

• Pattern in plot suggests dependent errors / lack of indep

• Pattern usually a linear or quadratic trend and/or cyclical

• If you are interested in more info read KNNL pgs 108-110

2. Residuals vs X or .

• Can look for –nonconstant variance–nonlinear relationship–outliers–To some extent non-Normality

of residuals

Y

Assessment of Normality• Look at distribution of residuals

–Can look at then together because E(e) = 0 (unlike Y’s)

• Common to use eye-test–Histogram–Normal quantile plot–Scatter in residual plot

Tests for Normality• H0: data are an i.i.d. sample from a

Normal population• Ha: data are not an i.i.d. sample

from a Normal population• KNNL (p 115) suggest a correlation

test that requires a table look-up based on relationship between obsand normal scores

Tests for Normality• SAS has several choices for the

significance testing procedure • PROC UNIVERIATE with the normal

option provides four proceduresproc univariate normal;

• Shapiro-Wilk is most common choice

Tests for Normality

Test Statistic p ValueShapiro-Wilk W 0.978904 Pr < W 0.8626

Kolmogorov-Smirnov D 0.09572 Pr > D >0.1500

Cramer-von Mises W-Sq 0.033263 Pr > W-Sq >0.2500

Anderson-Darling A-Sq 0.207142 Pr > A-Sq >0.2500

All P-values > 0.05…Do not reject H0

Not the same as concluding errors are Normally distributed. Just not enough evidence to reject

Toluca data set

Other tests for model assumptions

• Durbin-Watson test for serially correlated errors (KNNL p 114)

• Modified Levene test for homogeneity of variance (KNNL p 116-118)

• Breusch-Pagan test for homogeneity of variance (KNNL p 118)

• For SAS commands see topic9.sas

Plots vs significance tests• Significance tests results are very

dependent on the sample size; with sufficiently large samples we can reject slight deviations from null hypothesis that would not invalidate results if ignored

• Plots are more likely to suggest a remedy if one is needed

Default graphics with SAS 9.4

proc reg data=toluca; model hours=lotsize; id lotsize;

run;

Will discuss these diagnostics more with multiple regression

Plots provide rule of thumb limits

Questionable observation (30,273)

Additional summaries in these plots

• Rstudent: Studentized residual…almost all should be between ± 2

• Leverage: “Distance” of X from center…helps determine outlying X values in multivariable setting…outlying X values may be influential

• Cook’s D: Influence of ith case on all predicted values

Lack of fit• When we have repeat observations at

various values of X, we can do a significance test for nonlinearity

• Description in KNNL Section 3.7• Details of approach discussed when

we get to KNNL 17.9, p 762• Basic idea is to compare two models• Gplot with a smooth is a better (i.e.,

simpler) approach for this assesment

SAS code and outputproc reg data=toluca; model hours=lotsize / lackfit;

run;Analysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > FModel 1 252378 252378 105.88 <.0001Error 23 54825 2383.71562Lack of Fit 9 17245 1916.06954 0.71 0.6893Pure Error 14 37581 2684.34524Corrected Total 24 307203

Nonlinear relationships• We can model many nonlinear

relationships with linear models, some have several explanatory variables (i.e., need to move to multiple linear regression)–Y = β0 + β1X + β2X2 + e (quadratic)

–Y = β0 + β1log(X) + e

Nonlinear Relationships• Sometimes one can transform a

nonlinear equation into a linear equation

• Consider Y = β0exp(β1X) + e• Can form linear model using log

log(Y) = log(β0) + β1X + dNote that we have changed our

assumptions about the error. If we back-transform Y = β0exp(β1X)exp(d)

Nonlinear Relationships• If we don’t want to alter assumptions

on errors, we can instead perform a nonlinear regression analysis– Means vary in nonlinear manner– Obs deviate about these means

just as in linear regression• KNNL Chapter 13• SAS PROC NLIN

Nonconstant variance• Sometimes we need to model the way in

which the error variance changes– may be functionally related to X– In other words changes with the mean

• In this case, we use a weighted analysis• This is discussed KNNL 11.1• Use a weight statement in PROC REG

Non-Normal errors• Transformations of Y often

allow you to still use linear regression

• If not, use a procedure that allows different distributions for the error term–SAS PROC GENMOD

Generalized Linear Model• Possible distributions of Y:

– Binomial (Y/N or percentage data)– Poisson (Count data)– Gamma (exponential)– Inverse gaussian– Negative binomial– Multinomial

• Specify a link function for E(Y)– May be linear or nonlinear function of X

Transformations Ladder of Re-expression

p

0.0(log)0.5

-0.5-1.0

1.01.5 Transformation

is xp or yp

Transform of y affects error assumptions

Circle of TransformationsWhat quadrant best describes the shape of the relationship

X up, Y up

X up, Y down

X down, Y up

X down, Y down

X

Y

Transform using appropriate power

Box-Cox Transformations

• Also called power transformations• These transformations adjust for

non-Normality and nonconstant variance

• Y´ = Yλ or Y´ = (Yλ - 1)/λ• In the second form, the limit as λ

approaches zero is the (natural) log

Important Special Cases

• λ = 1, Y´ = Y1, no transformation• λ = .5, Y´ = Y1/2, square root• λ = -.5, Y´ = Y-1/2, one over square root• λ = -1, Y´ = Y-1 = 1/Y, inverse• λ = 0, Y´ = (natural) log of Y

Box-Cox Details• We can estimate λ by including it as

a parameter in a non-linear model• Yλ = β0 + β1X + e

and use the method of maximum likelihood to estimate it

• Details are in KNNL p 134-137• SAS code is in boxcox.sas

Box-Cox Solution• Standardized transformed Y is

–K1(Yλ - 1) if λ ≠ 0–K2log(Y) if λ = 0where K2 = (Π Yi)1/n (the geometric mean)and K1 = 1/ (λ K2

λ-1)• Run regressions with X as

explanatory variable • Best choice of λ minimizes SSE

data a1; input age plasma @@;cards;0 13.44 0 12.84 0 11.91 0 20.090 15.60 1 10.11 1 11.38 1 10.281 8.96 1 8.59 2 9.83 2 9.002 8.65 2 7.85 2 8.88 3 7.943 6.01 3 5.14 3 6.90 3 6.774 4.86 4 5.10 4 5.67 4 5.754 6.23;

Example

Variability large initially and gets smaller. Is it just different at age=0? Or is it more a function of age?

Box Cox Procedure*Procedure that will automatically find the Box-Cox transformation;

proc transreg data=a1;

model boxcox(plasma)=identity(age);

run;

Transformation Information for BoxCox(plasma)

Lambda R-Square Log Like

-2.50 0.76 -17.0444

-2.00 0.80 -12.3665

-1.50 0.83 -8.1127

-1.00 0.86 -4.8523 *

-0.50 0.87 -3.5523 <

0.00 + 0.85 -5.0754 *

0.50 0.82 -9.2925

1.00 0.75 -15.2625

1.50 0.67 -22.1378

2.00 0.59 -29.4720

2.50 0.50 -37.0844

< - Best Lambda

* - Confidence Interval

+ - Convenient Lambda

Based on these results, either take inverse square root or log

*The first part of the program gets the geometric mean;

data a2; set a1; lplasma=log(plasma);

proc univariate data=a2 noprint; var lplasma; output out=a3 mean=meanl;

Doing procedure inside SAS data steps

data a4; set a2; if _n_ eq 1 then set a3; keep age yl l;k2=exp(meanl);do l = -1.0 to 1.0 by .1;

k1=1/(l*k2**(l-1));yl=k1*(plasma**l -1);if abs(l) < 1E-8 then yl=k2*log(plasma);

output;end;

proc sort data=a4 out=a4; by l;

proc reg data=a4 noprint outest=a5; model yl=age; by l;

data a5; set a5; n=25; p=2; sse=(n-p)*(_rmse_)**2;

proc print data=a5; var l sse;

Obs l sse1 -1.0 33.90892 -0.9 32.70443 -0.8 31.76454 -0.7 31.09075 -0.6 30.68686 -0.5 30.5596***7 -0.4 30.71868 -0.3 31.17639 -0.2 31.9487

10 -0.1 33.0552

symbol1 v=none i=join;proc gplot data=a5;

plot sse*l;run;

Compare Regressions• Can also create large data set of

standardized transformed Y’s and compare R2 fits

• Best l maximizes R2

• This approach is also provided in the SAS code.

data a2; set a1;do l = -1.0 to -.1 by .1;

yl=(plasma**l -1)/l; output;

end;l=0; yl=log(plasma); output;do l = .1 to 1 by .1;

yl=(plasma**l -1)/l; output;

end;

proc sort data=a2 out=a2; by l;

proc reg noprint outest=a3; model yl=age/selection=rsquare; by l;

proc print data=a3; var l _rsq_;proc gplot data=a3; plot _rsq_*l/frame;

run;

data a1; set a1;

tplasma = plasma**(-.5);

tplasma1 = log(plasma);

symbol1 v=circle i=sm50;

proc gplot; plot tplasma*age;

proc gplot; plot tplasma1*age;

run;

Now let’s apply the transformation

Background Reading

• Sections 3.4 - 3.7 describe significance tests for assumptions (read it if you are interested).

• Box-Cox transformation is in boxcox.sas

• Read Sections 4.1, 4.2, 4.4, 4.5, and 4.6

topic 9: remedies - purdue universitybacraig/notes512/topic_9.pdf · poisson (count data) – gamma...

Documents