topic 9: remedies - purdue universitybacraig/notes512/topic_9.pdf · poisson (count data) – gamma...
TRANSCRIPT
Topic 9: Remedies
Outline
• Review diagnostics for residuals• Discuss remedies
– Nonlinear relationship– Nonconstant variance– Non-Normal distribution– Outliers
Diagnostics for residuals• Look at residuals to find serious/obvious
violations of the model assumptions – nonlinear relationship – non-constant variance– non-Normal errors
• presence of outliers• a strongly skewed distribution
Recommendations for checking assumptions
• Plot Y vs X (is it a linear relationship?)
• Use the i=sm## in symbol statement if using SAS to get smoothed curve fit
• If reasonable, fit model to get residuals• Look at distribution of residuals• Plot residuals vs X, time/run order, or any
other potential explanatory variable
Plots of Residuals• Plot residuals vs
– Time (run order)– X or predicted value (b0+b1X)
• In all cases look for –nonrandom patterns–outliers (unusual observations)
1. Residuals vs Time
• Pattern in plot suggests dependent errors / lack of indep
• Pattern usually a linear or quadratic trend and/or cyclical
• If you are interested in more info read KNNL pgs 108-110
2. Residuals vs X or .
• Can look for –nonconstant variance–nonlinear relationship–outliers–To some extent non-Normality
of residuals
Y
Assessment of Normality• Look at distribution of residuals
–Can look at then together because E(e) = 0 (unlike Y’s)
• Common to use eye-test–Histogram–Normal quantile plot–Scatter in residual plot
Tests for Normality• H0: data are an i.i.d. sample from a
Normal population• Ha: data are not an i.i.d. sample
from a Normal population• KNNL (p 115) suggest a correlation
test that requires a table look-up based on relationship between obsand normal scores
Tests for Normality• SAS has several choices for the
significance testing procedure • PROC UNIVERIATE with the normal
option provides four proceduresproc univariate normal;
• Shapiro-Wilk is most common choice
Tests for Normality
Test Statistic p ValueShapiro-Wilk W 0.978904 Pr < W 0.8626
Kolmogorov-Smirnov D 0.09572 Pr > D >0.1500
Cramer-von Mises W-Sq 0.033263 Pr > W-Sq >0.2500
Anderson-Darling A-Sq 0.207142 Pr > A-Sq >0.2500
All P-values > 0.05…Do not reject H0
Not the same as concluding errors are Normally distributed. Just not enough evidence to reject
Toluca data set
Other tests for model assumptions
• Durbin-Watson test for serially correlated errors (KNNL p 114)
• Modified Levene test for homogeneity of variance (KNNL p 116-118)
• Breusch-Pagan test for homogeneity of variance (KNNL p 118)
• For SAS commands see topic9.sas
Plots vs significance tests• Significance tests results are very
dependent on the sample size; with sufficiently large samples we can reject slight deviations from null hypothesis that would not invalidate results if ignored
• Plots are more likely to suggest a remedy if one is needed
Default graphics with SAS 9.4
proc reg data=toluca; model hours=lotsize; id lotsize;
run;
Will discuss these diagnostics more with multiple regression
Plots provide rule of thumb limits
Questionable observation (30,273)
Additional summaries in these plots
• Rstudent: Studentized residual…almost all should be between ± 2
• Leverage: “Distance” of X from center…helps determine outlying X values in multivariable setting…outlying X values may be influential
• Cook’s D: Influence of ith case on all predicted values
Lack of fit• When we have repeat observations at
various values of X, we can do a significance test for nonlinearity
• Description in KNNL Section 3.7• Details of approach discussed when
we get to KNNL 17.9, p 762• Basic idea is to compare two models• Gplot with a smooth is a better (i.e.,
simpler) approach for this assesment
SAS code and outputproc reg data=toluca; model hours=lotsize / lackfit;
run;Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 1 252378 252378 105.88 <.0001Error 23 54825 2383.71562Lack of Fit 9 17245 1916.06954 0.71 0.6893Pure Error 14 37581 2684.34524Corrected Total 24 307203
Nonlinear relationships• We can model many nonlinear
relationships with linear models, some have several explanatory variables (i.e., need to move to multiple linear regression)–Y = β0 + β1X + β2X2 + e (quadratic)
–Y = β0 + β1log(X) + e
Nonlinear Relationships• Sometimes one can transform a
nonlinear equation into a linear equation
• Consider Y = β0exp(β1X) + e• Can form linear model using log
log(Y) = log(β0) + β1X + dNote that we have changed our
assumptions about the error. If we back-transform Y = β0exp(β1X)exp(d)
Nonlinear Relationships• If we don’t want to alter assumptions
on errors, we can instead perform a nonlinear regression analysis– Means vary in nonlinear manner– Obs deviate about these means
just as in linear regression• KNNL Chapter 13• SAS PROC NLIN
Nonconstant variance• Sometimes we need to model the way in
which the error variance changes– may be functionally related to X– In other words changes with the mean
• In this case, we use a weighted analysis• This is discussed KNNL 11.1• Use a weight statement in PROC REG
Non-Normal errors• Transformations of Y often
allow you to still use linear regression
• If not, use a procedure that allows different distributions for the error term–SAS PROC GENMOD
Generalized Linear Model• Possible distributions of Y:
– Binomial (Y/N or percentage data)– Poisson (Count data)– Gamma (exponential)– Inverse gaussian– Negative binomial– Multinomial
• Specify a link function for E(Y)– May be linear or nonlinear function of X
Transformations Ladder of Re-expression
p
0.0(log)0.5
-0.5-1.0
1.01.5 Transformation
is xp or yp
Transform of y affects error assumptions
Circle of TransformationsWhat quadrant best describes the shape of the relationship
X up, Y up
X up, Y down
X down, Y up
X down, Y down
X
Y
Transform using appropriate power
Box-Cox Transformations
• Also called power transformations• These transformations adjust for
non-Normality and nonconstant variance
• Y´ = Yλ or Y´ = (Yλ - 1)/λ• In the second form, the limit as λ
approaches zero is the (natural) log
Important Special Cases
• λ = 1, Y´ = Y1, no transformation• λ = .5, Y´ = Y1/2, square root• λ = -.5, Y´ = Y-1/2, one over square root• λ = -1, Y´ = Y-1 = 1/Y, inverse• λ = 0, Y´ = (natural) log of Y
Box-Cox Details• We can estimate λ by including it as
a parameter in a non-linear model• Yλ = β0 + β1X + e
and use the method of maximum likelihood to estimate it
• Details are in KNNL p 134-137• SAS code is in boxcox.sas
Box-Cox Solution• Standardized transformed Y is
–K1(Yλ - 1) if λ ≠ 0–K2log(Y) if λ = 0where K2 = (Π Yi)1/n (the geometric mean)and K1 = 1/ (λ K2
λ-1)• Run regressions with X as
explanatory variable • Best choice of λ minimizes SSE
data a1; input age plasma @@;cards;0 13.44 0 12.84 0 11.91 0 20.090 15.60 1 10.11 1 11.38 1 10.281 8.96 1 8.59 2 9.83 2 9.002 8.65 2 7.85 2 8.88 3 7.943 6.01 3 5.14 3 6.90 3 6.774 4.86 4 5.10 4 5.67 4 5.754 6.23;
Example
Variability large initially and gets smaller. Is it just different at age=0? Or is it more a function of age?
Box Cox Procedure*Procedure that will automatically find the Box-Cox transformation;
proc transreg data=a1;
model boxcox(plasma)=identity(age);
run;
Transformation Information for BoxCox(plasma)
Lambda R-Square Log Like
-2.50 0.76 -17.0444
-2.00 0.80 -12.3665
-1.50 0.83 -8.1127
-1.00 0.86 -4.8523 *
-0.50 0.87 -3.5523 <
0.00 + 0.85 -5.0754 *
0.50 0.82 -9.2925
1.00 0.75 -15.2625
1.50 0.67 -22.1378
2.00 0.59 -29.4720
2.50 0.50 -37.0844
< - Best Lambda
* - Confidence Interval
+ - Convenient Lambda
Based on these results, either take inverse square root or log
*The first part of the program gets the geometric mean;
data a2; set a1; lplasma=log(plasma);
proc univariate data=a2 noprint; var lplasma; output out=a3 mean=meanl;
Doing procedure inside SAS data steps
data a4; set a2; if _n_ eq 1 then set a3; keep age yl l;k2=exp(meanl);do l = -1.0 to 1.0 by .1;
k1=1/(l*k2**(l-1));yl=k1*(plasma**l -1);if abs(l) < 1E-8 then yl=k2*log(plasma);
output;end;
proc sort data=a4 out=a4; by l;
proc reg data=a4 noprint outest=a5; model yl=age; by l;
data a5; set a5; n=25; p=2; sse=(n-p)*(_rmse_)**2;
proc print data=a5; var l sse;
Obs l sse1 -1.0 33.90892 -0.9 32.70443 -0.8 31.76454 -0.7 31.09075 -0.6 30.68686 -0.5 30.5596***7 -0.4 30.71868 -0.3 31.17639 -0.2 31.9487
10 -0.1 33.0552
symbol1 v=none i=join;proc gplot data=a5;
plot sse*l;run;
Compare Regressions• Can also create large data set of
standardized transformed Y’s and compare R2 fits
• Best l maximizes R2
• This approach is also provided in the SAS code.
data a2; set a1;do l = -1.0 to -.1 by .1;
yl=(plasma**l -1)/l; output;
end;l=0; yl=log(plasma); output;do l = .1 to 1 by .1;
yl=(plasma**l -1)/l; output;
end;
proc sort data=a2 out=a2; by l;
proc reg noprint outest=a3; model yl=age/selection=rsquare; by l;
proc print data=a3; var l _rsq_;proc gplot data=a3; plot _rsq_*l/frame;
run;
data a1; set a1;
tplasma = plasma**(-.5);
tplasma1 = log(plasma);
symbol1 v=circle i=sm50;
proc gplot; plot tplasma*age;
proc gplot; plot tplasma1*age;
run;
Now let’s apply the transformation
Background Reading
• Sections 3.4 - 3.7 describe significance tests for assumptions (read it if you are interested).
• Box-Cox transformation is in boxcox.sas
• Read Sections 4.1, 4.2, 4.4, 4.5, and 4.6