topic 9: remedies - purdue universitybacraig/notes512/topic_9.pdf · poisson (count data) – gamma...

51
Topic 9: Remedies

Upload: dodang

Post on 05-Feb-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Topic 9: Remedies

Page 2: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Outline

• Review diagnostics for residuals• Discuss remedies

– Nonlinear relationship– Nonconstant variance– Non-Normal distribution– Outliers

Page 3: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Diagnostics for residuals• Look at residuals to find serious/obvious

violations of the model assumptions – nonlinear relationship – non-constant variance– non-Normal errors

• presence of outliers• a strongly skewed distribution

Page 4: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Recommendations for checking assumptions

• Plot Y vs X (is it a linear relationship?)

• Use the i=sm## in symbol statement if using SAS to get smoothed curve fit

• If reasonable, fit model to get residuals• Look at distribution of residuals• Plot residuals vs X, time/run order, or any

other potential explanatory variable

Page 5: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Plots of Residuals• Plot residuals vs

– Time (run order)– X or predicted value (b0+b1X)

• In all cases look for –nonrandom patterns–outliers (unusual observations)

Page 6: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

1. Residuals vs Time

• Pattern in plot suggests dependent errors / lack of indep

• Pattern usually a linear or quadratic trend and/or cyclical

• If you are interested in more info read KNNL pgs 108-110

Page 7: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

2. Residuals vs X or .

• Can look for –nonconstant variance–nonlinear relationship–outliers–To some extent non-Normality

of residuals

Y

Page 8: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Assessment of Normality• Look at distribution of residuals

–Can look at then together because E(e) = 0 (unlike Y’s)

• Common to use eye-test–Histogram–Normal quantile plot–Scatter in residual plot

Page 9: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Tests for Normality• H0: data are an i.i.d. sample from a

Normal population• Ha: data are not an i.i.d. sample

from a Normal population• KNNL (p 115) suggest a correlation

test that requires a table look-up based on relationship between obsand normal scores

Page 10: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Tests for Normality• SAS has several choices for the

significance testing procedure • PROC UNIVERIATE with the normal

option provides four proceduresproc univariate normal;

• Shapiro-Wilk is most common choice

Page 11: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Tests for Normality

Test Statistic p ValueShapiro-Wilk W 0.978904 Pr < W 0.8626

Kolmogorov-Smirnov D 0.09572 Pr > D >0.1500

Cramer-von Mises W-Sq 0.033263 Pr > W-Sq >0.2500

Anderson-Darling A-Sq 0.207142 Pr > A-Sq >0.2500

All P-values > 0.05…Do not reject H0

Not the same as concluding errors are Normally distributed. Just not enough evidence to reject

Toluca data set

Page 12: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Other tests for model assumptions

• Durbin-Watson test for serially correlated errors (KNNL p 114)

• Modified Levene test for homogeneity of variance (KNNL p 116-118)

• Breusch-Pagan test for homogeneity of variance (KNNL p 118)

• For SAS commands see topic9.sas

Page 13: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Plots vs significance tests• Significance tests results are very

dependent on the sample size; with sufficiently large samples we can reject slight deviations from null hypothesis that would not invalidate results if ignored

• Plots are more likely to suggest a remedy if one is needed

Page 14: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Default graphics with SAS 9.4

proc reg data=toluca; model hours=lotsize; id lotsize;

run;

Page 15: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Will discuss these diagnostics more with multiple regression

Plots provide rule of thumb limits

Questionable observation (30,273)

Page 16: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Additional summaries in these plots

• Rstudent: Studentized residual…almost all should be between ± 2

• Leverage: “Distance” of X from center…helps determine outlying X values in multivariable setting…outlying X values may be influential

• Cook’s D: Influence of ith case on all predicted values

Page 17: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution
Page 18: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution
Page 19: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Lack of fit• When we have repeat observations at

various values of X, we can do a significance test for nonlinearity

• Description in KNNL Section 3.7• Details of approach discussed when

we get to KNNL 17.9, p 762• Basic idea is to compare two models• Gplot with a smooth is a better (i.e.,

simpler) approach for this assesment

Page 20: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

SAS code and outputproc reg data=toluca; model hours=lotsize / lackfit;

run;Analysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > FModel 1 252378 252378 105.88 <.0001Error 23 54825 2383.71562Lack of Fit 9 17245 1916.06954 0.71 0.6893Pure Error 14 37581 2684.34524Corrected Total 24 307203

Page 21: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Nonlinear relationships• We can model many nonlinear

relationships with linear models, some have several explanatory variables (i.e., need to move to multiple linear regression)–Y = β0 + β1X + β2X2 + e (quadratic)

–Y = β0 + β1log(X) + e

Page 22: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Nonlinear Relationships• Sometimes one can transform a

nonlinear equation into a linear equation

• Consider Y = β0exp(β1X) + e• Can form linear model using log

log(Y) = log(β0) + β1X + dNote that we have changed our

assumptions about the error. If we back-transform Y = β0exp(β1X)exp(d)

Page 23: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Nonlinear Relationships• If we don’t want to alter assumptions

on errors, we can instead perform a nonlinear regression analysis– Means vary in nonlinear manner– Obs deviate about these means

just as in linear regression• KNNL Chapter 13• SAS PROC NLIN

Page 24: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Nonconstant variance• Sometimes we need to model the way in

which the error variance changes– may be functionally related to X– In other words changes with the mean

• In this case, we use a weighted analysis• This is discussed KNNL 11.1• Use a weight statement in PROC REG

Page 25: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Non-Normal errors• Transformations of Y often

allow you to still use linear regression

• If not, use a procedure that allows different distributions for the error term–SAS PROC GENMOD

Page 26: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Generalized Linear Model• Possible distributions of Y:

– Binomial (Y/N or percentage data)– Poisson (Count data)– Gamma (exponential)– Inverse gaussian– Negative binomial– Multinomial

• Specify a link function for E(Y)– May be linear or nonlinear function of X

Page 27: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Transformations Ladder of Re-expression

p

0.0(log)0.5

-0.5-1.0

1.01.5 Transformation

is xp or yp

Transform of y affects error assumptions

Page 28: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Circle of TransformationsWhat quadrant best describes the shape of the relationship

X up, Y up

X up, Y down

X down, Y up

X down, Y down

X

Y

Transform using appropriate power

Page 29: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Box-Cox Transformations

• Also called power transformations• These transformations adjust for

non-Normality and nonconstant variance

• Y´ = Yλ or Y´ = (Yλ - 1)/λ• In the second form, the limit as λ

approaches zero is the (natural) log

Page 30: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Important Special Cases

• λ = 1, Y´ = Y1, no transformation• λ = .5, Y´ = Y1/2, square root• λ = -.5, Y´ = Y-1/2, one over square root• λ = -1, Y´ = Y-1 = 1/Y, inverse• λ = 0, Y´ = (natural) log of Y

Page 31: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Box-Cox Details• We can estimate λ by including it as

a parameter in a non-linear model• Yλ = β0 + β1X + e

and use the method of maximum likelihood to estimate it

• Details are in KNNL p 134-137• SAS code is in boxcox.sas

Page 32: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Box-Cox Solution• Standardized transformed Y is

–K1(Yλ - 1) if λ ≠ 0–K2log(Y) if λ = 0where K2 = (Π Yi)1/n (the geometric mean)and K1 = 1/ (λ K2

λ-1)• Run regressions with X as

explanatory variable • Best choice of λ minimizes SSE

Page 33: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

data a1; input age plasma @@;cards;0 13.44 0 12.84 0 11.91 0 20.090 15.60 1 10.11 1 11.38 1 10.281 8.96 1 8.59 2 9.83 2 9.002 8.65 2 7.85 2 8.88 3 7.943 6.01 3 5.14 3 6.90 3 6.774 4.86 4 5.10 4 5.67 4 5.754 6.23;

Example

Page 34: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Variability large initially and gets smaller. Is it just different at age=0? Or is it more a function of age?

Page 35: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Box Cox Procedure*Procedure that will automatically find the Box-Cox transformation;

proc transreg data=a1;

model boxcox(plasma)=identity(age);

run;

Page 36: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Transformation Information for BoxCox(plasma)

Lambda R-Square Log Like

-2.50 0.76 -17.0444

-2.00 0.80 -12.3665

-1.50 0.83 -8.1127

-1.00 0.86 -4.8523 *

-0.50 0.87 -3.5523 <

0.00 + 0.85 -5.0754 *

0.50 0.82 -9.2925

1.00 0.75 -15.2625

1.50 0.67 -22.1378

2.00 0.59 -29.4720

2.50 0.50 -37.0844

< - Best Lambda

* - Confidence Interval

+ - Convenient Lambda

Based on these results, either take inverse square root or log

Page 37: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution
Page 38: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

*The first part of the program gets the geometric mean;

data a2; set a1; lplasma=log(plasma);

proc univariate data=a2 noprint; var lplasma; output out=a3 mean=meanl;

Doing procedure inside SAS data steps

Page 39: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

data a4; set a2; if _n_ eq 1 then set a3; keep age yl l;k2=exp(meanl);do l = -1.0 to 1.0 by .1;

k1=1/(l*k2**(l-1));yl=k1*(plasma**l -1);if abs(l) < 1E-8 then yl=k2*log(plasma);

output;end;

Page 40: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

proc sort data=a4 out=a4; by l;

proc reg data=a4 noprint outest=a5; model yl=age; by l;

data a5; set a5; n=25; p=2; sse=(n-p)*(_rmse_)**2;

proc print data=a5; var l sse;

Page 41: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Obs l sse1 -1.0 33.90892 -0.9 32.70443 -0.8 31.76454 -0.7 31.09075 -0.6 30.68686 -0.5 30.5596***7 -0.4 30.71868 -0.3 31.17639 -0.2 31.9487

10 -0.1 33.0552

Page 42: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

symbol1 v=none i=join;proc gplot data=a5;

plot sse*l;run;

Page 43: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution
Page 44: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Compare Regressions• Can also create large data set of

standardized transformed Y’s and compare R2 fits

• Best l maximizes R2

• This approach is also provided in the SAS code.

Page 45: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

data a2; set a1;do l = -1.0 to -.1 by .1;

yl=(plasma**l -1)/l; output;

end;l=0; yl=log(plasma); output;do l = .1 to 1 by .1;

yl=(plasma**l -1)/l; output;

end;

Page 46: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

proc sort data=a2 out=a2; by l;

proc reg noprint outest=a3; model yl=age/selection=rsquare; by l;

proc print data=a3; var l _rsq_;proc gplot data=a3; plot _rsq_*l/frame;

run;

Page 47: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution
Page 48: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

data a1; set a1;

tplasma = plasma**(-.5);

tplasma1 = log(plasma);

symbol1 v=circle i=sm50;

proc gplot; plot tplasma*age;

proc gplot; plot tplasma1*age;

run;

Now let’s apply the transformation

Page 49: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution
Page 50: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution
Page 51: Topic 9: Remedies - Purdue Universitybacraig/notes512/Topic_9.pdf · Poisson (Count data) – Gamma (exponential) – Inverse gaussian ... SAS code is in boxcox.sas. Box-Cox Solution

Background Reading

• Sections 3.4 - 3.7 describe significance tests for assumptions (read it if you are interested).

• Box-Cox transformation is in boxcox.sas

• Read Sections 4.1, 4.2, 4.4, 4.5, and 4.6