scatterplot smoothing using proc loess and restricted cubic splines jonas v. bilenas barclays global...

28
Scatterplot Smoothing Using PROC LOESS and Restricted Cubic Splines Jonas V. Bilenas Barclays Global Retail Bank/UK Adjunct Faculty, Saint Joseph University, School of Business June 23, 2011

Post on 19-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Scatterplot Smoothing Using PROC LOESS and Restricted Cubic Splines

Jonas V. BilenasBarclays Global Retail Bank/UK

Adjunct Faculty, Saint Joseph University, School of Business

June 23, 2011

Introduction

• In this tutorial we will look at 2 scatterplot smoothing techniques:– The LOESS Procedure:

• Non-parametric regression smoothing (local regression or DWLS; Distance Weighted Least Squares).

– Restricted Cubic Splines:• Parametric smoothing that can be used in regression procedures to fit

functional models.

SUG, RUG, & LUG Pictures

LOESS documentation from SAS• The LOESS procedure implements a nonparametric method for estimating

regression surfaces pioneered by Cleveland, Devlin, and Grosse (1988), Cleveland and Grosse (1991), and Cleveland, Grosse, and Shyu (1992). The LOESS procedure allows great flexibility because no assumptions about the parametric form of the regression surface are needed.

• The main features of the LOESS procedure are as follows: – fits nonparametric models – supports the use of multidimensional data – supports multiple dependent variables – supports both direct and interpolated fitting that uses kd trees – performs statistical inference – performs automatic smoothing parameter selection – performs iterative reweighting to provide robust fitting when there are outliers in the

data – supports graphical displays produced through ODS Graphics

LOESS Procedure Details• LOESS fits a local regression function to the data within a

chosen neighborhood of points.

• The radius of each neighborhood is chosen so that the neighborhood contains a specified percentage of the data points. This percentage of the region is specified by a smoothing parameter (0 < smooth <= 1). The larger the smoothing parameter the smoother the graphed function.– Default value of smoothing is at 0.5.– Smoothing parameter can also be optimized:

• AICC specifies the AICC criterion.. • AICC1 specifies the AICC1 criterion. • GCV specifies the generalized cross validation criterion.

• The regression procedure performs a fit weighted by the distance of points from the center of the neighborhood. Missing values are deleted.

Example of some LOESSproc loess data=sashelp.cars; ods output outputstatistics=outstay; model MPG_Highway=MSRP /smooth=0.8 alpha=.05 all;run;

Fit Summary

Fit Method kd TreeBlending LinearNumber of Observations 428Number of Fitting Points 9kd Tree Bucket Size 68Degree of Local Polynomials 1Smoothing Parameter 0.80000Points in Local Neighborhood 342Residual Sum of Squares 8913.89292Trace[L] 3.77247GCV 0.04953AICC 4.05885AICC1 1737.19028Delta1 424.12399Delta2 424.20690Equivalent Number of Parameters 3.66893Lookup Degrees of Freedom 424.04109Residual Standard Error 4.58445

SUG, RUG, & LUG Pictures

Example of some LOESSproc sort data=outstay; by pred;run;

axis1 label = (angle=90 "MPG HIGHWAY");axis2 label = (h=1.5 "MSRP");

symbol1 i=none c=black v=dot h=0.5;symbol2 i=j value=none color=red l=1 width=30;

proc gplot data=outstay; plot (depvar pred)*MSRP / overlay haxis=axis2 vaxis=axis1 grid; title "LOESS Smooth=0.8";run;quit;

ods html;ods graphics on;proc loess data=sashelp.cars; model MPG_Highway=MSRP /smooth=(0.5 0.6 0.7 0.8) alpha=.05 all;run;ods grapahics off;ods html close;

LOESS with ODS GRAPHICS

ods html;ods graphics on;proc loess data=sashelp.cars; model MPG_Highway=MSRP / SELECT=AICC;run;ods grapahics off;ods html close;

Optimized LOESS

ods html;ods graphics on;title 'LOESS/SMOOTH=0.60';proc sgplot data=sashelp.cars; loess x=MSRP y=MPG_Highway / smooth=0.60;run; quit;ods graphics off;ods html close;

LOESS in SGPLOT

ods html;ods graphics on;proc loess data=sashelp.cars; model MPG_Highway=MSRP Horsepower / SELECT=AICC;run;ods grapahics off;ods html close;

Optimized LOESS

SUG, RUG, & LUG Pictures

ods html;ods graphics on;title 'Time series plot';proc loess data=ENSO; model Pressure = Month / SMOOTH=0.1 0.2 0.3 0.4;run; quit;ods graphics off;ods html close;

LOESS for Time Series Plots

Data from Cohen (SUGI 24)

Data also online:http://support.sas.com/documentation/

cdl/en/statug/63033/HTML/default/viewer.htm#statug_loess_sect033.htm

LOESS for Time Series Plots (AICC optimized)

Large Number of Observations

• http://www.statisticalanalysisconsulting.com/scatterplots-dealing-with-overplotting/• Peter Flom Blog.

• Set PLOTS(MAXPOINTS= ) in PROC LOESS. Default limit is 5000,• Run PROC LOESS on all data. But plot after binning independent variable and running means

on binned data.

proc loess data=test; /* output 300 for each record */ ods output outputstatistics=outstay; model MPG_Highway=horsepower /smooth=0.4 ;run;

proc rank data=outstay groups=100 ties=low out=ranked; var horsepower; ranks r_horsepower;run;

proc means data=ranked noprint nway; class r_horsepower; var depvar pred Horsepower; output out=means mean=;run;

axis1 label = (angle=90 "MPG HIGHWAY") ;axis2 label = (h=1.5 "Horsepower");

symbol1 i=none c=black v=dot h=0.5;symbol2 i=j value=none color=red l=1 width=10;

proc gplot data=means; plot (depvar pred)*Horsepower / overlay haxis=axis2 vaxis=axis1 grid; title "LOESS Smooth=0.4";run;quit;

Large Number of Observations

SUG, RUG, & LUG Pictures

Restricted Cubic Splines

• Recommended by Frank Harrell

• Knots are specified in advanced.

• Placement of Knots are not important. Usually determined predetermined percentiles based on sample size,

k Quantiles

3 .10 .5 .90

4 .05 .35 .65 .95

5 .05 .275 .5 .725 .95

6 .05 .23 .41 .59 .77 .95

7 .025 .1833 .3417 .5 .6583 .8167 .975

Restricted Cubic Splines• Percentile values can be derived using PROC UNIVARIATE.

• Can Optimize number of Knots selecting number based on minimizing AICC.

• Provides a parametric regression function.

• Sometimes knot transformations make for difficult interpretation.

• May be difficult to incorporate interaction terms.

• Much more efficient than categorizing continuous variables into dummy terms.

• Macro available:• http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SasMacros/survrisk.txt

Restricted Cubic Splinesproc univariate data=sashelp.cars noprint; var horsepower; output out=knots pctlpre=P_ pctlpts=5 27.5 50 72.5 95;run;

proc print data=knots; run;

Obs P_5 P_27_5 P_50 P_72_5 P_95

1 115 170 210 245 340

Restricted Cubic Splinesoptions nocenter mprint;data test; set sashelp.cars; %rcspline (horsepower,115, 170, 210, 245, 340);run;

LOG:MPRINT(RCSPLINE): DROP _kd_;MPRINT(RCSPLINE): _kd_= (340 - 115)**.666666666666 ;MPRINT(RCSPLINE):horsepower1=max((horsepower-115)/_kd_,0)**3+((245-115)*max((horsepower-340)/_kd_,0)**3-(340-115)*max((horsepower-245)/_kd_,0)**3)/(340-245);MPRINT(RCSPLINE): ;MPRINT(RCSPLINE):horsepower2=max((horsepower-170)/_kd_,0)**3+((245-170)*max((horsepower-340)/_kd_,0)**3-(340-170)*max((horsepower-245)/_kd_,0)**3)/(340-245);MPRINT(RCSPLINE): ;MPRINT(RCSPLINE):horsepower3=max((horsepower-210)/_kd_,0)**3+((245-210)*max((horsepower-340)/_kd_,0)**3-(340-210)*max((horsepower-245)/_kd_,0)**3)/(340-245);MPRINT(RCSPLINE): ;43 run;

Restricted Cubic Splinesproc reg data=test; model MPG_Highway = horsepower horsepower1 horsepower2 horsepower3; LINEAR: TEST horsepower1, horsepower2, horsepower3;run; quit;

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Pr > F

Model 4 8147.64458 2036.91115 145.37 <.0001Error 423 5926.86710 14.01151Corrected Total 427 14075

Root MSE 3.74319 R-Square 0.5789Dependent Mean 26.84346 Adj R-Sq 0.5749Coeff Var 13.94453

Parameter Estimates

Parameter StandardVariable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 63.32145 2.50445 25.28 <.0001Horsepower 1 -0.22900 0.01837 -12.46 <.0001horsepower1 1 0.83439 0.12653 6.59 <.0001horsepower2 1 -2.53834 0.49019 -5.18 <.0001horsepower3 1 2.55417 0.66356 3.85 0.0001

Test LINEAR Results for Dependent Variable MPG_Highway

MeanSource DF Square F Value Pr > F

Numerator 3 750.78949 53.58 <.0001Denominator 423 14.01151

Restricted Cubic Splines (5 Knots)

Restricted Cubic Splines (7 Knots): Time Series Data

Regression terms not significant

SUG, RUG, & LUG Pictures

References• Akaike, H. (1973), “Information Theory and an Extension of the Maximum

Likelihood Principle,” in Petrov and Csaki, eds., Proceedings of the Second International Symposium on Information Theory, 267–281.

• Cleveland, W. S., Devlin, S. J., and Grosse, E. (1988), “Regression by Local Fitting,” Journal of Econometrics, 37, 87–114.

• Cleveland, W. S. and Grosse, E. (1991), “Computational Methods for Local Regression,” Statistics and Computing, 1, 47–62.

• Cohen, R.A. (SUGI 24). “An Introduction to PROC LOESS for Local Regression,” Paper 273-24.

• Harrell, F. (2010). “Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis (Springer Series in Statistics),” Springer.

• Harrell RCSPLINE MACRO:– http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SasMacros/survrisk.txt

• C. J. Stone and C. Y. Koo (1985), “Additive splines in statistics,” In Proceedings of the Statistical Computing Section ASA, pages 45{48, Washington, DC, 1985. [34, 39]