stata red tutorial

Post on 07-Apr-2015

267 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

STATASTATA:: TheThe RedRed tutorialtutorial

This tutorial presentation is prepared by

Mohammad Ehsanul Karimehsan.karim@gmail.com

STATASTATA:: TheThe RedRed tutorialtutorial

This tutorial presentation is prepared by

Mohammad Ehsanul Karimehsan.karim@gmail.com

STATASTATA:: TheThe RedRed tutorialtutorial

ContentsContents

1. Introduction to Linear Regression2. Tests for Normality of Residuals3. Tests for Heteroscedasticity4. Tests for Multicollinearity5. Tests for Autocorrelation6. Detecting Unusual and Influential Data7. Tests for Model Specification

Linear Regression Analysis

1. Introduction to Introduction to Linear RegressionLinear Regression

Linear Regression

The command regress is used to perform linear regressions. The first variable after the regress command is always the dependent variable ( left-hand-side variable), and the list of the independent variables that we chose to include in the estimation model follows ( right-hand-side variables).

Linear Regression. clear. use hs1, clear. regress write read female

Linear Regression. clear. use hs1, clear. regress write read female

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 2, 197) = 77.21

Model | 7856.32118 2 3928.16059 Prob > F = 0.0000Residual | 10022.5538 197 50.8759077 R-squared = 0.4394

-------------+------------------------------ Adj R-squared = 0.4337Total | 17878.875 199 89.843593 Root MSE = 7.1327

------------------------------------------------------------------------------write | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------read | .5658869 .0493849 11.46 0.000 .468496 .6632778

female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011

------------------------------------------------------------------------------

Linear Regression. clear. use hs1, clear. regress write read female

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 2, 197) = 77.21

Model | 7856.32118 2 3928.16059 Prob > F = 0.0000Residual | 10022.5538 197 50.8759077 R-squared = 0.4394

-------------+------------------------------ Adj R-squared = 0.4337Total | 17878.875 199 89.843593 Root MSE = 7.1327

------------------------------------------------------------------------------write | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------read | .5658869 .0493849 11.46 0.000 .468496 .6632778

female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011

------------------------------------------------------------------------------

2. Tests for Tests for Normality of Normality of ResidualsResiduals

Tests for Normality of Residuals

We use the predict command with the residoption to generate residuals and we name the residuals r.

. predict r, resid

Tests for Normality of Residuals

Shapiro-Wilk W test for Normality

For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data

Tests for Normality of Residuals

Shapiro-Wilk W test for Normality

For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data

. swilk r

Tests for Normality of Residuals

Shapiro-Wilk W test for Normality

For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data

. swilk rShapiro-Wilk W test for normal data

Variable | Obs W V z Prob>z-------------+-------------------------------------------------

r | 200 0.98714 1.919 1.499 0.06692

Tests for Normality of Residuals

In verifying that the residuals are normally distributed, which is a very important assumption for regression,

the kdensity command with the normal option displays a density graph of the residuals with annormal distribution superimposedon the graph.

Tests for Normality of Residuals

. kdensity r, normal

Tests for Normality of Residuals

. kdensity r, normal

Tests for Normality of Residuals

The pnorm command produces a normal probability plot and it is another method of testing whether the residuals from the regression are normally distributed.

Tests for Normality of Residuals

. pnorm r

Tests for Normality of Residuals

. pnorm r

Tests for Normality of Residuals

The qnorm command produces a normal quantile plot.

It is yet another method for testing if the residuals are normally distributed.

Tests for Normality of Residuals

. qnorm r

Tests for Normality of Residuals

. qnorm r

Summary of Tests for Normality of Residuals

swilk performs the Shapiro-Wilk W test for normality.

kdensity produces kernel density plot with normal distribution overlayed.

pnorm graphs a standardized normal probability (P-P) plot.

qnorm plots the quantiles of varname against the quantiles of a normal distribution.

Tests for Normality of Residuals

3. Tests for Tests for HeteroscedasticityHeteroscedasticity

Tests for Heteroscedasticity

One of the basic assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals.

There are graphical and non-graphical methods for detecting heteroscedasticity.

Tests for Heteroscedasticity

Cook-Weisberg test for heteroskedasticity

Tests for Heteroscedasticity

Cook-Weisberg test for heteroskedasticity

. hettest

Cook-Weisberg test for heteroskedasticity using fitted values of write

Ho: Constant variancechi2(1) = 5.79Prob > chi2 = 0.0161

Tests for Heteroscedasticity

we use the rvfplot command with the yline(0)option to put a reference line at y=0.

Tests for Heteroscedasticity

we use the rvfplot command with the yline(0)option to put a reference line at y=0. . rvfplot, yline(0)

Tests for Heteroscedasticity

we use the rvfplot command with the yline(0)option to put a reference line at y=0. . rvfplot, yline(0)

Summary of Tests for Heteroscedasticity

hettest performs Cook and Weisberg testrvfplot graphs residual-versus-fitted plot.

Tests for Heteroscedasticity

4. Tests for Tests for MulticollinearityMulticollinearity

Tests for Multicollinearity

Multicollinearity is a concern for multiple regression, not for its existence, but for its degree.

For severe degree of multicollinearity, the regression model estimates of the coefficients become unstableand the standard errors for the coefficients can get wildly inflated.

Tests for MulticollinearityWe can use the vif command after the regression to check for multicollinearity.

vif stands for variance inflation factor.

Tests for MulticollinearityWe can use the vif command after the regression to check for multicollinearity.

vif stands for variance inflation factor.

. vif

Variable | VIF 1/VIF -------------+----------------------

female | 1.00 0.997182read | 1.00 0.997182

-------------+----------------------Mean VIF | 1.00

Tests for MulticollinearityWe can use the vif command after the regression to check for multicollinearity.

vif stands for variance inflation factor.

. vif

Variable | VIF 1/VIF -------------+----------------------

female | 1.00 0.997182read | 1.00 0.997182

-------------+----------------------Mean VIF | 1.00

A variable whose VIF values aregreater than 10 may merit further investigation. Tolerance= 1/VIF, is used to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10.

Tests for Multicollinearity

Summary of Tests for Multicollinearity

vif calculates the variance inflation factor for the independent variables in the linear model.

5. Tests for Tests for AutocorrelationAutocorrelation

Tests for Autocorrelation. tsset id

time variable: id, 1 to 200. dwstat

Durbin-Watson d-statistic( 3, 200) = 1.93992

6. Detecting Detecting Unusual and Unusual and

Influential DataInfluential Data

Detecting Unusual and Influential Data

• Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.

• Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points can have an effect on the estimate of regression coefficients.

• Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

Here we summarize the general rules of thumb we use for these measures to identify observations worthy of further investigation (where k is the number of predictors and n is the number of observations).

Measure Valueleverage >(2k+2)/nabs(rstu) > 2Cook's D > 4/nabs(DFITS) > 2*sqrt(k/n)abs(DFBETA) > 2/sqrt(n)

Detecting Unusual and Influential Data

We use the predict command with the rstudentoption to generate studentized residuals and we name the residuals r. Studentized residuals are a type of standardized residual that can be used to identify outliers.

Detecting Unusual and Influential Data

We use the predict command with the rstudentoption to generate studentized residuals and we name the residuals r. Studentized residuals are a type of standardized residual that can be used to identify outliers.

. predict r, rstudent

Detecting Unusual and Influential Data

. stem rStem-and-leaf plot for r (Studentized residuals)r rounded to nearest multiple of .01plot in units of .01-2** | 50,42-2** | 26,21-2** | 18-1** | 92,85,84,83-1** | 75,72,69,61,61,60-1** | 50,48,46,46,42-1** | 33,32,22,20,20,20-1** | 17,16,13,12,10,01-0** | 97,97,96,96,93,93,92,92,90,89,89,89,86,86,84,82,82,80,80-0** | 74,74,71,70,67-0** | 59,59,58,53,49,49,47,42,42,40-0** | 35,35,33,31,31,31,30,28,28,28,28,27,25,23,23,22-0** | 19,17,16,16,16,16,14,13,13,09,09,07,04,03,03,020** | 00,02,02,04,04,04,04,07,09,11,14,16,16,190** | 21,23,23,24,24,26,28,29,30,33,33,35,350** | 40,44,44,51,51,54,54,54,54,56,56,57,57,570** | 61,63,64,64,64,64,64,66,70,70,71,73,73,73,74,780** | 88,88,89,93,94,94,97,98,991** | 01,06,06,08,08,13,13,13,13,15,191** | 23,29,32,36,36,37,37,391** | 42,43,44,48,51,52,53,551** | 60,68,73,73,75,771** | 80,842** | 16

Detecting Unusual and Influential Data

. stem r

. sort r

. list r in 1/10r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068

10. -1.750652

Detecting Unusual and Influential Data

. stem r

. sort r

. list r in 1/10r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068

10. -1.750652

Detecting Unusual and Influential Data

. list r in -10/lr

191. 1.551833 192. 1.602682 193. 1.677923 194. 1.726393 195. 1.730591 196. 1.749522 197. 1.774811 198. 1.798141 199. 1.840841

200. 2.160904

. stem r

. sort r

. list r in 1/10r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068

10. -1.750652

Detecting Unusual and Influential Data

. list r in -10/lr

191. 1.551833 192. 1.602682 193. 1.677923 194. 1.726393 195. 1.730591 196. 1.749522 197. 1.774811 198. 1.798141 199. 1.840841

200. 2.160904

. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

Detecting Unusual and Influential Data

. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

. list r if r<-2 | r>2r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212

200. 2.160904

Detecting Unusual and Influential Data

. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

. list r if r<-2 | r>2r

1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212

200. 2.160904

Detecting Unusual and Influential Data

. list r if r<-2.5 | r>2.5r

1. -2.503566

. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

To get Leverage points, we use the predictcommand with the leverage option and we name them lev.

Detecting Unusual and Influential Data

To get Leverage points, we use the predictcommand with the leverage option and we name them lev.

. predict lev, leverage

Detecting Unusual and Influential Data

Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

Detecting Unusual and Influential Data

Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

. predict d, cooksd

Detecting Unusual and Influential Data

Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

. predict d, cooksd

Detecting Unusual and Influential Data

. list female read d if d>4/_Nfemale read d

13. male 50 .0234054 39. male 47 .0212312

123. female 57 .0202435 142. male 76 .0327483

Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

. predict dfit, dfits

. list dfit if abs(dfit)>2*sqrt(3/51)

Detecting Unusual and Influential Data

Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

. predict dfit, dfits

. list dfit if abs(dfit)>2*sqrt(3/51)

Detecting Unusual and Influential Data

The above measures are general measures of influence.

We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors.

Detecting Unusual and Influential Data

We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. Apparently this is more computational intensivethan summary statistics such as Cook's D.

Detecting Unusual and Influential Data

We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors.

Detecting Unusual and Influential Data

We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors.

. dfbetaDFread: DFbeta(read)

DFfemale: DFbeta(female)

Detecting Unusual and Influential Data

We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors.

. list DFread DFfemale in 1/5DFread DFfemale

1. .0492348 .1971976 2. -.0887463 -.1617497 3. .0915453 .1802994 4. .0434659 .1740918 5. .0717626 -.1374498

Detecting Unusual and Influential Data

There are also several graphs that can be used to search for unusual and influentialobservations.

The avplot command graphs an added-variable plot.

Detecting Unusual and Influential Data

avplot command not only works for the variables in the model, it also works for variables that are not in the model, which is why it is called added-variable plot.

We can do an avplot on variable grade.

Detecting Unusual and Influential Data

Detecting Unusual and Influential Data

. avplot grade

Detecting Unusual and Influential Data

. avplot grade

Added-Variable plot

rvpplot is another convenience command which produces a plot of the residual versus a specified predictorand it is also used after regress or anova.

Detecting Unusual and Influential Data

Detecting Unusual and Influential Data

. rvpplot read

Detecting Unusual and Influential Data

. rvpplot read

lvr2plot stands for leverageversus residual squared plot.

Detecting Unusual and Influential Data

Detecting Unusual and Influential Data

. lvr2plot

Detecting Unusual and Influential Data

. lvr2plot

Detecting Unusual and Influential Data

Summary of Detecting Unusual and Influential Data

predict create predicted values, residuals, and measures of influence.

dfbeta DFBETAs for all the independent variablesavplot graphs an added-variable plotlvr2plot graphs a leverage-versus-squared-

residual plot.rvpplot graphs a residual-versus-predictor plot.rvfplot graphs residual-versus-fitted plot.

7. Tests Tests for Model for Model SpecificationSpecification

Tests for Model Specification

A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model.

Tests for Model Specification

There are several methods to detect specification errors.

The linktest command performs a model specification link test for single-equation models.

Tests for Model Specification

. Linktest

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 2, 197) = 79.86

Model | 8005.11739 2 4002.55869 Prob > F = 0.0000Residual | 9873.75761 197 50.120597 R-squared = 0.4477

-------------+------------------------------ Adj R-squared = 0.4421Total | 17878.875 199 89.843593 Root MSE = 7.0796

------------------------------------------------------------------------------write | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------_hat | 2.807497 1.052071 2.67 0.008 .7327302 4.882264

_hatsq | -.0170281 .0098827 -1.72 0.086 -.0365176 .0024615_cons | -47.29516 27.77544 -1.70 0.090 -102.0705 7.480201

------------------------------------------------------------------------------

Tests for Model Specification

The ovtest command performs performs a regression specification error test (RESET) for omitted variables.

Tests for Model Specification

The ovtest command performs performs a regression specification error test (RESET) for omitted variables.

. ovtest

Tests for Model Specification

The ovtest command performs performs a regression specification error test (RESET) for omitted variables.

. ovtest

Ramsey RESET test using powers of the fitted values of write

Ho: model has no omitted variablesF(3, 194) = 1.95Prob > F = 0.1233

Tests for Model Specification

Summary of Tests for Model Specification

linktest performs a link test for model specification.ovtest performs regression specification error

test (RESET) for omitted variables.

STATASTATA:: TheThe RedRed tutorialtutorial

top related