stata red tutorial
Post on 07-Apr-2015
267 Views
Preview:
TRANSCRIPT
STATASTATA:: TheThe RedRed tutorialtutorial
This tutorial presentation is prepared by
Mohammad Ehsanul Karimehsan.karim@gmail.com
STATASTATA:: TheThe RedRed tutorialtutorial
This tutorial presentation is prepared by
Mohammad Ehsanul Karimehsan.karim@gmail.com
STATASTATA:: TheThe RedRed tutorialtutorial
ContentsContents
1. Introduction to Linear Regression2. Tests for Normality of Residuals3. Tests for Heteroscedasticity4. Tests for Multicollinearity5. Tests for Autocorrelation6. Detecting Unusual and Influential Data7. Tests for Model Specification
Linear Regression Analysis
1. Introduction to Introduction to Linear RegressionLinear Regression
Linear Regression
The command regress is used to perform linear regressions. The first variable after the regress command is always the dependent variable ( left-hand-side variable), and the list of the independent variables that we chose to include in the estimation model follows ( right-hand-side variables).
Linear Regression. clear. use hs1, clear. regress write read female
Linear Regression. clear. use hs1, clear. regress write read female
Source | SS df MS Number of obs = 200-------------+------------------------------ F( 2, 197) = 77.21
Model | 7856.32118 2 3928.16059 Prob > F = 0.0000Residual | 10022.5538 197 50.8759077 R-squared = 0.4394
-------------+------------------------------ Adj R-squared = 0.4337Total | 17878.875 199 89.843593 Root MSE = 7.1327
------------------------------------------------------------------------------write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------read | .5658869 .0493849 11.46 0.000 .468496 .6632778
female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011
------------------------------------------------------------------------------
Linear Regression. clear. use hs1, clear. regress write read female
Source | SS df MS Number of obs = 200-------------+------------------------------ F( 2, 197) = 77.21
Model | 7856.32118 2 3928.16059 Prob > F = 0.0000Residual | 10022.5538 197 50.8759077 R-squared = 0.4394
-------------+------------------------------ Adj R-squared = 0.4337Total | 17878.875 199 89.843593 Root MSE = 7.1327
------------------------------------------------------------------------------write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------read | .5658869 .0493849 11.46 0.000 .468496 .6632778
female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011
------------------------------------------------------------------------------
2. Tests for Tests for Normality of Normality of ResidualsResiduals
Tests for Normality of Residuals
We use the predict command with the residoption to generate residuals and we name the residuals r.
. predict r, resid
Tests for Normality of Residuals
Shapiro-Wilk W test for Normality
For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data
Tests for Normality of Residuals
Shapiro-Wilk W test for Normality
For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data
. swilk r
Tests for Normality of Residuals
Shapiro-Wilk W test for Normality
For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data
. swilk rShapiro-Wilk W test for normal data
Variable | Obs W V z Prob>z-------------+-------------------------------------------------
r | 200 0.98714 1.919 1.499 0.06692
Tests for Normality of Residuals
In verifying that the residuals are normally distributed, which is a very important assumption for regression,
the kdensity command with the normal option displays a density graph of the residuals with annormal distribution superimposedon the graph.
Tests for Normality of Residuals
. kdensity r, normal
Tests for Normality of Residuals
. kdensity r, normal
Tests for Normality of Residuals
The pnorm command produces a normal probability plot and it is another method of testing whether the residuals from the regression are normally distributed.
Tests for Normality of Residuals
. pnorm r
Tests for Normality of Residuals
. pnorm r
Tests for Normality of Residuals
The qnorm command produces a normal quantile plot.
It is yet another method for testing if the residuals are normally distributed.
Tests for Normality of Residuals
. qnorm r
Tests for Normality of Residuals
. qnorm r
Summary of Tests for Normality of Residuals
swilk performs the Shapiro-Wilk W test for normality.
kdensity produces kernel density plot with normal distribution overlayed.
pnorm graphs a standardized normal probability (P-P) plot.
qnorm plots the quantiles of varname against the quantiles of a normal distribution.
Tests for Normality of Residuals
3. Tests for Tests for HeteroscedasticityHeteroscedasticity
Tests for Heteroscedasticity
One of the basic assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals.
There are graphical and non-graphical methods for detecting heteroscedasticity.
Tests for Heteroscedasticity
Cook-Weisberg test for heteroskedasticity
Tests for Heteroscedasticity
Cook-Weisberg test for heteroskedasticity
. hettest
Cook-Weisberg test for heteroskedasticity using fitted values of write
Ho: Constant variancechi2(1) = 5.79Prob > chi2 = 0.0161
Tests for Heteroscedasticity
we use the rvfplot command with the yline(0)option to put a reference line at y=0.
Tests for Heteroscedasticity
we use the rvfplot command with the yline(0)option to put a reference line at y=0. . rvfplot, yline(0)
Tests for Heteroscedasticity
we use the rvfplot command with the yline(0)option to put a reference line at y=0. . rvfplot, yline(0)
Summary of Tests for Heteroscedasticity
hettest performs Cook and Weisberg testrvfplot graphs residual-versus-fitted plot.
Tests for Heteroscedasticity
4. Tests for Tests for MulticollinearityMulticollinearity
Tests for Multicollinearity
Multicollinearity is a concern for multiple regression, not for its existence, but for its degree.
For severe degree of multicollinearity, the regression model estimates of the coefficients become unstableand the standard errors for the coefficients can get wildly inflated.
Tests for MulticollinearityWe can use the vif command after the regression to check for multicollinearity.
vif stands for variance inflation factor.
Tests for MulticollinearityWe can use the vif command after the regression to check for multicollinearity.
vif stands for variance inflation factor.
. vif
Variable | VIF 1/VIF -------------+----------------------
female | 1.00 0.997182read | 1.00 0.997182
-------------+----------------------Mean VIF | 1.00
Tests for MulticollinearityWe can use the vif command after the regression to check for multicollinearity.
vif stands for variance inflation factor.
. vif
Variable | VIF 1/VIF -------------+----------------------
female | 1.00 0.997182read | 1.00 0.997182
-------------+----------------------Mean VIF | 1.00
A variable whose VIF values aregreater than 10 may merit further investigation. Tolerance= 1/VIF, is used to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10.
Tests for Multicollinearity
Summary of Tests for Multicollinearity
vif calculates the variance inflation factor for the independent variables in the linear model.
5. Tests for Tests for AutocorrelationAutocorrelation
Tests for Autocorrelation. tsset id
time variable: id, 1 to 200. dwstat
Durbin-Watson d-statistic( 3, 200) = 1.93992
6. Detecting Detecting Unusual and Unusual and
Influential DataInfluential Data
Detecting Unusual and Influential Data
• Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.
• Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points can have an effect on the estimate of regression coefficients.
• Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.
Here we summarize the general rules of thumb we use for these measures to identify observations worthy of further investigation (where k is the number of predictors and n is the number of observations).
Measure Valueleverage >(2k+2)/nabs(rstu) > 2Cook's D > 4/nabs(DFITS) > 2*sqrt(k/n)abs(DFBETA) > 2/sqrt(n)
Detecting Unusual and Influential Data
We use the predict command with the rstudentoption to generate studentized residuals and we name the residuals r. Studentized residuals are a type of standardized residual that can be used to identify outliers.
Detecting Unusual and Influential Data
We use the predict command with the rstudentoption to generate studentized residuals and we name the residuals r. Studentized residuals are a type of standardized residual that can be used to identify outliers.
. predict r, rstudent
Detecting Unusual and Influential Data
. stem rStem-and-leaf plot for r (Studentized residuals)r rounded to nearest multiple of .01plot in units of .01-2** | 50,42-2** | 26,21-2** | 18-1** | 92,85,84,83-1** | 75,72,69,61,61,60-1** | 50,48,46,46,42-1** | 33,32,22,20,20,20-1** | 17,16,13,12,10,01-0** | 97,97,96,96,93,93,92,92,90,89,89,89,86,86,84,82,82,80,80-0** | 74,74,71,70,67-0** | 59,59,58,53,49,49,47,42,42,40-0** | 35,35,33,31,31,31,30,28,28,28,28,27,25,23,23,22-0** | 19,17,16,16,16,16,14,13,13,09,09,07,04,03,03,020** | 00,02,02,04,04,04,04,07,09,11,14,16,16,190** | 21,23,23,24,24,26,28,29,30,33,33,35,350** | 40,44,44,51,51,54,54,54,54,56,56,57,57,570** | 61,63,64,64,64,64,64,66,70,70,71,73,73,73,74,780** | 88,88,89,93,94,94,97,98,991** | 01,06,06,08,08,13,13,13,13,15,191** | 23,29,32,36,36,37,37,391** | 42,43,44,48,51,52,53,551** | 60,68,73,73,75,771** | 80,842** | 16
Detecting Unusual and Influential Data
. stem r
. sort r
. list r in 1/10r
1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068
10. -1.750652
Detecting Unusual and Influential Data
. stem r
. sort r
. list r in 1/10r
1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068
10. -1.750652
Detecting Unusual and Influential Data
. list r in -10/lr
191. 1.551833 192. 1.602682 193. 1.677923 194. 1.726393 195. 1.730591 196. 1.749522 197. 1.774811 198. 1.798141 199. 1.840841
200. 2.160904
. stem r
. sort r
. list r in 1/10r
1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068
10. -1.750652
Detecting Unusual and Influential Data
. list r in -10/lr
191. 1.551833 192. 1.602682 193. 1.677923 194. 1.726393 195. 1.730591 196. 1.749522 197. 1.774811 198. 1.798141 199. 1.840841
200. 2.160904
. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.
Detecting Unusual and Influential Data
. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.
. list r if r<-2 | r>2r
1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212
200. 2.160904
Detecting Unusual and Influential Data
. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.
. list r if r<-2 | r>2r
1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212
200. 2.160904
Detecting Unusual and Influential Data
. list r if r<-2.5 | r>2.5r
1. -2.503566
. We should pay attention to studentizedresiduals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.
To get Leverage points, we use the predictcommand with the leverage option and we name them lev.
Detecting Unusual and Influential Data
To get Leverage points, we use the predictcommand with the leverage option and we name them lev.
. predict lev, leverage
Detecting Unusual and Influential Data
Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.
Detecting Unusual and Influential Data
Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.
. predict d, cooksd
Detecting Unusual and Influential Data
Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.
. predict d, cooksd
Detecting Unusual and Influential Data
. list female read d if d>4/_Nfemale read d
13. male 50 .0234054 39. male 47 .0212312
123. female 57 .0202435 142. male 76 .0327483
Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.
. predict dfit, dfits
. list dfit if abs(dfit)>2*sqrt(3/51)
Detecting Unusual and Influential Data
Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.
. predict dfit, dfits
. list dfit if abs(dfit)>2*sqrt(3/51)
Detecting Unusual and Influential Data
The above measures are general measures of influence.
We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors.
Detecting Unusual and Influential Data
We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. Apparently this is more computational intensivethan summary statistics such as Cook's D.
Detecting Unusual and Influential Data
We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors.
Detecting Unusual and Influential Data
We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors.
. dfbetaDFread: DFbeta(read)
DFfemale: DFbeta(female)
Detecting Unusual and Influential Data
We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors.
. list DFread DFfemale in 1/5DFread DFfemale
1. .0492348 .1971976 2. -.0887463 -.1617497 3. .0915453 .1802994 4. .0434659 .1740918 5. .0717626 -.1374498
Detecting Unusual and Influential Data
There are also several graphs that can be used to search for unusual and influentialobservations.
The avplot command graphs an added-variable plot.
Detecting Unusual and Influential Data
avplot command not only works for the variables in the model, it also works for variables that are not in the model, which is why it is called added-variable plot.
We can do an avplot on variable grade.
Detecting Unusual and Influential Data
Detecting Unusual and Influential Data
. avplot grade
Detecting Unusual and Influential Data
. avplot grade
Added-Variable plot
rvpplot is another convenience command which produces a plot of the residual versus a specified predictorand it is also used after regress or anova.
Detecting Unusual and Influential Data
Detecting Unusual and Influential Data
. rvpplot read
Detecting Unusual and Influential Data
. rvpplot read
lvr2plot stands for leverageversus residual squared plot.
Detecting Unusual and Influential Data
Detecting Unusual and Influential Data
. lvr2plot
Detecting Unusual and Influential Data
. lvr2plot
Detecting Unusual and Influential Data
Summary of Detecting Unusual and Influential Data
predict create predicted values, residuals, and measures of influence.
dfbeta DFBETAs for all the independent variablesavplot graphs an added-variable plotlvr2plot graphs a leverage-versus-squared-
residual plot.rvpplot graphs a residual-versus-predictor plot.rvfplot graphs residual-versus-fitted plot.
7. Tests Tests for Model for Model SpecificationSpecification
Tests for Model Specification
A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model.
Tests for Model Specification
There are several methods to detect specification errors.
The linktest command performs a model specification link test for single-equation models.
Tests for Model Specification
. Linktest
Source | SS df MS Number of obs = 200-------------+------------------------------ F( 2, 197) = 79.86
Model | 8005.11739 2 4002.55869 Prob > F = 0.0000Residual | 9873.75761 197 50.120597 R-squared = 0.4477
-------------+------------------------------ Adj R-squared = 0.4421Total | 17878.875 199 89.843593 Root MSE = 7.0796
------------------------------------------------------------------------------write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------_hat | 2.807497 1.052071 2.67 0.008 .7327302 4.882264
_hatsq | -.0170281 .0098827 -1.72 0.086 -.0365176 .0024615_cons | -47.29516 27.77544 -1.70 0.090 -102.0705 7.480201
------------------------------------------------------------------------------
Tests for Model Specification
The ovtest command performs performs a regression specification error test (RESET) for omitted variables.
Tests for Model Specification
The ovtest command performs performs a regression specification error test (RESET) for omitted variables.
. ovtest
Tests for Model Specification
The ovtest command performs performs a regression specification error test (RESET) for omitted variables.
. ovtest
Ramsey RESET test using powers of the fitted values of write
Ho: model has no omitted variablesF(3, 194) = 1.95Prob > F = 0.1233
Tests for Model Specification
Summary of Tests for Model Specification
linktest performs a link test for model specification.ovtest performs regression specification error
test (RESET) for omitted variables.
STATASTATA:: TheThe RedRed tutorialtutorial
top related