week 5slide #1 adjusted r 2, residuals, and review adjusted r 2 residual analysis stata regression...
TRANSCRIPT
Week 5 Slide #1
Adjusted R2, Residuals, and Review
• Adjusted R2
• Residual Analysis• Stata Regression Output
revisited– The Overall Model– Analyzing Residuals
• Review for Exam 2
Week 5 Slide #2
Exercise Review
– Use the caschool.dta dataseet
– Run a model in Stata using Average Income (avginc) to predict Average Test Scores (testscr)
– Examine the univariate distributions of both variables and the residuals
• Walk through the entire interpretation
• Build a Stata do-file as you go
Week 5 Slide #3
Exercise Review, continued
Source | SS df MS Number of obs = 420 -------------+------------------------------ F( 1, 418) = 430.83 Model | 77204.394 1 77204.394 Prob > F = 0.0000 Residual | 74905.1997 418 179.199042 R-squared = 0.5076 -------------+------------------------------ Adj R-squared = 0.5064 Total | 152109.594 419 363.030056 Root MSE = 13.387 ------------------------------------------------------------------------------ testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- avginc | 1.87855 .0905044 20.76 0.000 1.700649 2.05645 _cons | 625.3836 1.532405 408.11 0.000 622.3714 628.3958 ------------------------------------------------------------------------------
Week 5 Slide #4
Exercise Review, Continued
600
650
700
750
0 20 40 60avginc
testscr 95% CI
Fitted values
Week 5 Slide #5
Adjusted R2: An Alternative “Goodness of Fit” Measure
• Recall that R2 is calculated as:
• Hypothetically, as K approaches n, R2 approaches one (why?) – “degrees of freedom”
• Adjusted R2 compensates for that tendency
∑ ∑ −=−=
=
22
2
)( and )ˆ(
: where,
YYTSSYYESS
TSS
ESSR
ii
“explained sum of squares” “total sum of squares”
Week 5 Slide #6
Calculating Adjusted R2
Ra2 =R2 −
K −1n−K
(1−R2)
• The bigger the sample size (n), the smaller the adjustment• The more complex the model (the bigger K is), the larger the adjustment• The bigger R2 is, the smaller the adjustment
Week 5 Slide #7
Residual Analysis: Trouble Shooting• Conceptual use of residuals
– e, or what the model can’t explain
• Visual Diagnostics – Ideal: a “Sneeze plot”– Diagnostics using Residual Plots:
• Checking for heteroscedasticity• Checking for non-linearity• Checking for outliers
• Saving and Analyzing Residuals in Stata
Week 5 Slide #8
Review: Assumptions Necessary for Estimating Linear Models
1.Errors have identical distributions
Zero mean, same variance, across the range of X
2.Errors are independent of X and other i
3.Errors are normally distributed
E[ i ] ≠ f(X) and E[i ] ≠ f( j , j ≠i)
i=0
X
Week 5 Slide #9
The Ideal: Sneeze Splatter
e
Predicted Y
Problems: It is possible to “over-interpret” residual plots; it is also possible to miss patterns when there are large numbers of observations
Week 5 Slide #10
Heteroscedasticity
e
Predicted Y
Problem: Standard errors are not constant; hypothesis tests invalid
Week 5 Slide #11
Non-Linearity
e
Predicted Y
Problem: Biased estimated coefficients, inefficient model
Week 5 Slide #12
Checking for Outliers
e
Predicted Y
Problem: Under-specified model; measurement error
Residuals formodel usingall data
Possible Outliers
Residuals for modelwith outliers deleted
Week 5 Slide #13
Stata Regression Model:Regressing “testscr” onto “avginc”
Source | SS df MS Number of obs = 420 -------------+------------------------------ F( 1, 418) = 430.83 Model | 77204.394 1 77204.394 Prob > F = 0.0000 Residual | 74905.1997 418 179.199042 R-squared = 0.5076 -------------+------------------------------ Adj R-squared = 0.5064 Total | 152109.594 419 363.030056 Root MSE = 13.387 ------------------------------------------------------------------------------ testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- avginc | 1.87855 .0905044 20.76 0.000 1.700649 2.05645 _cons | 625.3836 1.532405 408.11 0.000 622.3714 628.3958 ------------------------------------------------------------------------------
Week 5 Slide #14
Regression Plot (again)
600
650
700
750
0 20 40 60avginc
testscr 95% CI
Fitted values
Week 5 Slide #15
Residual Plot
0
.01
.02
.03
Density
-40 -20 0 20 40Residual
Week 5 Slide #16
Examination of Residualsgsort e (or you can use “-e”)list observat testscr avginc yhat e in 1/5
. list observat testscr avginc yhat e in 1/5
+---------------------------------------------------+observat testscr avginc yhat e ---------------------------------------------------1. 393 683.4 13.567 650.8699 32.53016 2. 386 681.6 14.177 652.0157 29.5842 3. 419 672.2 9.952 644.0789 28.12111 4. 366 675.7 11.834 647.6143 28.08568 5. 371 676.95 12.934 649.6807 27.26921 +---------------------------------------------------+
Use the case ID number to find the relevant observation in the data set
Week 5 Slide #17
Residuals v. Predicted Values
Using an “ocular test,” non-linearity seems probable, but heteroscedasticity is not obvious here. But should we trust our eyeballs?
-40
-20
0
20
40
Residuals
640 660 680 700 720 740Fitted values
Week 5 Slide #18
Formal Test for Non-linearity:Omitted Variables
Tests whether adding 2nd, 3rd and 4th powers of X will improve the fit of the model:
Y=b0+b1X+b2X2+b3X3+b4X4+e
. ovtest Ramsey RESET test using powers of the fitted values of testscr Ho: model has no omitted variables F(3, 415) = 17.75 Prob > F = 0.0000
Week 5 Slide #19
Formal Tests for Heteroscedasticity
Tests to see whether the squared standardized residuals are linearly related to the predicted value of Y:
std(e2)=b0+b1(Predicted Y)
Week 5 Slide #20
393386419366371389328346335356367372
416
338395324374373355364
2733611385336325394308368311353362297316391287382262
3453133422365292381369272260380258294388357339286312315358
403
349350257344215384
410
321271320248360322363326
397
279
396
352232231284300280255270281242283295375343216246285208237210268
399
331218205318337221254247319203390376252189282299290330185274174309
377211347301183228226240370420288278233251277186317341245304310259250
348329239
412
1784197307334229289264139266291206
407
225340195212
400
354323175293213261
409
182162123
398
209
417
244173190243253159269222207219333
406
2353272143141961491532652272672001473238188204249223158144137194256129172296155169276161131359
224125
401
33230618418710194111150
415
2171281673021765105121241230145303
408
411
17717019911782263751191181461401411431228511313314812071104160220899819286
351
275164889616599305
1521321631081911566910390157
298
1511342011661101686184124
413
1541987913814212765
379
181100136
392
4610217983925880193130787491171109
234442021261158711410673112681074381422366462
383
9770953152
378
116603345
402
2820403766547218050273059632349762577262947415367191713
56215736
387
51343524
414
18162255
124832
1493418
3811915
404
135 7 810
405
639
0
.02
.04
.06
.08
Leverage
0 .005 .01 .015 .02Normalized residual squared
Case-wise Influence AnalysisThe Leverage versus Squared Residual Plot
Week 5 Slide #21
What to Do?• Nonlinearity
– Polynomial regression: try X and X2
– Variable transformation: logged variables– Use non-OLS regression (curve fitting)
• Heteroscedasticity– Re-specify model
• Omitted variables?• Use non-OLS regression (WLS)• Use robust standard errors
• Influential and Deviant Cases– Evaluate the cases– Run with controls (multivariate model)– Omit cases (last option)
Week 5 Slide #22
Next Week
• Review regression diagnostics
• Introduction to Matrix Algebra
• Review for Exam