simple and multiple regression
DESCRIPTION
Simple and Multiple Regression. 2.1 Simple Linear Regression. Let's examine the relationship between the size of school and academic performance to see if the size of the school is related to academic performance. - PowerPoint PPT PresentationTRANSCRIPT
Simple and Multiple Regression
2.1 Simple Linear Regression
Let's examine the relationship between the size of school and academic performance to see if the size of the school is related to academic performance.
For this example, api00 is the dependent variable and enroll is the predictor.
Dependent variable api00/academic performance of the school
Independent variable Enroll/number of students
F-test: 44.83 which means that the model is statistically
significant. R-squared
approximately 10% of the variance of api00 is accounted for by the model, in this case, enroll.
T-test for enroll equals -6.70, and is statistically
significant, meaning that the regression coefficient for enroll is significantly different from zero.
Coefficient for enroll is -.1998674, or approximately -.2,
meaning that for a one unit increase in enroll, we would expect a .2-unit decrease in api00.
Predicted Value
After you run a regression, you can create a variable that contains the predicted values using the predict command.
For this example, our new variable name will be fv
Below we can show a scatterplot of the outcome variable, api00 and the predictor, enroll.
400
600
800
100
0a
pi 2
000
0 500 1000 1500number of students
We can combine scatter with lfit to show a scatterplot with fitted values.
400
600
800
100
0
0 500 1000 1500number of students
api 2000 Fitted values
If you use the mlabel (snum) option on the scatter command, you can see the school number for each point. This allows us to see, for example, that one of the outliers is school 2910.
906
889887
876
888
4284
4271
29102899
2887
2911
2882
2907
2908
28952880
2890
3948
3956
3947
3952
39543943
3945
4293
42994318
4319
4296
4317
4322
4307
4302
4314
4292
4304
4308
600
596
611
595
592
602
5222
5210
5217
3644
364336243623
3629
3622
4017
58
70
65
657697
646 640629
659
699637
663
633
690
2982
300029973005
2972 3011
2977
3013
3024
3004
3010
2460
2479
2459
105
94
93
116
32363240
3241
32563250
32473258
1497
1478
1474
1511
1539
1490
1500
1515
1512
1475
15221472
15161489
1493
1606
1866
1747
1905
1699
2077
1959
1914
1685
1757 2087
1946
18211881
1862
1932
1919
178816641885
1682
1853
2082
1806
1997
1701
1926
1633
1690
1696
1775
1673
1863
1990
1752
1782
1995
1638
1815
1709
1812
1742
1799
1907
1924
1781
1897
1596
18681909
1600
18941895
1889
17411769
1941
1744
1680
18582074
1949
1903
1651
1723
1783
1634
2088 1925
1851
1616
2092
1652
1820
1791
16771836
1743
1801
1678
1731
1961
1900
1740
1977 1952
1625
1994
1978
1805
1704 1597
1872
1729
1612
1621
1763
18541611
1778
1615
1795
2080
1671 416
425
419
402
430406
412
413
3070
3072
3060
3051
3055
194
211182
167
203
187
201
210184165
181
198
2240
2247
2267
2278
2282
4448
44354449
4431
4427
4443
3698
3696
371536973700
3701
3518
3511
3520
3516
3525
3537
3519
3523
3535
3765
3736
3785
3741
3757
3751
3791
3772
3758
3794
3793
3759
3784
3754
3196
3203 3200
3184
3202
3187
3193
4132
4128
4131
4173
41674140
4145
41434136
4488
4576
4506
4554
4518
4486
4537
4585
4573
4534
4581
4507
4522
4530
4583
4485
4558
4480
4533
45744580
4596
4514
4519
4502
4516
4511
4487
4528
4539
4547
4731
4720
47834737
47364714
4781
46984775
4747
4744
4729
4780
47744777
53875386
5370
53665358
53885371
5362
3867
3848
3834
38333828
3839
3843
3854
3845
3850
3864
38533869
3826
3824
3835
3865
3822
3129 3128
3127
3121
3145
3133
3151
6068
6072
6057
606560626060
48794871
4875
4859
48774880
4881
4876
4868
4862
4878
5920
5917
5927
5926
5933469
468
482489
504
488479
400
600
800
100
0
0 500 1000 1500number of students
api 2000 Fitted values
2. 2 Multiple Regression
Dependent variable api00/academic performance of the school
Independent variable ell/english language learners meals/pct free meals yr_rnd/year round school mobility/pct 1st year in school
Independent variable acs_k3/avg class size k-3 acs_46/avg class size 4-6 full/pct full credential emer/pct emer credential enroll/number of students
F statistics R-square, Adjusted R-square T values Coefficients
But how to compare the relative importance of coefficients?
Regress with beta command
Let us compare the regress output with the listcoef output. You will notice that the values listed in the Coef., t, and P>|t| values are the same in the two outputs.
The bStdX column gives the unit change in Y expected with a one standard deviation change in X.
The bStdY column gives the standard deviation change in Y expected with a one unit change in X.
The SDofX column gives that standard deviation of each predictor variable in the model.
2. 3 Hypothesis Testing
Single coefficient Mutiple coefficients
Correlation
As part of doing a multiple regression analysis you might be interested in seeing the correlations among the variables in the regression model.
You can use correlate command as shown below.
You can also use pwcorr handle missing values options: sig
2.4 Examine Distribution Assumption
Classical regression assumption requires that the outcome (dependent) to be normally distributed.
In large sample, this assumption is not that important because of Central Limit Theory
In small sample, however, the distribution assumption could be relevant
We will investigate issues concerning normality.
Here we check the normality of enroll We start with making some graphs
Hisgram Kdesnity
We can use the normal option to superimpose a normal curve on this graph and the bin(20) option to use 20 bins. The distribution looks skewed to the right.
An alternative to histograms is the kernel density plot, which approximates the probability density of the variable.
Kernel density plots have the advantage of being smooth and of being independent of the choice of origin, unlike histograms.
Stata implements kernel density plots with the kdensity command.
Having concluded that enroll is not normally distributed, how should we address this problem?
We may try to transform enroll to make it more normally distributed. Potential transformations include taking the log, the square root or raising the variable to a power.
Stata includes the ladder and gladder commands to help selecting the right transformation. Ladder reports numeric results and gladder produces a graphic display.
This indicates that the log transformation would help to make enroll more normally distributed.
Let's use the generate command with the log function to create the variable lenroll which will be the log of enroll.
Note that log in Stata will give you the natural log, not log base 10. To get log base 10, type log10(var)
2. 5 Summary
Simple Regression Multiple Regression Hypothesis Testing Examine the normality assumption
Quiz I
Make graphs of api99: histogram, kdensity plot
What is the correlation between api99 and meals?
Regress api99 on meals. Create and list the fitted (predicted)
values. Graph meals and api99 with and without
the regression line.
Quiz II
Look at the correlations among the variables api99 meals ell avg_ed using the corr and pwcorr commands.
Perform a regression predicting api99 from meals and ell. Interpret the output.