simple and multiple regression

Simple and Multiple Regression

2.1 Simple Linear Regression

Let's examine the relationship between the size of school and academic performance to see if the size of the school is related to academic performance.

For this example, api00 is the dependent variable and enroll is the predictor.

Dependent variable api00/academic performance of the school

Independent variable Enroll/number of students

F-test: 44.83 which means that the model is statistically

significant. R-squared

approximately 10% of the variance of api00 is accounted for by the model, in this case, enroll.

T-test for enroll equals -6.70, and is statistically

significant, meaning that the regression coefficient for enroll is significantly different from zero.

Coefficient for enroll is -.1998674, or approximately -.2,

meaning that for a one unit increase in enroll, we would expect a .2-unit decrease in api00.

Predicted Value

After you run a regression, you can create a variable that contains the predicted values using the predict command.

For this example, our new variable name will be fv

Below we can show a scatterplot of the outcome variable, api00 and the predictor, enroll.

400

600

800

100

0a

pi 2

000

0 500 1000 1500number of students

We can combine scatter with lfit to show a scatterplot with fitted values.

400

600

800

100

0


api 2000 Fitted values

If you use the mlabel (snum) option on the scatter command, you can see the school number for each point. This allows us to see, for example, that one of the outliers is school 2910.

906

889887

876

888

4284

4271

29102899

2887

2911

2882

2907

2908

28952880

2890

3948

3956

3947

3952

39543943

3945

4293

42994318

4319

4296

4317

4322

4307

4302

4314

4292

4304

4308

600

596

611

595

592

602

5222

5210

5217

3644

364336243623

3629

3622

4017

58

70

65

657697

646 640629

659

699637

663

633

690

2982

300029973005

2972 3011

2977

3013

3024

3004

3010

2460

2479

2459

105

94

93

116

32363240

3241

32563250

32473258

1497

1478

1474

1511

1539

1490

1500

1515

1512

1475

15221472

15161489

1493

1606

1866

1747

1905

1699

2077

1959

1914

1685

1757 2087

1946

18211881

1862

1932

1919

178816641885

1682

1853

2082

1806

1997

1701

1926

1633

1690

1696

1775

1673

1863

1990

1752

1782

1995

1638

1815

1709

1812

1742

1799

1907

1924

1781

1897

1596

18681909

1600

18941895

1889

17411769

1941

1744

1680

18582074

1949

1903

1651

1723

1783

1634

2088 1925

1851

1616

2092

1652

1820

1791

16771836

1743

1801

1678

1731

1961

1900

1740

1977 1952

1625

1994

1978

1805

1704 1597

1872

1729

1612

1621

1763

18541611

1778

1615

1795

2080

1671 416

425

419

402

430406

412

413

3070

3072

3060

3051

3055

194

211182

167

203

187

201

210184165

181

198

2240

2247

2267

2278

2282

4448

44354449

4431

4427

4443

3698

3696

371536973700

3701

3518

3511

3520

3516

3525

3537

3519

3523

3535

3765

3736

3785

3741

3757

3751

3791

3772

3758

3794

3793

3759

3784

3754

3196

3203 3200

3184

3202

3187

3193

4132

4128

4131

4173

41674140

4145

41434136

4488

4576

4506

4554

4518

4486

4537

4585

4573

4534

4581

4507

4522

4530

4583

4485

4558

4480

4533

45744580

4596

4514

4519

4502

4516

4511

4487

4528

4539

4547

4731

4720

47834737

47364714

4781

46984775

4747

4744

4729

4780

47744777

53875386

5370

53665358

53885371

5362

3867

3848

3834

38333828

3839

3843

3854

3845

3850

3864

38533869

3826

3824

3835

3865

3822

3129 3128

3127

3121

3145

3133

3151

6068

6072

6057

606560626060

48794871

4875

4859

48774880

4881

4876

4868

4862

4878

5920

5917

5927

5926

5933469

468

482489

504

488479

400

600

800

100

0


api 2000 Fitted values

2. 2 Multiple Regression

Dependent variable api00/academic performance of the school

Independent variable ell/english language learners meals/pct free meals yr_rnd/year round school mobility/pct 1st year in school

Independent variable acs_k3/avg class size k-3 acs_46/avg class size 4-6 full/pct full credential emer/pct emer credential enroll/number of students

F statistics R-square, Adjusted R-square T values Coefficients

But how to compare the relative importance of coefficients?

Regress with beta command

Let us compare the regress output with the listcoef output. You will notice that the values listed in the Coef., t, and P>|t| values are the same in the two outputs.

The bStdX column gives the unit change in Y expected with a one standard deviation change in X.

The bStdY column gives the standard deviation change in Y expected with a one unit change in X.

The SDofX column gives that standard deviation of each predictor variable in the model.

2. 3 Hypothesis Testing

Single coefficient Mutiple coefficients

Correlation

As part of doing a multiple regression analysis you might be interested in seeing the correlations among the variables in the regression model.

You can use correlate command as shown below.

You can also use pwcorr handle missing values options: sig

2.4 Examine Distribution Assumption

Classical regression assumption requires that the outcome (dependent) to be normally distributed.

In large sample, this assumption is not that important because of Central Limit Theory

In small sample, however, the distribution assumption could be relevant

We will investigate issues concerning normality.

Here we check the normality of enroll We start with making some graphs

Hisgram Kdesnity

We can use the normal option to superimpose a normal curve on this graph and the bin(20) option to use 20 bins. The distribution looks skewed to the right.

An alternative to histograms is the kernel density plot, which approximates the probability density of the variable.

Kernel density plots have the advantage of being smooth and of being independent of the choice of origin, unlike histograms.

Stata implements kernel density plots with the kdensity command.

Having concluded that enroll is not normally distributed, how should we address this problem?

We may try to transform enroll to make it more normally distributed. Potential transformations include taking the log, the square root or raising the variable to a power.

Stata includes the ladder and gladder commands to help selecting the right transformation. Ladder reports numeric results and gladder produces a graphic display.

This indicates that the log transformation would help to make enroll more normally distributed.

Let's use the generate command with the log function to create the variable lenroll which will be the log of enroll.

Note that log in Stata will give you the natural log, not log base 10. To get log base 10, type log10(var)

2. 5 Summary

Simple Regression Multiple Regression Hypothesis Testing Examine the normality assumption

Quiz I

Make graphs of api99: histogram, kdensity plot

What is the correlation between api99 and meals?

Regress api99 on meals. Create and list the fitted (predicted)

values. Graph meals and api99 with and without

the regression line.

Quiz II

Look at the correlations among the variables api99 meals ell avg_ed using the corr and pwcorr commands.

Perform a regression predicting api99 from meals and ell. Interpret the output.