soc 206 lecture 2

26
Logic of Multivariate Analysis Multiple Regression

Upload: thetis

Post on 05-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

SOC 206 Lecture 2. Logic of Multivariate Analysis Multiple Regression. Multivariate Analysis. Why multivariate analysis? Nothing happens by a single cause If it did – it would imply perfect determinism it would imply perfect/divine measurement - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SOC 206  Lecture 2

Logic of Multivariate AnalysisMultiple Regression

Page 2: SOC 206  Lecture 2

Why multivariate analysis? Nothing happens by a single cause

If it did – it would imply perfect determinism it would imply perfect/divine measurement it would be impossible to separate cause from

effect (where does effect start and where does cause end)

Social reality is notoriously multi-causal even more than certain physical/chemical/biological processes People are not just objects but also subjects of causal

processes – reflexivity, agency, framing etc. (Some of these are hard to capture in statistical models.)

Page 3: SOC 206  Lecture 2

#1. Empirical Association #2. Appropriate Time Order #3. Non-Spuriousness (Excluding other Forms of Causation)

Mill tells us that even individual causal relationships cannot be established without multivariate analysis (#3).

Suppose we suspect X causes Y Y=f(X,e) Suppose we establish that X is related to Y (#1) and X precedes Y (#2). But what if both X and Y are the result of Z a third variable:

E.g. Academic Performance=f( Poverty, e) If that were true redistributing income should help academic achievements.

But maybe both are the result of parents education (a confounding factor)

Povertye

Poverty Academic Performance

Poverty Academic Performance

Parents’ Education

e1e2

-

- +

Page 4: SOC 206  Lecture 2

Eliminating or “controlling for” other, confounding factors (Z)

Experiments -- treatment (X) is introduced by researcher: 1. Physical control

Excluding factors by physical design – physical control of Zs 2. Randomization

Random assignment to treatment and control – randomized control Zs Observational research – no manipulation by researcher

3. Quasi-experiments Found experiments – choice of cases that are “minimum pairs”: they are the

same on most confounding factors (Zs) but they are different in the treatment (X)

4. Statistical manipulation Removing the effect of Z from the relationship between Y and X

Organizing data into groups homogenous by the control variable Z and looking at the relationship between treatment X and response Y if Y still moves together with X it cannot be because they are moved by Z: Z is

constant. If Z is the cause of Y and Z is constant Y must be constant too.

Residualizing X on Z then residualizing Y on Z. That leaves us with that part of X and Y that is unrelated to Z. If the two residualized variables still move together, that cannot be because they are moved by Z.

Page 5: SOC 206  Lecture 2

Remember: in a regression the error is always unrelated to the independent variable(s)

Residualizing – (we ‘take out,’ ‘eliminate’ Z from both Y and X)

iyz

ii +e=a'+b' ZY

ixz

ii +e Z+b'=a'X ''

ee+b=ae iixz

iyz ** 0a*

iiii +eZb X=a+bY 21 1bb*

Page 6: SOC 206  Lecture 2

The temporal position of Z vis-à-vis X

Conditional Effect of X on Y Controlling for Z

No change/

Zero or statistically not significant

Weaker but statistically significant

Stronger than the unconditional effect

Uneven among the categories of Z

Antecedent variable(Z precedes both X and Y

Z is not a factor

Spurious associationX is not a factor(Z is their common cause)

X is a factor but some of its original effect is spurious

Suppression Statistical Interaction(X works differently depending on the values of Z)

Intervening variable(Z precedes Y but not X)

Z is not a factor

Explanation or chain relationshipX is a factor but only through Y (X does not have a direct or independent effect)

X is a factor and it effects Y both through Z and directly (or through other variables missing from the model

Suppression Statistical Interaction(X works differently depending on the values of Z)

Page 7: SOC 206  Lecture 2

Yi=a+b1Xi+b2Zi+ei

or Yi=a+b1X1i+b2X2i+ei

To obtain a, b1, and b2 we first calculate β*1 and β*2 from the standardized regression.

Then we transform them into their metric equivalents Finally we obtain a with the help of the means of Y, X1 and X2 .

Page 8: SOC 206  Lecture 2

2*1

2*2

*2

*2

*2

*1

12

1221

12

1212

211212

122

1

1

)(

XX

XXYXYX

XX

XXYXYX

XXXXYXYX

XXYX

r

rrr

r

rrr

rrrr

rr

iXXY eZZZiii

21** 21

We multiply each side by ZX1i

We sum across all cases and divide by n

We get our first normal equation (for the correlation between Y and X1 ).

We get an expression for β*1 .

We multiply each side by ZX2i . Repeat what did.

We get our second normal equation (for the correlation between Y and X2 ).

Plugging in for β*1 .

Both standardized coefficients can be expressed in terms of the three correlations among Y, X1 and X2 .

iXXXXXYX eZZZZZZZiiiiiii 121111

** 21

121

121

1211111

*2

*1

*2

*1

*2

*1

XXYX

XXYX

iXXXXXXY

rr

rr

n

Z

n

ZZ

n

ZZ

n

ZZ eiiiiiii

1.

2.

Page 9: SOC 206  Lecture 2

We multiply each standardized coefficient by the ratio of the standard deviation of the dependent variable and the independent variable to which it belongs.

Take the two normal equations:

What do we learn from the normal equations?

If either β*2 =0 or rx1x2=0 , the unconditional effect does not change once we control for X2.

We get suppression only if β*2 ≠0 and rx1x2 ≠ and of the opposite signs if the unconditional effect is positive and of the same signs if the unconditional effect is negative. The correlation (unconditional effect) of X1 or X2 on Y can be decomposed into two parts. Take X1

The direct (or net) effect of X1 on Y (β*1 ) controlling for X2

and something else that is the product of the direct (or net) effect of X2 (β*2 ) on Y and the correlation between X1 and X2 (rx1x2), the measure of multicollinearity between the two independent variables.

*22

*11

2

1

X

Y

X

Y

S

Sb

S

Sb

*2

*1 122

XXYX rr

121

*2

*1 XXYX rr

Page 10: SOC 206  Lecture 2

http://www.miabella-llc.com/demo.html

Page 11: SOC 206  Lecture 2

AP=f(P,e1) ZAP= β*’1 ZP+e1

AP=f(P,PE,e) ZAP= β*1 ZP+ β*2 ZPE+ e

Poverty Academic Performance

Poverty Academic Performance

Parents’ Education

e1

e

β*’1

β*1

β*2

Page 12: SOC 206  Lecture 2

. correlate AVG_ED API13 MEALS, means (obs=10173)

Variable | Mean Std. Dev. Min Max - ------------+------------------------------------------------------------------------ AVG_ED | 2.781778 .758739 1 5 API13 | 784.182 102.2096 311 999 MEALS | 58.57338 27.9053 0 100

| AVG_ED API13 MEALS ------------------+--------------------------- AVG_ED | 1.0000 API13 | 0.6706 1.0000 MEALS | -0.8178 -0.4743 1.0000

. regress API13 AVG_ED MEALS, beta

Source | SS df MS Number of obs = 10173 -------------+------------------------------ ------------------- F( 2, 10170) = 4441.76 Model | 49544993 2 24772496.5 Prob > F = 0.0000 Residual | 56719871.2 10170 5577.17514 R-squared = 0.4662-------------+------------------------------ --------------------- Adj R-squared = 0.4661 Total | 106264864 10172 10446.8014 Root MSE = 74.68

---------------------------------------------------------------------------------------------------------- API13 | Coef. Std. Err. t P>|t| Beta -------------+-------------------------------------------------------------------------------------- AVG_ED | 114.9596 1.695597 67.80 0.000 .853387 MEALS | .8187537 .0461029 17.76 0.000 .2235364 _ cons | 416.4326 7.135849 58.36 0.000 .------------------------------------------------------------------------------------------------------------

Page 13: SOC 206  Lecture 2
Page 14: SOC 206  Lecture 2

-400

-200

02

004

00R

esid

ual

s

600 700 800 900 1000Fitted values

. estat hettest

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of API13

chi2(1) = 1332.01 Prob > chi2 = 0.0000

Page 15: SOC 206  Lecture 2

Poverty Academic Performance

Poverty Academic Performance

Parents’ Education

e1

e

β*’1 =-.4743

β*1=.2235364

β*2=.853387rx1x2 =β*’1= -.8178121

*2

*1 XXYX rr

ryx1 =β*’1 =-.4743

4743.69789989.2235364.

8178.*853387.2235364.1

YXr

Spurious indirect effect

Page 16: SOC 206  Lecture 2

Parents’ Education Academic Performance

Poverty Academic Performance

Parents’ Education

e1

e

β*’2 =. 6706

β*1=.2235364

β*2=.853387rx1x2 =β*’1= -.817812

*1

*22 XXYX rr

ryx2 =β*’2 =. 6706

6706..18280807 -853387.

8178.*2235364.853387.2

YXr

Indirect effect

Page 17: SOC 206  Lecture 2

Venn diagram R-square= Unique contribution by X1 + unique contribution by X2

+ common contribution by both X1 and X2

Multicollinearity Unique contributions are small, statistically non-significant, still

R-square is large because of the common contribution is large.

x1x2

y

y

x1x2

Page 18: SOC 206  Lecture 2

Comparing theories How much a theory adds to an already existing one Calculating the contribution of a set of variables ----- R2

Where R12 is the fit of the reduced/smaller model and R2

2 is the fit of the full/complete model

and K1 is the number of independent variables in the reduced model and K2 is the number of independent variables in the complete model

and N is the sample size.

Warning: You have to make sure you use the exact same cases for each model!

)1/()1(

)/()(

22

2

122

12

2)12(),12(

KNR

KKRRF KNKK

Page 19: SOC 206  Lecture 2

Adding a new independent variable will always improve fit even if it is unrelated to the dependent variable.

We have to consider the parsimony (number of independent variables) of the model relative to the sample size. For N=2, a simple regression will always have a perfect fit General rule: N-1 independent variables will always result in R-squared of 1 no matter what

those variables are

Adjusted R-square

)1(

)1)(( 222

KN

RKRR adj

Page 20: SOC 206  Lecture 2

Yi=a+b1X1i+b2X2i+....+bkXki+ei

If we standardized Y, X1… Xk turning them into Z scores we can re-write the equation as

Zyi=β*1Zx1i+ β*2Zx2i+… +β*kZxki+ei

To find the coefficients we have to write out k number of normal equations one for each correlation between each independent variable and the dependent variable

ryx1=β*1+ β*2 rx1x2+…..+β*k rx1xk

ryx2= β*1rx1x2+ β*2+…..+β*k rx2xk

………………. ryxk= β*1rx1xk + β*2 rx2xk+…..+β*k

and solve k equations for k unknowns (β*1, β*2…. β*k)

Page 21: SOC 206  Lecture 2

. correlate API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR(obs=10082)

| API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS----------------+------------------------------------------------------------------------------------------ API13 | 1.0000 MEALS | -0.4876 1.0000 AVG_ED | 0.6736 -0.8232 1.0000 P_EL | -0.3039 0.6149 -0.6526 1.0000 P_GATE | 0.2827 -0.1631 0.2126 -0.1564 1.0000 EMER | -0.0987 0.0197 -0.0407 -0.0211 -0.0541 1.0000 DMOB | 0.5413 -0.0693 0.2123 0.0231 0.2198 -0.0487 1.0000 PCT_AA | -0.2215 0.1625 -0.1057 -0.0718 0.0334 0.1380 -0.1306 1.0000 PCT_AI | -0.1388 0.0461 -0.0246 -0.1510 -0.0812 0.0180 -0.1138 -0.0684 1.0000 PCT_AS | 0.3813 -0.3031 0.3946 -0.0954 0.2321 -0.0247 0.1620 -0.0475 -0.0902 1.0000 PCT_FI | 0.1646 -0.1221 0.1687 -0.0526 0.1281 0.0007 0.1203 0.0578 -0.0788 0.2485 PCT_HI | -0.4301 0.6923 -0.8007 0.7143 -0.1296 -0.0192 -0.0193 -0.0911 -0.1834 -0.3733 PCT_PI | -0.0598 0.0533 -0.0228 0.0286 0.0091 0.0315 -0.0202 0.2195 -0.0311 0.0748 PCT_MR | 0.1468 -0.3714 0.3933 -0.3322 0.0052 0.0102 -0.0928 -0.0053 0.0667 0.0904

| PCT_FI PCT_HI PCT_PI PCT_MR-----------------+------------------------------------ PCT_FI | 1.0000 PCT_HI | -0.1488 1.0000 PCT_PI | 0.2769 -0.0763 1.0000 PCT_MR | 0.0928 -0.4700 0.0611 1.0000

API13 Academic Performance Index 2013MEALS Percent Free or Reduced Price Meal EligibleAVG_ED Average Parent Education Level (1-5)P_EL Percent English LearnerP_GATE Percent in Gifted And Talented Education ProgramEMER Percent Teachers with Emergency CredentialsDMOB Percent Students Enrolled in District w/o 30 Gap in EnrollmentPCT_AA Percent African AmericanPCT_AI Percent American Indian or Alaska NativePCT_AS Percent AsianPCT_FI Percent FilipinoPCT_HI Percent Hispanic or LatinoPCT_PI Percent Native Hawaiian or Pacific IslanderPCT_MR Percent Mixed Race

Page 22: SOC 206  Lecture 2

. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB if AVG_ED>0 & AVG_ED<6, beta

Source | SS df MS Number of obs = 10082--------------+------------------------------ -------------------------------------- F( 6, 10075) = 2947.08 Model | 65503313.6 6 10917218.9 Prob > F = 0.0000 Residual | 37321960.3 10075 3704.41293 R-squared = 0.6370-------------+---------------------------------------------------------------------- Adj R-squared = 0.6368 Total | 102825274 10081 10199.9081 Root MSE = 60.864

------------------------------------------------------------------------------------------------------------ API13 | Coef. Std. Err. t P>|t| Beta-------------+---------------------------------------------------------------------------------------------- MEALS | .1843877 .0394747 4.67 0.000 .0508435 AVG_ED | 92.81476 1.575453 58.91 0.000 .6976283 P_EL | .6984374 .0469403 14.88 0.000 .1225343 P_GATE | .8179836 .0666113 12.28 0.000 .0769699 EMER | -1.095043 .1424199 -7.69 0.000 -.046344 DMOB | 4.715438 .0817277 57.70 0.000 .3746754 _cons | 52.79082 8.491632 6.22 0.000 .------------------------------------------------------------------------------------------------------------

. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ED>0 & AVG_ED<6, beta

Source | SS df MS Number of obs = 10082----------------+-------------------------------------------------------------------- F( 13, 10068) = 1488.01 Model | 67627352 13 5202104 Prob > F = 0.0000 Residual | 35197921.9 10068 3496.01926 R-squared = 0.6577-------------+---------------------------------------------------------------------- Adj R-squared = 0.6572 Total | 102825274 10081 10199.9081 Root MSE = 59.127

-------------------------------------------------------------------------------------------------------------- API13 | Coef. Std. Err. t P>|t| Beta--------------+----------------------------------------------------------------------------------------------- MEALS | .370891 .0395857 9.37 0.000 .1022703 AVG_ED | 89.51041 1.851184 48.35 0.000 .6727917 P_EL | .2773577 .0526058 5.27 0.000 .0486598 P_GATE | .7084009 .0664352 10.66 0.000 .0666584 EMER | -.7563048 .1396315 -5.42 0.000 -.032008 DMOB | 4.398746 .0817144 53.83 0.000 .349512 PCT_AA | -1.096513 .0651923 -16.82 0.000 -.1112841 PCT_AI | -1.731408 .1560803 -11.09 0.000 -.0718944 PCT_AS | .5951273 .0585275 10.17 0.000 .0715228 PCT_FI | .2598189 .1650952 1.57 0.116 .0099543 PCT_HI | .0231088 .0445723 0.52 0.604 .0066676 PCT_PI | -2.745531 .6295791 -4.36 0.000 -.0274142 PCT_MR | -.8061266 .1838885 -4.38 0.000 -.0295927 _cons | 96.52733 9.305661 10.37 0.000 .-----------------------------------------------------------------------------------------------------------

)1/()1(

)/()(

22

2

122

12

2)12(),12(

KNR

KKRRF KNKK

)1131082/()6577.1(

)613/()6370.6577(.1068,7F

2265.91068/3423.

7/0207.

)1131082/()6577.1(

)613/()6370.6577(.

Page 23: SOC 206  Lecture 2

. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR, vce(hc3) beta

Linear regression Number of obs = 10082 F( 13, 10068) = 1439.49 Prob > F = 0.0000 R-squared = 0.6577 Root MSE = 59.127

------------------------------------------------------------------------------------------------------- | Robust HC3 API13 | Coef. Std. Err. t P>|t| Beta-------------+----------------------------------------------------------------------------------------- MEALS | .370891 .0576739 6.43 0.000 .1022703 AVG_ED | 89.51041 2.651275 33.76 0.000 .6727917 P_EL | .2773577 .0646176 4.29 0.000 .0486598 P_GATE | .7084009 .0624278 11.35 0.000 .0666584 EMER | -.7563048 .2248352 -3.36 0.001 -.032008 DMOB | 4.398746 .1645831 26.73 0.000 .349512 PCT_AA | -1.096513 .0799674 -13.71 0.000 -.1112841 PCT_AI | -1.731408 .2257328 -7.67 0.000 -.0718944 PCT_AS | .5951273 .0492148 12.09 0.000 .0715228 PCT_FI | .2598189 .1343712 1.93 0.053 .0099543 PCT_HI | .0231088 .0511823 0.45 0.652 .0066676 PCT_PI | -2.745531 .7471198 -3.67 0.000 -.0274142 PCT_MR | -.8061266 .2485255 -3.24 0.001 -.0295927 _cons | 96.52733 16.89459 5.71 0.000 .------------------------------------------------------------------------------

.

Notice the coeffcients, the betas, the R-squared plus the Root MSE are unchanged. The Std. Err.s are different and so are the t values and therefore the P values also change.Look at PCT_FI. Now it is almost significant at the .05 level. On the previous slide the P value is .116.

Page 24: SOC 206  Lecture 2

-400

-200

02

004

006

00R

esid

ual

s

200 400 600 800 1000Fitted values

GOOD ONESResidual Name Tested/Enrolled506.0523 Muir Charter 78/78488.5563 SIATech 65/66342.7693 Escuela Popular/Center for Training and 88/91280.2587 YouthBuild Charter School of California 78/78246.7804 Oakland Charter Academy 238/238232.4897 Oakland Charter High 146/146230.0739 Opportunities For Learning - Baldwin Par 1434/1442

BAD ONES -399.4998 Sierra Vista High (SD) 14/15 -342.2773 Baden High (Continuation) 73/73 -336.5667 Dover Bridge to Success 84/88 -322.1879 Millennium High Alternative 43/49 -318.0444 Aurora High (Continuation) 128/131 -315.5069 Sunrise (Special Education) 34/34 -311.1326 Nueva Vista High 20/28

Page 25: SOC 206  Lecture 2

. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_> ED>0 & AVG_ED<6 [aweight = TESTED], beta(sum of wgt is 9.0302e+06)

Source | SS df MS Number of obs = 10082----------------+-------------------------------------------------------------------- F( 13, 10068) = 2324.54 Model | 41089704.2 13 3160746.48 Prob > F = 0.0000 Residual | 13689769.3 10068 1359.73076 R-squared = 0.7501----------------+--------------------------------------------------------------------- Adj R-squared = 0.7498 Total | 54779473.6 10081 5433.9325 Root MSE = 36.875

------------------------------------------------------------------------------ API13 | Coef. Std. Err. t P>|t| Beta------------------+---------------------------------------------------------------- MEALS | .2401007 .032364 7.42 0.000 .0828479 AVG_ED | 83.84621 1.444873 58.03 0.000 .8044588 P_EL | .1605591 .0405248 3.96 0.000 .0306712 P_GATE | .2649964 .0443791 5.97 0.000 .0317522 EMER | -1.527603 .1503635 -10.16 0.000 -.0513386 DMOB | 3.414537 .0834016 40.94 0.000 .2212861 PCT_AA | -1.275241 .0583403 -21.86 0.000 -.1301146 PCT_AI | -1.96138 .2143326 -9.15 0.000 -.0499468 PCT_AS | .4787539 .0368303 13.00 0.000 .082836 PCT_FI | -.0272983 .1113346 -0.25 0.806 -.0013581 PCT_HI | .0440935 .0351466 1.25 0.210 .0158328 PCT_PI | -2.464109 .5116525 -4.82 0.000 -.0271533 PCT_MR | -.5071886 .1678521 -3.02 0.003 -.0187953 _cons | 220.2237 9.318893 23.63 0.000 .------------------------------------------------------------------------------

Page 26: SOC 206  Lecture 2

Characteristics of OLS if sample is probability sample Unbiased E(b)= the mean sample value is the population value Efficient Min b the sample values are as close to each other as possible Consistent as sample size (n) approaches infinity, the sample value converges on the population value

If the following assumptions are met: The Model is

Complete Linear Additive

Variables are measured at an interval or ratio scale without error

The regression error term is normally distributed has an expected value of 0 errors are independent homoscedasticity predictors are unrelated to error In a system of interrelated equations the errors are unrelated to each other

1Prlim

nn

b