copyright sharyn o'halloran1 statistics and quantitative analysis u4320 lecture 11 : path...
TRANSCRIPT
Copyright Sharyn O'Halloran 1
Statistics and Quantitative Analysis U4320
Lecture 11 : Path Diagrams Prof. Sharyn O’Halloran
URL: http://www.columbia.edu/itc/sipa/U4320y-003/
Copyright Sharyn O'Halloran 2
Key Points Review Regression in Excel Slope Coefficient as a
Multiplication Factor Path Diagram and Causal Models Direct and Indirect Effects
Copyright Sharyn O'Halloran 3
Regression in Excel Example:
Manatees are large gentle sea creatures that live along the Florida coast.
Many Manatees are killed or injured by powerboats each year.
The US Fish and Wildlife Service conducted a study on the impact on registration permits and number of Manatees killed.
Number of Powerboats
ManateeDeaths
Copyright Sharyn O'Halloran 4
Regression in Excel
Year Powerboat
registration (1000)Manatees
Killed
1977 447 131978 460 211979 481 241980 498 161981 513 241982 512 201983 526 151984 559 341985 585 331986 614 331987 645 391988 675 431989 711 501990 719 471991 716 531992 716 381993 716 351994 735 49
Powerboat registration (1000)
Manatees Killed
Descriptive StatisticsMean 601.56 Mean 32.61Standard Error 24.46 Standard Error 3.02Median 599.50 Median 33.50Mode 716.00 Mode 24.00Standard Deviation 103.79 Standard Deviation 12.82Sample Variance 10773.32 Sample Variance 164.25Range 288.00 Range 40.00Minimum 447.00 Minimum 13.00Maximum 735.00 Maximum 53.00Sum 10828.00 Sum 587.00Count 18.00 Count 18.00Confidence Level(95.0%) 51.62
Confidence Level(95.0%) 6.37
These are the data collected:
Does the number of Registered Powerboats increase the number of Manatees killed?
Copyright Sharyn O'Halloran 5
Regression in Excel(cont.)
Relation between Powerboat Registrtion (1000) and Manatee Deaths
-40
-30
-20
-10
0
10
20
30
40
50
60
-100 0 100 200 300 400 500 600 700 800
Manatee Data
Reg
istr
atio
nManatees Killed
Coefficients Standard Error t Stat P-value
Intercept -35.18 7.70 -4.57 0.000314Powerboat registration (1000) 0.11 0.01 8.93 0.000000
For each additional 1000 powerboats registered, we expect an increase of .11 Manatee Deaths.
*
1
* 938
110
574
1835ˆ).(
X.
).(-
.-Y
*Note: t-statistics in parentheses. * indicates p-value <0.05
Graph Data:
Copyright Sharyn O'Halloran 6
Regression in Excel(cont.)
Hypothesis TestingH0: = 0
Ha: 0
Calculate a 95% Confidence Interval
Reject or Fail to Reject Null Hypothesis Therefore, we reject the null hypothesis that b1=0
in favor of the alternative that it is not equal to 0.
b
kn SEtb *1
025.
0.01*12.20.11
0.002120.11
0.11212 0.10788
Copyright Sharyn O'Halloran 7
Regression in Excel(cont.)
Interpretation Is the count of registered powerboats
related to the number of Manatees kill? Yes, each additional 1000 powerboat
registration is associated with an additional .11 Manatee deaths.
If the Fish and Wildlife Service limited powerboat registration to 700, what would the expected number of deaths of Manatee?
What if no powerboats were allowed?
82.41)700(1101835ˆ ..Y
18.35)0(1101835ˆ ..Y
Copyright Sharyn O'Halloran 8
Regression in Excel(cont.)
Policy Prescriptions?? What are some of the costs associated
with limiting powerboat registration? What are some of the benefits? Should powerboats be prohibited? If we want to maintain the current
population of Manatees, what level of registration would be allowed?
What additional data would be necessary to make this decision?
Copyright Sharyn O'Halloran 9
Interpretation: Regression Coefficients as Multiplication Factors
Simple Regression Basic Equation
Remember our basic one variable regression equation is:
b is the slope of the regression line. It represents the change in Y corresponding
to a unit change in X.
Y = a + bX
Copyright Sharyn O'Halloran 10
Interpretation: Regression Coefficients as Multiplication Factors (cont.)
Multiplication Factor We can also think of b as a multiplication factor.
Example Take the first fertilizer equation:
Say we add 5 more pounds of fertilizer. Then the change in yield according to this equation will be:
Y = b X
Y = 36 + .06 X
Y = b XY = .06
Y = .30 bushels
Copyright Sharyn O'Halloran 11
Interpretation: Regression Coefficients as Multiplication Factors (cont.)
Multiple Regression: "Other Things Being Equal" Now consider the multiple regression
equation:
We can still think of the slopes as multiplication factors.
But now they are multiplication factors if we change only one variable and keep all others constant.
Y = b0 + b
1X
1 + b
2X
2
Copyright Sharyn O'Halloran 12
Interpretation: Regression Coefficients as Multiplication Factors (cont.)
Say we change X1 to (X1 + X1) Then we can write:
If X1 changes while all others remain constant, then change in
Y = b1(change in X1)
(new)
(initial)
])([22
22
111
11
0
0
Xb
Xb
XXb
Xb
b
b
Y
Y
Y
11XbY
Copyright Sharyn O'Halloran 13
Interpretation: Regression Coefficients as Multiplication Factors (cont.)
Examples Let's try an example. Say we have the following single and
multiple regression equations:
YIELD = 36 + .059 FERTILIZER
YIELD = 30 + 1.50 RAIN
YIELD = 28 + .038 FERT + .83 RAIN.
Bivariate gives total effect of FertilizerBivariate gives total effect of Rain
Multivariate gives the partial effect of fertilizer and rain
Copyright Sharyn O'Halloran 14
Interpretation: Regression Coefficients as Multiplication Factors (cont.)
What is the change in yield if a farmer adds another 100 pounds of fertilizer? Answer:
Only the fertilizer will change, not the rain.
So use the multiple regression equation:
Y = b1 X1
Y = 100 (.038)Y = 3.8 bushels
Copyright Sharyn O'Halloran 15
Interpretation: Regression Coefficients as Multiplication Factors (cont.)
What is the expected change in yield if a farmer irrigates his fields with 3 inches of water? Answer:
Only the amount of water will change, not the fertilizer.
So use the multiple regression equation:
Y = b2 X2
Y = 3 (.83)Y = 2.5 bushels
Copyright Sharyn O'Halloran 16
Interpretation: Regression Coefficients as Multiplication Factors (cont.)
Say the farmer adds both 100 pounds of fertilizer and 3 inches of irrigation. Now what will the difference in yield be?
Answer: The change in yield will reflect the changes in
both independent variables:
bushels 6.3 Y
2.5 3.8 Y
(3) (0.83) (100) 0.38 Y
X b X b Y 2211
Copyright Sharyn O'Halloran 17
Interpretation: Regression Coefficients as Multiplication Factors (cont.)
Now, say rainfall has increased 3 inches and we know that fertilizer is not held constant.
What would your best guess be as to the difference in yield?
Answer: Since fertilizer is not held constant, we should use the single regression equation:
bushels. 4.5 Y
(1.5) 3 Y
X b Y
Copyright Sharyn O'Halloran 18
Path Analysis Purpose:
Develop a technique that allows us to disaggregate the effects caused directly by the increase in rainfall and indirectly by other factors.
This is known as Path Analysis. A path analysis is an ordered causal system that relates
the effect that a change in X produces on Y. This can be directly:
Employment is a cause of earnings—People who get (lose jobs increase (decrease) their earnings.
Race is a cause of party identification—Blacks are more likely to become Democrats than are whites.
This can also be indirectly: Employment causes earnings via education—People who
higher education get jobs are therefore have higher earnings.
Race causes party via income—Blacks tend to have lower income and therefore are more likely to become Democrats than are whites.
Employment Earnings
Employment Earnings
Education
Copyright Sharyn O'Halloran 19
Path Analysis (cont.)
Example 1: Fiji Women Say we have data on 4700 women from
Fiji. Basic Model
We know for each woman: Age Years of education, and Number children
Copyright Sharyn O'Halloran 20
Path Analysis (cont.)
Path Diagram We might think that a woman's age and education
correlate with how many children she has. We can write a causal model that looks like this:
Age
Education
Children
Copyright Sharyn O'Halloran 21
Path Analysis (cont.)
Estimates When we estimate these relationships,
we get the results: CHILDREN = 3.4 + .059 AGE - .16 EDUC
We can represent these results as follows: Age
Education
Children
0.059
-0.16
Copyright Sharyn O'Halloran 22
Path Analysis (cont.)
Direct and Indirect effects Now, let's say we think there might
also be a relationship between a woman's age and education.
Estimated Equation If we estimate this regression, we get
the result: EDUCATION = 7.6 - .032
AGE. Older women have less education than
younger women.
Copyright Sharyn O'Halloran 23
Path Analysis (cont.)
Path Diagram We now add this new information into
the causal model:
Age
Education
Children
0.059
-0.16
-0.032
What is the change in the expected number of children due to 1 extra year, holding education constant?
What is the change in the years of education from this same 1 extra year of age?
Copyright Sharyn O'Halloran 24
Path Analysis (cont.)
Direct and Indirect Effects Question:
What's the change in number of children from one extra year of age, letting education change too?
The change in age has two effects: a direct and an indirect effect.
Direct Effect (Multiple regression coefficient) The direct effect is captured in the
coefficient leading from AGE to CHILDREN. This is the multiple regression coefficient,
and it represents the expected extra number of children from one extra year, holding education constant
Age
Children
0.059
Copyright Sharyn O'Halloran 25
Path Analysis (cont.)
Indirect Effect We know that an extra year corresponds with
-.032 years of school. Each extra year of school corresponds with -.16
extra children. We get the indirect effect by multiplying
along the arrows leading from AGE to CHILDREN through EDUCATION:
-0.16
Age
Education
Children-0.032 (-.032) * (-.16) = + .005
Copyright Sharyn O'Halloran 26
Path Analysis (cont.)
Total Effect So the total effect of AGE on CHILDREN
letting EDUCATION vary too is the sum of the direct and indirect effects.
Question: What do you think would have happened if we
ran a simple regression of CHILDREN on AGE? What would the coefficient have been?
Age
Education
Children
0.059
-0.16
-0.032.059 + .005 = .064.
Copyright Sharyn O'Halloran 27
Path Analysis: In a Nutshell
A path is a route from Xi to Xj
Paths follow one-way arrows. Paths have signs:
They may be positive, negative, or zero. Paths have sizes or magnitudes:
Numbers that summarize the total impact of Xj of a unit change in Xi after it has rippled through the system.
Copyright Sharyn O'Halloran 28
Path Analysis: In a Nutshell Illustration
Assume: A Year’s experience raises income $500 Each $ income raises one’s measure of conservatism by .025 A point on the conservative scale raises the percentage who
vote Republican by .137 percentage points What happens if workers gain 5 years of experience?
Incomes increase by (5X$500)=2500 A $2,500 increase in income increase their conservatism 62.5
points ($2500X.025=62.5) The 62.5 point increase in conservatism increases
Republicanism 8.56 points (62.3X.137=8.56). Multiplication Rule
One year increase in experience would produce a 8.56/5=1.71 shift.
Which could be found by multiplying the 3 coefficients together $500X.025X.137 =1.7125
Copyright Sharyn O'Halloran 29
Path Analysis (cont.)
Summary The value of a path (the change in Xj per change in
Xi) is found by multiplying the coefficients of each arrow in the path.
Regardless of the measurement units of the intervening variables, the result will come out in Xj units per one unit difference in Xi.
A path diagram provides insight into the relationship between simple and multiple regression.
Multiple regression gives us the direct (partial) effects of the independent variables on the dependent variable holding all else constant.
Simple regression gives us the total effect, which is the sum of the direct and the indirect effects.
Copyright Sharyn O'Halloran 30
Path Analysis (cont.)
The total relation of Y to X1 is equal to the direct plus indirect relation.
(Simple regression coefficient) Total Effects= Direct + Indirect = b1 + bb2
22110ˆ XbXbbY
212
2
111 xxbxbyx
If we divide this equation by
2
1
21
212
1
1
x
xxbb
x
yx
2
1x
Regression coefficient of X2 against X1, which we denote b.Regression
coefficient of Y against X1, that is the total relation.
Copyright Sharyn O'Halloran 31
Path Analysis: Example 2(cont.)
Example 2: Brady, Cooper and Hurley Defining Unity
Party unity scores are calculated as: (% voting in the majority - % voting in the minority)
Building the Causal Model Two components to party unity: internal and
external factors. So we can write a causal model like this:
External
Internal
Unity
External factors define how homogeneous is the constituent base of the party.Internal factors have to do with the strength of party leadership
Copyright Sharyn O'Halloran 32
Path Analysis: Example 2(cont.)
However, it is also thought that external factors influence internal factors.
That is, when legislators from a party are united on the issues, they are more likely to give their leaders power to get things done.
Thus we add another line to our model:
External
Internal
Unity
Copyright Sharyn O'Halloran 33
Path Analysis: Example 2(cont.)
Results When this model was estimated, the
results were:PARTY STRENGTH = .61 INTERNAL + .58 EXTERNAL;
INTERNAL = .66 EXTERNAL.
Question What is the effect of External factors on
Party Unity? Direct Effect = 0.58 Indirect Effect = (.66)*(.61) = 0.40 Total Effect = .58 + .40 = .98
Copyright Sharyn O'Halloran 34
Path Analysis: Example 3(cont.)
Example 3: Commie Model from Shapiro What determines people's attitudes towards
whether communists should be allowed to teach college?
Hypothesis Our hypothesis is that attitudes towards
teaching depend on attitudes towards communism in general, party ID, education, and age.
Copyright Sharyn O'Halloran 35
Path Analysis: Example 3(cont.)
Define the Variables First define and recode all variables:
TeachCom is a dichotomous variable, coded 1 if the respondent thought it was OK for communists to teach college.
Smarts is years of education. PartyOn is the respondent's party ID.
0 stands for strongly Democrat, 6 for strongly Republican.
ComPhile is how you think about communism as a system of government.
Higher values mean that it's a good system. Years is your age.
Copyright Sharyn O'Halloran 36
Path Analysis: Example 3(cont.)
Second, Report Descriptive Statistics Means gives the mean of each variable. Stddev gives their standard deviation. N gives the number of valid observations. Corr gives the correlations between variables. Sig tells us the significance of each correlation.
Variable Mean Standard Deviation Years 45.86 18.021 Smarts 12.843 3.005 Partyon 2.84 2.119 Comphile 1.68 0.758 Teachcom 0.541 0.499 Number of Cases = 932
Copyright Sharyn O'Halloran 37
Path Analysis: Example 3(cont.)
Third, Specify Causal Model Attitudes towards teaching are determined by all
the other variables. Attitude towards the communist system depends
on party ID, education, and age. Party ID depends on education and age. Finally, education depends on age.
Smarts
PartyOn
TeachComYears ComPhile
Copyright Sharyn O'Halloran 38
Path Analysis: Example 3(cont.)
Note: How to Build a Causal Model First of all, what constitutes a valid causal
model? For now, the answer is: no cycles. That is, you shouldn't be able to start at a point
and follow arrows and end up back at the same point.
Second, once you have a causal model, how do you know which regressions to run?
For each variable, see what arrows are going into it.
Then run a regression with those variables as the independent variables.
Copyright Sharyn O'Halloran 39
Path Analysis: Example 3(cont.)
Fourth, Estimate the Regression Model
Regression commands How we specify our causal model determines
what regression we run. For instance, TeachCom has arrows going into
it from all other variables, so we run the regression with all the variables.
Then we take ComPhile, and regress it on Years, Smarts and PartyOn.
And so on down the line.
Copyright Sharyn O'Halloran 40
Path Analysis: Example 3(cont.)
Correlation Matrix Smarts is negatively correlated with years.
That means that older people tend to have had fewer years of schooling.
PartyOn is negatively correlated with years So older people tend to be Republican.
Age is Negatively correlated with ComPhile and Teachcom
Older people tend to have more negative attitudes toward the communist system and be against communists teaching college.
Note: One-tailed p-values are reported beneath the
correlation coefficient. These simply report bivariate relations.
Copyright Sharyn O'Halloran 41
Path Analysis: Example 3(cont.)
Regression Results
-.0047
-.006
-.044
.053
.032
-.029-.015
Smarts
PartyOn
TeachComYears ComPhile
-.003
.035
.16
Copyright Sharyn O'Halloran 42
Path Analysis: Example 3 (cont.)
What is the effect of Years on Teachcom? Indirect Effect via Comphile Indirect Effect via Partyon
Partyon alone (-.0047)*(-.015) =.0000705
Partyon and Comphile (-.0047)(-.029)(.016) =.0000218
Indirect Effect via Smarts
Smarts alone (-.044)(.035) = .00154
Smarts & PartyOn (-.044)(.053)(-.015) = .000035 Smarts & ComPhile (-.044)(.032)(.16) = .000023 Smarts, Partyon & ComPhile
(-.044)(.053)(-.029)(.16) = .000011
Total Indirect Effects =-.00259
Copyright Sharyn O'Halloran 43
Path Analysis: Example 3 (cont.)
The Total Effect of Age on People’s attitudes toward teachers having communist beliefs.
Total Effects = Direct + Indirect
.0056.0026--.003Effect Total
.0026EffectIndirect Total
Copyright Sharyn O'Halloran 44
Path Analysis: Example 3 (cont.)
Five, Interpretation: For each additional year, peoples’ attitudes
toward whether it is appropriate for a communist to teach decreases by .0056 units.
.003 of those units come from the direct impact of just older people having different views.
.0026 of those units come from the indirect impact of
age via education level, Partisan identification, and What one thinks of communism as a system of
government.
Copyright Sharyn O'Halloran 45
Homework Recap
Your homework assignment is to specify a Path Diagram.
Issues in the Article There is a dispute between American
and European researchers on the effectiveness of AZT.
Americans say that it works, and Europeans say that there's not enough evidence.
Copyright Sharyn O'Halloran 46
Homework The U.S. View
The U.S. allowed AZT to be distributed to HIV-positive individuals on the basis of a study completed in 1989.
Usually the FDA requires that to release a drug the experimenters show:
Instead of direct link, researchers showed an indirect link.
If both of these correlations are positive, then so should be the total effect from AZT to health.
Drug HealthPositivePositive
AZT Markers HealthPositivePositive PositivePositive
Positive ?Positive ?
Copyright Sharyn O'Halloran 47
Homework The European View
The European researchers said that although it's true that AZT raised the level of CD-4 markers, these markers didn't indicate any long-term improvement in health.
So they say that the model looks like this:
If there's no link between CD-4 and health, then the overall link between AZT and health is also 0 on the basis of the information presented so far.
AZT Markers Health
No effect No effect
PositivePositive No effect No effect
Copyright Sharyn O'Halloran 48
Homework How to Resolve the Dispute
What kind of evidence would they need to resolve this dispute?
First, they could do studies to show that AZT has a direct effect on health. These studies take longer, but their conclusions are more reliable since they show a direct link.
Or, they could find another marker. That is, another intermediate substance that AZT affects and that affects health.
Copyright Sharyn O'Halloran 49
Notes on Multiple Regression
22110
2
222112
212
2
111
3
2
1
XbXbYb
xbxxbyx
xxbxbyx
YYy
XXx
XXx
where
222
111
,
We apply the OLS criteria Minimize Y Y d 2 2
Yields 3 normal equations
22110ˆ XbXbbY
General Linear Model Takes into account not only the impact of X1 on Y but also the interaction between X1 and X2
Copyright Sharyn O'Halloran 50
Notes on Multiple Regression(cont.)
nnnn 22110i
XbXbbY
22110XbXbbY
Which simplifies to:
Summing all the observations and dividing by n yields:
22110 XbXbYb
Deriving the formula for b0:1
2
3
22110XbXbbY
Copyright Sharyn O'Halloran 51
Notes on Multiple Regression(cont.)
Deriving the formula for b1:
212
2
11101XXbXbXbYX
1. Multiply both sides of Eq. 1 by X1
2. Summing all the observations and dividing by n:
n
XXb
n
XbXb
n
XXb
n
Xb
n
Xb
n
YX
21
2
2
1
110
212
2
11101 )(
3. Multiply Equation (2) by : X
4
4. Subtract Equation (5) from Equation (4):
5
22
21
2
2
2
1
2
2
1
2
1
11
1XX
bX
bYX
XXn
Xn
YXn
122
2
11101XXbXbXbXY
Copyright Sharyn O'Halloran 52
Notes on Multiple Regression(cont.)
n
XXn
nn
Xn
n
YXn
n
22
21
2
2
2
1
2
2
1
2
1
1
11XX
bX
bYX
212
2
111 xxbxbyx
In general,
2
1
2
1
2
2
1
2
2
11
2
1
22
1
2
)(2
2
2
XnXx
XnXnXX
XnXXX
XXXXXX
n
ii
n
ii
n
ii
n
i
n
i
n
i
n
ii
ii
ii
Therefore we can rewrite:
Copyright Sharyn O'Halloran 53
Path Analysis: Example 3 (cont.)
-.00966)(-.006)(.1alone ComPhile
.0000218.029)(.16)(-.0047)(-ComPhile andPartyOn
.0000705.015)(-.0047)(-aloneParty
.000011 .16)3)(-.029)(-.044)(.05 Comphile &Partyon Smarts,
.000023- 32)(.16)(-.044)(.0 Comphile & Smarts
.000035 53)(-.015)(-.044)(.0 Partyon & Smarts
.00154- 35)(-.044)(.0alone Smarts
leVia ComPhi
nVia PartyO
Via Smarts
.0056.0026--.003Effect Total
.0026EffectIndirect Total