publ0055:introductiontoquantitativemethods lecture6 ... · datasources observationaldata...
TRANSCRIPT
PUBL0055: Introduction to Quantitative Methods
Lecture 6: Regression (Causality)
Jack Blumenau and Benjamin Lauderdale
1 / 49
Motivation
Does health insurance improve health outcomes? (Revisited)Is there a causal effect of health insurance on actual levels of health? Inweek 2, we used this example to show that randomized experimentsrepresent a “gold standard” for causal inference, and that drawing causalconclusions from observational data is complicated by “confounding”relationships.
• Y (Dependent variable): health
• “Would you say your health in general is excelent (5), very good (4),good (3), fair (2), or poor (1)?”
• X (Independent variable): insured
• “Do you have health insurance?” TRUE = Insured, FALSE = Not insured
2 / 49
Data sources
Observational dataNational Health Interview Survey (NHIS, N = 19996): an annual survey ofthe US population that asks questions about health and health insurance.
Experimental dataRAND Health Insurance Experiment (RAND, N = 2702): an experimentconducted between 1974 and 1982 in the US. In this experiment,researchers randomly allocated individuals to receive health insurance.
In both cases we also have information from some of the other questionson the survey (gender, income, race, etc).
3 / 49
Lecture Outline
Regression and randomized experiments
Confounding
Regression discontinuity designs
Conclusion
4 / 49
Regression and randomized experiments
Randomization and regression
• Revision (1): The difference in means provides an unbiased estimateof the causal effect when our treatment variable is randomly assignedto units (week 2).
• Revision (2): The coefficient associated with a binary variable in asimple linear regression is equal to the difference in means estimate(week 4).
• Implication: When treatment is randomized, the linear regressioncoefficient provides an unbiased estimate of the causal effect!
5 / 49
Regression and randomized experiments (example)
Let’s calculate the difference in means using the experimental data:## Mean health level for insured and uninsured individualsmean_health_insured <- mean(rand$health[rand$insured == TRUE])mean_health_uninsured <- mean(rand$health[rand$insured == FALSE])mean_health_insured - mean_health_uninsured
## [1] -0.01895885
## Regression of health on insurance statuslm(health ~ insured, rand)
...## (Intercept) insuredTRUE## 3.40702 -0.01896...
Implication: The causal effect of insurance on health is very close to zero.
6 / 49
Benefits of using regression to analyse experiments
1. Heterogeneous treatment effects
• Do the effects of the treatment vary by type of unit?• You can already do this: interactions!
2. Non-binary treatments
• Is the treatment you care about continuous? Or categorical?• You can already do this: factor/continuous variables in regression!
3. Increasing the “precision” of our estimates
• Control for other factors that determine the outcome can make theestimates of the treatment effects more precise
• We will cover this in future weeks!
7 / 49
Heterogeneous treatment effects
In week 2, we were concerned with estimating the average treatment effect
• What is the average difference in potential outcomes under treatmentand control across all units in our sample
However, we may care about whether the treatment has different effects fordifferent types of individuals.
• Does insurance status matter more for low income than high incomeindividuals?
Fortunately, we can use interactions between explanatory variables toanswer this.
8 / 49
Interactions (revision)
• An interaction exists when the relationship one explanatory variable(𝑋1) and our dependent variable (𝑌 ) depends on the value ofanother explanatory variable (𝑋2)
• We can add interactions into the linear regression model by includingthe product of two explanatory variables in our model:
𝑌𝑖 = 𝛼 + 𝛽1𝑋1 + 𝛽2𝑋2 + 𝛽3(𝑋1 ∗ 𝑋2) + 𝜖𝑖
• We can interpret interactions by calculating the fitted values forvarious cases of interest
9 / 49
Heterogeneous treatment effects
heterogeneous_effects_model <- lm(health ~ insured * income, rand)summary(heterogeneous_effects_model)
...## Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 3.195337 0.072245 44.229 < 2e-16 ***## insuredTRUE 0.098969 0.079757 1.241 0.214754## income 0.006551 0.001963 3.338 0.000855 ***## insuredTRUE:income -0.003536 0.002181 -1.621 0.105126## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.767 on 2698 degrees of freedom## Multiple R-squared: 0.007878, Adjusted R-squared: 0.006775## F-statistic: 7.141 on 3 and 2698 DF, p-value: 8.932e-05...
10 / 49
Heterogeneous treatment effects
# Treatment effect for low-income individualspredict(heterogeneous_effects_model,
newdata = data.frame(insured = c(F,T),income = 10)) # income in thousands
## 1 2## 3.26085 3.32446
# Treatment effect for high-income individualspredict(heterogeneous_effects_model,
newdata = data.frame(insured = c(F,T),income = 30)) # income in thousands
## 1 2## 3.391874 3.384768
Some suggestion that the treatment slightly increases reported health for low-incomeindividuals, but not for high-income individuals.
11 / 49
Non-binary treatments
We have generally been focused on treatments that are binary – i.e. whereunits are either assigned to treatment or to control.
However, there are many examples where our (randomly assigned)treatment may not be binary. For instance:
1. Effect of different spending levels on budget approval (continuous)2. Effect of types of campaign materials on voter turnout (categorical)
In our healthcare example, the treatment in the experiment was actuallycategorical.
12 / 49
Non-binary treatments
table(rand$insured,rand$plantype)
#### Catastrophic Coinsurance Deductible Free## FALSE 491 0 0 0## TRUE 0 727 593 891
• “Catastrophic” – Individuals pay for all health costs• “Coinsurance” – Individuals pay 25-50% of costs• “Deductible” – Costs capped at $150• “Free” – Individuals pay nothing
Question: Are the causal effects the same for all three treatmentconditions?
13 / 49
Non-binary treatments (factor)
categorical_treatment_model <- lm(health ~ plantype, rand)summary(categorical_treatment_model)
...## Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 3.407016 0.034684 98.231 <2e-16 ***## plantypeCoinsurance 0.044112 0.044893 0.983 0.3259## plantypeDeductible -0.009908 0.046893 -0.211 0.8327## plantypeFree -0.076444 0.043196 -1.770 0.0769 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.7685 on 2698 degrees of freedom## Multiple R-squared: 0.003769, Adjusted R-squared: 0.002661## F-statistic: 3.403 on 3 and 2698 DF, p-value: 0.01701...
14 / 49
Non-binary treatments (factor)
# Treatment effects for different planspredict(categorical_treatment_model,
newdata = data.frame(plantype = c(”Catastrophic”, ”Coinsurance”,”Deductible”, ”Free”)))
## 1 2 3 4## 3.407016 3.451128 3.397108 3.330572
The average outcome is similar for all three treatment groups, which aresimilar to the control group.
15 / 49
Randomization and regression
Using regression to analyse the RAND health experiment, we have seen that:
1. The average causal effect of health insurance on health outcomes isvery small
2. The causal effect is somewhat more positive for low incomeindividuals, though it is still small
3. There is little difference in the estimated causal effects of differentinsurance plan-types
Sidenote: Despite these modest effects, the RAND experiment showed much larger effects interms of use of health care services. Self-reported health may not be the most importanthealth care outcome!
16 / 49
Confounding
What happens when we don’t have experimental data?
So far, we have focused on how we can use regression to also analyseexperimental data.
But most data we have in the social sciences is not experimental, it isobservational.
Regression also provides a framework for controlling for confoundingfactors that arise in observational data.
17 / 49
Confounding recap
ConfoundingConfounding exists when there are differences other than the treatmentbetween treatment and control groups, and which may affect theoutcome.
Imagine that we observe a positive (bivariate) relationship between thefollowing variables:
• National chocolate consumption → Number of Nobel prizes• Job market training → securing employment
Question: Can you think of any potentially confounding variables?
18 / 49
19 / 49
Multiple Linear Regression & OVB
In the context of regression, failing to adjust for potentially confoundingfactors is usually known as omitted variable bias (OVB)
• OVB occurs when two conditions are met:
1. When our X variable (𝑋1) is correlated with another X variable (𝑋2)that has not been included in the analysis
2. When our the omitted variable (𝑋2) is also correlated with our Yvariable
• We call an omitted variable that leads to such bias a confoundingvariable
Intuition: The association that we observe between 𝑋1 and 𝑌 might bemore meaningfully attributed to 𝑋2
20 / 49
Multiple Linear Regression & OVB
If OVB is present, and we do nothing to address it, then our estimate of 𝛽1will be biased (wrong)
𝑐𝑜𝑟(𝑋1, 𝑋2) > 0 𝑐𝑜𝑟(𝑋1, 𝑋2) < 0𝑐𝑜𝑟(𝑋2, 𝑌 ) > 0 ̂𝛽1 too big ̂𝛽1 too small𝑐𝑜𝑟(𝑋2, 𝑌 ) < 0 ̂𝛽1 too small ̂𝛽1 too big
• 𝑐𝑜𝑟(𝑋1, 𝑋2) is the correlation between 𝑋1 and 𝑋2
• 𝑐𝑜𝑟(𝑋2, 𝑌 ) is the correlation between 𝑋2 and 𝑌
Implication: Depending on the relationship that 𝑋2 has with 𝑋1 and 𝑌 ,̂𝛽1 could be either too big or too small!
21 / 49
Controlling for confounders of health insurance
Omitted variable bias is a problem when
1. The independent variable is correlated with other factors
and2. The dependent variable is also correlated with those factors
22 / 49
Controlling for confounders of health insurance
Omitted variable bias is a problem when
1. The independent variable is correlated with other factors
and2. The dependent variable is also correlated with those factors
insured health age female non_white years_educ income
Uninsured 3.6 41.0 46.4 17.8 11.6 42.9Insured 4.0 43.3 49.2 15.3 14.3 101.3Difference 0.3 2.3 2.8 -2.4 2.7 58.4
22 / 49
Controlling for confounders of health insurance
Omitted variable bias is a problem when
1. The independent variable is correlated with other factors
and2. The dependent variable is also correlated with those factors
health
age −0.02female −0.001non_white −0.13years_educ 0.08income 0.005Constant 4.67 3.89 3.91 2.75 3.46
Observations 19,996 19,996 19,996 19,996 19,996R2 0.03 0.0000 0.003 0.07 0.07
22 / 49
Controlling for confounders of health insurance
Omitted variable bias is a problem when
1. The independent variable is correlated with other factors
and2. The dependent variable is also correlated with those factors
Both conditions are met in this example.Question: What can we do about it?
22 / 49
“Controlling for” confounders
One strategy for dealing with confounding is to compare average outcomesbetween treatment and control units while controlling for potentialconfounders.
Example: Subclassification calculate differences between treatment andcontrol within levels of a confounding variable:
## Mean health level insured v uninsured (all)mean_health_insured <- mean(nhis$health[nhis$insured == TRUE])mean_health_uninsured <- mean(nhis$health[nhis$insured == FALSE])mean_health_insured - mean_health_uninsured
## [1] 0.3262003
Consequence: The effects of insurance on health are much smaller when wecontrol for income.
23 / 49
“Controlling for” confounders
One strategy for dealing with confounding is to compare average outcomesbetween treatment and control units while controlling for potentialconfounders.
Example: Subclassification calculate differences between treatment andcontrol within levels of a confounding variable:
## Mean health for insured v uninsured (low income)insured_low_mean <- mean(nhis$health[nhis$insured == T &
nhis$income_cat == ”Low”])uninsured_low_mean <- mean(nhis$health[nhis$insured == F &
nhis$income_cat == ”Low”])insured_low_mean - uninsured_low_mean
## [1] 0.1010293
Consequence: The effects of insurance on health are much smaller when wecontrol for income.
23 / 49
“Controlling for” confounders
One strategy for dealing with confounding is to compare average outcomesbetween treatment and control units while controlling for potentialconfounders.
Example: Subclassification calculate differences between treatment andcontrol within levels of a confounding variable:
## Mean health for insured v uninsured (high income)insured_high_mean <- mean(nhis$health[nhis$insured == T &
nhis$income_cat == ”High”])uninsured_high_mean <- mean(nhis$health[nhis$insured == F &
nhis$income_cat == ”High”])insured_high_mean - uninsured_high_mean
## [1] 0.1542666
Consequence: The effects of insurance on health are much smaller when wecontrol for income.
23 / 49
“Controlling for” confounders with regression
Regression also allows us to control for other variables, but is more flexiblethan subclassification:
• If we believe that the association between 𝑋1 and 𝑌 is confoundedby 𝑋2, and we are able to measure 𝑋2, we can control for 𝑋2
• We can hold the value of 𝑋2 constant while estimating theassociation between 𝑋1 and 𝑌
• If we believe that the association between 𝑋1 and𝑌 is confoundedby other variables, we can also control for those variables!
Intuition: The idea is the same as for subclassification, but regressionallows us to calculate differences between treatment and control whileholding multiple variables constant.
24 / 49
Controlling for confounders of health insurance
# Naive modelnhis_model <- lm(health ~ insured, data = nhis)
# Model controlling for incomenhis_model_with_income <- lm(health ~ insured + income, data = nhis)
# Model controlling for many variablesnhis_model_with_covariates <- lm(health ~ insured + income +
age + female +non_white + years_educ,
data = nhis)
25 / 49
Controlling for confounders of health insurance
health
(1) (2) (3)
insured 0.33 0.06 0.02income 0.004 0.004age −0.02female −0.05non_white −0.14years_educ 0.05Constant 3.62 3.43 3.81
Observations 19,996 19,996 19,996R2 0.02 0.07 0.13
1. The ̂𝛽 coefficient on insureddecreases when controlling forother variables
2. ̂𝛽insured in model 3 is muchcloser to the experimentalestimate
Answer: Only if we are willing toassume that we are controlling for allconfounders/omitted variables.
Example: One potential missingcontrol variable here is baselinehealth levels. People who were lesshealthy in the past may be lesshealthy now and more likely to takeout insurance!
26 / 49
Controlling for confounders of health insurance
health
(1) (2) (3)
insured 0.33 0.06 0.02income 0.004 0.004age −0.02female −0.05non_white −0.14years_educ 0.05Constant 3.62 3.43 3.81
Observations 19,996 19,996 19,996R2 0.02 0.07 0.13
Question: Does ̂𝛽insured represent thecausal effect of insurance on health?
Answer: Only if we are willing toassume that we are controlling for allconfounders/omitted variables.
Example: One potential missingcontrol variable here is baselinehealth levels. People who were lesshealthy in the past may be lesshealthy now and more likely to takeout insurance!
26 / 49
Controlling for confounders of health insurance
health
(1) (2) (3)
insured 0.33 0.06 0.02income 0.004 0.004age −0.02female −0.05non_white −0.14years_educ 0.05Constant 3.62 3.43 3.81
Observations 19,996 19,996 19,996R2 0.02 0.07 0.13
Question: Does ̂𝛽insured represent thecausal effect of insurance on health?
Answer: Only if we are willing toassume that we are controlling for allconfounders/omitted variables.
Example: One potential missingcontrol variable here is baselinehealth levels. People who were lesshealthy in the past may be lesshealthy now and more likely to takeout insurance!
26 / 49
Controlling for confounders of health insurance
health
(1) (2) (3)
insured 0.33 0.06 0.02income 0.004 0.004age −0.02female −0.05non_white −0.14years_educ 0.05Constant 3.62 3.43 3.81
Observations 19,996 19,996 19,996R2 0.02 0.07 0.13
Question: Does ̂𝛽insured represent thecausal effect of insurance on health?
Answer: Only if we are willing toassume that we are controlling for allconfounders/omitted variables.
Example: One potential missingcontrol variable here is baselinehealth levels. People who were lesshealthy in the past may be lesshealthy now and more likely to takeout insurance!
26 / 49
Controlling for confounders of health insurance (experimental data)
Note that the same does not apply with experimental data!
health
(1) (2) (3)
insured −0.02 −0.01 −0.02income 0.004 0.003age −0.01female −0.03non_white −0.13years_educ 0.04Constant 3.41 3.29 3.24
Observations 2,702 2,702 2,702R2 0.0001 0.01 0.09
• The coefficient for insured isthe same in model 1 and model2. Why?
• OVB is present when omittedvariables are correlated withboth independent anddependent variables.
• Insurance status is randomlyassigned, so cannot becorrelated with other factors.
27 / 49
Controlling for confounders with regression
Confounders, control, and causal inferenceIn order to claim that estimates based on observational data representcausal differences, you have to assume that you have controlled for allpossible confounding variables.
This is difficult because you may not:
• know what all the confounders are
• be able to measure some confounders
• be able to observe some confounders
28 / 49
Task
Imagine that you would like to estimate the causal effect of school classsizes (X) on student test scores in maths (Y). You cannot run an experiment,but you can collect observational data from all the schools in a city likeLondon.
You decide to use regression to analyse your data. Speak to your neighbourabout which variables you think you should control for in your regression.
We will share our lists after the break.
29 / 49
Break
30 / 49
Regression discontinuity designs
Do smaller classes cause better grades?
Example 1: The STAR ExperimentThe STAR project was a randomized experiment designed to test thecausal effects of class sizes on learning. Students in Tennessee schoolswere randomly assigned either to regular sized classes (22-25 students,the control group) or to smaller classes (15-17 students, the treatmentgroup). We can use the results of this experiment to see the effect ofsmall class sizes on student learning.
• Y (Dependent variable): grade
• Student grade on a standardised math test (0 to 100)
• X (Independent variable): small_class
• TRUE = student in small class, FALSE = student in regular sized class
31 / 49
Experimental results
# Treated mean grademean_grade_treated <- mean(star$grade[star$small_class == TRUE])mean_grade_treated
## [1] 48.04885
# Control mean grademean_grade_control <- mean(star$grade[star$small_class == FALSE])mean_grade_control
## [1] 45.82546
# Differnce in meansmean_grade_treated - mean_grade_control
## [1] 2.22339
The results suggest that small classes cause modest improvements in student grades. Wewill use this as a rough benchmark for the observational analysis.
32 / 49
Do smaller classes cause better grades?
Example 2: Data from Israeli schoolsAngrist and Lavy (1999) collect observational data from schools in Israel. Looking at dataon 646 schools, they record class size, average student score on a standardized mathstest, and the percentage of students in the school from disadvantaged backgrounds. Wewill ignore differences in context between Israel and Tennessee and try to replicate theexperimental finding of the STAR experiment using this observational data.
• Y (Dependent variable): grade
• Average student grade on a standardised math test (0 to 100).
• 𝑋1 (Independent variable): class_size
• Number of students in the class
• 𝑋2 (Independent variable): pct_disadvantaged
• % of students from disadvantaged background
33 / 49
Observational data
20 25 30 35 40
2030
4050
Class size and average grade
Class size
Ave
rage
gra
de
34 / 49
Observational results
simple_model <- lm(grade ~ class_size, schools)
grade
class_size 0.15Constant 33.63
Observations 646R2 0.02
Question: What is the effect of aone-student increase in class size?
• One student more is associatedwith a 0.15 increase in averagegrades
• This is the opposite of what theexperimental results suggest!
What are some potentiallyconfounding factors here?
35 / 49
Observational results
simple_model <- lm(grade ~ class_size, schools)multiple_model <- lm(grade ~ class_size + pct_disadvantaged, schools)
grade
(1) (2)
class_size 0.15 0.01pct_disadvantaged −0.19Constant 33.63 41.02
Observations 646 640R2 0.02 0.31
• The effect is much weaker, veryclose to zero, controlling forpct_disadvantaged
• This is still not what theexperimental results suggest
• Have we controlled for allconfounding factors?
36 / 49
Observational results
Can we get closer to the experimental estimate by controlling for otherfeatures of schools/pupils?
No! We don’t have any other variables to use as controls.
Central problem with using regression “control” strategies for causalinference: if you can’t control for relevant factors, you cannot eliminateconfounding.
37 / 49
An alternative approach
Problems:
1. Observational estimates are different from experimental estimates
2. The observational estimate is probably subject to confounding
3. We cannot control for lots of relevant confounding variables
Solutions:
1. Use over-time variation in our outcome variable to hold unobservedfactors constant (next week)
2. Look for something that approximates random variation in ourexplanatory variable (this week)
38 / 49
Regression discontinuity designs
In a regression discontinuity design (RDD) we look for a subset of unitswhere we believe that our explanatory variable occurs as if at random.
If we believe that the treatment is randomly assigned for some units, wecan compare the outcomes for those units to estimate the causal effect!
In our example, for which students might class size be assigned “as if atrandom”?
39 / 49
Some detail about class sizes in Israel
• Class size in Israeli schools is capped at 40 students
• As soon as a class gets to 41 students, it is split into two classes (of 20and 21)
• → In a year group with 40 students, each will be in a class of 40
• → In a year group with 41 students, each will be in a class of 20/21
40 / 49
Some detail about class sizes in Israel
20 30 40 50 60 70 80
2025
3035
40
Year group size and class size
Year group size
Cla
ss s
ize
• Observation: A small change inyear group size leads to a largechange in class size
• Assumption 1: In the areaaround 40, class size is “as goodas” randomly assigned
• Assumption 2: Nothing otherthan class size changes in thearea around 40 students
Intuition: Schools with 40 students will be very like those with 41. We cancompare average grades of schools just above/below 40 to calculate thecausal effect.
41 / 49
RDD visualization
20 30 40 50 60 70 80
2530
3540
4550
Year group size and average grade
Year group size
Ave
rage
gra
de
• Note that the x-axis is cohortsize, not class size
• Units just below the red linehave large classes
• Units just above the red linehave small classes
• In an RDD, we compare theoutcomes for these two groups
42 / 49
Regression discontinuity designs
The key intuition of the RDD is to compare units just above and just below a“discontinuity”. We can use regression to calculate the average value of theoutcome for these units.
Procedure:
1. Fit a regression for the observations on one side of the discontinuity2. Fit a regression for the observations on the other side of the
discontinuity3. Calculate the fitted values for both models at the point of the
discontinuity4. The difference in fitted values is the causal effect!
43 / 49
RDD visualization
20 30 40 50 60 70 80
2530
3540
4550
Year group size and average grade
Year group size
Ave
rage
gra
de
44 / 49
RDD visualization
20 30 40 50 60 70 80
2530
3540
4550
Year group size and average grade
Year group size
Ave
rage
gra
de
44 / 49
RDD visualization
20 30 40 50 60 70 80
2530
3540
4550
Year group size and average grade
Year group size
Ave
rage
gra
de
44 / 49
RDD visualization
20 30 40 50 60 70 80
2530
3540
4550
Year group size and average grade
Year group size
Ave
rage
gra
de
Causal effect
44 / 49
RDD code
1. Fit a regression for one side of the discontinuitybelow_model <- lm(grade ~ cohort_size, schools[schools$cohort_size <= 40, ])
2. Fit a regression for the other side of the discontinuityabove_model <- lm(grade ~ cohort_size, schools[schools$cohort_size > 40, ])
3. Calculate the fitted values for both models at the discontinuityfitted_a <- predict(below_model, newdata = data.frame(cohort_size = 40))fitted_b <- predict(above_model, newdata = data.frame(cohort_size = 40))
4. The difference in fitted values is the causal effect!fitted_b - fitted_a
## 1## 1.874213
Note that the estimate is positive, implying that smaller classes improvetest scores. 45 / 49
Summing up RDDs
• Main advantage of RDD: high internal validity
• So long as nothing else happens at the discontinuity, the RDD estimategives us a causal effect
• We can estimate causal effects even without having all the possibleconfounding covariates
• Main disadvantage of RDD: low external validity
• This design really only tells us the effect of treatment at the point ofthe discontinuity which may not be what we care about!
• e.g. the effect of smaller classes for school years above and below 40students
• Good research designs requires a lot of substantive knowledge!
• We needed to know specifics about Israeli education for our RDD• Pay attention in your substantive classes!
46 / 49
Conclusion
Regression and causality
When can we interpret a regression coefficient causally?
1. Randomized experiments
• Coefficient on a binary treatment is estimate of the average treatmenteffect
2. Observational studies
• We can only interpret coefficients causally when we have controlled forall confounders as additional X variables
• Confounders: variables that cause both the treatment and the outcome
3. Regression discontinuity designs
• With a suitable treatment discontinuity, you may be able to useregression for causal inference even without controlling for allconfounders
47 / 49
Next week
1. Regression for difference-in-differences
2. Panel data
3. More on regression specification
48 / 49
Seminar
In seminars this week, you will learn to …
1. …analyse experimental data using regression.
2. …argue about whether regression coefficients are good estimates ofcausal effects.
3. …implement regression discontinuity designs.
49 / 49