publ0055:introductiontoquantitativemethods lecture6 ... · datasources observationaldata...

PUBL0055: Introduction to Quantitative Methods

Lecture 6: Regression (Causality)

Jack Blumenau and Benjamin Lauderdale

1 / 49

Motivation

Does health insurance improve health outcomes? (Revisited)Is there a causal effect of health insurance on actual levels of health? Inweek 2, we used this example to show that randomized experimentsrepresent a “gold standard” for causal inference, and that drawing causalconclusions from observational data is complicated by “confounding”relationships.

• Y (Dependent variable): health

• “Would you say your health in general is excelent (5), very good (4),good (3), fair (2), or poor (1)?”

• X (Independent variable): insured

• “Do you have health insurance?” TRUE = Insured, FALSE = Not insured

2 / 49

Data sources

Observational dataNational Health Interview Survey (NHIS, N = 19996): an annual survey ofthe US population that asks questions about health and health insurance.

Experimental dataRAND Health Insurance Experiment (RAND, N = 2702): an experimentconducted between 1974 and 1982 in the US. In this experiment,researchers randomly allocated individuals to receive health insurance.

In both cases we also have information from some of the other questionson the survey (gender, income, race, etc).

3 / 49

Lecture Outline

Regression and randomized experiments

Confounding

Regression discontinuity designs

Conclusion

4 / 49

Regression and randomized experiments

Randomization and regression

• Revision (1): The difference in means provides an unbiased estimateof the causal effect when our treatment variable is randomly assignedto units (week 2).

• Revision (2): The coefficient associated with a binary variable in asimple linear regression is equal to the difference in means estimate(week 4).

• Implication: When treatment is randomized, the linear regressioncoefficient provides an unbiased estimate of the causal effect!

5 / 49

Regression and randomized experiments (example)

Let’s calculate the difference in means using the experimental data:## Mean health level for insured and uninsured individualsmean_health_insured <- mean(rand$health[rand$insured == TRUE])mean_health_uninsured <- mean(rand$health[rand$insured == FALSE])mean_health_insured - mean_health_uninsured

## [1] -0.01895885

## Regression of health on insurance statuslm(health ~ insured, rand)

...## (Intercept) insuredTRUE## 3.40702 -0.01896...

Implication: The causal effect of insurance on health is very close to zero.

6 / 49

Benefits of using regression to analyse experiments

1. Heterogeneous treatment effects

• Do the effects of the treatment vary by type of unit?• You can already do this: interactions!

2. Non-binary treatments

• Is the treatment you care about continuous? Or categorical?• You can already do this: factor/continuous variables in regression!

3. Increasing the “precision” of our estimates

• Control for other factors that determine the outcome can make theestimates of the treatment effects more precise

• We will cover this in future weeks!

7 / 49

Heterogeneous treatment effects

In week 2, we were concerned with estimating the average treatment effect

• What is the average difference in potential outcomes under treatmentand control across all units in our sample

However, we may care about whether the treatment has different effects fordifferent types of individuals.

• Does insurance status matter more for low income than high incomeindividuals?

Fortunately, we can use interactions between explanatory variables toanswer this.

8 / 49

Interactions (revision)

• An interaction exists when the relationship one explanatory variable(𝑋1) and our dependent variable (𝑌 ) depends on the value ofanother explanatory variable (𝑋2)

• We can add interactions into the linear regression model by includingthe product of two explanatory variables in our model:

𝑌𝑖 = 𝛼 + 𝛽1𝑋1 + 𝛽2𝑋2 + 𝛽3(𝑋1 ∗ 𝑋2) + 𝜖𝑖

• We can interpret interactions by calculating the fitted values forvarious cases of interest

9 / 49


heterogeneous_effects_model <- lm(health ~ insured * income, rand)summary(heterogeneous_effects_model)

...## Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 3.195337 0.072245 44.229 < 2e-16 ***## insuredTRUE 0.098969 0.079757 1.241 0.214754## income 0.006551 0.001963 3.338 0.000855 ***## insuredTRUE:income -0.003536 0.002181 -1.621 0.105126## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.767 on 2698 degrees of freedom## Multiple R-squared: 0.007878, Adjusted R-squared: 0.006775## F-statistic: 7.141 on 3 and 2698 DF, p-value: 8.932e-05...

10 / 49


# Treatment effect for low-income individualspredict(heterogeneous_effects_model,

newdata = data.frame(insured = c(F,T),income = 10)) # income in thousands

## 1 2## 3.26085 3.32446

# Treatment effect for high-income individualspredict(heterogeneous_effects_model,

newdata = data.frame(insured = c(F,T),income = 30)) # income in thousands

## 1 2## 3.391874 3.384768

Some suggestion that the treatment slightly increases reported health for low-incomeindividuals, but not for high-income individuals.

11 / 49

Non-binary treatments

We have generally been focused on treatments that are binary – i.e. whereunits are either assigned to treatment or to control.

However, there are many examples where our (randomly assigned)treatment may not be binary. For instance:

1. Effect of different spending levels on budget approval (continuous)2. Effect of types of campaign materials on voter turnout (categorical)

In our healthcare example, the treatment in the experiment was actuallycategorical.

12 / 49

http://www.jackblumenau.com/papers/spending_experiment.pdf

https://www.cambridge.org/core/journals/american-political-science-review/article/social-pressure-and-voter-turnout-evidence-from-a-largescale-field-experiment/11E84AF4C0B7FBD1D20C855972C2C3EB

Non-binary treatments

table(rand$insured,rand$plantype)

#### Catastrophic Coinsurance Deductible Free## FALSE 491 0 0 0## TRUE 0 727 593 891

• “Catastrophic” – Individuals pay for all health costs• “Coinsurance” – Individuals pay 25-50% of costs• “Deductible” – Costs capped at $150• “Free” – Individuals pay nothing

Question: Are the causal effects the same for all three treatmentconditions?

13 / 49

Non-binary treatments (factor)

categorical_treatment_model <- lm(health ~ plantype, rand)summary(categorical_treatment_model)

...## Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 3.407016 0.034684 98.231 <2e-16 ***## plantypeCoinsurance 0.044112 0.044893 0.983 0.3259## plantypeDeductible -0.009908 0.046893 -0.211 0.8327## plantypeFree -0.076444 0.043196 -1.770 0.0769 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.7685 on 2698 degrees of freedom## Multiple R-squared: 0.003769, Adjusted R-squared: 0.002661## F-statistic: 3.403 on 3 and 2698 DF, p-value: 0.01701...

14 / 49

Non-binary treatments (factor)

# Treatment effects for different planspredict(categorical_treatment_model,

newdata = data.frame(plantype = c(”Catastrophic”, ”Coinsurance”,”Deductible”, ”Free”)))

## 1 2 3 4## 3.407016 3.451128 3.397108 3.330572

The average outcome is similar for all three treatment groups, which aresimilar to the control group.

15 / 49

Randomization and regression

Using regression to analyse the RAND health experiment, we have seen that:

1. The average causal effect of health insurance on health outcomes isvery small

2. The causal effect is somewhat more positive for low incomeindividuals, though it is still small

3. There is little difference in the estimated causal effects of differentinsurance plan-types

Sidenote: Despite these modest effects, the RAND experiment showed much larger effects interms of use of health care services. Self-reported health may not be the most importanthealth care outcome!

16 / 49

https://www.rand.org/pubs/commercial_books/CB199.html

https://www.rand.org/pubs/commercial_books/CB199.html

Confounding

What happens when we don’t have experimental data?

So far, we have focused on how we can use regression to also analyseexperimental data.

But most data we have in the social sciences is not experimental, it isobservational.

Regression also provides a framework for controlling for confoundingfactors that arise in observational data.

17 / 49

Confounding recap

ConfoundingConfounding exists when there are differences other than the treatmentbetween treatment and control groups, and which may affect theoutcome.

Imagine that we observe a positive (bivariate) relationship between thefollowing variables:

• National chocolate consumption → Number of Nobel prizes• Job market training → securing employment

Question: Can you think of any potentially confounding variables?

18 / 49

19 / 49

Multiple Linear Regression & OVB

In the context of regression, failing to adjust for potentially confoundingfactors is usually known as omitted variable bias (OVB)

• OVB occurs when two conditions are met:

1. When our X variable (𝑋1) is correlated with another X variable (𝑋2)that has not been included in the analysis

2. When our the omitted variable (𝑋2) is also correlated with our Yvariable

• We call an omitted variable that leads to such bias a confoundingvariable

Intuition: The association that we observe between 𝑋1 and 𝑌 might bemore meaningfully attributed to 𝑋2

20 / 49

Multiple Linear Regression & OVB

If OVB is present, and we do nothing to address it, then our estimate of 𝛽1will be biased (wrong)

𝑐𝑜𝑟(𝑋1, 𝑋2) > 0 𝑐𝑜𝑟(𝑋1, 𝑋2) < 0𝑐𝑜𝑟(𝑋2, 𝑌 ) > 0 ̂𝛽1 too big ̂𝛽1 too small𝑐𝑜𝑟(𝑋2, 𝑌 ) < 0 ̂𝛽1 too small ̂𝛽1 too big

• 𝑐𝑜𝑟(𝑋1, 𝑋2) is the correlation between 𝑋1 and 𝑋2

• 𝑐𝑜𝑟(𝑋2, 𝑌 ) is the correlation between 𝑋2 and 𝑌

Implication: Depending on the relationship that 𝑋2 has with 𝑋1 and 𝑌 ,̂𝛽1 could be either too big or too small!

21 / 49

Controlling for confounders of health insurance

Omitted variable bias is a problem when

1. The independent variable is correlated with other factors

and2. The dependent variable is also correlated with those factors

22 / 49





insured health age female non_white years_educ income

Uninsured 3.6 41.0 46.4 17.8 11.6 42.9Insured 4.0 43.3 49.2 15.3 14.3 101.3Difference 0.3 2.3 2.8 -2.4 2.7 58.4

22 / 49





health

age −0.02female −0.001non_white −0.13years_educ 0.08income 0.005Constant 4.67 3.89 3.91 2.75 3.46

Observations 19,996 19,996 19,996 19,996 19,996R2 0.03 0.0000 0.003 0.07 0.07

22 / 49





Both conditions are met in this example.Question: What can we do about it?

22 / 49

“Controlling for” confounders

One strategy for dealing with confounding is to compare average outcomesbetween treatment and control units while controlling for potentialconfounders.

Example: Subclassification calculate differences between treatment andcontrol within levels of a confounding variable:

## Mean health level insured v uninsured (all)mean_health_insured <- mean(nhis$health[nhis$insured == TRUE])mean_health_uninsured <- mean(nhis$health[nhis$insured == FALSE])mean_health_insured - mean_health_uninsured

## [1] 0.3262003

Consequence: The effects of insurance on health are much smaller when wecontrol for income.

23 / 49




## Mean health for insured v uninsured (low income)insured_low_mean <- mean(nhis$health[nhis$insured == T &

nhis$income_cat == ”Low”])uninsured_low_mean <- mean(nhis$health[nhis$insured == F &

nhis$income_cat == ”Low”])insured_low_mean - uninsured_low_mean

## [1] 0.1010293


23 / 49




## Mean health for insured v uninsured (high income)insured_high_mean <- mean(nhis$health[nhis$insured == T &

nhis$income_cat == ”High”])uninsured_high_mean <- mean(nhis$health[nhis$insured == F &

nhis$income_cat == ”High”])insured_high_mean - uninsured_high_mean

## [1] 0.1542666


23 / 49

“Controlling for” confounders with regression

Regression also allows us to control for other variables, but is more flexiblethan subclassification:

• If we believe that the association between 𝑋1 and 𝑌 is confoundedby 𝑋2, and we are able to measure 𝑋2, we can control for 𝑋2

• We can hold the value of 𝑋2 constant while estimating theassociation between 𝑋1 and 𝑌

• If we believe that the association between 𝑋1 and𝑌 is confoundedby other variables, we can also control for those variables!

Intuition: The idea is the same as for subclassification, but regressionallows us to calculate differences between treatment and control whileholding multiple variables constant.

24 / 49


# Naive modelnhis_model <- lm(health ~ insured, data = nhis)

# Model controlling for incomenhis_model_with_income <- lm(health ~ insured + income, data = nhis)

# Model controlling for many variablesnhis_model_with_covariates <- lm(health ~ insured + income +

age + female +non_white + years_educ,

data = nhis)

25 / 49


health

(1) (2) (3)

insured 0.33 0.06 0.02income 0.004 0.004age −0.02female −0.05non_white −0.14years_educ 0.05Constant 3.62 3.43 3.81

Observations 19,996 19,996 19,996R2 0.02 0.07 0.13

1. The ̂𝛽 coefficient on insureddecreases when controlling forother variables

2. ̂𝛽insured in model 3 is muchcloser to the experimentalestimate

Answer: Only if we are willing toassume that we are controlling for allconfounders/omitted variables.

Example: One potential missingcontrol variable here is baselinehealth levels. People who were lesshealthy in the past may be lesshealthy now and more likely to takeout insurance!

26 / 49


health

(1) (2) (3)

insured 0.33 0.06 0.02income 0.004 0.004age −0.02female −0.05non_white −0.14years_educ 0.05Constant 3.62 3.43 3.81

Observations 19,996 19,996 19,996R2 0.02 0.07 0.13

Question: Does ̂𝛽insured represent thecausal effect of insurance on health?

Answer: Only if we are willing toassume that we are controlling for allconfounders/omitted variables.

Example: One potential missingcontrol variable here is baselinehealth levels. People who were lesshealthy in the past may be lesshealthy now and more likely to takeout insurance!

26 / 49

Controlling for confounders of health insurance (experimental data)

Note that the same does not apply with experimental data!

health

(1) (2) (3)

insured −0.02 −0.01 −0.02income 0.004 0.003age −0.01female −0.03non_white −0.13years_educ 0.04Constant 3.41 3.29 3.24

Observations 2,702 2,702 2,702R2 0.0001 0.01 0.09

• The coefficient for insured isthe same in model 1 and model2. Why?

• OVB is present when omittedvariables are correlated withboth independent anddependent variables.

• Insurance status is randomlyassigned, so cannot becorrelated with other factors.

27 / 49

Controlling for confounders with regression

Confounders, control, and causal inferenceIn order to claim that estimates based on observational data representcausal differences, you have to assume that you have controlled for allpossible confounding variables.

This is difficult because you may not:

• know what all the confounders are

• be able to measure some confounders

• be able to observe some confounders

28 / 49

Task

Imagine that you would like to estimate the causal effect of school classsizes (X) on student test scores in maths (Y). You cannot run an experiment,but you can collect observational data from all the schools in a city likeLondon.

You decide to use regression to analyse your data. Speak to your neighbourabout which variables you think you should control for in your regression.

We will share our lists after the break.

29 / 49

Break

30 / 49

Do smaller classes cause better grades?

Example 1: The STAR ExperimentThe STAR project was a randomized experiment designed to test thecausal effects of class sizes on learning. Students in Tennessee schoolswere randomly assigned either to regular sized classes (22-25 students,the control group) or to smaller classes (15-17 students, the treatmentgroup). We can use the results of this experiment to see the effect ofsmall class sizes on student learning.

• Y (Dependent variable): grade

• Student grade on a standardised math test (0 to 100)

• X (Independent variable): small_class

• TRUE = student in small class, FALSE = student in regular sized class

31 / 49

Experimental results

# Treated mean grademean_grade_treated <- mean(star$grade[star$small_class == TRUE])mean_grade_treated

## [1] 48.04885

# Control mean grademean_grade_control <- mean(star$grade[star$small_class == FALSE])mean_grade_control

## [1] 45.82546

# Differnce in meansmean_grade_treated - mean_grade_control

## [1] 2.22339

The results suggest that small classes cause modest improvements in student grades. Wewill use this as a rough benchmark for the observational analysis.

32 / 49

Do smaller classes cause better grades?

Example 2: Data from Israeli schoolsAngrist and Lavy (1999) collect observational data from schools in Israel. Looking at dataon 646 schools, they record class size, average student score on a standardized mathstest, and the percentage of students in the school from disadvantaged backgrounds. Wewill ignore differences in context between Israel and Tennessee and try to replicate theexperimental finding of the STAR experiment using this observational data.

• Y (Dependent variable): grade

• Average student grade on a standardised math test (0 to 100).

• 𝑋1 (Independent variable): class_size

• Number of students in the class

• 𝑋2 (Independent variable): pct_disadvantaged

• % of students from disadvantaged background

33 / 49

Observational data

20 25 30 35 40

2030

4050

Class size and average grade

Class size

Ave

rage

gra

de

34 / 49

Observational results

simple_model <- lm(grade ~ class_size, schools)

grade

class_size 0.15Constant 33.63

Observations 646R2 0.02

Question: What is the effect of aone-student increase in class size?

• One student more is associatedwith a 0.15 increase in averagegrades

• This is the opposite of what theexperimental results suggest!

What are some potentiallyconfounding factors here?

35 / 49


simple_model <- lm(grade ~ class_size, schools)multiple_model <- lm(grade ~ class_size + pct_disadvantaged, schools)

grade

(1) (2)

class_size 0.15 0.01pct_disadvantaged −0.19Constant 33.63 41.02

Observations 646 640R2 0.02 0.31

• The effect is much weaker, veryclose to zero, controlling forpct_disadvantaged

• This is still not what theexperimental results suggest

• Have we controlled for allconfounding factors?

36 / 49


Can we get closer to the experimental estimate by controlling for otherfeatures of schools/pupils?

No! We don’t have any other variables to use as controls.

Central problem with using regression “control” strategies for causalinference: if you can’t control for relevant factors, you cannot eliminateconfounding.

37 / 49

An alternative approach

Problems:

1. Observational estimates are different from experimental estimates

2. The observational estimate is probably subject to confounding

3. We cannot control for lots of relevant confounding variables

Solutions:

1. Use over-time variation in our outcome variable to hold unobservedfactors constant (next week)

2. Look for something that approximates random variation in ourexplanatory variable (this week)

38 / 49


In a regression discontinuity design (RDD) we look for a subset of unitswhere we believe that our explanatory variable occurs as if at random.

If we believe that the treatment is randomly assigned for some units, wecan compare the outcomes for those units to estimate the causal effect!

In our example, for which students might class size be assigned “as if atrandom”?

39 / 49

Some detail about class sizes in Israel

• Class size in Israeli schools is capped at 40 students

• As soon as a class gets to 41 students, it is split into two classes (of 20and 21)

• → In a year group with 40 students, each will be in a class of 40

• → In a year group with 41 students, each will be in a class of 20/21

40 / 49

Some detail about class sizes in Israel

20 30 40 50 60 70 80

2025

3035

40

Year group size and class size

Year group size

Cla

ss s

ize

• Observation: A small change inyear group size leads to a largechange in class size

• Assumption 1: In the areaaround 40, class size is “as goodas” randomly assigned

• Assumption 2: Nothing otherthan class size changes in thearea around 40 students

Intuition: Schools with 40 students will be very like those with 41. We cancompare average grades of schools just above/below 40 to calculate thecausal effect.

41 / 49

RDD visualization

20 30 40 50 60 70 80

2530

3540

4550

Year group size and average grade

Year group size

Ave

rage

gra

de

• Note that the x-axis is cohortsize, not class size

• Units just below the red linehave large classes

• Units just above the red linehave small classes

• In an RDD, we compare theoutcomes for these two groups

42 / 49


The key intuition of the RDD is to compare units just above and just below a“discontinuity”. We can use regression to calculate the average value of theoutcome for these units.

Procedure:

1. Fit a regression for the observations on one side of the discontinuity2. Fit a regression for the observations on the other side of the

discontinuity3. Calculate the fitted values for both models at the point of the

discontinuity4. The difference in fitted values is the causal effect!

43 / 49

RDD visualization

20 30 40 50 60 70 80

2530

3540

4550


Year group size

Ave

rage

gra

de

44 / 49

RDD visualization

20 30 40 50 60 70 80

2530

3540

4550


Year group size

Ave

rage

gra

de

Causal effect

44 / 49

RDD code

1. Fit a regression for one side of the discontinuitybelow_model <- lm(grade ~ cohort_size, schools[schools$cohort_size <= 40, ])

2. Fit a regression for the other side of the discontinuityabove_model <- lm(grade ~ cohort_size, schools[schools$cohort_size > 40, ])

3. Calculate the fitted values for both models at the discontinuityfitted_a <- predict(below_model, newdata = data.frame(cohort_size = 40))fitted_b <- predict(above_model, newdata = data.frame(cohort_size = 40))

4. The difference in fitted values is the causal effect!fitted_b - fitted_a

## 1## 1.874213

Note that the estimate is positive, implying that smaller classes improvetest scores. 45 / 49

Summing up RDDs

• Main advantage of RDD: high internal validity

• So long as nothing else happens at the discontinuity, the RDD estimategives us a causal effect

• We can estimate causal effects even without having all the possibleconfounding covariates

• Main disadvantage of RDD: low external validity

• This design really only tells us the effect of treatment at the point ofthe discontinuity which may not be what we care about!

• e.g. the effect of smaller classes for school years above and below 40students

• Good research designs requires a lot of substantive knowledge!

• We needed to know specifics about Israeli education for our RDD• Pay attention in your substantive classes!

46 / 49

Conclusion

Regression and causality

When can we interpret a regression coefficient causally?

1. Randomized experiments

• Coefficient on a binary treatment is estimate of the average treatmenteffect

2. Observational studies

• We can only interpret coefficients causally when we have controlled forall confounders as additional X variables

• Confounders: variables that cause both the treatment and the outcome

3. Regression discontinuity designs

• With a suitable treatment discontinuity, you may be able to useregression for causal inference even without controlling for allconfounders

47 / 49

Next week

1. Regression for difference-in-differences

2. Panel data

3. More on regression specification

48 / 49

Seminar

In seminars this week, you will learn to …

1. …analyse experimental data using regression.

2. …argue about whether regression coefficients are good estimates ofcausal effects.

3. …implement regression discontinuity designs.

49 / 49

publ0055:introductiontoquantitativemethods lecture6 ... · datasources observationaldata...

Documents