epse 581c: causal inference for applied researchers...epse 581c: causal inference for applied...

EPSE 581C: Causal Inference for Applied Researchers

Ed Kroc

University of British Columbia

[email protected]

May 27, 2019

Ed Kroc (UBC) Causal Inference May 27, 2019 1 / 50

Last time

More model misspecification and (some of) its effects

Consistency and unbiasedness of estimators


Today

Even more model misspecification and (some of) its effects

The Neyman-Rubin causal model

Introduction to matching


Model misspecification: Ex. 1

Misspecified model on LEFT; properly specified model on RIGHT:

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

x

y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

x

y




0.0 0.5 1.0 1.5

-0.15

-0.05

0.05

0.10

0.15

0.20

fitted(mod.w)

residuals(mod.w)

0.0 0.5 1.0 1.5 2.0

-0.10

-0.05

0.00

0.05

0.10

fitted(mod.r)

residuals(mod.r)

Clear evidence of model misspecification in residuals vs. fitted plot!




0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

x

y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

x

yzACEwrong pX “ 0.5q “ pβ1T “ 0.515

zACE rightpX “ 0.5q “ pβT ` 0.5pβTX “ 0.008` 0.5 ˚ 0.999 “ 0.508

Not too bad. . ., but what if the misspecification was worse?




0.0 0.2 0.4 0.6 0.8 1.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

x

y

0.0 0.2 0.4 0.6 0.8 1.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

x

y




-2.5 -2.0 -1.5 -1.0 -0.5 0.0

-0.3

-0.2

-0.1

0.0

0.1

0.2

fitted(mod.w2)

residuals(mod.w2)

-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0

-0.10

-0.05

0.00

0.05

0.10

fitted(mod.r)

residuals(mod.r)

Clear evidence of model misspecification in residuals vs. fitted plot!




0.0 0.2 0.4 0.6 0.8 1.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

x

y

0.0 0.2 0.4 0.6 0.8 1.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

x

yzACEwrong pX “ 0.5q “ pβ1T “ ´0.152

zACE rightpX “ 0.5q “ pβT ` 0.5pβTX ` p0.5q2

pβTX2 ““ ´0.527

Misspecified model ACE estimate is 3-times too small.


Model misspecification: ignore fit statistics

Notice: fit statistics are useless here.

That is, misspecified models can still “fit” the data very well.

Good enough for explanatory modelling.

Not good enough for causal modelling!

Ignore all fit statistics when performing causal modelling, including:

Goodness-of-fit F -tests

R2 statistics

Information criterion statistics (AIC, BIC, DIC, etc.)


Model misspecification: ignore statistical significance

Notice: statistical significance of model coefficient estimates isirrelevant here.

Recall numerical Ex. 1:

All estimates significant for misspecified model

In properly specified model, intercept (pβ0) and marginal treatment

(pβT ) estimates not statistically significant.

Recall numerical Ex. 2:

All estimates significant for misspecified model

In properly specified model, intercept (pβ0) and marginal first-order

treatment (pβT ) estimates not statistically significant.


Model misspecification: bigger sample size will never fixthe problem

It is common “wisdom” that the more data you have, the better youwill be able to quantify your effects of interest.

This is true for explanatory/descriptive and predictive modelling, butfalse for causal modelling.


Unbiasedness of estimators

Generally, an estimator pθ for some population parameter θ of a randomvariable of interest X is called unbiased if:

Eppθq “ θ

In words, an estimator is unbiased for its estimand (what it is trying toestimate) if, on average, the estimator equals the estimand.

Example: In a random sample, the sample mean, pθ “ 1n

řni“1 Xi , is an

unbiased estimator of the population mean, θ “ EpX q


Consistency of estimators

Generally, an estimator pθ is called consistent if, as the sample size increaseswithout bound, the sample value of pθ approaches a single number, a:

for all ε ą 0, limnÑ8

Prp|pθ ´ a| ą ε | Snq “ 0,

where Sn denotes a random sample of size n.

If an estimator is both unbiased and consistent, then not only does itsaverage value equal the true estimand of interest, but as we increasethe sample size, the estimator becomes more and more precise aboutthis true value.

That is, such an estimator is both accurate and precise as sample sizeincreases.

(ML, OLS) Estimates of regression parameters are always consistent,but are also always biased when the functional form of the regressionmodel is misspecified.


Model misspecification: omitted variables

So far, we have only focused on model misspecification where thefunctional form of the covariates is misspecified, but our modelsalways contained all explanatory variables.

In practical non-experimental research, we will always be missingsome confounders; we can’t measure everything, or even knoweverything we should always be measuring!

Detecting important omitted variables can be very difficult.

Residual plots still the way to go, but they will not always suggestomitted variable bias.

Hence, why the exchangeability of treatment is so important in anRD-design: treatment is “as good as” randomly assigned near thethreshold; thus, biasing effects of omitted variables should benegligible (near the threshold).


Model misspecification with omitted variables

True data-generating process, where CorrpX ,W q “ 0.42:

Y “ 10` X ` X 2 ` X ˚W ` δ

Misspecified model 1:Y “ β0 ` βXX ` ε

Model fit:




Y “ 10` X ` X 2 ` X ˚W ` δ

Misspecified model 1:Y “ β0 ` βXX ` ε

0 1 2 3 4 5 6

2040

6080

100

Misspecified first-order model

x

y

0 20 40 60

-20

-10

010

2030

Misspecified first-order model

fitted(mod1)

residuals(mod1)




Y “ 10` X ` X 2 ` X ˚W ` δ

Obvious missing curvature in plots of model 1.

Misspecified model 2:

Y “ β0 ` βXX ` βX2X2 ` ε

Model fit:




Y “ 10` X ` X 2 ` X ˚W ` δ


Y “ β0 ` βXX ` βX2X2 ` ε

0 1 2 3 4 5 6

2040

6080

100

Misspecified second-order model

x

y

20 40 60 80

-20

-10

010

Misspecified second-order model

fitted(mod2)

residuals(mod2)




Y “ 10` X ` X 2 ` X ˚W ` δ

Obvious heteroskedasticity in plots of model 2 actually caused by omittedvariable W .


Y “ β0 ` βXX ` βWW ` βXWX ¨W ` ε

Model fit:




Y “ 10` X ` X 2 ` X ˚W ` δ



0 1 2 3 4 5 6

-20

24

68

10

Scatterplot of Covariates

x

w

20 40 60 80 100

-2-1

01

23

45

Misspecified first-order interaction model

fitted(mod3)

residuals(mod3)




Y “ 10` X ` X 2 ` X ˚W ` δ



3D plot of x,w vs. y

-1 0 1 2 3 4 5 6

0 2

0 4

0 6

0 8

0100

120

-4-2

0 2

4 6

810

x

wy




Y “ 10` X ` X 2 ` X ˚W ` δ

Still seeing indication of missing curvature, at least near extreme fittedvalues.

Correctly specified (overspecified) model:

Y “ β0 ` βXX ` βX2X2 ` βWW ` βXWX ¨W ` ε

Model fit:




Y “ 10` X ` X 2 ` X ˚W ` δ

Correctly specified (overspecified) model:

Y “ β0 ` βXX ` βX2X2 ` βWW ` βXWX ¨W ` ε

20 40 60 80 100

-1.5

-1.0

-0.5

0.0

0.5

1.0

Correctly specified model

fitted(mod4)

residuals(mod4)


Model misspecification

Often impossible to assess model misspecification in the publishedliterature.

Notice how wildly the model estimates varied between the variousmisspecified models; this is often a good way to tell if modelmisspecification is of concern.


Model misspecification: ugly example

True data-generating process, where Wi correlate with X :

Y “ 10`X`X 2`W1`0.02X ˚W1`0.01W2`0.02W3˚W4`0.01W3˚X2`δ

Misspecified model:

Y “ β0 ` βXX ` βX2X2 ` βWW1 ` βXWX ¨W1 ` ε

10 20 30 40 50 60 70

-1.5

-1.0

-0.5

0.0

0.5

1.0

Rez vs. Fitted, ugly data

fitted(modu)

residuals(modu)




Y “ 10`X`X 2`W1`0.02X ˚W1`0.01W2`0.02W3˚W4`0.01W3˚X2`δ

Misspecified model:

Y “ β0 ` βXX ` βX2X2 ` βWW1 ` βXWX ¨W1 ` ε

10 20 30 40 50 60 70

-1.5

-1.0

-0.5

0.0

0.5

1.0

Rez vs. Fitted, ugly data

fitted(modu)

residuals(modu)




Y “ 10`X`X 2`W1`0.02X ˚W1`0.01W2`0.02W3˚W4`0.01W3˚X2`δ

Maybe some evidence of misspecification? Residuals don’t really look thatbad though.

Model fit:

Parameter estimates still quite biased.


Regression discontinuity design

Recall the RD design:

Recall in an RD design, we can estimate the ACE at the fixedthreshold.

Recall model misspecification is an issue here, since we fit separateregression models on either side of the threshold in order to constructan estimate of the ACE at the threshold.

The major challenge with RD designs is how to specify the propermodel(s).

Very common advice: use nonparametric methods that allow the datato determine the functional form of the model.

However, major problems with this:

Nonparametric methods prone to overfitting.

Overreliance on (possibly confounded) response data away from thethreshold.


Regression discontinuity design

Best advice for fitting RD design:

Alternative:

Gelman & Imbens, JBusEconStat (2018): use only first or second-orderparametric models.

Criticism: Still an overreliance on (possibly confounded) response dataaway from the threshold.

My advice:

Recall: have exchangeability of units over treatment near the fixedthreshold.

So fit your regression model as locally as possible.

First or second-order models will usually be sufficient locally (Taylorseries).

Rely only (mostly) on unconfounded response data near the threshold.

Problem: need a lot more data near the threshold in order to fit themodel.



We have relied on mathematicitizing the problem of causal inferenceusing the language of counterfactuals (potential outcomes).

Such a formulation was introduced by Neyman in 1923.

Subsequently, Rubin, Holland, and others extended Neyman’s ideasbeyond the framework of randomized, controlled experiments.

Today, the Neyman-Rubin causal model is the industry standard inmost areas of social and health science (although Pearl’s framework isgaining more popularity).



In the counterfactual formulation of causality, we imagine that everyindividual i has a pair of hypothetical “potential outcomes”:

Yi p1q is the outcome for sample unit i if they were given the treatment(T “ 1),

Yi p0q is the outcome for sample unit i if they are not given thetreatment (T “ 0).

Fundamental problem of (counterfactual) causal inference: for anyindividual i , we can never simultaneously observe Yi p0q and Yi p1q.

In particular, we can never know the individual causal effect oftreatment:

ICEi pT q :“ Yi p1q ´ Yi p0q.



Since it is never possible to observe the ICE, we instead focus on theaverage causal effect:

ACE pT q :“ EpYi p1q ´ Yi p0qq “ EpYi p1qq ´ EpYi p0qq.

If assignment to treatment depends on some covariate(s) X , we maywrite

ACE pT | X q :“ EpYi p1q ´ Yi p0q | X q.

It is then possible to estimate the ACE (unconditional or conditional)if all sample units are exchangeable over treatment; i.e. if sampleunits only differ by what treatment they receive, on average.

Random assignment to treatment will ensure such exchangeability.


The stable unit treatment value assumption (SUTVA)

One key assumption of the NRCM is called the stable unit treatmentvalue assumption (SUTVA):

The counterfactual (potential outcome) of one sample unit should beunaffected by the particular assignment of treatments to the othersample units.

More simply, whatever treatment one sample unit receives should notaffect the outcome of whatever treatment another sample unit receives.

Such an assumption should hold in a tightly controlled experiment,e.g. where sample units are not allowed to interact with each other.

But the SUTVA assumption may fail if sample units are not isolated.



Example were SUTVA assumption fails:

We randomly assign treatment or placebo (double-blinded) to 100 highblood pressure patients.

However, suppose two of these patients (Boris and Doris) live in thesame household.

Unbeknownst to anyone, Boris receives the treatment, while Dorisreceives the placebo.

Suppose an unknown side effect of the drug is that it tends to makepatients very tired.

Boris becomes very tired on the treatment, causing Doris to have to domore for both of them (e.g. cook, clean, shop, etc.)

The extra stress causes Doris’ blood pressure to rise.

Thus, our estimate of the ACE of treatment may be inflated: effect ofplacebo on Doris (moderated by effect of drug on Boris) caused Doris’blood pressure to go up.



Three possible fixes to the Boris and Doris problem:

(1) Ensure that all sample units are isolated by design (e.g. do not enrolpatients who know each other).

(2) Decompose causal effect of treatment into two parts:YBorisp1 | TDoris “ 1q vs. YBorisp0 | TDoris “ 1q andYBorisp1 | TDoris “ 0q vs. YBorisp0 | TDoris “ 0q.

(3) Control for confounding variable: Yi p1 | i „ jq vs. Yi p0 | i „ jq.

(1) is by far the best choice.

(2) and (3) resolve the issue by complicating the analysis; they willalso require considerably more data since we have more effects(dependencies) to disentangle.


The fundamental problem of causal inference

Rubin (and Holland) call the fact that we can never simultaneouslyobserve Yi p0q and Yi p1q for any individual i the fundamental problemof causal inference.

Rubin goes one step further and characterizes this as a problem ofmissing data.

Subject Yi p0q Yi p1q ICE “ Yi p1q ´ Yi p0q

Anya ? -2 ?Boris ? -5 ?Doris 5 ? ?Natasha -1 ? ?Pyotr ? 1 ?Vladimir 1 ? ?



These data allow us to estimate the following conditionalexpectations:

EpYi | T “ 1q, EpYi | T “ 0q.

For our data below, we have

pEpYi | T “ 0q “ 1.67, pEpYi | T “ 1q “ ´2.





Recall that we really want the ACE:

ACE “ EpYi p1qq ´ EpYi p0qq

If treatment is randomly assigned to patients, and patients stay ontheir assigned treatment, and the SUTVA holds, then:

zACE “ pEpYi | T “ 1q ´ pEpYi | T “ 0q “ ´2´ 1.67 “ ´3.67




The assignment to treatment mechanism

We have already argued that randomization of treatment allows foran unconfounded estimate of the ACE (assuming patients stay ontheir assigned treatment - we will later see a technique that relaxesthis assumption: instrumental variables).

We have also seen that a deterministic assignment of treatment canallow for an unconfounded estimate of the ACE locally (RD design).

In practice, true randomization can be very difficult to achieve:blinding of patients and researchers/doctors helps.



Many ways the assignment to treatment mechanism can becomeconfounded: e.g. side effects of treatment affect women more thanmen, older people susceptible to more side effects, busier people lesslikely to follow our directions, etc.

Note: self-selection or preferential sampling in study are not examplesof confounded assignment to treatment mechanisms. They certainlycompromise the generalizability of our inferences, but they do notdirectly affect the quality of the causal inferences we can make on thesample.

Oftentimes, the assignment to treatment mechanism will beconfounded due to ethical considerations.



Classic example (Rubin): “The perfect doctor”.

The perfect doctor knows (a priori) what the best treatment is forevery individual patient.

Thus, the perfect doctor assigns the best treatment for each individualpatient. The perfect doctor has complete counterfactual information:


Anya -1 -2 -1Boris -2 -5 -3Doris 5 2 -3Natasha -1 0 1Pyotr 0 1 1Vladimir 1 -1 -2

Average 0.33 -0.83 ACEtrue = -1.17



Classic example (Rubin): “The perfect doctor”.

Based on this total information, the perfect doctor makes thefollowing assignments to treatment:


Anya ? -2 ?Boris ? -5 ?Doris ? 2 ?Natasha -1 ? ?Pyotr 0 ? ?Vladimir ? -1 ?

Average -0.5 -1.5 zACE = -1

Notice: estimate of ACE is now distorted (biased). The perfect doctor isgreat for individual patients, but bad for science/inference.


Matching

In observational or quasi-experimental research designs, we usually do nothave the ability to assign treatment at all; we simply observe outcomesbased on assignment to treatment mechanisms that we cannotcontrol/manipulate (e.g. self-selection to treatment, or comorbidities(covariates) increasing likelihood of assignment to treatment).

The most widely used way to attempt to correct for this confoundingof the assignment to treatment mechanism is by some kind ofmatching of sample units.

Generally speaking, the idea is to match up sample units from onetreatment group to another based on how similar they are over allmeasured covariates.

Matched sample units of different treatments can then mimiccounterfactuals, assuming no omitted confounders.


Matching: example

Revisit blood pressure example, but this time assignment to treatment isnot randomized: first three patients to enrol receive drug, last threereceive placebo.

Subject SexBaseline

blood pressureYi p0q Yi p1q ICE “ Yi p1q ´ Yi p0q

Anya F 150 ? -2 ?Boris M 170 ? -5 ?Doris F 180 ? 2 ?Natasha F 150 -1 ? ?Pyotr M 170 0 ? ?Vladimir M 180 1 ? ?

Average 166.7 0 -1.67 zACE = -1.67


Matching: example

Subject SexBaseline

blood pressureYi p0q Yi p1q ICE “ Yi p1q ´ Yi p0q

Anya F 150 ? -2 ?Boris M 170 ? -5 ?Doris F 180 ? 2 ?Natasha F 150 -1 ? ?Pyotr M 170 0 ? ?Vladimir M 180 1 ? ?

Average 166.7 0 -1.67 zACE = -1.67

Can match (drug, placebo) over baseline blood pressure, but not over sex.Are females more likely to enrol before males? Other possible (omitted)confounders?


Matching: problems

Three fundamental problems of matching:

(1) Can never measure all possible confounders, so always imperfectcorrections.

(2) Matching on many covariates simultaneously requires a lot of data(curse of dimensionality).

(3) Matching on multiple covariates requires that we observe enoughsample units in each possible subcategory/strata so that we can finda “match”.


Matching: problems

Three fundamental problems of matching:

(1) Can never measure all possible confounders, so always imperfectcorrections.

Issue (1) can never be addressed by matching.

(2) Matching on many covariates simultaneously requires a lot of data(curse of dimensionality).

Propensity score matching addresses problem (2), but relies on aregression framework that is susceptible to all the usual issues withmodel misspecification.

(3) Matching on multiple covariates requires that we observe enoughsample units in each possible subcategory/strata so that we can finda “match”.

Fix that’s not a fix for (3): restrict inferences to subgroups whereenough data exist.


Next time

Propensity scores and propensity score matching


epse 581c: causal inference for applied researchers...epse 581c: causal inference for applied...

Documents