socexpnotes - royal holloway, university of...
TRANSCRIPT
-
Empirical Methods for Public Policy AnalysisLecture notes
Dan Anderberg
Royal Holloway University of London
January 2007
1 Introduction to Treatment Evaluation
In this …rst part of the course we will go through some econometric techniques that
have become increasingly popular in public economics. In particular, I will focus on
what has become known as treatment evaluation. Treatment evaluation is concerned
with measuring the impact of interventions on outcomes of interest. The approach and
the terminology originates from medical research where an intervention frequently means
exposing someone to some form of treatment.
In public economics the techniques can be used to study the e¤ect of e.g. welfare- and
social insurance programs on various aspects of behaviour including labour supply, un-
employment duration, family structure etc. We will encounter these methods frequently
during the course, which motivates why I wanted to start with spending some time pro-
viding an overview of these methods.
² “Treatment” can mean just about anything (being exposed to a more generouswelfare system, getting training, smoking, having good neighbours etc.)
1.1 Examples
Hormone Replacement Therapy
Consider …rst an example from medicine. The onset of menopause implies that that the
body produces less hormones like estrogen and it is perceived that menopause increases
the risk of osteoporosis, Alzheimer’s disease and heart disease. Given this, a possible
treatment is “hormone replacement therapy” (HRT). Indeed, in 2001, some 20 percent of
1
-
US women in menopause were receiving HRT at an expenditure of 2.75 billion $US. The
question is, however, whether HRT was it e¤ective?
Evaluating the e¤ectiveness of HRT is complicated by the fact that women who choose
to go on HRT after onset of menopause are di¤erent from women who choose not to:
they have higher levels of HDL (good cholesterol), lower blood pressure, engage more in
physical activity, and have lower weight, are more educated etc. Controlling for all such
di¤erences is not easily done.
To settle the issue a randomized trial was organized: the Women’s Health Initiative
trial in which 27,000 women age 50-79 were followed for 9 years. Among the women, the
allocation of HRT was randomly allocated. However, the trial was discontinued when
evidence mounted that HRT increases the risk of heart disease and stroke! Hence a
conclusions from this was that the initially perceived bene…cial e¤ect of HRT on reducing
the risk of heart disease and stroke must have been driven by selection e¤ect.
The Tennessee STAR Experiment
The Tennessee STAR (Student/Teacher Achievement Ratio) experiment sought to deter-
mine the e¤ect of class size on educational outcomes. 79 schools were randomly selected
for treatment. The treatment involved the formation of classes in terms of student num-
bers and “teacher-student ratio” for students in grades K-3 with three possible designs:
small (13-17), regular (22-25), regular with teacher aide with both students and teachers
randomly allocated to class types. The outcomes of interest were (i) standardized tests in
grades K-8 (short-run outcomes), and (ii) participation and scores in ACT/SAT college
admission tests in …nal year of high school. See Krueger and Whitmore (2001) for an
analysis of the Tennessee STAR Experiment.
1.2 Treatment E¤ect
The policy relevance of treatment evaluation should be immediate: it can help us iden-
tify potential improvements in policy. We will focus on the problem of estimating an
average treatment e¤ect (ATE) : Formally, ATE is the average partial e¤ect of a
binary explanatory variable. What does this mean? Partial e¤ect means the e¤ect of
2
-
the treatment holding other factors constant (as in a standard multiple regression). By
binary we mean that an individual either (i) gets the treatment, or (ii) does not get the
treatment. We are e.g. not studying treatments that vary in intensity. Thus we can
think of treatment as a dummy variable
wi ´
8<:
1 if individual i receives “treatment”
0 if individual i does not receive “treatment”(1)
A natural approach to estimating the e¤ect of treatment on an outcome y would then
be to simply include the treatment dummy wi in a standard linear regression,
yi = ® + ¯wi + "i; (2)
or, if we expand the approach to allow for other “explanatory variables” xi,
yi = ® + ¯wi + ±xi + "i; (3)
So what’s wrong this approach? The answer is (as we will see): “Sometimes there
is nothing wrong with this approach!”. However, there are two major concerns about
this approach which motivates a more general approach. First, we would generally be
concerned that the receipt-of-treatment dummy variable wi might be correlated with the
error term, leading to a problem of ”endogeneity”. Indeed, much of the methods put
forward in the treatment evaluation literature are motived precisely as a way of tackling
the potential endogeneity problem. Second, note e.g. that the above formulation seems
to implicitly assume that treatment has the same e¤ect on the outcome for all individuals.
This would be a strong assumption: in many cases we would expect that people respond
to “treatment” in di¤erent ways; indeed, in general we would be concerned that there
might be unobserved heterogeneity in how individuals respond to treatment. Hence we
would like to develop a more general and rigorous framework in which we can address
some of these issues.
2 Causal Inference and Counterfactuals
Most of our discussion of treatment evaluation will be carried out in the context of the
Holland-Rubin causal model (Holland, 1986, Rubin, 1974).
3
-
2.1 The Causal Model and Measures of Treatment E¤ects
Treatment evaluation methods are concerned with identifying the causal e¤ect of treat-
ment on an outcome of interest: What was the e¤ect on individual i of receiving treat-
ment? Note that we are implicitly comparing individual i with herself in the alternative
scenario where she does not get treatment. Hence causal inference is based on the notion
of counterfactuals. A natural formulation of this is to say that each individual has two
potential outcomes: one outcome with treatment and one outcome without treatment.
We will denote the outcome of individual i with treatment as yi1 and the outcome with-
out treatment as yi0. The causal e¤ect (or “treatment e¤ect”) for individual i then then
yi1¡yi0. The problem is that will ever only observe one of the two potential outcomes: noindividual can both receive treatment and not receive treatment. Hence for individual i
there will be one observed outcome which we will denote yi. If the individual does receive
treatment, wi = 1, then the observed outcome is the one with treatment, i.e. yi = yi1. On
the other hand, if the individual does not receive treatment, wi = 0, then the observed
outcome is the one without treatment, yi = yi0. The outcome that remains unobserved
is the counterfactual.
It could be that everyone experiences the same treatment e¤ect. In this special case,
which we will generally refer to as the “homogenous treatment e¤ects” case,
yi1 ¡ yi0 = ¯ for all i: (4)
However, in most cases this seems like an implausibly strong assumption. Hence we want
to allow for the possibility that individuals react di¤erently to treatment.
Our de…nition of the treatment e¤ect for individual i implicitly assumes that this
e¤ect is independent of who else received treatment or, conversely, that treatment of
individual i only a¤ects individual i; in the literature this is commonly referred to as the
stable unit treatment value assumption (SUTVA) (Neyman, 1923, Rubin, 1980). This
assumption rules out e.g. peer-e¤ects, cross-e¤ects and general equilibrium e¤ects.
² ADD REFS
In the following we will adopt the assumption of random sampling. In particular,
we will refer to the potential outcomes model as our general “population model” and
4
-
assume that an independent identically distributed (i.i.d.) sample can be drawn from the
population.1 In describing the population model we will dispense with the subscript i
for the individual and simply write y1 for the “treated” outcome, y0 for the “untreated”
outcome and y1 ¡ y0 for the treatment e¤ect.Since the treatment e¤ect can, in general, be expected to vary in the population, we
can think of it as a random variable. (If we were to randomly pick an individual, the
treatment e¤ect is a random variable.) We will then have to decide which moments of
that distribution we will be interested in. A natural measure of the impact of treatment
to focus on is the average treatment e¤ect in the population. This answers the following
question: If we randomly pick out an individual, what is the expected value of his/her
treatment e¤ect? We can de…ne this formally:
Definition 1 Average Treatment E¤ect. The average treatment e¤ect ATE is the
expected value of the treatment e¤ect in the population,
ATE ´ E [y1 ¡ y0] :
What if we observe some individual characteristics like age, gender etc.? It could e.g.
be that the average treatment e¤ect among men is di¤erent from that among women.
How do we take this into account? A natural way to do this is to collect an individ-
ual’s characteristics in a vector x, so e.g. x could be (male; age = 29; :::; white). How
would we then denote the average treatment e¤ect among individuals with the speci…c
characteristics x? To do this we can use the notion of the conditional expected value,
ATE (x) ´ E [y1 ¡ y0jx] :
ATE (x) should thus be interpreted as the average treatment e¤ect among individuals
with characteristics x.
A second measure of the impact of treatment that we can estimate is the average
e¤ect of treatment on the treated, which we can denote ATT :
1There are cases when the i.i.d. assumption is not strictly valid for the case where data is in the form
of repeated cross-section (where samples are obtained from the population at di¤erent points in time) or
in the case of panel data (which consists of repeated observations on the same cross section of individuals
etc. In these case the will assume random sampling in the cross-section dimension.
5
-
Definition 2 Average Treatment E¤ect on the Treated. The average e¤ect of
treatment on the treated ATT is the expected value of the treatment e¤ect among those
who would receive treatment,
ATT = E [y1 ¡ y0jw = 1] : (5)
Indeed, ATT is often more interesting than ATE; consider e.g. the case of a pro-
gramme where a policy-maker chooses the eligible population. ATT then, by focusing on
the programme participants, determines the realized gross return from the programme
which can then be compared to the costs in order to evaluate whether the programme
was successful or not (Heckman, LaLonde and Smith, 2000)
Note that ATE and ATT will, in general, not be the same. This could be e.g. due
to treatment being allocated to observable subgroups of population who are expected to
bene…t more from treatment. Or, it could be the result of “self-selection”: if individuals
can, choose whether or not to participate, individual participation can be expected to
depend on the individual treatment e¤ect. Indeed, the only case where ATE and ATT
can generally be expected to coincide is when the allocation of treatment is the outcome
of a randomized experiment (see below).
Later on we will also encounter a third measure of the e¤ect of treatment; this concept,
introduced by Imbens and Angrist (1994) is known as a Local Average Treatment
E¤ect (LATE). LATE can be estimated using instrumental variables under weak
conditions. However, LATE has two drawbacks: (i) it measures the e¤ect of treatment
on a generally unidenti…able subpopulation, and (ii) the de…nition itself depends on the
particular instrumental variable that one has available. LATE uses the existence of
a variable that only a¤ects an individual’s outcome through the participation decision
w. In intuitive terms, LATE then measures the average impact of treatment on the
subpopulation whose participation is a¤ected by variation in the instrumental variable.
6
-
2.2 The Observed Outcome: The Switching Equation
We noted above that, for any one individual, there are two potential outcomes, y1 and
y0. However, we only observe one of these two potential outcomes. In particular,
w = 1 ) y = y1 (the individual is treated)w = 0 ) y = y0 (the individual is not treated)
(6)
A useful way of writing this is as follows
y = wy1 + (1 ¡ w) y0 = y0 + w (y1 ¡ y0) (7)
The last formulation is revealing: it is as is the individual has a “base outcome” y0 to
which we add the treatment e¤ect if the individual indeed gets treatment. This formula-
tion will come in handy: we can refer to it as the “switching equation”.
2.3 Observed and Unobserved Characteristics
Above we introduced the notion of observed characteristics x that may a¤ect the e¤ect of
treatment and the allocation of treatment. However, there may also exist characteristics
that we cannot measure. This is a common problem in econometrics. Consider e.g. the
use of a traditional Mincerian wage equation for estimating the rate of return to education;
individual years of schooling are likely to be correlated with unobserved ability. The
problem of unobserved characteristics such as ability, motivation, self-esteem etc. will
come up frequently in the treatment evaluation problem.
3 Randomized Experiments
In randomized experiments treatment is randomly assigned. Randomized experiments
in economics are rare; they are nevertheless interesting to consider for four reasons.
First, the are considered to be the “gold standard” of evaluation; hence, as providing
a benchmark for other evaluation methods, it is useful to understand how they work.
Second, evidence based on randomize experiments are usually highly in‡uencial; hence
it is important to be aware of their potential limitations, what can undermine their
validity. Third, “experiment-like situations” can occur by chance, in which case we talk
7
-
about “natural experiments”. Fourth, many of the methods build on the analytics of
randomized experiments.
An Example: Vitamin C and Cancer
One experiment gave vitamin C to 100 patients believed to be terminally ill from cancer
(See Rosenbaum (2002), Ch. 1). A comparison group constructed by matched sampling:
for each treated patient, 10 other patients were randomly chosen from historical records
with same type of cancer and other characteristics in terms of age, gender etc. It was
found that patients receiving vitamin C lived about 4 times longer than controls, highly
signi…cant. However, later a carefully randomized experiment conducted at the Mayo
Clinic overturned the result: patients were randomly assigned to receive vitamin C or a
placebo. With the more careful research design there was no evidence that vitamin C
helped prolong survival among cancer patients.
3.1 Basic Idea and Intuition
Suppose we have a target population and that we want to determine what the average
e¤ect of some speci…c “treatment” is. How might we go about doing this? Borrowing
some ideas from medicine, a good way (at least from statistical point of view) to go
about this would be to randomly split people into two groups: one group that will receive
the treatment and one that will not. The two groups are commonly referred to as the
“treated” group and the “untreated” (or “control” group).
The idea behind randomizing who gets treatment is that doing so will guarantee that
there will be no systematic di¤erences between the treated- and the untreated groups
– both groups will be representative of the population. Speci…cally, since the allocation
of treatment is completely randomized, who gets treatment cannot be related to the indi-
vidual e¤ect of treatment or any other individual characteristics, whether observable or
unobservable. Hence it shouldn’t be e.g. that those who bene…t more from the treatment
are more likely to get it.
This fact will make it extremely simple to identify the average e¤ect of treatment: we
can simply compare the average outome in the treated group with the average outcome
8
-
in the untreated group.
² The average outcome among those who did receive treatment should be represen-tative of what would be the average outcome in the whole population if everyone
had received treatment.
² The outcome among those who did not receive treatment should be representative ofwhat would be the average outcome in the whole population if no one had received
treatment.
Suppose then that the allocation of treatment has been randomized and we have (a
random sample) consisting of N1 individuals who have received treatment and N0 indi-
vidual who did not receive treatment. A natural way to estimate the e¤ect of treatment
is then to compare the means in the two groups. Let
y1 ´ 1N1
N1X
i=1
yi and y0 ´1N0
N0X
i=1
yi; (8)
be the average outcome in the treated- and the untreated group respectively and consider
the following di¤erence-in-means estimator of the average treatment e¤ect,
dATE = y1 ¡ y0:
The question is: What are the properties of this estimator?
3.2 Theory
Let’s look at some of the theory behind this. The starting point is that we have a
random draw from the population; thus fwi; yi0; yi1gNi=1 are treaded as i.i.d. randomvariables. For each observed individual we observe the outcome yi = yi0 + wi (yi1 ¡ yi0).The crucial feature of randomization is that it ensures that the allocation of treatment
is (statisically) independent of the potential outcomes; stated in terms of the population
model the assumption is that:
Assumption 1 The allocation of treatment is independent of the potential outcomes,
(y0; y1) ? w:
9
-
The …rst thing to note is that, under randomized allocation of treatment ATE = ATT
since the treated individuals are, by construction, representative of the entire population.2
Formally, due to randomization
E (yj jw = k) = E (yj) , j; k = 0; 1 (9)
which implies that
ATT = E (yi ¡ y0jw = 1) = E (yi ¡ y0) = ATE: (10)
Consider now the di¤erence-in-means estimator; to do this we can use conditional ex-
pectations. Since the allocation of treatment is completely randomized in the population,
w (the treatment dummy) must be (statistically) independent of the potential outcomes
y0 and y1. In particular, it follows that
E [y1jw = 1] = E [y1] (11)
Note what this says: the expected outcome among those actually treated must equal the
expected “treated” outcome in the overall population. This comes from the fact that,
due to randomization, the treated individuals are representative of the overall population.
Similarly, it must be that
E [y0jw = 0] = E [y0] (12)
This says that the expected outcome among those not treated equals the expected “un-
treated” outcome in the overall population.
The above equalities were stated in terms of the potential outcomes y0 and y1. What
about the observed outcome y? What is the average value of the observed outcome y
among those who receive treatment? Formally this is E [yjw = 1]. To see more exactlywhat this is we can use the switching equation (7) to substitute for y; we then obtain
that
E [yjw = 1] = E [y0 + w (y1 ¡ y0) jw = 1] = E [y1jw = 1] = E [y1] (13)
The …rst equality comes from substituting for y using the switching equation (7); the
second equality follows from using that we are conditioning on w = 1; and the third
2Formally E [ykjw = j] = E [yk] for k; j = 0; 1 whereby E [y1 ¡ y0jw = 1] = E [y1 ¡ y0].
10
-
equality simply reiterates (11). Hence the expected outcome among those treated is the
same as the expected treated outcome in the overall population. Similarly, using the
switching equation we can get more insight into the average outcome among those who
are untreated. Using the same reasoning we obtain
E [yjw = 0] = E [y0 + w (y1 ¡ y0) jw = 0] = E [y0jw = 0] = E [y0] (14)
Hence the expected outcome among the untreated individual is the expected untreated
outcome in the overall population.
Putting (13) and (14) together we obtain that
E [yjw = 1] ¡ E [yjw = 0] = E [y1 ¡ y0] = ATE:
But E [yjw = 1] is naturally estimated by the mean outcome among the treated y1.Similarly, E [yjw = 0] is naturally estimate by the mean outcome among the untreated y0.In particular, randomization of the allocation of treatment implies that the di¤erences-in-
means estimator is unbiased, consistent and asymptotically normal.
We can now make use of the weak law of large numbers which implies that the sample
mean y1 converges (in probability) to E [yjw = 1] = E [y1] while the sample mean y0
converges (in probability) to E [yjw = 0] = E [y0].3 Hence di¤erences-in-means estimatory1 ¡ y0 converges in probability to ATE; in other words, is a consistent estimator forATE: We also say that ATE is “identi…ed”.
Hence we have managed to partially overcome the fact that the counterfactual is not
observed: randomization ensures that the outcomes in the control group will mimic what
would have happened in the treated group had they remained untreated. We will not
be able to identify the individual treatment e¤ects since we will not be able to observe
anyone individual in more than one state. However, we will have a chance of identifying
the average treatment e¤ect in the population by looking at the averages within the
treated group and the control group.
3Loosely stated, the weak law of large numbers says that, under the i.i.d. assumption the sample
mean converges (in probability) to the population mean: Let yi; i = 1; 2; ::: be a sequence of independent,
identically distributed random numbers; then the sample average N¡1PN
i=1 yip! E (yi) as the sample
size N grows to in…nity.
11
-
Using OLS
A point to note is that the randomized experiments settings is one case where using a
dummy variable approach in a linear regression generates the “right answer”: suppose
that we were simply to estimate a linear model where the treatment indicator w is the
only regressor,
y = ®+ ¯w + ": (15)
As is well-known, estimating this equation using OLS will give as estimate of ® the
mean outcome among the “untreated” y0 and as estimate of ¯ the di¤erence in means
among the “treated” and the “untreated”, y1 ¡ y0. In other words, the OLS estimatorof ¯ is precisely the di¤erence-in-means estimator that, under a randomized experiment,
provides a consistent estimate of ATE (or, equivalently, ATT ).
Suppose we also observe some individual characteristics x. Do we then need to some-
how “control” for these? The answer is that, as long as we are interested in the average
treatment e¤ect in the overall population, this should not be necessary: since the allo-
cation of treatment is randomized, there should be no systematic di¤erences in observed
characteristics between the the treated- and the untreated group.
However, observing individual characteristics (other than the treatment indicator w)
is nevertheless useful for two reasons. First, they can be used to check that the validity
of the assumption that the allocation of treatment is purely random: if the allocation of
treatment is indeed truly random then the treatment indicator should not be correlated
with any of the observed individual characteristics. Second, observing individual char-
acteristics can help in establishing the statistical signi…cance of the estimated treatment
e¤ect. Suppose e.g. that we are estimating the linear model using OLS. If we do not
include x in the regression, then we are e¤ectively “leaving them in the error term”.
This is not a problem for the consistency of the estimate for ¯ (since it does not make
the error term correlated with w); however, it tends to make the standard error of the
estimate large. If we instead include the characteristics x in the regression we can reduce
the variance of the error term and hence also the stardard error of the ¯ estimate.
12
-
3.3 Examples
The Canadian Self-Su¢ciency Project
The Programme The Canadian Self-Su¢ciency Programme (SSP) is an example of
an in-work bene…t scheme: it o¤ers an earnings subsidy to long-term welfare recipients.
The aim of the programme is to support low-income households by “making work pay”.
The SSP has three key features: (i) a substantial …nancial incentive for work relative to
non-work, (ii) a relatively low marginal tax rate on the earnings of those who work, and
(iii) a “full-time” work requirement of 30 hrs/week.
² Check the web for an up-to-date description...
Assuming that the 30-hour work requirement is met, the SSP bene…t is equal to
half the di¤erence between a participant’s gross labor earnings and a target earnings.
(Unearned income does not a¤ect the SSP payment and the supplement is also not
dependent on family size.)
The existence of the SSP fundamentally changes the budget constraint of an indi-
vidual. The basic welfare programme targetted at low income households in Canada is
known as Income Assistance (IA).4 The IA, however, by reducing the bene…t one-for-one
as the individual’s earnings grow imply an implicit 100 percent marginal tax rate (after
a modest earnings disregard).
The impact of the SSP on an individual’s budget constraint can be seen e.g. in the
hypothetical example shown in the …gure below (Card and Robins, 1996, Fig 1).
FIG
Under IA there is …rst a short upward-sloping segment representing the earnings
disregard; the slope is then zero, representing the one-for-one bene…t reduction, up until
the bene…t is fully withdrawn. The SSP introduced a notch (or a vertical jump) in the
budget constraint at 30 hrs and, moreover, by only being withdrawn at 50 percent as the
individual’s earnings grows, generates a positively sloped segment.
4In practice each province operates its own IA programme; however, the provincial IA systems share
many of the features key features that are important for our purposes, e.g. being o¤set by income from
employment and other sources.
13
-
The Randomized Experiment Before being adopted as a national policy [check
this] an experimental version of the SSP was constructed. The full SSP evaluation entails
a …ve-year follow-up of some 6,000 families. Card and Robins (1996) provide an early
evaluation of some 2,000 families followed over the …rst 18-24 months of the experiment.
Eligibility in the experiments was limited to single parents who had been on IA for at
least 12 of the previous 13 months. People assigned to the programme were given up to 12
months to obtain a full time job and initiate a …rst SSP payment. Those who initiated an
SSP payment would be eligible for SSP supplements for the next three years (whenever
satisfying the 30 hr requirement at at least the minimum wage). Those who did initiate
an SSP payment within the initial 12 month period lost any further entitlement.
From economic theory, the main impact of the SSP is that it would induce some
people who othewise would have remained on IA and worked less than 30 hours per week
to move from welfare to full-time employment (see Card and Robins, 1996, for a full
discussion).
The Evaluation Card and Robins (1996) explored the impact of the SSP on sample of
single parents, over 18 years of age, who had received IA payments for at least 12 of the
past 13 months residing in British Columbia and in New Brunswick. The randomization
in the experimental phase of the SSP was carried out as follows. Sample members were
informed that they had been selected to participate in a research project involving the
possibility of a wage supplement. They were asked to sign a concent form (with roughly
90 percent of the selected individuals agreeing to participate) after which the sample
members were randomly allocated to a treatment group (1,066 individuals)and a control
group (1,056 individuals). Individual outcomes in terms of labour force participation,
hours of work, earnings etc. were recorded from the start of the program and also
retrospectively for one year prior to the experiment. Card and Robins use information
the …rst 18-24 months of the experiment.
After verifying that the treatment and control groups have the same observable char-
acteristics (as they should given that the randomization was properly done) Card and
Robins present their basic results. Since for each group the outcomes are recorded on a
monthly basis it is possible to trace the treatment e¤ect by month after the start of the
14
-
experiment. Moreover, since the outcomes were also recorded for the year prior to the
programme it is possible to check that there are no di¤erences in the outcome variables
between the two groups prior to the experiment.
The results can be conveniently represented graphically by plotting the time series of
the average monthly outcomes for the treatment and the control group, along with the
implied estimate of the monthly impact (i.e. the simple di¤erence-in-means). The next
…gure shows the estimated impact on monthly earnings (Card and Robins, Fig 5).
FIG
The …gure shows that there were indeed no di¤erences in the outcome variable between
the two groups in the pre-programme period; however, soon after the introduction of the
programme the earnings of the individuals in the treatment group exceeded the earnings
of those in the control group, with the di¤erence – i.e. the treatment e¤ect – being
statistically signi…cant from month 5 onwards.5 The impact on the rate of full-time
employment is follows a similar pattern. See next …gure (Card and Robins, Fig 7).
FIG
The estimated impact of the SSP on the full-time employment rate peaks at 14 per-
centage points after 14 months before dropping back slightly to about 10 percentage
points after 17 months. Relatedly, Card and Robins report on the impact of the SSP on
the probability of being o¤ IA, showing that SSP markedly increases the probability of
being a non-IA recipient.
Hence the early programme evaluation indicated that a signi…cant number of single
parents responded to the …nancial incentives provided by the programme; indeed, little
over a year after the programme enrollment the full time employment rate among those
initial welfare recipients who were o¤ered the SSP programme was nearly twice that of
those in the control group. However, Card and Robins also note that there seems to be
a tendency that the recipients were taking jobs with relatively low pay.
5The jump that occurs at the introduction of the programme is argued to stem from the fact that
pre- and post treatment outcomes were obtained from di¤erent surveys.
15
-
Other Examples of Randomized Experiments in Public Economics
The Evaluation of Training: The LaLonde Study In a highly in‡uential study
LaLonde (1986) used an experiment dataset to compare between experimentally and non-
experimentally determined results and between di¤erent types of non-experimental esti-
mation methodologies. The study is based on a programme called the National Supported
Work Demonstration (NSWD). This programme was operated during the mid-1970s in 10
sites across the US and was designed to help disadvantaged workers, in particular women
in receipt of cash welfare bene…ts (AFDC), ex-drug-addicts, ex-criminal o¤enders and
high-school drop-outs. In the programme, quali…ed applicants were randomly assigned
to treatment, which comprised a guaranteed job for 9 to 18 months. A total of 6,616
individuals were included in the study. Data was collected on both pre-program earnings
and post-program earnings, as well as a number of pre-program variables such as age,
education, ethnicity etc). Eligible candidates were randomized into the program. Due
to the randomization, the e¤ect of the program could be evaluated without bias using
simple di¤erences-in-means.
Given that the allocation of treatment was random, it should be that the individuals
who received treatment and those (eligible) who didn’t receive treatment have the same
observed characteristics. Indeed, the following …gure which reproduces LaLonde’s Table
1 shows that there was no di¤erences between the treatment group and the control group
in terms of observable characteristics.
FIG
The following Figure shows the earnings evolution for treatments and controls from a
preprogramme year (1975), through the treatment period (1976–77), until the postpro-
gramme period (1978) (an except from LaLonde’s Table 3)
FIG
The results indicate that the earnings of the treated- and the untreated individuals
were indeed very similar before the programme started. During the programme the
earnings of the treated was substantially higher and after the programme had …nished
16
-
the gap narrowed somewhat. Nevertheless, using the simple di¤erences in means after the
programme as the estimate of the treatment e¤ect the estimated e¤ect would be $886. If
construct a (di¤erence-in-di¤erence) estimator that subtracts the di¤erences in earnings
between the two groups before the programme the estimate would be $847.
The second aim of the LaLonde paper is to explore how well non-experimental methods
that would typically be used would perform. The idea is to use that, since we have
experimental data, we have the “right answer”; we can then ignore the control group
in the data and use other constructed control groups (from external data sources) along
with various regression methods to try to recover the treatment e¤ect. LaLonde uses
data from the PSID and the CPS to generate alternative control groups. He then uses a
variety of methods including (i) simple regression using post-treatment data that controls
for di¤erences in observed characteristics, (ii) di¤erences-in-di¤erence methods (with or
without controls for observed di¤erences in characteristics), (iii) a Heckman (two-step)
selection model. LaLonde concludes that the non-experimental methods perform very
poorly.
The US Re-Employment Bonus Experiments After a decade of increasing un-
employment in the US since the mid 70s, there was an increased interest in trying to
identify reform of the unemployment insurance (UI) system in order to get unemployed
back to work faster, and hence to reduce the …nancial pressure on the UI system. A
number of states implemented randomized experiments where unemployed workers were
given cash-bonuses for …nding jobs quickly. (See Meyer, 1995, for a survey.) One example
was the Illinois re-employment bonus experiment. This involved a $500 cash bonus for
…nding a job within 11 weeks (and keeping it for at least 4 months). The outcome of the
Illinois re-employment bonus experiments was analyzed by Woodbury and Spiegelman
(1987) and will be considered in more detail later on in the course.
The Negative Income Tax Experiments
² Next time...
17
-
3.4 Natural Experiments
Randomized experiments in economics are rare. However, situations can also occur when
some event occurs that naturally creates an experiment-like situations in the sense that
di¤erent individuals are exposed to di¤erent “treatments” in a way that is e¤ectively
random. Situation like these are generally referred to “natural experiments” to mark
that it was not a planned randomization but rather one that occurred “naturally”. In
so far as the allocation of treatment was indeed e¤ectively random the same analysis as
under planned randomized experiments apply.
A neat example of an analysis based on a natural experiment is Gould, Lavy and
Paserman (2004). The authors consider an event in which a large number of Ethiopian
Jews were due to political instability, over a very short period of time (a few days),
relocated to Israel. The families arriving from Ethiopia were, almost immediately upon
arrival, randomly allocated to destinations in Israel. Hence, some families would be
allocated to urban areas whereas other families would be allocated to rural areas etc. In
particular, the children of the immigrating families were, by chance, randomly allocated
to schools of di¤erent qualities. Using the randomization created by the event allowed
the authors to consider the impact of school quality of the children’s achievements.
Unfortunately, natural experiments are also rare in the context of public economics.
3.5 Potential Problems with Randomized Experiments
Even though randomized experiments are commonly thought of as the gold standard
among evaluation methods it should be noted even this approach has some potential
problems (some of which also apply to other evaluation methods).
First, random experiments are generally very costly to implement. Often the …nan-
cial costs are large. Moreover, there are often “ethical costs” that make randomized
experiments unfeasible: if some “treatment” is available that we believe might help some
individuals it can be considered unethical to randomly allocate the treatment across indi-
viduals rather than give it to those who would be in most need. Similarly there are certain
“treatments” that would be completely unfeasible to randomly allocate. Consider e.g.
the treatment “growing up in a poor neighborhood”: it would be completely unthinkable
18
-
that to construct a randomized scheme whereby children would be randomly re-allocated
to new homes and neighborhoods.
Second, there are threats to internal validity. As is obvious from the above analy-
sis, it is important that the randomization is correctly implemented so that the treated
and untreated populations are indeed identical in all respects except for the receipt of
treatment. Further, it is important that the initial randomization is adhered to: that all
people that were initially assigned to receive treatment indeed do receive treatment and
do not fail to take it up.6 Another potential problem is that the individual included in
the experiments may change their behaviour due to the mere fact of being in an exper-
iment (“Hawthorne e¤ects”). These problems relate to the so-called internal validity of
the evaluation results.
However, there can also be threats to the external validity of the evaluation that
compromise the ability to generalize the results of the experiments to other populations
and settings. One threat to external validity is “experiment speci…city ”: when the
experimental sample is not representative of the population to which the policy might
be extended or when the policy applied to the experimental sample is not representative
of the policy that would be implemented on a broader population. A second threat
to external validity is “limited duration”: the duration of an experiment may be too
short to identify the long-run responses that would obtain if the policy was permanently
adopted. Relatedly, adopting the programme as a wide-spread permanent policy could
have “general equilibrium e¤ects” that are large enough that results from the experiment
cannot be generalized.
4 When We Don’t Have a Randomized Experiment
Most often we have to get by with less “clean” evidence than would o¤erer by randomized
experiments. Non-randomization in the public policy context tend to occur for two
reasons:
1. The policy change (i.e. the treatment) when introduced a¤ected some individuals
6As we will see below, if this is not the case, e.g. due to some degree of self-selection, but the original
assignment of treatment is known, the original assignment can be used as an instrumental variable.
19
-
but not others; however the two groups di¤ered systematically. Hence we have a
treated group and an untreated group, but the two groups cannot be expected
to be identical.
2. The individuals themselves partly determine whether they receive treatment. In
other words, there is self-selection into treatment.
As an example of the …rst, suppose the government introduces a policy that is available
to people under the age of 25. Hence those under the age of 25 can be considered
as “treated” (i.e. exposed to the policy) while those above 25 are “untreated” (i.e.
una¤ected by the policy). However, it is quite clear that the two groups are not identical.
Or, similarly, suppose that a policy is introduced in one area but not in another; the two
areas might not be identical compositions.
As an example of the second problem, take e.g. a government training program. If
people can choose whether or not to participate, then we might suspect that those who
would gain the most from participating are also most likely to actually participate. In
this case it is easy to see why, if we are not careful, we could easily get a wrong estimate
the e¤ect of the program.
4.1 A Useful Decomposition
To see how biases can easily occur it is useful to decompose the individual’s treatment
e¤ect. Recall that E (y1) is the average outcome, with treatment, in the population. For
concreteness think of “treatment” as participating in a particular training program and
think of the outcome of interest as y earnings. Then y1 is the individual’s earnings after
participating in training.
Let’s introduce some short-hand: de…ne
¹1 ´ E (y1) ; (16)
to be average outcome, with treatment, in the population. Then we can decompose the
individual’s treated outcome as
y1 = ¹1 + À1: (17)
20
-
This has a simple interpretation: the individual’s earnings with training is the average
earnings (with training) in the population ¹1 plus an individual-speci…c component À1which, by construction, has zero mean in the population, E (À1) = 0. Similarly, we can
de…ne
¹0 ´ E (y0) (18)
as the average outcome in the population when no one gets treatment. If the speci…c
individual does not get any training her earnings will be y0 which we can decompose
into the population average ¹0 and an individual-speci…c component of earnings without
training À0,
y0 = ¹0 + À0: (19)
Note that
¹1 ¡ ¹0 = E (y1) ¡ E (y0) = E (y1 ¡ y0) ; (20)
is nothing but the average treatment e¤ect in the population ATE. The individual’s
treatment e¤ect, y1 ¡ y0, on the other hand is, by substitution,
y1 ¡ y0 = (¹1 + À1) ¡ (¹0 + À0) = (¹1 ¡ ¹0) + (À1 ¡ À0) : (21)
Using the above decompositions we see that an individual’s treatment e¤ect can be though
of as having two components: the average treatment e¤ect ATE and plus an individual-
speci…c component À1 ¡ À0; note that, since E (À1) = E (À0) = 0; the individual-speci…ccomponent is, on average, zero in the population, E (À1 ¡ À0) = 0.
Consider then a particular individual: if À1 > À0 the individual gains more from
participating in training than the average person as she has a larger individual earnings
component when participating than when not participating.
Suppose now that we take expectations in (21) conditional on receiving treatment,
w = 1; this will give us the average treatment e¤ect on the treated,
ATT = E (y1 ¡ y0jw = 1) = (¹1 ¡ ¹0) + E (À1 ¡ À0jw = 1) : (22)
In other words,
ATT = ATE + E (À1 ¡ À0jw = 1) : (23)
21
-
Hence unless the last term is zero we have that the average treatment e¤ect on the treated
is not the same as the average treatment e¤ect in the population, ATT 6= ATE. Supposee.g. that, due to self-selection, the participating individuals have, on average, positive
individual-speci…c treatment e¤ect components, E (À1 ¡ À0jw = 1) > 0. It then followsthat the average treatment e¤ect on the treated is larger than the average treatment
e¤ect in the population, ATT > ATE.
It is then also easy to see why the simple di¤erence-in-means estimator will generally
not estimate any treatment e¤ect of interest; in general the di¤erence-in-means estimator
will estimate the expected di¤erence in outcome for the treated and the untreated which
is
E (y1jw = 1) ¡E (y0jw = 0) = (¹1 ¡ ¹0) + E (À1jw = 1) ¡ E (À0jw = 0) : (24)
But this is, in general, neither ATE nor ATT: This is intuitive: suppose that those
who gain the most from participating in training are more likely to actually participate;
conversely, those who bene…t the least are less likely to participate. Then it is clear that:
² The average earnings among those who actually participate is not a good estimateof what would have been the earnings in the whole population if everyone had par-
ticipated !
² The average earnings among those who do not participate is not a good estimate ofwhat would have been the earnings in the whole population if no one had participated.
Simply put, neither group is representative of the population. Hence if we were to use
a di¤erence-in-means approach, we’d be subtracting something that is unrepresentative
from something that is unrepresentative; it is then quite clear that we should not expect
to uncover the average treatment e¤ect.
4.2 Roadmap to Methodologies for Non-Experimental Data
The available menu of empirical methodologies for non-experimental data depends on two
main factors. The …rst factor is the timing-structure of the data: Is the data a panel (or,
22
-
possibly, a repeated cross-section) or a single cross-section? A range of methodologies
can be applied to pure cross-sectional data; these include matching estimators, regression
discontinuity designs, (Heckman) selection model, and Instrumental Variables methods.
If, on the other hand, data are in longitudinal or repeated cross-section format, the
di¤erence-in-di¤erence approach can be applied. Moreover, various methods are, as we
will see, often combined.
The second factor is the richness of the data: Are we measuring all factors that are
relevant for the selection into treatment? If we believe that we have good measures of
all factors that a¤ect the allocation of treatment we say that there is “selection on ob-
servables” (only). If, on the other hand, we believe that some unmeasured individual
characteristics (e.g. intrinsic motivation, self-esteem etc.) a¤ecting who received treat-
ment, then we have to acknowledge that there may be some “selection on unobservables”.
The main class of models that rely on the assumption of selection on observables is
the class of matching models. In contrast, the main models allowing for selection on
unobservables are the (Heckman) selection model and the IV methods.
5 Selection on Observables and Matching
Suppose then that we do not have a perfectly randomized experiment. Indeed, “natural”
experiments, by not being controlled randomizations, frequently give rise to treatment
and control groups that may be quite di¤erent from each other in terms of their observ-
able characteristics. This then begs the question of how to control for these observable
di¤erences. Matching provides a way forward for the case where we only have cross-
sectional information available; however, as we will see it will provide a way forward by
making fairly strong assumptions about knowledge of the treatment allocation process.
Hence suppose that we have cross-sectional data where some individuals have received
treatment and some others have not, but the two groups are not identical in their char-
acteristics. A simple di¤erence-in-means estimator is then likely to confound the e¤ects
of the treatment with the e¤ects of the di¤erences between the groups. What can we do
then? A natural way forward is to try to understand how treatment is allocated. After all,
we …gured out that was not completely random. Speci…cally, suppose that we observe
23
-
a number of individual characteristics x. To make things easy, suppose for now that
the observed characteristics can only take on a …nite number of possible values; in other
words, suppose that x is discrete and has some distribution in the population which we
can represent by a probability density function Á (x) (with some support X). Moreover,
suppose that the observed characteristics x capture everything that is relevant to the
allocation of treatment. In other words, suppose we believe that the treatment depends
on the potential outcomes only through x. That is to say that the allocation of treatment
w is independent of (y0; y1) once we condition on any particular value of x.
Suppose e.g. that x = fgender; ageg. If we compare a 30 year old man with a 25 yearold woman, it may e.g. be that the man is more likely to participate. However, the main
point is that all men of the same age are exactly equally likely to participate! No other
information about these men, no matter how hard or how easy to obtain, would help
us predict who in the group receives treatment. If the measured characteristics x truly
capture everything that is relevant to the allocation of treatment, then if we focus on a
group of individuals with the same characteristics x, who gets treatment and who does
not should be purely random! But that means that, as long as we focus on individuals
with the same characteristics, it is as if we have a randomized experiment. That will
clearly give us a way forward.
The assumption that the allocation of treatment is independent of the potential out-
comes once we condition on a set of observable variables, introduced by Rosenbaum and
Rubin (1983), goes under various names in the literature: it is commonly referred to as the
“ignorability of treatment”, the “unconfoundedness” or then “selection on observables”
assumption.
Theory: A Formalization
Suppose then that we observe outcome the outcome variable y for a random sample from
the population (where the observed outcome is either y1 or y0 depending on whether
or not the individual has received treatment as described above); we also observe who
receives treatment, as indicated by the variable w; …nally we observe a vector of individual
characteristics x. Our key assumption will be that x contains everything that is relevant
to the allocation of treatment. Hence once we condition on x the allocation of treatment
24
-
will no longer be correlated with the potential outcomes.
Assumption 2 Conditional Independence Assumption. Conditional on x; w and
(y0; y1) are independent: (y0; y1) ? wjx.
This means that, if we take two individuals with the same observed characteristics x,
they will be equally likely to receive treatment: speci…cally, the allocation of treatment
cannot be related to any other factors, including the potential outcomes. The fact that
the allocation of treatment in this sense “ignores” the outcomes has motivated the name
“ignorability assumption”.
Another way of looking at the assumption of conditional independence is to note that it
allows the allocation of treatment w to be correlated with the potential outcomes (y0; y1),
but that the correlation disappears once we partial out the observed characteristics x.
One immediate e¤ect of this is that we can unequivocally talk about the conditional
probability of participation given x; to see this note that
Pr (w = 1jx) = Pr (w = 1jx;y0; y1) ; (25)
since, given that we are already conditioning on x, w is independent of the potential
outcomes. Hence we can write an individual’s probability of receiving treatment as a
function of her characteristics. Formally, for all x 2 X, de…ne
p (x) ´ Pr (w = 1jx) : (26)
The function p (x) is in the treatment evaluation literature commonly known as the
propensity score function.
For future reference we can also consider a second useful assumption; this assumption
states that, for every possible value of x (in the support of x), there are both treated and
untreated individuals
Assumption 3 The Overlap Assumption. p (x) 2 (0; 1) for all x 2 X.
A simple violation of the overlap assumption is e.g. when treatment are given to all
men and no women: in this case p (male) = 1 and p (female) = 0: As we will see, the
overlap assumption is critical for the feasibility of matching techniques.
25
-
A problem with the assumption that x captures everything that is relevant to the
allocation of treatment is that it is virtually impossible to verify. Hence for this to be
convincing, we really should have a good set of observed characteristics. Nevertheless,
let’s proceed on faith and consider how the average treatment e¤ect can be estimated
when selection into treatment is on observable variables.
5.1 Simple Matching
We noted above that if the observed characteristics x capture everything that is relevant
to the allocation of treatment, then if we focus on a group of individual’s with exactly the
same characteristics, who gets treatment and who does not is e¤ectively random: it is as
if, within that group, there is a randomized experiment. But in that case we know how
to proceed. We can simply compare the average outcome of those who receive treatment
with the average ourcome of those that don’t.
Thus e.g. if x = fgender; ageg we would e.g. pick out all women aged 35 (say) and,within this group, we would compute the mean outcome among those who receive treat-
ment y1 and the mean outcome among those that don’t y0. In order to emphasize that we
are looking only at individuals with the speci…c characteristics x = ffemale; age = 35gwe can write these (conditional) sample averages are y1 (x) and y0 (x). We could then
take the di¤erence y1 (x) ¡ y0 (x) as a natural estimate of the average treatment e¤ectamong individuals with these speci…c characteristics. Recall that we denoted this by
ATE (x) above. Thus we again use the simple di¤erence-in-means estimator, only this
time on individuals with the same characteristics:
dATE (x) = y1 (x) ¡ y0 (x) : (27)
What if we wanted to estimate the average treatment e¤ect in the whole population?
We would need to determine what fraction of the population is of each type and take the
weighted average of the group-speci…c average treatment e¤ects. Hence, if we want to
estimate ATE, then what we need to do is to:
² Obtain the estimate dATE (x) for all possible values of x using the di¤erences-in-means estimator within each group x 2 X.
26
-
² For each possible value x 2 X estimate the fraction of the population with thoseparticular characteristics; the obvious estimator is the corresponding fraction in the
sample, which we can denote f (x).
² Take the weighted average across all groups to obtain the estimated ATE,
dATE =X
x2Xf (x) dATE (x) : (28)
Theory
Let’s look a bit more in detail at the theory behind this. Recall that we de…ned ATE (x)
as the average treatment e¤ect among the part of the population that has characteristics
x,
ATE (x) ´ E (y1 ¡ y0jx) : (29)
The average treatment e¤ect in the population ATE is simply the weighted average of
the average treatment e¤ects in the subpopulations
ATE =X
x2XÁ (x)ATE (x) (30)
The key to the analysis is the assumption that, conditional on x, treatment status
w is independent of (y0; y1). This means that, the expected outcome for those actually
treated is representative of the treated outcome among everyone with the characteristics
x; formally, Assumption 2 implies that
E (y1jx;w = 1) = E (y1jx) : (31)
Similarly, the expected outcome for those were not treated is representative of the un-
treated outcome among everyone with the characteristics x,
E (y0jx;w = 1) = E (y0jx) : (32)
Turning from the potential outcomes (y0; y1) to the actual outcomes y, we then have
that
E (yjx; w = 1) = E [y0 + w (y1 ¡ y0) jx; w = 1] = E [y0 + 1 ¢ (y1 ¡ y0) jx; w = 1](33)
= E (y1jx; w = 1) = E (y1jx) :
27
-
The …rst equality follows from replacing the actual outcome y using the switching equation
(7), the second equality follows from plugging in that w = 1, the third equality follows
by direct simpli…cation, while the last equality reiterates equation (31).
By an analogous argument we have that
E [yjx; w = 0] = E [y0 + w (y1 ¡ y0) jx; w = 0] = E [y0 + 0 ¢ (y1 ¡ y0) jx; w = 0](34)
= E [y0jx; w = 0] = E [y0jx] :
Hence
E [yjx; w = 1] ¡ E [yjx; w = 0] = E [y1jx] ¡ E [y0jx] = E [y1 ¡ y0jx] = ATE (x) : (35)
Proceeding exactly as in the case of randomized experiments we would use that, due
to random sampling, the corresponding sample means are consistent estimators of the
population.
Let y1 (x) denote the sample mean outcome among the treated individuals with char-
acteristics x and let y0 (x) denote the sample mean outcome among the untreated in-
dividuals with characteristics x. Again, using the weak law of large numbers we have
that the probability limit of y1 (x) is E [yjx; w = 1] and the probability limit of y0 (x) isE [yjx; w = 0]. Hence y1 (x) ¡ y0 (x) is a consistent estimator of ATE (x).
Moreover, the observed fraction of individuals in the sample who have the character-
istics x,
f (x) ´ 1N
NX
i=1
I (xi = x) (36)
(where I (¢) is the indicator function that is one if the statement in the brackets is true andzero otherwise) is a consistent estimator of the corresponding fraction in the population,
i.e. the probability limit of f (x) is Á (x).7 The matching estimator
dATE =X
x2Xf (x)
£y1 (x) ¡ y0 (x)
¤(37)
is therefore (using the Slutsky Theorem) a consistent estimator of ATE.
7Recall that we are assuming that x is discrete.
28
-
Potential Problems
Although this sounds straightforward, three complications have to be tackled.
1. The number of possible combinations of characteristics tends to grow very quickly
2. Some characteristics may be continous.
3. For some values of x there may be no treated (or no untreated) individuals.
The …rst problem is known as the curse of dimensionality; to see the problem
suppose that we initially have x = fgender; ageg and that age can take on values thevalues 20,21,22, ... , 50. Then there are already 2 £ 31 = 62 possible x-vectors. Supposethat we add another dimension; e.g. suppose we add years of schooling which can take
on 10 di¤erent values (say). Then the number of possible x-vectors quickly increase to
620!! Hence we see that simple matching quickly runs into problems as we add more
variables – the number of groups simply tends to grow very, very quickly. Note that this
is a problems since, for the analysis to be convincing we need to make the case that we
are including in x all variables that are relevant to the allocation of treatment. But as
the possible number of x vectors grow we also need the number of observations to grow
so that we can estimate with precision the group-speci…c means y0 (x) and y1 (x) at every
x.
The second problem is similar: any continous variable can, in principle, take on an
in…nite number of values which makes it impossible to estimate ATE (x) on all possible
values of x: for one we cannot even list all possible values of x and moreover, we cannot
expect to …nd individual with exactly the same characteristics. One way to “solve” this
problem is to “discretize” the continous variables, stratifying the sample into bins or
cells.8
The third problem points to the importance of the overlap assumption: problem three
obtains when that assumption is violated. For some value of x, either y1 (x) or y0 (x)
8Another way to proceed is to accept “inexact” matching; e.g. one can compare each treated indi-
viduals with the untreated individual with the “most similar” characteristics (where “most similar” is
de…ne using some pre-speci…ed distance measure, e.g. the Euclidean metric). If one further assumes that
ATE (x) is continous in x a range of ‡exible non-parametric methods are available.
29
-
cannot be computed – there are simply no one to average over! Hence we cannot estimate
ATE (x) – the average treatment e¤ect at that speci…c x is simply not identi…ed (and
hence, neither is ATE).
An Example: Earnings by Veteran Status
Angrist (1998) considers the e¤ect of voluntary military service on the earnings and
employment status of veterans. A simple comparison on veterans to non-veterans can be
expected to be misleading for two reasons:
² There is likely to be self-selection in applying for the military
² The is e¤ective screening of applicants by the military.
To tackle these problems Angrist’s comparison by veteran status is restricted to a
sample of applicants to the military, only about half of whom actually enlist. Moreover,
the data contains most of the variables that the military uses to screen applicants (age,
schooling, the Armed Forces Quali…cation Test (AFQT) score, application year). Angrist
compares three di¤erent estimators:
² Di¤erence in means between veterans and nonveterans.
² Matching on the observed variables.
² Regression estimates with controls for the observed variables.
Angrist merges military data (from the Defense Manpower Data Center) with earnings
data (from the Social Security Administration) using social security number. Knowledge
of the military selection process suggests that the recorded characteristics matter for
the entry decision. In order to perform matching, Angrist de…nes cells using the year
of application (1979-1982, 4 categories), AFQT score (5 categories), schooling level at
the time of application (6 categories), year of birth (1954-1965, 12 categories), race (2
categories), generating a total 2,880 categories.
² NEED TO COMPLETE THIS EXAMPLE>>>
30
-
5.2 Linear Regression
Early on we asked what’s wrong with the …rst-year econometrics solution of simply using
the treatment indicator w as a dummy variable in an OLS regression. We can now have
a look at the answer to this question. Hence consider the simple formulation,
y = ®+ ¯w + ±x+ ": (38)
We want to know if, under some circumstances, the coe¢cient on the treatment
dummy ¯ measures something meaningful like the ATE.
Note …rst that if we do not include the observable covariates x then we are almost
surely in trouble. Running the regression, using OLS, with the single dummy variable
as regressor would the give that the estimated value of ¯ that is equal to the di¤erence
in sample means between the treated and the untreated individuals, b̄ = y1 ¡ y0 – i.e.it would be the simple di¤erence-in-means estimator. From our earlier analysis we know
that the di¤erence-in-means estimator is a consistent estimator of
E (y1jw = 1) ¡ E (y0jw = 0) = ATE + E (À1jw = 1) ¡ E (À0jw = 0) (39)
which is neither the ATE nor the ATT . Only in the case of pure randomization will this
approach in general consistently estimate ATE (and ATT ).
However, maybe by including the observed covariates x we can rid the regression of
the problem that the allocation of treatment is correlated with the error term. Hence,
let’s investigate under which conditions estimating this equation by OLS will give us
the answer we are looking for, namely the ATE. Based on the above intuition, we can
now show that applying OLS on will provide a consistent estimate of ATE under three
assumptions.
1. We maintain the assumption that the characteristics in x capture everything that
is relevant to the allocation of treatment, i.e. we maintain the conditional indepen-
dence assumption 2.
2. We assume that the average treatment e¤ect does not vary across groups: ATE (x)
is the same for all x.
31
-
3. The conditional average outcome in the absence of treatment is linear in x; E [y0jx] =®+ x±.
The above assumptions are clearly quite strong; nevertheless, there are plenty of
examples of OLS estimates in the literature. The thing to take away from this is that
OLS can make sense; however, the conditions under which it is perfectly valid are quite
stringent.
Theory*
Let’s prove that OLS makes sense under the above assumptions. Let’s start by formalizing
the second assumption:
ATE (x) = E [y1 ¡ y0jx] = ¯ for all x 2 X (40)
where ¯ is a constant. Hence ATE (x) = ATE = ¯ for all groups x. Recall then the
decomposition we generated above,
yk = ¹k + Àk with ¹k = E [yk] , k = 0; 1 (41)
We noted that the individual treatment e¤ect can then be written as
y1 ¡ y0 = (¹1 ¡ ¹0) + (À1 ¡ À0) (42)
where ¹1 ¡ ¹0 = ATE and where À1 ¡ À0 is the individual-speci…c component of thetreatment e¤ect. Taking the average within a given group x then yields
ATE (x) = E [y1 ¡ y0jx] = ATE + E [À1 ¡ À0jx] (43)
The left hand side is ATE (x). Hence since ATE (x) = ATE for all x it must be that
E [À1 ¡ À0jx] = 0 for all x 2 X (44)
This of course simply re‡ects that, conditional on x, treatment is as if randomized.
Next we turn to the observed outcome y; by the switching equation (7). Substituting
for y1 and y0 using the decomposition we see that
y = w (¹1 + À1) + (1 ¡w) (¹0 + À0) (45)
= ¹0 + (¹1 ¡ ¹0)w + À0 + w (À1 ¡ À0)
32
-
Now use that we observe the characteristics x; taking expectations of the observed out-
come conditional on x and w (the treatment indicator),
E [yjx; w] = E [¹0 + À0 + (¹1 ¡ ¹0)w + w (À1 ¡ À0) jx; w] (46)
In order to simplify this we use that ¹0 and (¹1 ¡ ¹0) = ATE are constants and that(trivially) E [wjx; w] = w. Thus
E [yjx; w] = ¹0 +ATE ¢ w + E [À0jx; w] + E [w (À1 ¡ À0) jx; w] (47)
However, when we consider the last term we see that this simpli…es,
E [w (À1 ¡ À0) jx; w] = wE [(À1 ¡ À0) jx; w] = wE [(À1 ¡ À0) jx] = 0 (48)
where the …rst equality follows from the fact that we are conditioning on w (so we can
treat it as a constant), the second equality follows from the conditional independence
assumption which states that, once we condition on x, the conditional outcomes are
independent of the treatment allocation w; …nally, the last equality comes from equation
(44). Hence we have that
E [yjx; w] = ¹0 +ATE ¢ w + E [À0jx] (49)
where we also used that, due to selection on observables E [À0jx; w] = E [À0jx]. Finallyby the linearity assumption
E [y0jx] = E [¹0 + À0jx] = ¹0 + E [À0jx] = ®+ x± (50)
Hence have that
E [yjx; w] = ®+ATE ¢ w + ±x (51)
Hence, under the speci…c assumptions, the linear structure holds and the coe¢cient on
the treatment dummy is the ATE. It can also easily be show that w and the disturbance
" are uncorrelated; this follows naturally from the assumption that the allocation of
treatment is only related to x, not to the individual potential outcomes. Hence the ATE
can, in this case, be consistently estimated using OLS.
33
-
5.3 Matching on the Propensity Score
One advantage of the simple matching was that it allows us to explore how the average
treatment e¤ect vary with the individuals’ characteristics – speci…cally, we could estimate
ATE (x) at the various values of x. The main problem with the matching method is that
is tends to run into the problem of the curse of dimensionality: If we have a rich datasets
there will be many variables available for categorising and grouping that the data gets
grouped into a large number of cells each containing few observations. In short, if we
exploit the richness of the data to compare like with like we are likely to end up having
few comparable observations and so correspondingly imprecise estimates.
It turns out that there is another way to exploit the assumption that the observed
variables x capture everything that is relevant to the allocation of treatment. The basic
idea is that it may be possible to summarize x in a “lower dimension”; speci…cally, it
may be possible to control for the allocation of treatment directly rather than for all
dimensions of x. This is the basic idea underlying the propensity score matching.
Above we used the notation p (x) to highlight that the probility of receiving treatment
depends on (and only on) the observed characteristics x. The probability of receiving
treatment p (x) is sometimes also known as the propensity score. The propensity score
function p (x) thus summarizes the process by which treatment is allocated. Note that
p (x) is always a number in the unit interval, so it is a particular way of summarizing x
in a single dimension. The question is: is it of any use? It turns out that it is. Let’s
elaborate a bit on this.
Recall the key assumption that x captures everything that is relevant to the allocation
of treatment. This was the reason why simple matching on x was valid. By focusing on
individuals with the same characteristics x we managed to get a handle on the counter-
factual: within the group of individuals with the same value of x, the average outcome
of those that were not treated should not be systematically di¤erent from the average
outcome that would have obtained for the treated had they in fact not been treated.
However, as the Angrist example showed, the number of categories can grow very
rapidly. Hence it would be useful if the information about the treatment allocation process
could be summarized in a lower dimension. Rosenbaum and Rubin (1983) showed that
rather than matching on the characteristics x one can match on the propensity score
34
-
p (x).9 Speci…cally, they showed that the key conditional independence assumption 2
implies that conditional independence also holds once we condition on the propensity
score (rather than full vector x): Formally, assumption 2 implies that (y0; y1) ? wjp (x)(we prove this below).
To see the usefulness of this, it is worthwhile to go back to the logic underlying simple
matching on characteristics. The logic there was as follows: Take a group of individuals
with the same characteristics x. Since x fully determines the probability of being treated,
those who were in fact treated must be representative, in terms of potential outcomes,
of the whole group of individuals with characteristics x. Similarly, those who were not
treated must be representative, in terms of potential outcomes, of the whole group of
individuals with characteristics x. Hence the untreated group form a valid control group
for the treated group in the sense that the expected outcome in the untreated group
corresponds to the expected outcome in the treated group that would have obtained had
the latter not been treated.
Now do the same mind experiment, only now focus on a group of individuals with the
same propensity score. i.e. the same value of p (x), say p0 (a number between zero and
one). The individuals within this group generally have di¤erent observed characteristics;
indeed, the group consists of all individuals with any characteristics x such that p (x) =
p0. However, they do share one crucial feature: they are equally likely to be selected for
treatment. Hence those within this group who were in fact treated must be representative,
in terms of potential outcomes, of the whole group of individuals with propensity score
p (x) = p0. Similarly, those who were not treated must be representative, in terms of
potential outcomes, of the whole group of individual with propensity score p (x) = p0.
Hence the untreated group form a valid control group for the treated group in the sense
that the expected outcome in the untreated group corresponds to the expected outcome
in the treated group that would have obtained had the latter not been treated.
Hence matching individuals by their propensity scores should in principle be feasible.
In practice, the propensity score function will not be known but rather must be estimated
e.g. using a Logit or Probit model; after doing so we can proceed by matching individuals
9See also Rosenbaum and Rubin (1984) and Heckman, Ichimura and Todd (1998).
35
-
using the estimated propensity score bp (x).10 Suppose then that, within a random sample,we can determine what was each individual’s probability of receiving treatment. Then
we can match the individuals into groups according to their propensity scores. Thus,
suppose that we have collected all the individuals in the sample who had the speci…c
probability p0 (a number between 0 and 1) of receiving treatment. Proceeding as in the
case with simple matching, we can then compute the sample average outcome y1 among
those within this group who were actually treated. To emphasize that we are doing this
for the group of individuals with propensity score p0, we can write y1 (p0). Since the
people we are averaging over are representative in terms of their potential outcomes of
all individuals with propensity score p0, y1 (p0) estimates
E£y1jp (x) = p0
¤
Similarly, we compute the average outcome among those within the group who were not
treated; the resulting average y0 (p0) estimates
E£y0jp (x) = p0
¤
Hence, the di¤erence, y1 (p0) ¡ y0 (p0) estimates the average treatment e¤ect amongthe portion of the population that have the speci…c propensity score p0. In line with
our previous notation, we can denote this conditional average treatment e¤ect ATE (p0),
de…ned as
ATE¡p0
¢´ E
£y1 ¡ y0jp (x) = p0
¤
The idea is then straightforward: we can try to proceed as we did with simple match-
ing:
² We estimate ATE (p) at “every value” of p between zero and one.
² We also work out, for each probility p, how large is the fraction of the populationthat has this speci…c probability of receiving treatment.
10Once we have estimated the propensity score function, we can (in principle) estimate the distribution
of propensity scores, F (p). (Note the interpretation: F (p) is the fraction of the population that have a
probability of being treated that is less than or equal to p).
36
-
² For the population average treatment e¤ect, ATE, we then take the weighted av-erage across all values of p.
To summarize, in the propensity score matching method we thus model the allocation
of treatment as a function of observable variables and predict the probability of treatment
for both the treated and the untreated groups. The method then proceeds by comparing
the outcomes across treated and untreated individuals within groups of individuals that
have a very similar probability of receiving the treatment.
There are, however, a couple of practical problems with this approach.
1. We need to know each individual’s probability of receiving treatment – speci…cally,
we need to know that propensity score function p (x).
2. The probability p0 is a continous variables so there is in principle an in…nite number
of probabilities p0 between zero and one.
The …rst problem is, as noted above, usually handled by estimating, in a …rst step, the
function p (x) in a straightforward way (typically using a Logit or a Probit model) and
use the predicted probability bp (x) for each individual. The second problem implies thatwe are unlikely to …nd very many treated and untreated individuals with exactly the same
probability of treatment, that is, we are unlikely to …nd very many exact matches on the
probability score. Moreover, there is in principle an in…nite number of possible values of
p; hence we cannot possibly hope to estimate ATE (p) at every possible probability p.
There are two basic methods methods used to overcome this latter problem. One involves
matching each treated individual within the sample with an untreated individual who is
the “nearest neighbour” (according to some pre-speci…ed criterion) in terms of propensity
score.11 A second approach involves discretizing the population based on the (predicted)
propensity score.
If we adopt the approach of discretizing the estimated propensity score we should also
check for “balance” within each bin, by which we mean that we should check that once
we focus on individual with the same (or similar) propensity scores who, within the group
11Stata programs that do propensity score matching are available on the web. In Stata type “net
search propensity score”.
37
-
receives treatment should not be related to the observe characteristics x. Hence, if within
a block, we …nd that we do not have balance, this suggests that we are lumping together
individuals who are not close enough in terms of propensity scores. That would mean
that, within the block, those received treatment seem to be systematically di¤erent from
those that didn’t, implying that the latter are not valid as a control group for the former.
What can we do? One simple remedy is to stratify again (within the problematic block)
so as to ensure that we are only comparing individual with su¢ciently similar propensity
scores. To check for balance within a block one can simply compare the means of each
characteristic x between the treated and the untreated individuals.
Theory*
For now let’s suppose that the propensity score p (x) is a known function. Note also that,
since the treatment indicator dummy w is binary,
p (x) = E (wjx) : (52)
We want to show that the conditional independence assumption 2 implies that we
also have conditional independence when we condition, not on the full vector x, but on
the summary function which is the propensity score function. In particular we want to
show the following:
Proposition 1 Suppose that Assumption 2 holds so that (y0; y1) ? wjx and supposealso that Assumption 3 holds so that p (x) 2 (0; 1) for all x 2 X. Then (y0; y1) ? wjp (x).
We will show that Pr (w = 1jy0; y1; p (x)) = Pr (w = 1jp (x)) = p (x), which impliesthat w is independent of (y0; y1) conditional on p (x). The proof uses the law of iterated
expectations. First, note that since w is binary, w = 0; 1,
Pr (w = 1j (y0; y1) ; p (x)) = E (wj (y0; y1) ; p (x)) (53)
Then expand the right hand side by using the law of iterated expectations,
Pr (w = 1j (y0; y1) ; p (x)) = E [E (wj (y0; y1) ; p (x) ;x) j (y0; y1) ; p (x)] (54)
38
-
Then simplify the right hand side using Assumption 2 (which implies that the conditioners
other than x in the inner expectation are super‡ous)
Pr (w = 1j (y0; y1) ; p (x)) = E [E (wjx) j (y0; y1) ; p (x)] (55)
Then use that E (wjx) is, by de…nition of the propensity score, equal to p (x). Hence
Pr (w = 1j (y0; y1) ; p (x)) = E [p (x) j (y0; y1) ; p (x)] (56)
But, then trivially,
Pr (w = 1j (y0; y1) ; p (x)) = p (x) (57)
Moreover, by a similar argument,
Pr (w = 1jp (x)) = E (wjp (x)) = E [E (wjp (x) ;x) jp (x)] (58)= E [E (wjx) jp (x)] = E [p (x) jp (x)] = p (x) (59)
Hence we have that
Pr (w = 1j (y0; y1) ; p (x)) = Pr (w = 1jp (x)) (60)
since both equal p (x). Thus it follows that w is independent of (y0; y1) given that we
condition on p (x).
Another implication of the conditional independence assumption 2 is that, once we
condition on p (x), x will be independent of w, i.e. x ? wjp (x). This is sometimes knownas the “balancing score property”. Intuitively, di¤erent combinations of the covariates
x can generate the same value of the propensity score; however, once we condition on
x being in the set fxjp (x) = p0g for any given value of p0, x will not be related in anyfurther way to the allocation of treatment. The implication of this is that, in regression
of w on x and p (x), the coe¢cients for x should be zero (or not signi…cantly di¤erent
from zero).
Proposition 1 proves that the conditional independence assumption 2 extends to the
propensity score: once we restrict our attention to individuals with the same value of the
propensity score it is as if the allocation of treatment was random within this group. If
we know the propensity score function then we can proceed as in the case of matching
39
-
(ignoring for a second the two complications that p (x) is a continuous variable and that
it is also unknown). For “each value” of the propensity score function p (x) compute
the sample average of those treated, denoted y1 (p (x)), and those untreated, denoted
y0 (p (x)). The …rst sample mean consistently estimates E [yjp (x) ; w = 1] while the lattersample mean consistently estimates E [yjp (x) ; w = 0]. Estimate the density of p (x) todetermine the fraction of individuals belonging to each possible value of p (x), denoted
f (p (x)), which consistently estimates the corresponding population fraction Á (p (x)).
Then take the weighted average over the possible value of the propensity score to obtain
a consistent estimate of the ATE,
dATE =X
p(x)
f (p (x))¡y1 (p (x)) ¡ y0 (p (x))
¢(61)
The two problems here are, as noted above, (i) that the propensity score is initially
unknown, and (ii) that the propensity score p (x) is generally a continuous variable.
The …rst problem can be handled if we can …nd a way of consistently estimating the
propensity score function p (¢). In that case consistency carries over to the case werewe use the estimated propensity scores bp (x). The second problem is typically handledby partitioning the range of the propensity score function – i.e. the unit interval – into
subintervals or by using di¤erent versions of “nearest neighbour” matching.
Examples
The Evaluation of a Training Program Above we discussed the paper by LaLonde
(1986); that paper used experimental data to explore how well standard nonexperimental
estimators succeed in correctly estimating treatment e¤ects. Deheija and Wahba (1998)
use the same data to explore how well the propensity score matching approach fares.
They conclude that the propensity score matching approach works much better than
the non-experimental approaches considered by LaLonde (1986) and seems to frequently
come close to the experimental results.12
The study by Deheija and Wahba is a nice example of the practical implementation
of propensity score matching. Recall that LaLonde studied experimental data on the
12See also Heckman and Smith (1995) for a discussion.
40
-
National Supported Work (NSW) Demonstration. In addition to using the treatment
and control group from the NSW experimental data, LaLonde also constructed alterative
“control groups” from other external data sources (PSID, CPS) in order to check how
other standard non-experimental approaches would fare. Deheija and Wahba take the
data from LaLonde to examine if the propensity score method fares better tha the non-
experimental approaches considered by LaLonde. To do this, DW combine the data on
the treated individuals from the NSW experimental data with the arti…cially constructed
control groups from the PSID and the CPS (which contrains the same background infor-
mation).
They then proceed in two steps. First, they estimate the propensity score p (x) using
the pre-treatment variables x (observed for both the individuals in the NSW data and
the arti…cial control groups); to do this they use a standard Logistic probability model.
They then group the observations into strata based on the estimated propensity score
and check for balancing of the pre-treatment variables within each strata; that is, they
use statistical tests to check whether, within each stratum, the distribution of the pre-
treatment variables x is the same for the treated and the untreated individuals (as it
should due to the balancing score property w ? xjp (x)). If there are no signi…cantdi¤erences in the distribution of x between the