socexpnotes - royal holloway, university of...

Empirical Methods for Public Policy AnalysisLecture notes

Dan Anderberg

Royal Holloway University of London

January 2007

1 Introduction to Treatment Evaluation

In this …rst part of the course we will go through some econometric techniques that

have become increasingly popular in public economics. In particular, I will focus on

what has become known as treatment evaluation. Treatment evaluation is concerned

with measuring the impact of interventions on outcomes of interest. The approach and

the terminology originates from medical research where an intervention frequently means

exposing someone to some form of treatment.

In public economics the techniques can be used to study the e¤ect of e.g. welfare- and

social insurance programs on various aspects of behaviour including labour supply, un-

employment duration, family structure etc. We will encounter these methods frequently

during the course, which motivates why I wanted to start with spending some time pro-

viding an overview of these methods.

² “Treatment” can mean just about anything (being exposed to a more generouswelfare system, getting training, smoking, having good neighbours etc.)

1.1 Examples

Hormone Replacement Therapy

Consider …rst an example from medicine. The onset of menopause implies that that the

body produces less hormones like estrogen and it is perceived that menopause increases

the risk of osteoporosis, Alzheimer’s disease and heart disease. Given this, a possible

treatment is “hormone replacement therapy” (HRT). Indeed, in 2001, some 20 percent of

1

US women in menopause were receiving HRT at an expenditure of 2.75 billion $US. The

question is, however, whether HRT was it e¤ective?

Evaluating the e¤ectiveness of HRT is complicated by the fact that women who choose

to go on HRT after onset of menopause are di¤erent from women who choose not to:

they have higher levels of HDL (good cholesterol), lower blood pressure, engage more in

physical activity, and have lower weight, are more educated etc. Controlling for all such

di¤erences is not easily done.

To settle the issue a randomized trial was organized: the Women’s Health Initiative

trial in which 27,000 women age 50-79 were followed for 9 years. Among the women, the

allocation of HRT was randomly allocated. However, the trial was discontinued when

evidence mounted that HRT increases the risk of heart disease and stroke! Hence a

conclusions from this was that the initially perceived bene…cial e¤ect of HRT on reducing

the risk of heart disease and stroke must have been driven by selection e¤ect.

The Tennessee STAR Experiment

The Tennessee STAR (Student/Teacher Achievement Ratio) experiment sought to deter-

mine the e¤ect of class size on educational outcomes. 79 schools were randomly selected

for treatment. The treatment involved the formation of classes in terms of student num-

bers and “teacher-student ratio” for students in grades K-3 with three possible designs:

small (13-17), regular (22-25), regular with teacher aide with both students and teachers

randomly allocated to class types. The outcomes of interest were (i) standardized tests in

grades K-8 (short-run outcomes), and (ii) participation and scores in ACT/SAT college

admission tests in …nal year of high school. See Krueger and Whitmore (2001) for an

analysis of the Tennessee STAR Experiment.

1.2 Treatment E¤ect

The policy relevance of treatment evaluation should be immediate: it can help us iden-

tify potential improvements in policy. We will focus on the problem of estimating an

average treatment e¤ect (ATE) : Formally, ATE is the average partial e¤ect of a

binary explanatory variable. What does this mean? Partial e¤ect means the e¤ect of

2

the treatment holding other factors constant (as in a standard multiple regression). By

binary we mean that an individual either (i) gets the treatment, or (ii) does not get the

treatment. We are e.g. not studying treatments that vary in intensity. Thus we can

think of treatment as a dummy variable

wi ´

8<:

1 if individual i receives “treatment”

0 if individual i does not receive “treatment”(1)

A natural approach to estimating the e¤ect of treatment on an outcome y would then

be to simply include the treatment dummy wi in a standard linear regression,

yi = ® + ¯wi + "i; (2)

or, if we expand the approach to allow for other “explanatory variables” xi,

yi = ® + ¯wi + ±xi + "i; (3)

So what’s wrong this approach? The answer is (as we will see): “Sometimes there

is nothing wrong with this approach!”. However, there are two major concerns about

this approach which motivates a more general approach. First, we would generally be

concerned that the receipt-of-treatment dummy variable wi might be correlated with the

error term, leading to a problem of ”endogeneity”. Indeed, much of the methods put

forward in the treatment evaluation literature are motived precisely as a way of tackling

the potential endogeneity problem. Second, note e.g. that the above formulation seems

to implicitly assume that treatment has the same e¤ect on the outcome for all individuals.

This would be a strong assumption: in many cases we would expect that people respond

to “treatment” in di¤erent ways; indeed, in general we would be concerned that there

might be unobserved heterogeneity in how individuals respond to treatment. Hence we

would like to develop a more general and rigorous framework in which we can address

some of these issues.

2 Causal Inference and Counterfactuals

Most of our discussion of treatment evaluation will be carried out in the context of the

Holland-Rubin causal model (Holland, 1986, Rubin, 1974).

3

2.1 The Causal Model and Measures of Treatment E¤ects

Treatment evaluation methods are concerned with identifying the causal e¤ect of treat-

ment on an outcome of interest: What was the e¤ect on individual i of receiving treat-

ment? Note that we are implicitly comparing individual i with herself in the alternative

scenario where she does not get treatment. Hence causal inference is based on the notion

of counterfactuals. A natural formulation of this is to say that each individual has two

potential outcomes: one outcome with treatment and one outcome without treatment.

We will denote the outcome of individual i with treatment as yi1 and the outcome with-

out treatment as yi0. The causal e¤ect (or “treatment e¤ect”) for individual i then then

yi1¡yi0. The problem is that will ever only observe one of the two potential outcomes: noindividual can both receive treatment and not receive treatment. Hence for individual i

there will be one observed outcome which we will denote yi. If the individual does receive

treatment, wi = 1, then the observed outcome is the one with treatment, i.e. yi = yi1. On

the other hand, if the individual does not receive treatment, wi = 0, then the observed

outcome is the one without treatment, yi = yi0. The outcome that remains unobserved

is the counterfactual.

It could be that everyone experiences the same treatment e¤ect. In this special case,

which we will generally refer to as the “homogenous treatment e¤ects” case,

yi1 ¡ yi0 = ¯ for all i: (4)

However, in most cases this seems like an implausibly strong assumption. Hence we want

to allow for the possibility that individuals react di¤erently to treatment.

Our de…nition of the treatment e¤ect for individual i implicitly assumes that this

e¤ect is independent of who else received treatment or, conversely, that treatment of

individual i only a¤ects individual i; in the literature this is commonly referred to as the

stable unit treatment value assumption (SUTVA) (Neyman, 1923, Rubin, 1980). This

assumption rules out e.g. peer-e¤ects, cross-e¤ects and general equilibrium e¤ects.

² ADD REFS

In the following we will adopt the assumption of random sampling. In particular,

we will refer to the potential outcomes model as our general “population model” and

4

assume that an independent identically distributed (i.i.d.) sample can be drawn from the

population.1 In describing the population model we will dispense with the subscript i

for the individual and simply write y1 for the “treated” outcome, y0 for the “untreated”

outcome and y1 ¡ y0 for the treatment e¤ect.Since the treatment e¤ect can, in general, be expected to vary in the population, we

can think of it as a random variable. (If we were to randomly pick an individual, the

treatment e¤ect is a random variable.) We will then have to decide which moments of

that distribution we will be interested in. A natural measure of the impact of treatment

to focus on is the average treatment e¤ect in the population. This answers the following

question: If we randomly pick out an individual, what is the expected value of his/her

treatment e¤ect? We can de…ne this formally:

Definition 1 Average Treatment E¤ect. The average treatment e¤ect ATE is the

expected value of the treatment e¤ect in the population,

ATE ´ E [y1 ¡ y0] :

What if we observe some individual characteristics like age, gender etc.? It could e.g.

be that the average treatment e¤ect among men is di¤erent from that among women.

How do we take this into account? A natural way to do this is to collect an individ-

ual’s characteristics in a vector x, so e.g. x could be (male; age = 29; :::; white). How

would we then denote the average treatment e¤ect among individuals with the speci…c

characteristics x? To do this we can use the notion of the conditional expected value,

ATE (x) ´ E [y1 ¡ y0jx] :

ATE (x) should thus be interpreted as the average treatment e¤ect among individuals

with characteristics x.

A second measure of the impact of treatment that we can estimate is the average

e¤ect of treatment on the treated, which we can denote ATT :

1There are cases when the i.i.d. assumption is not strictly valid for the case where data is in the form

of repeated cross-section (where samples are obtained from the population at di¤erent points in time) or

in the case of panel data (which consists of repeated observations on the same cross section of individuals

etc. In these case the will assume random sampling in the cross-section dimension.

5

Definition 2 Average Treatment E¤ect on the Treated. The average e¤ect of

treatment on the treated ATT is the expected value of the treatment e¤ect among those

who would receive treatment,

ATT = E [y1 ¡ y0jw = 1] : (5)

Indeed, ATT is often more interesting than ATE; consider e.g. the case of a pro-

gramme where a policy-maker chooses the eligible population. ATT then, by focusing on

the programme participants, determines the realized gross return from the programme

which can then be compared to the costs in order to evaluate whether the programme

was successful or not (Heckman, LaLonde and Smith, 2000)

Note that ATE and ATT will, in general, not be the same. This could be e.g. due

to treatment being allocated to observable subgroups of population who are expected to

bene…t more from treatment. Or, it could be the result of “self-selection”: if individuals

can, choose whether or not to participate, individual participation can be expected to

depend on the individual treatment e¤ect. Indeed, the only case where ATE and ATT

can generally be expected to coincide is when the allocation of treatment is the outcome

of a randomized experiment (see below).

Later on we will also encounter a third measure of the e¤ect of treatment; this concept,

introduced by Imbens and Angrist (1994) is known as a Local Average Treatment

E¤ect (LATE). LATE can be estimated using instrumental variables under weak

conditions. However, LATE has two drawbacks: (i) it measures the e¤ect of treatment

on a generally unidenti…able subpopulation, and (ii) the de…nition itself depends on the

particular instrumental variable that one has available. LATE uses the existence of

a variable that only a¤ects an individual’s outcome through the participation decision

w. In intuitive terms, LATE then measures the average impact of treatment on the

subpopulation whose participation is a¤ected by variation in the instrumental variable.

6

2.2 The Observed Outcome: The Switching Equation

We noted above that, for any one individual, there are two potential outcomes, y1 and

y0. However, we only observe one of these two potential outcomes. In particular,

w = 1 ) y = y1 (the individual is treated)w = 0 ) y = y0 (the individual is not treated)

(6)

A useful way of writing this is as follows

y = wy1 + (1 ¡ w) y0 = y0 + w (y1 ¡ y0) (7)

The last formulation is revealing: it is as is the individual has a “base outcome” y0 to

which we add the treatment e¤ect if the individual indeed gets treatment. This formula-

tion will come in handy: we can refer to it as the “switching equation”.

2.3 Observed and Unobserved Characteristics

Above we introduced the notion of observed characteristics x that may a¤ect the e¤ect of

treatment and the allocation of treatment. However, there may also exist characteristics

that we cannot measure. This is a common problem in econometrics. Consider e.g. the

use of a traditional Mincerian wage equation for estimating the rate of return to education;

individual years of schooling are likely to be correlated with unobserved ability. The

problem of unobserved characteristics such as ability, motivation, self-esteem etc. will

come up frequently in the treatment evaluation problem.

3 Randomized Experiments

In randomized experiments treatment is randomly assigned. Randomized experiments

in economics are rare; they are nevertheless interesting to consider for four reasons.

First, the are considered to be the “gold standard” of evaluation; hence, as providing

a benchmark for other evaluation methods, it is useful to understand how they work.

Second, evidence based on randomize experiments are usually highly in‡uencial; hence

it is important to be aware of their potential limitations, what can undermine their

validity. Third, “experiment-like situations” can occur by chance, in which case we talk

7

about “natural experiments”. Fourth, many of the methods build on the analytics of

randomized experiments.

An Example: Vitamin C and Cancer

One experiment gave vitamin C to 100 patients believed to be terminally ill from cancer

(See Rosenbaum (2002), Ch. 1). A comparison group constructed by matched sampling:

for each treated patient, 10 other patients were randomly chosen from historical records

with same type of cancer and other characteristics in terms of age, gender etc. It was

found that patients receiving vitamin C lived about 4 times longer than controls, highly

signi…cant. However, later a carefully randomized experiment conducted at the Mayo

Clinic overturned the result: patients were randomly assigned to receive vitamin C or a

placebo. With the more careful research design there was no evidence that vitamin C

helped prolong survival among cancer patients.

3.1 Basic Idea and Intuition

Suppose we have a target population and that we want to determine what the average

e¤ect of some speci…c “treatment” is. How might we go about doing this? Borrowing

some ideas from medicine, a good way (at least from statistical point of view) to go

about this would be to randomly split people into two groups: one group that will receive

the treatment and one that will not. The two groups are commonly referred to as the

“treated” group and the “untreated” (or “control” group).

The idea behind randomizing who gets treatment is that doing so will guarantee that

there will be no systematic di¤erences between the treated- and the untreated groups

– both groups will be representative of the population. Speci…cally, since the allocation

of treatment is completely randomized, who gets treatment cannot be related to the indi-

vidual e¤ect of treatment or any other individual characteristics, whether observable or

unobservable. Hence it shouldn’t be e.g. that those who bene…t more from the treatment

are more likely to get it.

This fact will make it extremely simple to identify the average e¤ect of treatment: we

can simply compare the average outome in the treated group with the average outcome

8

in the untreated group.

² The average outcome among those who did receive treatment should be represen-tative of what would be the average outcome in the whole population if everyone

had received treatment.

² The outcome among those who did not receive treatment should be representative ofwhat would be the average outcome in the whole population if no one had received

treatment.

Suppose then that the allocation of treatment has been randomized and we have (a

random sample) consisting of N1 individuals who have received treatment and N0 indi-

vidual who did not receive treatment. A natural way to estimate the e¤ect of treatment

is then to compare the means in the two groups. Let

y1 ´ 1N1

N1X

i=1

yi and y0 ´1N0

N0X

i=1

yi; (8)

be the average outcome in the treated- and the untreated group respectively and consider

the following di¤erence-in-means estimator of the average treatment e¤ect,

dATE = y1 ¡ y0:

The question is: What are the properties of this estimator?

3.2 Theory

Let’s look at some of the theory behind this. The starting point is that we have a

random draw from the population; thus fwi; yi0; yi1gNi=1 are treaded as i.i.d. randomvariables. For each observed individual we observe the outcome yi = yi0 + wi (yi1 ¡ yi0).The crucial feature of randomization is that it ensures that the allocation of treatment

is (statisically) independent of the potential outcomes; stated in terms of the population

model the assumption is that:

Assumption 1 The allocation of treatment is independent of the potential outcomes,

(y0; y1) ? w:

9

The …rst thing to note is that, under randomized allocation of treatment ATE = ATT

since the treated individuals are, by construction, representative of the entire population.2

Formally, due to randomization

E (yj jw = k) = E (yj) , j; k = 0; 1 (9)

which implies that

ATT = E (yi ¡ y0jw = 1) = E (yi ¡ y0) = ATE: (10)

Consider now the di¤erence-in-means estimator; to do this we can use conditional ex-

pectations. Since the allocation of treatment is completely randomized in the population,

w (the treatment dummy) must be (statistically) independent of the potential outcomes

y0 and y1. In particular, it follows that

E [y1jw = 1] = E [y1] (11)

Note what this says: the expected outcome among those actually treated must equal the

expected “treated” outcome in the overall population. This comes from the fact that,

due to randomization, the treated individuals are representative of the overall population.

Similarly, it must be that

E [y0jw = 0] = E [y0] (12)

This says that the expected outcome among those not treated equals the expected “un-

treated” outcome in the overall population.

The above equalities were stated in terms of the potential outcomes y0 and y1. What

about the observed outcome y? What is the average value of the observed outcome y

among those who receive treatment? Formally this is E [yjw = 1]. To see more exactlywhat this is we can use the switching equation (7) to substitute for y; we then obtain

that

E [yjw = 1] = E [y0 + w (y1 ¡ y0) jw = 1] = E [y1jw = 1] = E [y1] (13)

The …rst equality comes from substituting for y using the switching equation (7); the

second equality follows from using that we are conditioning on w = 1; and the third

2Formally E [ykjw = j] = E [yk] for k; j = 0; 1 whereby E [y1 ¡ y0jw = 1] = E [y1 ¡ y0].

10

equality simply reiterates (11). Hence the expected outcome among those treated is the

same as the expected treated outcome in the overall population. Similarly, using the

switching equation we can get more insight into the average outcome among those who

are untreated. Using the same reasoning we obtain

E [yjw = 0] = E [y0 + w (y1 ¡ y0) jw = 0] = E [y0jw = 0] = E [y0] (14)

Hence the expected outcome among the untreated individual is the expected untreated

outcome in the overall population.

Putting (13) and (14) together we obtain that

E [yjw = 1] ¡ E [yjw = 0] = E [y1 ¡ y0] = ATE:

But E [yjw = 1] is naturally estimated by the mean outcome among the treated y1.Similarly, E [yjw = 0] is naturally estimate by the mean outcome among the untreated y0.In particular, randomization of the allocation of treatment implies that the di¤erences-in-

means estimator is unbiased, consistent and asymptotically normal.

We can now make use of the weak law of large numbers which implies that the sample

mean y1 converges (in probability) to E [yjw = 1] = E [y1] while the sample mean y0

converges (in probability) to E [yjw = 0] = E [y0].3 Hence di¤erences-in-means estimatory1 ¡ y0 converges in probability to ATE; in other words, is a consistent estimator forATE: We also say that ATE is “identi…ed”.

Hence we have managed to partially overcome the fact that the counterfactual is not

observed: randomization ensures that the outcomes in the control group will mimic what

would have happened in the treated group had they remained untreated. We will not

be able to identify the individual treatment e¤ects since we will not be able to observe

anyone individual in more than one state. However, we will have a chance of identifying

the average treatment e¤ect in the population by looking at the averages within the

treated group and the control group.

3Loosely stated, the weak law of large numbers says that, under the i.i.d. assumption the sample

mean converges (in probability) to the population mean: Let yi; i = 1; 2; ::: be a sequence of independent,

identically distributed random numbers; then the sample average N¡1PN

i=1 yip! E (yi) as the sample

size N grows to in…nity.

11

Using OLS

A point to note is that the randomized experiments settings is one case where using a

dummy variable approach in a linear regression generates the “right answer”: suppose

that we were simply to estimate a linear model where the treatment indicator w is the

only regressor,

y = ®+ ¯w + ": (15)

As is well-known, estimating this equation using OLS will give as estimate of ® the

mean outcome among the “untreated” y0 and as estimate of ¯ the di¤erence in means

among the “treated” and the “untreated”, y1 ¡ y0. In other words, the OLS estimatorof ¯ is precisely the di¤erence-in-means estimator that, under a randomized experiment,

provides a consistent estimate of ATE (or, equivalently, ATT ).

Suppose we also observe some individual characteristics x. Do we then need to some-

how “control” for these? The answer is that, as long as we are interested in the average

treatment e¤ect in the overall population, this should not be necessary: since the allo-

cation of treatment is randomized, there should be no systematic di¤erences in observed

characteristics between the the treated- and the untreated group.

However, observing individual characteristics (other than the treatment indicator w)

is nevertheless useful for two reasons. First, they can be used to check that the validity

of the assumption that the allocation of treatment is purely random: if the allocation of

treatment is indeed truly random then the treatment indicator should not be correlated

with any of the observed individual characteristics. Second, observing individual char-

acteristics can help in establishing the statistical signi…cance of the estimated treatment

e¤ect. Suppose e.g. that we are estimating the linear model using OLS. If we do not

include x in the regression, then we are e¤ectively “leaving them in the error term”.

This is not a problem for the consistency of the estimate for ¯ (since it does not make

the error term correlated with w); however, it tends to make the standard error of the

estimate large. If we instead include the characteristics x in the regression we can reduce

the variance of the error term and hence also the stardard error of the ¯ estimate.

12

3.3 Examples

The Canadian Self-Su¢ciency Project

The Programme The Canadian Self-Su¢ciency Programme (SSP) is an example of

an in-work bene…t scheme: it o¤ers an earnings subsidy to long-term welfare recipients.

The aim of the programme is to support low-income households by “making work pay”.

The SSP has three key features: (i) a substantial …nancial incentive for work relative to

non-work, (ii) a relatively low marginal tax rate on the earnings of those who work, and

(iii) a “full-time” work requirement of 30 hrs/week.

² Check the web for an up-to-date description...

Assuming that the 30-hour work requirement is met, the SSP bene…t is equal to

half the di¤erence between a participant’s gross labor earnings and a target earnings.

(Unearned income does not a¤ect the SSP payment and the supplement is also not

dependent on family size.)

The existence of the SSP fundamentally changes the budget constraint of an indi-

vidual. The basic welfare programme targetted at low income households in Canada is

known as Income Assistance (IA).4 The IA, however, by reducing the bene…t one-for-one

as the individual’s earnings grow imply an implicit 100 percent marginal tax rate (after

a modest earnings disregard).

The impact of the SSP on an individual’s budget constraint can be seen e.g. in the

hypothetical example shown in the …gure below (Card and Robins, 1996, Fig 1).

FIG

Under IA there is …rst a short upward-sloping segment representing the earnings

disregard; the slope is then zero, representing the one-for-one bene…t reduction, up until

the bene…t is fully withdrawn. The SSP introduced a notch (or a vertical jump) in the

budget constraint at 30 hrs and, moreover, by only being withdrawn at 50 percent as the

individual’s earnings grows, generates a positively sloped segment.

4In practice each province operates its own IA programme; however, the provincial IA systems share

many of the features key features that are important for our purposes, e.g. being o¤set by income from

employment and other sources.

13

The Randomized Experiment Before being adopted as a national policy [check

this] an experimental version of the SSP was constructed. The full SSP evaluation entails

a …ve-year follow-up of some 6,000 families. Card and Robins (1996) provide an early

evaluation of some 2,000 families followed over the …rst 18-24 months of the experiment.

Eligibility in the experiments was limited to single parents who had been on IA for at

least 12 of the previous 13 months. People assigned to the programme were given up to 12

months to obtain a full time job and initiate a …rst SSP payment. Those who initiated an

SSP payment would be eligible for SSP supplements for the next three years (whenever

satisfying the 30 hr requirement at at least the minimum wage). Those who did initiate

an SSP payment within the initial 12 month period lost any further entitlement.

From economic theory, the main impact of the SSP is that it would induce some

people who othewise would have remained on IA and worked less than 30 hours per week

to move from welfare to full-time employment (see Card and Robins, 1996, for a full

discussion).

The Evaluation Card and Robins (1996) explored the impact of the SSP on sample of

single parents, over 18 years of age, who had received IA payments for at least 12 of the

past 13 months residing in British Columbia and in New Brunswick. The randomization

in the experimental phase of the SSP was carried out as follows. Sample members were

informed that they had been selected to participate in a research project involving the

possibility of a wage supplement. They were asked to sign a concent form (with roughly

90 percent of the selected individuals agreeing to participate) after which the sample

members were randomly allocated to a treatment group (1,066 individuals)and a control

group (1,056 individuals). Individual outcomes in terms of labour force participation,

hours of work, earnings etc. were recorded from the start of the program and also

retrospectively for one year prior to the experiment. Card and Robins use information

the …rst 18-24 months of the experiment.

After verifying that the treatment and control groups have the same observable char-

acteristics (as they should given that the randomization was properly done) Card and

Robins present their basic results. Since for each group the outcomes are recorded on a

monthly basis it is possible to trace the treatment e¤ect by month after the start of the

14

experiment. Moreover, since the outcomes were also recorded for the year prior to the

programme it is possible to check that there are no di¤erences in the outcome variables

between the two groups prior to the experiment.

The results can be conveniently represented graphically by plotting the time series of

the average monthly outcomes for the treatment and the control group, along with the

implied estimate of the monthly impact (i.e. the simple di¤erence-in-means). The next

…gure shows the estimated impact on monthly earnings (Card and Robins, Fig 5).

FIG

The …gure shows that there were indeed no di¤erences in the outcome variable between

the two groups in the pre-programme period; however, soon after the introduction of the

programme the earnings of the individuals in the treatment group exceeded the earnings

of those in the control group, with the di¤erence – i.e. the treatment e¤ect – being

statistically signi…cant from month 5 onwards.5 The impact on the rate of full-time

employment is follows a similar pattern. See next …gure (Card and Robins, Fig 7).

FIG

The estimated impact of the SSP on the full-time employment rate peaks at 14 per-

centage points after 14 months before dropping back slightly to about 10 percentage

points after 17 months. Relatedly, Card and Robins report on the impact of the SSP on

the probability of being o¤ IA, showing that SSP markedly increases the probability of

being a non-IA recipient.

Hence the early programme evaluation indicated that a signi…cant number of single

parents responded to the …nancial incentives provided by the programme; indeed, little

over a year after the programme enrollment the full time employment rate among those

initial welfare recipients who were o¤ered the SSP programme was nearly twice that of

those in the control group. However, Card and Robins also note that there seems to be

a tendency that the recipients were taking jobs with relatively low pay.

5The jump that occurs at the introduction of the programme is argued to stem from the fact that

pre- and post treatment outcomes were obtained from di¤erent surveys.

15

Other Examples of Randomized Experiments in Public Economics

The Evaluation of Training: The LaLonde Study In a highly in‡uential study

LaLonde (1986) used an experiment dataset to compare between experimentally and non-

experimentally determined results and between di¤erent types of non-experimental esti-

mation methodologies. The study is based on a programme called the National Supported

Work Demonstration (NSWD). This programme was operated during the mid-1970s in 10

sites across the US and was designed to help disadvantaged workers, in particular women

in receipt of cash welfare bene…ts (AFDC), ex-drug-addicts, ex-criminal o¤enders and

high-school drop-outs. In the programme, quali…ed applicants were randomly assigned

to treatment, which comprised a guaranteed job for 9 to 18 months. A total of 6,616

individuals were included in the study. Data was collected on both pre-program earnings

and post-program earnings, as well as a number of pre-program variables such as age,

education, ethnicity etc). Eligible candidates were randomized into the program. Due

to the randomization, the e¤ect of the program could be evaluated without bias using

simple di¤erences-in-means.

Given that the allocation of treatment was random, it should be that the individuals

who received treatment and those (eligible) who didn’t receive treatment have the same

observed characteristics. Indeed, the following …gure which reproduces LaLonde’s Table

1 shows that there was no di¤erences between the treatment group and the control group

in terms of observable characteristics.

FIG

The following Figure shows the earnings evolution for treatments and controls from a

preprogramme year (1975), through the treatment period (1976–77), until the postpro-

gramme period (1978) (an except from LaLonde’s Table 3)

FIG

The results indicate that the earnings of the treated- and the untreated individuals

were indeed very similar before the programme started. During the programme the

earnings of the treated was substantially higher and after the programme had …nished

16

the gap narrowed somewhat. Nevertheless, using the simple di¤erences in means after the

programme as the estimate of the treatment e¤ect the estimated e¤ect would be $886. If

construct a (di¤erence-in-di¤erence) estimator that subtracts the di¤erences in earnings

between the two groups before the programme the estimate would be $847.

The second aim of the LaLonde paper is to explore how well non-experimental methods

that would typically be used would perform. The idea is to use that, since we have

experimental data, we have the “right answer”; we can then ignore the control group

in the data and use other constructed control groups (from external data sources) along

with various regression methods to try to recover the treatment e¤ect. LaLonde uses

data from the PSID and the CPS to generate alternative control groups. He then uses a

variety of methods including (i) simple regression using post-treatment data that controls

for di¤erences in observed characteristics, (ii) di¤erences-in-di¤erence methods (with or

without controls for observed di¤erences in characteristics), (iii) a Heckman (two-step)

selection model. LaLonde concludes that the non-experimental methods perform very

poorly.

The US Re-Employment Bonus Experiments After a decade of increasing un-

employment in the US since the mid 70s, there was an increased interest in trying to

identify reform of the unemployment insurance (UI) system in order to get unemployed

back to work faster, and hence to reduce the …nancial pressure on the UI system. A

number of states implemented randomized experiments where unemployed workers were

given cash-bonuses for …nding jobs quickly. (See Meyer, 1995, for a survey.) One example

was the Illinois re-employment bonus experiment. This involved a $500 cash bonus for

…nding a job within 11 weeks (and keeping it for at least 4 months). The outcome of the

Illinois re-employment bonus experiments was analyzed by Woodbury and Spiegelman

(1987) and will be considered in more detail later on in the course.

The Negative Income Tax Experiments

² Next time...

17

3.4 Natural Experiments

Randomized experiments in economics are rare. However, situations can also occur when

some event occurs that naturally creates an experiment-like situations in the sense that

di¤erent individuals are exposed to di¤erent “treatments” in a way that is e¤ectively

random. Situation like these are generally referred to “natural experiments” to mark

that it was not a planned randomization but rather one that occurred “naturally”. In

so far as the allocation of treatment was indeed e¤ectively random the same analysis as

under planned randomized experiments apply.

A neat example of an analysis based on a natural experiment is Gould, Lavy and

Paserman (2004). The authors consider an event in which a large number of Ethiopian

Jews were due to political instability, over a very short period of time (a few days),

relocated to Israel. The families arriving from Ethiopia were, almost immediately upon

arrival, randomly allocated to destinations in Israel. Hence, some families would be

allocated to urban areas whereas other families would be allocated to rural areas etc. In

particular, the children of the immigrating families were, by chance, randomly allocated

to schools of di¤erent qualities. Using the randomization created by the event allowed

the authors to consider the impact of school quality of the children’s achievements.

Unfortunately, natural experiments are also rare in the context of public economics.

3.5 Potential Problems with Randomized Experiments

Even though randomized experiments are commonly thought of as the gold standard

among evaluation methods it should be noted even this approach has some potential

problems (some of which also apply to other evaluation methods).

First, random experiments are generally very costly to implement. Often the …nan-

cial costs are large. Moreover, there are often “ethical costs” that make randomized

experiments unfeasible: if some “treatment” is available that we believe might help some

individuals it can be considered unethical to randomly allocate the treatment across indi-

viduals rather than give it to those who would be in most need. Similarly there are certain

“treatments” that would be completely unfeasible to randomly allocate. Consider e.g.

the treatment “growing up in a poor neighborhood”: it would be completely unthinkable

18

that to construct a randomized scheme whereby children would be randomly re-allocated

to new homes and neighborhoods.

Second, there are threats to internal validity. As is obvious from the above analy-

sis, it is important that the randomization is correctly implemented so that the treated

and untreated populations are indeed identical in all respects except for the receipt of

treatment. Further, it is important that the initial randomization is adhered to: that all

people that were initially assigned to receive treatment indeed do receive treatment and

do not fail to take it up.6 Another potential problem is that the individual included in

the experiments may change their behaviour due to the mere fact of being in an exper-

iment (“Hawthorne e¤ects”). These problems relate to the so-called internal validity of

the evaluation results.

However, there can also be threats to the external validity of the evaluation that

compromise the ability to generalize the results of the experiments to other populations

and settings. One threat to external validity is “experiment speci…city ”: when the

experimental sample is not representative of the population to which the policy might

be extended or when the policy applied to the experimental sample is not representative

of the policy that would be implemented on a broader population. A second threat

to external validity is “limited duration”: the duration of an experiment may be too

short to identify the long-run responses that would obtain if the policy was permanently

adopted. Relatedly, adopting the programme as a wide-spread permanent policy could

have “general equilibrium e¤ects” that are large enough that results from the experiment

cannot be generalized.

4 When We Don’t Have a Randomized Experiment

Most often we have to get by with less “clean” evidence than would o¤erer by randomized

experiments. Non-randomization in the public policy context tend to occur for two

reasons:

1. The policy change (i.e. the treatment) when introduced a¤ected some individuals

6As we will see below, if this is not the case, e.g. due to some degree of self-selection, but the original

assignment of treatment is known, the original assignment can be used as an instrumental variable.

19

but not others; however the two groups di¤ered systematically. Hence we have a

treated group and an untreated group, but the two groups cannot be expected

to be identical.

2. The individuals themselves partly determine whether they receive treatment. In

other words, there is self-selection into treatment.

As an example of the …rst, suppose the government introduces a policy that is available

to people under the age of 25. Hence those under the age of 25 can be considered

as “treated” (i.e. exposed to the policy) while those above 25 are “untreated” (i.e.

una¤ected by the policy). However, it is quite clear that the two groups are not identical.

Or, similarly, suppose that a policy is introduced in one area but not in another; the two

areas might not be identical compositions.

As an example of the second problem, take e.g. a government training program. If

people can choose whether or not to participate, then we might suspect that those who

would gain the most from participating are also most likely to actually participate. In

this case it is easy to see why, if we are not careful, we could easily get a wrong estimate

the e¤ect of the program.

4.1 A Useful Decomposition

To see how biases can easily occur it is useful to decompose the individual’s treatment

e¤ect. Recall that E (y1) is the average outcome, with treatment, in the population. For

concreteness think of “treatment” as participating in a particular training program and

think of the outcome of interest as y earnings. Then y1 is the individual’s earnings after

participating in training.

Let’s introduce some short-hand: de…ne

¹1 ´ E (y1) ; (16)

to be average outcome, with treatment, in the population. Then we can decompose the

individual’s treated outcome as

y1 = ¹1 + À1: (17)

20

This has a simple interpretation: the individual’s earnings with training is the average

earnings (with training) in the population ¹1 plus an individual-speci…c component À1which, by construction, has zero mean in the population, E (À1) = 0. Similarly, we can

de…ne

¹0 ´ E (y0) (18)

as the average outcome in the population when no one gets treatment. If the speci…c

individual does not get any training her earnings will be y0 which we can decompose

into the population average ¹0 and an individual-speci…c component of earnings without

training À0,

y0 = ¹0 + À0: (19)

Note that

¹1 ¡ ¹0 = E (y1) ¡ E (y0) = E (y1 ¡ y0) ; (20)

is nothing but the average treatment e¤ect in the population ATE. The individual’s

treatment e¤ect, y1 ¡ y0, on the other hand is, by substitution,

y1 ¡ y0 = (¹1 + À1) ¡ (¹0 + À0) = (¹1 ¡ ¹0) + (À1 ¡ À0) : (21)

Using the above decompositions we see that an individual’s treatment e¤ect can be though

of as having two components: the average treatment e¤ect ATE and plus an individual-

speci…c component À1 ¡ À0; note that, since E (À1) = E (À0) = 0; the individual-speci…ccomponent is, on average, zero in the population, E (À1 ¡ À0) = 0.

Consider then a particular individual: if À1 > À0 the individual gains more from

participating in training than the average person as she has a larger individual earnings

component when participating than when not participating.

Suppose now that we take expectations in (21) conditional on receiving treatment,

w = 1; this will give us the average treatment e¤ect on the treated,

ATT = E (y1 ¡ y0jw = 1) = (¹1 ¡ ¹0) + E (À1 ¡ À0jw = 1) : (22)

In other words,

ATT = ATE + E (À1 ¡ À0jw = 1) : (23)

21

Hence unless the last term is zero we have that the average treatment e¤ect on the treated

is not the same as the average treatment e¤ect in the population, ATT 6= ATE. Supposee.g. that, due to self-selection, the participating individuals have, on average, positive

individual-speci…c treatment e¤ect components, E (À1 ¡ À0jw = 1) > 0. It then followsthat the average treatment e¤ect on the treated is larger than the average treatment

e¤ect in the population, ATT > ATE.

It is then also easy to see why the simple di¤erence-in-means estimator will generally

not estimate any treatment e¤ect of interest; in general the di¤erence-in-means estimator

will estimate the expected di¤erence in outcome for the treated and the untreated which

is

E (y1jw = 1) ¡E (y0jw = 0) = (¹1 ¡ ¹0) + E (À1jw = 1) ¡ E (À0jw = 0) : (24)

But this is, in general, neither ATE nor ATT: This is intuitive: suppose that those

who gain the most from participating in training are more likely to actually participate;

conversely, those who bene…t the least are less likely to participate. Then it is clear that:

² The average earnings among those who actually participate is not a good estimateof what would have been the earnings in the whole population if everyone had par-

ticipated !

² The average earnings among those who do not participate is not a good estimate ofwhat would have been the earnings in the whole population if no one had participated.

Simply put, neither group is representative of the population. Hence if we were to use

a di¤erence-in-means approach, we’d be subtracting something that is unrepresentative

from something that is unrepresentative; it is then quite clear that we should not expect

to uncover the average treatment e¤ect.

4.2 Roadmap to Methodologies for Non-Experimental Data

The available menu of empirical methodologies for non-experimental data depends on two

main factors. The …rst factor is the timing-structure of the data: Is the data a panel (or,

22

possibly, a repeated cross-section) or a single cross-section? A range of methodologies

can be applied to pure cross-sectional data; these include matching estimators, regression

discontinuity designs, (Heckman) selection model, and Instrumental Variables methods.

If, on the other hand, data are in longitudinal or repeated cross-section format, the

di¤erence-in-di¤erence approach can be applied. Moreover, various methods are, as we

will see, often combined.

The second factor is the richness of the data: Are we measuring all factors that are

relevant for the selection into treatment? If we believe that we have good measures of

all factors that a¤ect the allocation of treatment we say that there is “selection on ob-

servables” (only). If, on the other hand, we believe that some unmeasured individual

characteristics (e.g. intrinsic motivation, self-esteem etc.) a¤ecting who received treat-

ment, then we have to acknowledge that there may be some “selection on unobservables”.

The main class of models that rely on the assumption of selection on observables is

the class of matching models. In contrast, the main models allowing for selection on

unobservables are the (Heckman) selection model and the IV methods.

5 Selection on Observables and Matching

Suppose then that we do not have a perfectly randomized experiment. Indeed, “natural”

experiments, by not being controlled randomizations, frequently give rise to treatment

and control groups that may be quite di¤erent from each other in terms of their observ-

able characteristics. This then begs the question of how to control for these observable

di¤erences. Matching provides a way forward for the case where we only have cross-

sectional information available; however, as we will see it will provide a way forward by

making fairly strong assumptions about knowledge of the treatment allocation process.

Hence suppose that we have cross-sectional data where some individuals have received

treatment and some others have not, but the two groups are not identical in their char-

acteristics. A simple di¤erence-in-means estimator is then likely to confound the e¤ects

of the treatment with the e¤ects of the di¤erences between the groups. What can we do

then? A natural way forward is to try to understand how treatment is allocated. After all,

we …gured out that was not completely random. Speci…cally, suppose that we observe

23

a number of individual characteristics x. To make things easy, suppose for now that

the observed characteristics can only take on a …nite number of possible values; in other

words, suppose that x is discrete and has some distribution in the population which we

can represent by a probability density function Á (x) (with some support X). Moreover,

suppose that the observed characteristics x capture everything that is relevant to the

allocation of treatment. In other words, suppose we believe that the treatment depends

on the potential outcomes only through x. That is to say that the allocation of treatment

w is independent of (y0; y1) once we condition on any particular value of x.

Suppose e.g. that x = fgender; ageg. If we compare a 30 year old man with a 25 yearold woman, it may e.g. be that the man is more likely to participate. However, the main

point is that all men of the same age are exactly equally likely to participate! No other

information about these men, no matter how hard or how easy to obtain, would help

us predict who in the group receives treatment. If the measured characteristics x truly

capture everything that is relevant to the allocation of treatment, then if we focus on a

group of individuals with the same characteristics x, who gets treatment and who does

not should be purely random! But that means that, as long as we focus on individuals

with the same characteristics, it is as if we have a randomized experiment. That will

clearly give us a way forward.

The assumption that the allocation of treatment is independent of the potential out-

comes once we condition on a set of observable variables, introduced by Rosenbaum and

Rubin (1983), goes under various names in the literature: it is commonly referred to as the

“ignorability of treatment”, the “unconfoundedness” or then “selection on observables”

assumption.

Theory: A Formalization

Suppose then that we observe outcome the outcome variable y for a random sample from

the population (where the observed outcome is either y1 or y0 depending on whether

or not the individual has received treatment as described above); we also observe who

receives treatment, as indicated by the variable w; …nally we observe a vector of individual

characteristics x. Our key assumption will be that x contains everything that is relevant

to the allocation of treatment. Hence once we condition on x the allocation of treatment

24

will no longer be correlated with the potential outcomes.

Assumption 2 Conditional Independence Assumption. Conditional on x; w and

(y0; y1) are independent: (y0; y1) ? wjx.

This means that, if we take two individuals with the same observed characteristics x,

they will be equally likely to receive treatment: speci…cally, the allocation of treatment

cannot be related to any other factors, including the potential outcomes. The fact that

the allocation of treatment in this sense “ignores” the outcomes has motivated the name

“ignorability assumption”.

Another way of looking at the assumption of conditional independence is to note that it

allows the allocation of treatment w to be correlated with the potential outcomes (y0; y1),

but that the correlation disappears once we partial out the observed characteristics x.

One immediate e¤ect of this is that we can unequivocally talk about the conditional

probability of participation given x; to see this note that

Pr (w = 1jx) = Pr (w = 1jx;y0; y1) ; (25)

since, given that we are already conditioning on x, w is independent of the potential

outcomes. Hence we can write an individual’s probability of receiving treatment as a

function of her characteristics. Formally, for all x 2 X, de…ne

p (x) ´ Pr (w = 1jx) : (26)

The function p (x) is in the treatment evaluation literature commonly known as the

propensity score function.

For future reference we can also consider a second useful assumption; this assumption

states that, for every possible value of x (in the support of x), there are both treated and

untreated individuals

Assumption 3 The Overlap Assumption. p (x) 2 (0; 1) for all x 2 X.

A simple violation of the overlap assumption is e.g. when treatment are given to all

men and no women: in this case p (male) = 1 and p (female) = 0: As we will see, the

overlap assumption is critical for the feasibility of matching techniques.

25

A problem with the assumption that x captures everything that is relevant to the

allocation of treatment is that it is virtually impossible to verify. Hence for this to be

convincing, we really should have a good set of observed characteristics. Nevertheless,

let’s proceed on faith and consider how the average treatment e¤ect can be estimated

when selection into treatment is on observable variables.

5.1 Simple Matching

We noted above that if the observed characteristics x capture everything that is relevant

to the allocation of treatment, then if we focus on a group of individual’s with exactly the

same characteristics, who gets treatment and who does not is e¤ectively random: it is as

if, within that group, there is a randomized experiment. But in that case we know how

to proceed. We can simply compare the average outcome of those who receive treatment

with the average ourcome of those that don’t.

Thus e.g. if x = fgender; ageg we would e.g. pick out all women aged 35 (say) and,within this group, we would compute the mean outcome among those who receive treat-

ment y1 and the mean outcome among those that don’t y0. In order to emphasize that we

are looking only at individuals with the speci…c characteristics x = ffemale; age = 35gwe can write these (conditional) sample averages are y1 (x) and y0 (x). We could then

take the di¤erence y1 (x) ¡ y0 (x) as a natural estimate of the average treatment e¤ectamong individuals with these speci…c characteristics. Recall that we denoted this by

ATE (x) above. Thus we again use the simple di¤erence-in-means estimator, only this

time on individuals with the same characteristics:

dATE (x) = y1 (x) ¡ y0 (x) : (27)

What if we wanted to estimate the average treatment e¤ect in the whole population?

We would need to determine what fraction of the population is of each type and take the

weighted average of the group-speci…c average treatment e¤ects. Hence, if we want to

estimate ATE, then what we need to do is to:

² Obtain the estimate dATE (x) for all possible values of x using the di¤erences-in-means estimator within each group x 2 X.

26

² For each possible value x 2 X estimate the fraction of the population with thoseparticular characteristics; the obvious estimator is the corresponding fraction in the

sample, which we can denote f (x).

² Take the weighted average across all groups to obtain the estimated ATE,

dATE =X

x2Xf (x) dATE (x) : (28)

Theory

Let’s look a bit more in detail at the theory behind this. Recall that we de…ned ATE (x)

as the average treatment e¤ect among the part of the population that has characteristics

x,

ATE (x) ´ E (y1 ¡ y0jx) : (29)

The average treatment e¤ect in the population ATE is simply the weighted average of

the average treatment e¤ects in the subpopulations

ATE =X

x2XÁ (x)ATE (x) (30)

The key to the analysis is the assumption that, conditional on x, treatment status

w is independent of (y0; y1). This means that, the expected outcome for those actually

treated is representative of the treated outcome among everyone with the characteristics

x; formally, Assumption 2 implies that

E (y1jx;w = 1) = E (y1jx) : (31)

Similarly, the expected outcome for those were not treated is representative of the un-

treated outcome among everyone with the characteristics x,

E (y0jx;w = 1) = E (y0jx) : (32)

Turning from the potential outcomes (y0; y1) to the actual outcomes y, we then have

that

E (yjx; w = 1) = E [y0 + w (y1 ¡ y0) jx; w = 1] = E [y0 + 1 ¢ (y1 ¡ y0) jx; w = 1](33)

= E (y1jx; w = 1) = E (y1jx) :

27

The …rst equality follows from replacing the actual outcome y using the switching equation

(7), the second equality follows from plugging in that w = 1, the third equality follows

by direct simpli…cation, while the last equality reiterates equation (31).

By an analogous argument we have that

E [yjx; w = 0] = E [y0 + w (y1 ¡ y0) jx; w = 0] = E [y0 + 0 ¢ (y1 ¡ y0) jx; w = 0](34)

= E [y0jx; w = 0] = E [y0jx] :

Hence

E [yjx; w = 1] ¡ E [yjx; w = 0] = E [y1jx] ¡ E [y0jx] = E [y1 ¡ y0jx] = ATE (x) : (35)

Proceeding exactly as in the case of randomized experiments we would use that, due

to random sampling, the corresponding sample means are consistent estimators of the

population.

Let y1 (x) denote the sample mean outcome among the treated individuals with char-

acteristics x and let y0 (x) denote the sample mean outcome among the untreated in-

dividuals with characteristics x. Again, using the weak law of large numbers we have

that the probability limit of y1 (x) is E [yjx; w = 1] and the probability limit of y0 (x) isE [yjx; w = 0]. Hence y1 (x) ¡ y0 (x) is a consistent estimator of ATE (x).

Moreover, the observed fraction of individuals in the sample who have the character-

istics x,

f (x) ´ 1N

NX

i=1

I (xi = x) (36)

(where I (¢) is the indicator function that is one if the statement in the brackets is true andzero otherwise) is a consistent estimator of the corresponding fraction in the population,

i.e. the probability limit of f (x) is Á (x).7 The matching estimator

dATE =X

x2Xf (x)

£y1 (x) ¡ y0 (x)

¤(37)

is therefore (using the Slutsky Theorem) a consistent estimator of ATE.

7Recall that we are assuming that x is discrete.

28

Potential Problems

Although this sounds straightforward, three complications have to be tackled.

1. The number of possible combinations of characteristics tends to grow very quickly

2. Some characteristics may be continous.

3. For some values of x there may be no treated (or no untreated) individuals.

The …rst problem is known as the curse of dimensionality; to see the problem

suppose that we initially have x = fgender; ageg and that age can take on values thevalues 20,21,22, ... , 50. Then there are already 2 £ 31 = 62 possible x-vectors. Supposethat we add another dimension; e.g. suppose we add years of schooling which can take

on 10 di¤erent values (say). Then the number of possible x-vectors quickly increase to

620!! Hence we see that simple matching quickly runs into problems as we add more

variables – the number of groups simply tends to grow very, very quickly. Note that this

is a problems since, for the analysis to be convincing we need to make the case that we

are including in x all variables that are relevant to the allocation of treatment. But as

the possible number of x vectors grow we also need the number of observations to grow

so that we can estimate with precision the group-speci…c means y0 (x) and y1 (x) at every

x.

The second problem is similar: any continous variable can, in principle, take on an

in…nite number of values which makes it impossible to estimate ATE (x) on all possible

values of x: for one we cannot even list all possible values of x and moreover, we cannot

expect to …nd individual with exactly the same characteristics. One way to “solve” this

problem is to “discretize” the continous variables, stratifying the sample into bins or

cells.8

The third problem points to the importance of the overlap assumption: problem three

obtains when that assumption is violated. For some value of x, either y1 (x) or y0 (x)

8Another way to proceed is to accept “inexact” matching; e.g. one can compare each treated indi-

viduals with the untreated individual with the “most similar” characteristics (where “most similar” is

de…ne using some pre-speci…ed distance measure, e.g. the Euclidean metric). If one further assumes that

ATE (x) is continous in x a range of ‡exible non-parametric methods are available.

29

cannot be computed – there are simply no one to average over! Hence we cannot estimate

ATE (x) – the average treatment e¤ect at that speci…c x is simply not identi…ed (and

hence, neither is ATE).

An Example: Earnings by Veteran Status

Angrist (1998) considers the e¤ect of voluntary military service on the earnings and

employment status of veterans. A simple comparison on veterans to non-veterans can be

expected to be misleading for two reasons:

² There is likely to be self-selection in applying for the military

² The is e¤ective screening of applicants by the military.

To tackle these problems Angrist’s comparison by veteran status is restricted to a

sample of applicants to the military, only about half of whom actually enlist. Moreover,

the data contains most of the variables that the military uses to screen applicants (age,

schooling, the Armed Forces Quali…cation Test (AFQT) score, application year). Angrist

compares three di¤erent estimators:

² Di¤erence in means between veterans and nonveterans.

² Matching on the observed variables.

² Regression estimates with controls for the observed variables.

Angrist merges military data (from the Defense Manpower Data Center) with earnings

data (from the Social Security Administration) using social security number. Knowledge

of the military selection process suggests that the recorded characteristics matter for

the entry decision. In order to perform matching, Angrist de…nes cells using the year

of application (1979-1982, 4 categories), AFQT score (5 categories), schooling level at

the time of application (6 categories), year of birth (1954-1965, 12 categories), race (2

categories), generating a total 2,880 categories.

² NEED TO COMPLETE THIS EXAMPLE>>>

30

5.2 Linear Regression

Early on we asked what’s wrong with the …rst-year econometrics solution of simply using

the treatment indicator w as a dummy variable in an OLS regression. We can now have

a look at the answer to this question. Hence consider the simple formulation,

y = ®+ ¯w + ±x+ ": (38)

We want to know if, under some circumstances, the coe¢cient on the treatment

dummy ¯ measures something meaningful like the ATE.

Note …rst that if we do not include the observable covariates x then we are almost

surely in trouble. Running the regression, using OLS, with the single dummy variable

as regressor would the give that the estimated value of ¯ that is equal to the di¤erence

in sample means between the treated and the untreated individuals, b̄ = y1 ¡ y0 – i.e.it would be the simple di¤erence-in-means estimator. From our earlier analysis we know

that the di¤erence-in-means estimator is a consistent estimator of

E (y1jw = 1) ¡ E (y0jw = 0) = ATE + E (À1jw = 1) ¡ E (À0jw = 0) (39)

which is neither the ATE nor the ATT . Only in the case of pure randomization will this

approach in general consistently estimate ATE (and ATT ).

However, maybe by including the observed covariates x we can rid the regression of

the problem that the allocation of treatment is correlated with the error term. Hence,

let’s investigate under which conditions estimating this equation by OLS will give us

the answer we are looking for, namely the ATE. Based on the above intuition, we can

now show that applying OLS on will provide a consistent estimate of ATE under three

assumptions.

1. We maintain the assumption that the characteristics in x capture everything that

is relevant to the allocation of treatment, i.e. we maintain the conditional indepen-

dence assumption 2.

2. We assume that the average treatment e¤ect does not vary across groups: ATE (x)

is the same for all x.

31

3. The conditional average outcome in the absence of treatment is linear in x; E [y0jx] =®+ x±.

The above assumptions are clearly quite strong; nevertheless, there are plenty of

examples of OLS estimates in the literature. The thing to take away from this is that

OLS can make sense; however, the conditions under which it is perfectly valid are quite

stringent.

Theory*

Let’s prove that OLS makes sense under the above assumptions. Let’s start by formalizing

the second assumption:

ATE (x) = E [y1 ¡ y0jx] = ¯ for all x 2 X (40)

where ¯ is a constant. Hence ATE (x) = ATE = ¯ for all groups x. Recall then the

decomposition we generated above,

yk = ¹k + Àk with ¹k = E [yk] , k = 0; 1 (41)

We noted that the individual treatment e¤ect can then be written as

y1 ¡ y0 = (¹1 ¡ ¹0) + (À1 ¡ À0) (42)

where ¹1 ¡ ¹0 = ATE and where À1 ¡ À0 is the individual-speci…c component of thetreatment e¤ect. Taking the average within a given group x then yields

ATE (x) = E [y1 ¡ y0jx] = ATE + E [À1 ¡ À0jx] (43)

The left hand side is ATE (x). Hence since ATE (x) = ATE for all x it must be that

E [À1 ¡ À0jx] = 0 for all x 2 X (44)

This of course simply re‡ects that, conditional on x, treatment is as if randomized.

Next we turn to the observed outcome y; by the switching equation (7). Substituting

for y1 and y0 using the decomposition we see that

y = w (¹1 + À1) + (1 ¡w) (¹0 + À0) (45)

= ¹0 + (¹1 ¡ ¹0)w + À0 + w (À1 ¡ À0)

32

Now use that we observe the characteristics x; taking expectations of the observed out-

come conditional on x and w (the treatment indicator),

E [yjx; w] = E [¹0 + À0 + (¹1 ¡ ¹0)w + w (À1 ¡ À0) jx; w] (46)

In order to simplify this we use that ¹0 and (¹1 ¡ ¹0) = ATE are constants and that(trivially) E [wjx; w] = w. Thus

E [yjx; w] = ¹0 +ATE ¢ w + E [À0jx; w] + E [w (À1 ¡ À0) jx; w] (47)

However, when we consider the last term we see that this simpli…es,

E [w (À1 ¡ À0) jx; w] = wE [(À1 ¡ À0) jx; w] = wE [(À1 ¡ À0) jx] = 0 (48)

where the …rst equality follows from the fact that we are conditioning on w (so we can

treat it as a constant), the second equality follows from the conditional independence

assumption which states that, once we condition on x, the conditional outcomes are

independent of the treatment allocation w; …nally, the last equality comes from equation

(44). Hence we have that

E [yjx; w] = ¹0 +ATE ¢ w + E [À0jx] (49)

where we also used that, due to selection on observables E [À0jx; w] = E [À0jx]. Finallyby the linearity assumption

E [y0jx] = E [¹0 + À0jx] = ¹0 + E [À0jx] = ®+ x± (50)

Hence have that

E [yjx; w] = ®+ATE ¢ w + ±x (51)

Hence, under the speci…c assumptions, the linear structure holds and the coe¢cient on

the treatment dummy is the ATE. It can also easily be show that w and the disturbance

" are uncorrelated; this follows naturally from the assumption that the allocation of

treatment is only related to x, not to the individual potential outcomes. Hence the ATE

can, in this case, be consistently estimated using OLS.

33

5.3 Matching on the Propensity Score

One advantage of the simple matching was that it allows us to explore how the average

treatment e¤ect vary with the individuals’ characteristics – speci…cally, we could estimate

ATE (x) at the various values of x. The main problem with the matching method is that

is tends to run into the problem of the curse of dimensionality: If we have a rich datasets

there will be many variables available for categorising and grouping that the data gets

grouped into a large number of cells each containing few observations. In short, if we

exploit the richness of the data to compare like with like we are likely to end up having

few comparable observations and so correspondingly imprecise estimates.

It turns out that there is another way to exploit the assumption that the observed

variables x capture everything that is relevant to the allocation of treatment. The basic

idea is that it may be possible to summarize x in a “lower dimension”; speci…cally, it

may be possible to control for the allocation of treatment directly rather than for all

dimensions of x. This is the basic idea underlying the propensity score matching.

Above we used the notation p (x) to highlight that the probility of receiving treatment

depends on (and only on) the observed characteristics x. The probability of receiving

treatment p (x) is sometimes also known as the propensity score. The propensity score

function p (x) thus summarizes the process by which treatment is allocated. Note that

p (x) is always a number in the unit interval, so it is a particular way of summarizing x

in a single dimension. The question is: is it of any use? It turns out that it is. Let’s

elaborate a bit on this.

Recall the key assumption that x captures everything that is relevant to the allocation

of treatment. This was the reason why simple matching on x was valid. By focusing on

individuals with the same characteristics x we managed to get a handle on the counter-

factual: within the group of individuals with the same value of x, the average outcome

of those that were not treated should not be systematically di¤erent from the average

outcome that would have obtained for the treated had they in fact not been treated.

However, as the Angrist example showed, the number of categories can grow very

rapidly. Hence it would be useful if the information about the treatment allocation process

could be summarized in a lower dimension. Rosenbaum and Rubin (1983) showed that

rather than matching on the characteristics x one can match on the propensity score

34

p (x).9 Speci…cally, they showed that the key conditional independence assumption 2

implies that conditional independence also holds once we condition on the propensity

score (rather than full vector x): Formally, assumption 2 implies that (y0; y1) ? wjp (x)(we prove this below).

To see the usefulness of this, it is worthwhile to go back to the logic underlying simple

matching on characteristics. The logic there was as follows: Take a group of individuals

with the same characteristics x. Since x fully determines the probability of being treated,

those who were in fact treated must be representative, in terms of potential outcomes,

of the whole group of individuals with characteristics x. Similarly, those who were not

treated must be representative, in terms of potential outcomes, of the whole group of

individuals with characteristics x. Hence the untreated group form a valid control group

for the treated group in the sense that the expected outcome in the untreated group

corresponds to the expected outcome in the treated group that would have obtained had

the latter not been treated.

Now do the same mind experiment, only now focus on a group of individuals with the

same propensity score. i.e. the same value of p (x), say p0 (a number between zero and

one). The individuals within this group generally have di¤erent observed characteristics;

indeed, the group consists of all individuals with any characteristics x such that p (x) =

p0. However, they do share one crucial feature: they are equally likely to be selected for

treatment. Hence those within this group who were in fact treated must be representative,

in terms of potential outcomes, of the whole group of individuals with propensity score

p (x) = p0. Similarly, those who were not treated must be representative, in terms of

potential outcomes, of the whole group of individual with propensity score p (x) = p0.

Hence the untreated group form a valid control group for the treated group in the sense

that the expected outcome in the untreated group corresponds to the expected outcome

in the treated group that would have obtained had the latter not been treated.

Hence matching individuals by their propensity scores should in principle be feasible.

In practice, the propensity score function will not be known but rather must be estimated

e.g. using a Logit or Probit model; after doing so we can proceed by matching individuals

9See also Rosenbaum and Rubin (1984) and Heckman, Ichimura and Todd (1998).

35

using the estimated propensity score bp (x).10 Suppose then that, within a random sample,we can determine what was each individual’s probability of receiving treatment. Then

we can match the individuals into groups according to their propensity scores. Thus,

suppose that we have collected all the individuals in the sample who had the speci…c

probability p0 (a number between 0 and 1) of receiving treatment. Proceeding as in the

case with simple matching, we can then compute the sample average outcome y1 among

those within this group who were actually treated. To emphasize that we are doing this

for the group of individuals with propensity score p0, we can write y1 (p0). Since the

people we are averaging over are representative in terms of their potential outcomes of

all individuals with propensity score p0, y1 (p0) estimates

E£y1jp (x) = p0

¤

Similarly, we compute the average outcome among those within the group who were not

treated; the resulting average y0 (p0) estimates

E£y0jp (x) = p0

¤

Hence, the di¤erence, y1 (p0) ¡ y0 (p0) estimates the average treatment e¤ect amongthe portion of the population that have the speci…c propensity score p0. In line with

our previous notation, we can denote this conditional average treatment e¤ect ATE (p0),

de…ned as

ATE¡p0

¢´ E

£y1 ¡ y0jp (x) = p0

¤

The idea is then straightforward: we can try to proceed as we did with simple match-

ing:

² We estimate ATE (p) at “every value” of p between zero and one.

² We also work out, for each probility p, how large is the fraction of the populationthat has this speci…c probability of receiving treatment.

10Once we have estimated the propensity score function, we can (in principle) estimate the distribution

of propensity scores, F (p). (Note the interpretation: F (p) is the fraction of the population that have a

probability of being treated that is less than or equal to p).

36

² For the population average treatment e¤ect, ATE, we then take the weighted av-erage across all values of p.

To summarize, in the propensity score matching method we thus model the allocation

of treatment as a function of observable variables and predict the probability of treatment

for both the treated and the untreated groups. The method then proceeds by comparing

the outcomes across treated and untreated individuals within groups of individuals that

have a very similar probability of receiving the treatment.

There are, however, a couple of practical problems with this approach.

1. We need to know each individual’s probability of receiving treatment – speci…cally,

we need to know that propensity score function p (x).

2. The probability p0 is a continous variables so there is in principle an in…nite number

of probabilities p0 between zero and one.

The …rst problem is, as noted above, usually handled by estimating, in a …rst step, the

function p (x) in a straightforward way (typically using a Logit or a Probit model) and

use the predicted probability bp (x) for each individual. The second problem implies thatwe are unlikely to …nd very many treated and untreated individuals with exactly the same

probability of treatment, that is, we are unlikely to …nd very many exact matches on the

probability score. Moreover, there is in principle an in…nite number of possible values of

p; hence we cannot possibly hope to estimate ATE (p) at every possible probability p.

There are two basic methods methods used to overcome this latter problem. One involves

matching each treated individual within the sample with an untreated individual who is

the “nearest neighbour” (according to some pre-speci…ed criterion) in terms of propensity

score.11 A second approach involves discretizing the population based on the (predicted)

propensity score.

If we adopt the approach of discretizing the estimated propensity score we should also

check for “balance” within each bin, by which we mean that we should check that once

we focus on individual with the same (or similar) propensity scores who, within the group

11Stata programs that do propensity score matching are available on the web. In Stata type “net

search propensity score”.

37

receives treatment should not be related to the observe characteristics x. Hence, if within

a block, we …nd that we do not have balance, this suggests that we are lumping together

individuals who are not close enough in terms of propensity scores. That would mean

that, within the block, those received treatment seem to be systematically di¤erent from

those that didn’t, implying that the latter are not valid as a control group for the former.

What can we do? One simple remedy is to stratify again (within the problematic block)

so as to ensure that we are only comparing individual with su¢ciently similar propensity

scores. To check for balance within a block one can simply compare the means of each

characteristic x between the treated and the untreated individuals.

Theory*

For now let’s suppose that the propensity score p (x) is a known function. Note also that,

since the treatment indicator dummy w is binary,

p (x) = E (wjx) : (52)

We want to show that the conditional independence assumption 2 implies that we

also have conditional independence when we condition, not on the full vector x, but on

the summary function which is the propensity score function. In particular we want to

show the following:

Proposition 1 Suppose that Assumption 2 holds so that (y0; y1) ? wjx and supposealso that Assumption 3 holds so that p (x) 2 (0; 1) for all x 2 X. Then (y0; y1) ? wjp (x).

We will show that Pr (w = 1jy0; y1; p (x)) = Pr (w = 1jp (x)) = p (x), which impliesthat w is independent of (y0; y1) conditional on p (x). The proof uses the law of iterated

expectations. First, note that since w is binary, w = 0; 1,

Pr (w = 1j (y0; y1) ; p (x)) = E (wj (y0; y1) ; p (x)) (53)

Then expand the right hand side by using the law of iterated expectations,

Pr (w = 1j (y0; y1) ; p (x)) = E [E (wj (y0; y1) ; p (x) ;x) j (y0; y1) ; p (x)] (54)

38

Then simplify the right hand side using Assumption 2 (which implies that the conditioners

other than x in the inner expectation are super‡ous)

Pr (w = 1j (y0; y1) ; p (x)) = E [E (wjx) j (y0; y1) ; p (x)] (55)

Then use that E (wjx) is, by de…nition of the propensity score, equal to p (x). Hence

Pr (w = 1j (y0; y1) ; p (x)) = E [p (x) j (y0; y1) ; p (x)] (56)

But, then trivially,

Pr (w = 1j (y0; y1) ; p (x)) = p (x) (57)

Moreover, by a similar argument,

Pr (w = 1jp (x)) = E (wjp (x)) = E [E (wjp (x) ;x) jp (x)] (58)= E [E (wjx) jp (x)] = E [p (x) jp (x)] = p (x) (59)

Hence we have that

Pr (w = 1j (y0; y1) ; p (x)) = Pr (w = 1jp (x)) (60)

since both equal p (x). Thus it follows that w is independent of (y0; y1) given that we

condition on p (x).

Another implication of the conditional independence assumption 2 is that, once we

condition on p (x), x will be independent of w, i.e. x ? wjp (x). This is sometimes knownas the “balancing score property”. Intuitively, di¤erent combinations of the covariates

x can generate the same value of the propensity score; however, once we condition on

x being in the set fxjp (x) = p0g for any given value of p0, x will not be related in anyfurther way to the allocation of treatment. The implication of this is that, in regression

of w on x and p (x), the coe¢cients for x should be zero (or not signi…cantly di¤erent

from zero).

Proposition 1 proves that the conditional independence assumption 2 extends to the

propensity score: once we restrict our attention to individuals with the same value of the

propensity score it is as if the allocation of treatment was random within this group. If

we know the propensity score function then we can proceed as in the case of matching

39

(ignoring for a second the two complications that p (x) is a continuous variable and that

it is also unknown). For “each value” of the propensity score function p (x) compute

the sample average of those treated, denoted y1 (p (x)), and those untreated, denoted

y0 (p (x)). The …rst sample mean consistently estimates E [yjp (x) ; w = 1] while the lattersample mean consistently estimates E [yjp (x) ; w = 0]. Estimate the density of p (x) todetermine the fraction of individuals belonging to each possible value of p (x), denoted

f (p (x)), which consistently estimates the corresponding population fraction Á (p (x)).

Then take the weighted average over the possible value of the propensity score to obtain

a consistent estimate of the ATE,

dATE =X

p(x)

f (p (x))¡y1 (p (x)) ¡ y0 (p (x))

¢(61)

The two problems here are, as noted above, (i) that the propensity score is initially

unknown, and (ii) that the propensity score p (x) is generally a continuous variable.

The …rst problem can be handled if we can …nd a way of consistently estimating the

propensity score function p (¢). In that case consistency carries over to the case werewe use the estimated propensity scores bp (x). The second problem is typically handledby partitioning the range of the propensity score function – i.e. the unit interval – into

subintervals or by using di¤erent versions of “nearest neighbour” matching.

Examples

The Evaluation of a Training Program Above we discussed the paper by LaLonde

(1986); that paper used experimental data to explore how well standard nonexperimental

estimators succeed in correctly estimating treatment e¤ects. Deheija and Wahba (1998)

use the same data to explore how well the propensity score matching approach fares.

They conclude that the propensity score matching approach works much better than

the non-experimental approaches considered by LaLonde (1986) and seems to frequently

come close to the experimental results.12

The study by Deheija and Wahba is a nice example of the practical implementation

of propensity score matching. Recall that LaLonde studied experimental data on the

12See also Heckman and Smith (1995) for a discussion.

40

National Supported Work (NSW) Demonstration. In addition to using the treatment

and control group from the NSW experimental data, LaLonde also constructed alterative

“control groups” from other external data sources (PSID, CPS) in order to check how

other standard non-experimental approaches would fare. Deheija and Wahba take the

data from LaLonde to examine if the propensity score method fares better tha the non-

experimental approaches considered by LaLonde. To do this, DW combine the data on

the treated individuals from the NSW experimental data with the arti…cially constructed

control groups from the PSID and the CPS (which contrains the same background infor-

mation).

They then proceed in two steps. First, they estimate the propensity score p (x) using

the pre-treatment variables x (observed for both the individuals in the NSW data and

the arti…cial control groups); to do this they use a standard Logistic probability model.

They then group the observations into strata based on the estimated propensity score

and check for balancing of the pre-treatment variables within each strata; that is, they

use statistical tests to check whether, within each stratum, the distribution of the pre-

treatment variables x is the same for the treated and the untreated individuals (as it

should due to the balancing score property w ? xjp (x)). If there are no signi…cantdi¤erences in the distribution of x between the

socexpnotes - royal holloway, university of...

Documents