socexpnotes - royal holloway, university of...

85
Empirical Methods for Public Policy Analysis Lecture notes Dan Anderberg Royal Holloway University of London January 2007 1 Introduction to Treatment Evaluation In this …rst part of the course we will go through some econometric techniques that have become increasingly popular in public economics. In particular, I will focus on what has become known as treatment evaluation. Treatment evaluation is concerned with measuring the impact of interventions on outcomes of interest. The approach and the terminology originates from medical research where an intervention frequently means exposing someone to some form of treatment. In public economics the techniques can be used to study the e¤ect of e.g. welfare- and social insurance programs on various aspects of behaviour including labour supply, un- employment duration, family structure etc. We will encounter these methods frequently during the course, which motivates why I wanted to start with spending some time pro- viding an overview of these methods. ² Treatment ” can mean just about anything (being exposed to a more generous welfare system, getting training, smoking, having good neighbours etc.) 1.1 Examples Hormone Replacement Therapy Consider …rst an example from medicine. The onset of menopause implies that that the body produces less hormones like estrogen and it is perceived that menopause increases the risk of osteoporosis, Alzheimer’s disease and heart disease. Given this, a possible treatment is “hormone replacement therapy” (HRT). Indeed, in 2001, some 20 percent of 1

Upload: others

Post on 09-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Empirical Methods for Public Policy AnalysisLecture notes

    Dan Anderberg

    Royal Holloway University of London

    January 2007

    1 Introduction to Treatment Evaluation

    In this …rst part of the course we will go through some econometric techniques that

    have become increasingly popular in public economics. In particular, I will focus on

    what has become known as treatment evaluation. Treatment evaluation is concerned

    with measuring the impact of interventions on outcomes of interest. The approach and

    the terminology originates from medical research where an intervention frequently means

    exposing someone to some form of treatment.

    In public economics the techniques can be used to study the e¤ect of e.g. welfare- and

    social insurance programs on various aspects of behaviour including labour supply, un-

    employment duration, family structure etc. We will encounter these methods frequently

    during the course, which motivates why I wanted to start with spending some time pro-

    viding an overview of these methods.

    ² “Treatment” can mean just about anything (being exposed to a more generouswelfare system, getting training, smoking, having good neighbours etc.)

    1.1 Examples

    Hormone Replacement Therapy

    Consider …rst an example from medicine. The onset of menopause implies that that the

    body produces less hormones like estrogen and it is perceived that menopause increases

    the risk of osteoporosis, Alzheimer’s disease and heart disease. Given this, a possible

    treatment is “hormone replacement therapy” (HRT). Indeed, in 2001, some 20 percent of

    1

  • US women in menopause were receiving HRT at an expenditure of 2.75 billion $US. The

    question is, however, whether HRT was it e¤ective?

    Evaluating the e¤ectiveness of HRT is complicated by the fact that women who choose

    to go on HRT after onset of menopause are di¤erent from women who choose not to:

    they have higher levels of HDL (good cholesterol), lower blood pressure, engage more in

    physical activity, and have lower weight, are more educated etc. Controlling for all such

    di¤erences is not easily done.

    To settle the issue a randomized trial was organized: the Women’s Health Initiative

    trial in which 27,000 women age 50-79 were followed for 9 years. Among the women, the

    allocation of HRT was randomly allocated. However, the trial was discontinued when

    evidence mounted that HRT increases the risk of heart disease and stroke! Hence a

    conclusions from this was that the initially perceived bene…cial e¤ect of HRT on reducing

    the risk of heart disease and stroke must have been driven by selection e¤ect.

    The Tennessee STAR Experiment

    The Tennessee STAR (Student/Teacher Achievement Ratio) experiment sought to deter-

    mine the e¤ect of class size on educational outcomes. 79 schools were randomly selected

    for treatment. The treatment involved the formation of classes in terms of student num-

    bers and “teacher-student ratio” for students in grades K-3 with three possible designs:

    small (13-17), regular (22-25), regular with teacher aide with both students and teachers

    randomly allocated to class types. The outcomes of interest were (i) standardized tests in

    grades K-8 (short-run outcomes), and (ii) participation and scores in ACT/SAT college

    admission tests in …nal year of high school. See Krueger and Whitmore (2001) for an

    analysis of the Tennessee STAR Experiment.

    1.2 Treatment E¤ect

    The policy relevance of treatment evaluation should be immediate: it can help us iden-

    tify potential improvements in policy. We will focus on the problem of estimating an

    average treatment e¤ect (ATE) : Formally, ATE is the average partial e¤ect of a

    binary explanatory variable. What does this mean? Partial e¤ect means the e¤ect of

    2

  • the treatment holding other factors constant (as in a standard multiple regression). By

    binary we mean that an individual either (i) gets the treatment, or (ii) does not get the

    treatment. We are e.g. not studying treatments that vary in intensity. Thus we can

    think of treatment as a dummy variable

    wi ´

    8<:

    1 if individual i receives “treatment”

    0 if individual i does not receive “treatment”(1)

    A natural approach to estimating the e¤ect of treatment on an outcome y would then

    be to simply include the treatment dummy wi in a standard linear regression,

    yi = ® + ¯wi + "i; (2)

    or, if we expand the approach to allow for other “explanatory variables” xi,

    yi = ® + ¯wi + ±xi + "i; (3)

    So what’s wrong this approach? The answer is (as we will see): “Sometimes there

    is nothing wrong with this approach!”. However, there are two major concerns about

    this approach which motivates a more general approach. First, we would generally be

    concerned that the receipt-of-treatment dummy variable wi might be correlated with the

    error term, leading to a problem of ”endogeneity”. Indeed, much of the methods put

    forward in the treatment evaluation literature are motived precisely as a way of tackling

    the potential endogeneity problem. Second, note e.g. that the above formulation seems

    to implicitly assume that treatment has the same e¤ect on the outcome for all individuals.

    This would be a strong assumption: in many cases we would expect that people respond

    to “treatment” in di¤erent ways; indeed, in general we would be concerned that there

    might be unobserved heterogeneity in how individuals respond to treatment. Hence we

    would like to develop a more general and rigorous framework in which we can address

    some of these issues.

    2 Causal Inference and Counterfactuals

    Most of our discussion of treatment evaluation will be carried out in the context of the

    Holland-Rubin causal model (Holland, 1986, Rubin, 1974).

    3

  • 2.1 The Causal Model and Measures of Treatment E¤ects

    Treatment evaluation methods are concerned with identifying the causal e¤ect of treat-

    ment on an outcome of interest: What was the e¤ect on individual i of receiving treat-

    ment? Note that we are implicitly comparing individual i with herself in the alternative

    scenario where she does not get treatment. Hence causal inference is based on the notion

    of counterfactuals. A natural formulation of this is to say that each individual has two

    potential outcomes: one outcome with treatment and one outcome without treatment.

    We will denote the outcome of individual i with treatment as yi1 and the outcome with-

    out treatment as yi0. The causal e¤ect (or “treatment e¤ect”) for individual i then then

    yi1¡yi0. The problem is that will ever only observe one of the two potential outcomes: noindividual can both receive treatment and not receive treatment. Hence for individual i

    there will be one observed outcome which we will denote yi. If the individual does receive

    treatment, wi = 1, then the observed outcome is the one with treatment, i.e. yi = yi1. On

    the other hand, if the individual does not receive treatment, wi = 0, then the observed

    outcome is the one without treatment, yi = yi0. The outcome that remains unobserved

    is the counterfactual.

    It could be that everyone experiences the same treatment e¤ect. In this special case,

    which we will generally refer to as the “homogenous treatment e¤ects” case,

    yi1 ¡ yi0 = ¯ for all i: (4)

    However, in most cases this seems like an implausibly strong assumption. Hence we want

    to allow for the possibility that individuals react di¤erently to treatment.

    Our de…nition of the treatment e¤ect for individual i implicitly assumes that this

    e¤ect is independent of who else received treatment or, conversely, that treatment of

    individual i only a¤ects individual i; in the literature this is commonly referred to as the

    stable unit treatment value assumption (SUTVA) (Neyman, 1923, Rubin, 1980). This

    assumption rules out e.g. peer-e¤ects, cross-e¤ects and general equilibrium e¤ects.

    ² ADD REFS

    In the following we will adopt the assumption of random sampling. In particular,

    we will refer to the potential outcomes model as our general “population model” and

    4

  • assume that an independent identically distributed (i.i.d.) sample can be drawn from the

    population.1 In describing the population model we will dispense with the subscript i

    for the individual and simply write y1 for the “treated” outcome, y0 for the “untreated”

    outcome and y1 ¡ y0 for the treatment e¤ect.Since the treatment e¤ect can, in general, be expected to vary in the population, we

    can think of it as a random variable. (If we were to randomly pick an individual, the

    treatment e¤ect is a random variable.) We will then have to decide which moments of

    that distribution we will be interested in. A natural measure of the impact of treatment

    to focus on is the average treatment e¤ect in the population. This answers the following

    question: If we randomly pick out an individual, what is the expected value of his/her

    treatment e¤ect? We can de…ne this formally:

    Definition 1 Average Treatment E¤ect. The average treatment e¤ect ATE is the

    expected value of the treatment e¤ect in the population,

    ATE ´ E [y1 ¡ y0] :

    What if we observe some individual characteristics like age, gender etc.? It could e.g.

    be that the average treatment e¤ect among men is di¤erent from that among women.

    How do we take this into account? A natural way to do this is to collect an individ-

    ual’s characteristics in a vector x, so e.g. x could be (male; age = 29; :::; white). How

    would we then denote the average treatment e¤ect among individuals with the speci…c

    characteristics x? To do this we can use the notion of the conditional expected value,

    ATE (x) ´ E [y1 ¡ y0jx] :

    ATE (x) should thus be interpreted as the average treatment e¤ect among individuals

    with characteristics x.

    A second measure of the impact of treatment that we can estimate is the average

    e¤ect of treatment on the treated, which we can denote ATT :

    1There are cases when the i.i.d. assumption is not strictly valid for the case where data is in the form

    of repeated cross-section (where samples are obtained from the population at di¤erent points in time) or

    in the case of panel data (which consists of repeated observations on the same cross section of individuals

    etc. In these case the will assume random sampling in the cross-section dimension.

    5

  • Definition 2 Average Treatment E¤ect on the Treated. The average e¤ect of

    treatment on the treated ATT is the expected value of the treatment e¤ect among those

    who would receive treatment,

    ATT = E [y1 ¡ y0jw = 1] : (5)

    Indeed, ATT is often more interesting than ATE; consider e.g. the case of a pro-

    gramme where a policy-maker chooses the eligible population. ATT then, by focusing on

    the programme participants, determines the realized gross return from the programme

    which can then be compared to the costs in order to evaluate whether the programme

    was successful or not (Heckman, LaLonde and Smith, 2000)

    Note that ATE and ATT will, in general, not be the same. This could be e.g. due

    to treatment being allocated to observable subgroups of population who are expected to

    bene…t more from treatment. Or, it could be the result of “self-selection”: if individuals

    can, choose whether or not to participate, individual participation can be expected to

    depend on the individual treatment e¤ect. Indeed, the only case where ATE and ATT

    can generally be expected to coincide is when the allocation of treatment is the outcome

    of a randomized experiment (see below).

    Later on we will also encounter a third measure of the e¤ect of treatment; this concept,

    introduced by Imbens and Angrist (1994) is known as a Local Average Treatment

    E¤ect (LATE). LATE can be estimated using instrumental variables under weak

    conditions. However, LATE has two drawbacks: (i) it measures the e¤ect of treatment

    on a generally unidenti…able subpopulation, and (ii) the de…nition itself depends on the

    particular instrumental variable that one has available. LATE uses the existence of

    a variable that only a¤ects an individual’s outcome through the participation decision

    w. In intuitive terms, LATE then measures the average impact of treatment on the

    subpopulation whose participation is a¤ected by variation in the instrumental variable.

    6

  • 2.2 The Observed Outcome: The Switching Equation

    We noted above that, for any one individual, there are two potential outcomes, y1 and

    y0. However, we only observe one of these two potential outcomes. In particular,

    w = 1 ) y = y1 (the individual is treated)w = 0 ) y = y0 (the individual is not treated)

    (6)

    A useful way of writing this is as follows

    y = wy1 + (1 ¡ w) y0 = y0 + w (y1 ¡ y0) (7)

    The last formulation is revealing: it is as is the individual has a “base outcome” y0 to

    which we add the treatment e¤ect if the individual indeed gets treatment. This formula-

    tion will come in handy: we can refer to it as the “switching equation”.

    2.3 Observed and Unobserved Characteristics

    Above we introduced the notion of observed characteristics x that may a¤ect the e¤ect of

    treatment and the allocation of treatment. However, there may also exist characteristics

    that we cannot measure. This is a common problem in econometrics. Consider e.g. the

    use of a traditional Mincerian wage equation for estimating the rate of return to education;

    individual years of schooling are likely to be correlated with unobserved ability. The

    problem of unobserved characteristics such as ability, motivation, self-esteem etc. will

    come up frequently in the treatment evaluation problem.

    3 Randomized Experiments

    In randomized experiments treatment is randomly assigned. Randomized experiments

    in economics are rare; they are nevertheless interesting to consider for four reasons.

    First, the are considered to be the “gold standard” of evaluation; hence, as providing

    a benchmark for other evaluation methods, it is useful to understand how they work.

    Second, evidence based on randomize experiments are usually highly in‡uencial; hence

    it is important to be aware of their potential limitations, what can undermine their

    validity. Third, “experiment-like situations” can occur by chance, in which case we talk

    7

  • about “natural experiments”. Fourth, many of the methods build on the analytics of

    randomized experiments.

    An Example: Vitamin C and Cancer

    One experiment gave vitamin C to 100 patients believed to be terminally ill from cancer

    (See Rosenbaum (2002), Ch. 1). A comparison group constructed by matched sampling:

    for each treated patient, 10 other patients were randomly chosen from historical records

    with same type of cancer and other characteristics in terms of age, gender etc. It was

    found that patients receiving vitamin C lived about 4 times longer than controls, highly

    signi…cant. However, later a carefully randomized experiment conducted at the Mayo

    Clinic overturned the result: patients were randomly assigned to receive vitamin C or a

    placebo. With the more careful research design there was no evidence that vitamin C

    helped prolong survival among cancer patients.

    3.1 Basic Idea and Intuition

    Suppose we have a target population and that we want to determine what the average

    e¤ect of some speci…c “treatment” is. How might we go about doing this? Borrowing

    some ideas from medicine, a good way (at least from statistical point of view) to go

    about this would be to randomly split people into two groups: one group that will receive

    the treatment and one that will not. The two groups are commonly referred to as the

    “treated” group and the “untreated” (or “control” group).

    The idea behind randomizing who gets treatment is that doing so will guarantee that

    there will be no systematic di¤erences between the treated- and the untreated groups

    – both groups will be representative of the population. Speci…cally, since the allocation

    of treatment is completely randomized, who gets treatment cannot be related to the indi-

    vidual e¤ect of treatment or any other individual characteristics, whether observable or

    unobservable. Hence it shouldn’t be e.g. that those who bene…t more from the treatment

    are more likely to get it.

    This fact will make it extremely simple to identify the average e¤ect of treatment: we

    can simply compare the average outome in the treated group with the average outcome

    8

  • in the untreated group.

    ² The average outcome among those who did receive treatment should be represen-tative of what would be the average outcome in the whole population if everyone

    had received treatment.

    ² The outcome among those who did not receive treatment should be representative ofwhat would be the average outcome in the whole population if no one had received

    treatment.

    Suppose then that the allocation of treatment has been randomized and we have (a

    random sample) consisting of N1 individuals who have received treatment and N0 indi-

    vidual who did not receive treatment. A natural way to estimate the e¤ect of treatment

    is then to compare the means in the two groups. Let

    y1 ´ 1N1

    N1X

    i=1

    yi and y0 ´1N0

    N0X

    i=1

    yi; (8)

    be the average outcome in the treated- and the untreated group respectively and consider

    the following di¤erence-in-means estimator of the average treatment e¤ect,

    dATE = y1 ¡ y0:

    The question is: What are the properties of this estimator?

    3.2 Theory

    Let’s look at some of the theory behind this. The starting point is that we have a

    random draw from the population; thus fwi; yi0; yi1gNi=1 are treaded as i.i.d. randomvariables. For each observed individual we observe the outcome yi = yi0 + wi (yi1 ¡ yi0).The crucial feature of randomization is that it ensures that the allocation of treatment

    is (statisically) independent of the potential outcomes; stated in terms of the population

    model the assumption is that:

    Assumption 1 The allocation of treatment is independent of the potential outcomes,

    (y0; y1) ? w:

    9

  • The …rst thing to note is that, under randomized allocation of treatment ATE = ATT

    since the treated individuals are, by construction, representative of the entire population.2

    Formally, due to randomization

    E (yj jw = k) = E (yj) , j; k = 0; 1 (9)

    which implies that

    ATT = E (yi ¡ y0jw = 1) = E (yi ¡ y0) = ATE: (10)

    Consider now the di¤erence-in-means estimator; to do this we can use conditional ex-

    pectations. Since the allocation of treatment is completely randomized in the population,

    w (the treatment dummy) must be (statistically) independent of the potential outcomes

    y0 and y1. In particular, it follows that

    E [y1jw = 1] = E [y1] (11)

    Note what this says: the expected outcome among those actually treated must equal the

    expected “treated” outcome in the overall population. This comes from the fact that,

    due to randomization, the treated individuals are representative of the overall population.

    Similarly, it must be that

    E [y0jw = 0] = E [y0] (12)

    This says that the expected outcome among those not treated equals the expected “un-

    treated” outcome in the overall population.

    The above equalities were stated in terms of the potential outcomes y0 and y1. What

    about the observed outcome y? What is the average value of the observed outcome y

    among those who receive treatment? Formally this is E [yjw = 1]. To see more exactlywhat this is we can use the switching equation (7) to substitute for y; we then obtain

    that

    E [yjw = 1] = E [y0 + w (y1 ¡ y0) jw = 1] = E [y1jw = 1] = E [y1] (13)

    The …rst equality comes from substituting for y using the switching equation (7); the

    second equality follows from using that we are conditioning on w = 1; and the third

    2Formally E [ykjw = j] = E [yk] for k; j = 0; 1 whereby E [y1 ¡ y0jw = 1] = E [y1 ¡ y0].

    10

  • equality simply reiterates (11). Hence the expected outcome among those treated is the

    same as the expected treated outcome in the overall population. Similarly, using the

    switching equation we can get more insight into the average outcome among those who

    are untreated. Using the same reasoning we obtain

    E [yjw = 0] = E [y0 + w (y1 ¡ y0) jw = 0] = E [y0jw = 0] = E [y0] (14)

    Hence the expected outcome among the untreated individual is the expected untreated

    outcome in the overall population.

    Putting (13) and (14) together we obtain that

    E [yjw = 1] ¡ E [yjw = 0] = E [y1 ¡ y0] = ATE:

    But E [yjw = 1] is naturally estimated by the mean outcome among the treated y1.Similarly, E [yjw = 0] is naturally estimate by the mean outcome among the untreated y0.In particular, randomization of the allocation of treatment implies that the di¤erences-in-

    means estimator is unbiased, consistent and asymptotically normal.

    We can now make use of the weak law of large numbers which implies that the sample

    mean y1 converges (in probability) to E [yjw = 1] = E [y1] while the sample mean y0

    converges (in probability) to E [yjw = 0] = E [y0].3 Hence di¤erences-in-means estimatory1 ¡ y0 converges in probability to ATE; in other words, is a consistent estimator forATE: We also say that ATE is “identi…ed”.

    Hence we have managed to partially overcome the fact that the counterfactual is not

    observed: randomization ensures that the outcomes in the control group will mimic what

    would have happened in the treated group had they remained untreated. We will not

    be able to identify the individual treatment e¤ects since we will not be able to observe

    anyone individual in more than one state. However, we will have a chance of identifying

    the average treatment e¤ect in the population by looking at the averages within the

    treated group and the control group.

    3Loosely stated, the weak law of large numbers says that, under the i.i.d. assumption the sample

    mean converges (in probability) to the population mean: Let yi; i = 1; 2; ::: be a sequence of independent,

    identically distributed random numbers; then the sample average N¡1PN

    i=1 yip! E (yi) as the sample

    size N grows to in…nity.

    11

  • Using OLS

    A point to note is that the randomized experiments settings is one case where using a

    dummy variable approach in a linear regression generates the “right answer”: suppose

    that we were simply to estimate a linear model where the treatment indicator w is the

    only regressor,

    y = ®+ ¯w + ": (15)

    As is well-known, estimating this equation using OLS will give as estimate of ® the

    mean outcome among the “untreated” y0 and as estimate of ¯ the di¤erence in means

    among the “treated” and the “untreated”, y1 ¡ y0. In other words, the OLS estimatorof ¯ is precisely the di¤erence-in-means estimator that, under a randomized experiment,

    provides a consistent estimate of ATE (or, equivalently, ATT ).

    Suppose we also observe some individual characteristics x. Do we then need to some-

    how “control” for these? The answer is that, as long as we are interested in the average

    treatment e¤ect in the overall population, this should not be necessary: since the allo-

    cation of treatment is randomized, there should be no systematic di¤erences in observed

    characteristics between the the treated- and the untreated group.

    However, observing individual characteristics (other than the treatment indicator w)

    is nevertheless useful for two reasons. First, they can be used to check that the validity

    of the assumption that the allocation of treatment is purely random: if the allocation of

    treatment is indeed truly random then the treatment indicator should not be correlated

    with any of the observed individual characteristics. Second, observing individual char-

    acteristics can help in establishing the statistical signi…cance of the estimated treatment

    e¤ect. Suppose e.g. that we are estimating the linear model using OLS. If we do not

    include x in the regression, then we are e¤ectively “leaving them in the error term”.

    This is not a problem for the consistency of the estimate for ¯ (since it does not make

    the error term correlated with w); however, it tends to make the standard error of the

    estimate large. If we instead include the characteristics x in the regression we can reduce

    the variance of the error term and hence also the stardard error of the ¯ estimate.

    12

  • 3.3 Examples

    The Canadian Self-Su¢ciency Project

    The Programme The Canadian Self-Su¢ciency Programme (SSP) is an example of

    an in-work bene…t scheme: it o¤ers an earnings subsidy to long-term welfare recipients.

    The aim of the programme is to support low-income households by “making work pay”.

    The SSP has three key features: (i) a substantial …nancial incentive for work relative to

    non-work, (ii) a relatively low marginal tax rate on the earnings of those who work, and

    (iii) a “full-time” work requirement of 30 hrs/week.

    ² Check the web for an up-to-date description...

    Assuming that the 30-hour work requirement is met, the SSP bene…t is equal to

    half the di¤erence between a participant’s gross labor earnings and a target earnings.

    (Unearned income does not a¤ect the SSP payment and the supplement is also not

    dependent on family size.)

    The existence of the SSP fundamentally changes the budget constraint of an indi-

    vidual. The basic welfare programme targetted at low income households in Canada is

    known as Income Assistance (IA).4 The IA, however, by reducing the bene…t one-for-one

    as the individual’s earnings grow imply an implicit 100 percent marginal tax rate (after

    a modest earnings disregard).

    The impact of the SSP on an individual’s budget constraint can be seen e.g. in the

    hypothetical example shown in the …gure below (Card and Robins, 1996, Fig 1).

    FIG

    Under IA there is …rst a short upward-sloping segment representing the earnings

    disregard; the slope is then zero, representing the one-for-one bene…t reduction, up until

    the bene…t is fully withdrawn. The SSP introduced a notch (or a vertical jump) in the

    budget constraint at 30 hrs and, moreover, by only being withdrawn at 50 percent as the

    individual’s earnings grows, generates a positively sloped segment.

    4In practice each province operates its own IA programme; however, the provincial IA systems share

    many of the features key features that are important for our purposes, e.g. being o¤set by income from

    employment and other sources.

    13

  • The Randomized Experiment Before being adopted as a national policy [check

    this] an experimental version of the SSP was constructed. The full SSP evaluation entails

    a …ve-year follow-up of some 6,000 families. Card and Robins (1996) provide an early

    evaluation of some 2,000 families followed over the …rst 18-24 months of the experiment.

    Eligibility in the experiments was limited to single parents who had been on IA for at

    least 12 of the previous 13 months. People assigned to the programme were given up to 12

    months to obtain a full time job and initiate a …rst SSP payment. Those who initiated an

    SSP payment would be eligible for SSP supplements for the next three years (whenever

    satisfying the 30 hr requirement at at least the minimum wage). Those who did initiate

    an SSP payment within the initial 12 month period lost any further entitlement.

    From economic theory, the main impact of the SSP is that it would induce some

    people who othewise would have remained on IA and worked less than 30 hours per week

    to move from welfare to full-time employment (see Card and Robins, 1996, for a full

    discussion).

    The Evaluation Card and Robins (1996) explored the impact of the SSP on sample of

    single parents, over 18 years of age, who had received IA payments for at least 12 of the

    past 13 months residing in British Columbia and in New Brunswick. The randomization

    in the experimental phase of the SSP was carried out as follows. Sample members were

    informed that they had been selected to participate in a research project involving the

    possibility of a wage supplement. They were asked to sign a concent form (with roughly

    90 percent of the selected individuals agreeing to participate) after which the sample

    members were randomly allocated to a treatment group (1,066 individuals)and a control

    group (1,056 individuals). Individual outcomes in terms of labour force participation,

    hours of work, earnings etc. were recorded from the start of the program and also

    retrospectively for one year prior to the experiment. Card and Robins use information

    the …rst 18-24 months of the experiment.

    After verifying that the treatment and control groups have the same observable char-

    acteristics (as they should given that the randomization was properly done) Card and

    Robins present their basic results. Since for each group the outcomes are recorded on a

    monthly basis it is possible to trace the treatment e¤ect by month after the start of the

    14

  • experiment. Moreover, since the outcomes were also recorded for the year prior to the

    programme it is possible to check that there are no di¤erences in the outcome variables

    between the two groups prior to the experiment.

    The results can be conveniently represented graphically by plotting the time series of

    the average monthly outcomes for the treatment and the control group, along with the

    implied estimate of the monthly impact (i.e. the simple di¤erence-in-means). The next

    …gure shows the estimated impact on monthly earnings (Card and Robins, Fig 5).

    FIG

    The …gure shows that there were indeed no di¤erences in the outcome variable between

    the two groups in the pre-programme period; however, soon after the introduction of the

    programme the earnings of the individuals in the treatment group exceeded the earnings

    of those in the control group, with the di¤erence – i.e. the treatment e¤ect – being

    statistically signi…cant from month 5 onwards.5 The impact on the rate of full-time

    employment is follows a similar pattern. See next …gure (Card and Robins, Fig 7).

    FIG

    The estimated impact of the SSP on the full-time employment rate peaks at 14 per-

    centage points after 14 months before dropping back slightly to about 10 percentage

    points after 17 months. Relatedly, Card and Robins report on the impact of the SSP on

    the probability of being o¤ IA, showing that SSP markedly increases the probability of

    being a non-IA recipient.

    Hence the early programme evaluation indicated that a signi…cant number of single

    parents responded to the …nancial incentives provided by the programme; indeed, little

    over a year after the programme enrollment the full time employment rate among those

    initial welfare recipients who were o¤ered the SSP programme was nearly twice that of

    those in the control group. However, Card and Robins also note that there seems to be

    a tendency that the recipients were taking jobs with relatively low pay.

    5The jump that occurs at the introduction of the programme is argued to stem from the fact that

    pre- and post treatment outcomes were obtained from di¤erent surveys.

    15

  • Other Examples of Randomized Experiments in Public Economics

    The Evaluation of Training: The LaLonde Study In a highly in‡uential study

    LaLonde (1986) used an experiment dataset to compare between experimentally and non-

    experimentally determined results and between di¤erent types of non-experimental esti-

    mation methodologies. The study is based on a programme called the National Supported

    Work Demonstration (NSWD). This programme was operated during the mid-1970s in 10

    sites across the US and was designed to help disadvantaged workers, in particular women

    in receipt of cash welfare bene…ts (AFDC), ex-drug-addicts, ex-criminal o¤enders and

    high-school drop-outs. In the programme, quali…ed applicants were randomly assigned

    to treatment, which comprised a guaranteed job for 9 to 18 months. A total of 6,616

    individuals were included in the study. Data was collected on both pre-program earnings

    and post-program earnings, as well as a number of pre-program variables such as age,

    education, ethnicity etc). Eligible candidates were randomized into the program. Due

    to the randomization, the e¤ect of the program could be evaluated without bias using

    simple di¤erences-in-means.

    Given that the allocation of treatment was random, it should be that the individuals

    who received treatment and those (eligible) who didn’t receive treatment have the same

    observed characteristics. Indeed, the following …gure which reproduces LaLonde’s Table

    1 shows that there was no di¤erences between the treatment group and the control group

    in terms of observable characteristics.

    FIG

    The following Figure shows the earnings evolution for treatments and controls from a

    preprogramme year (1975), through the treatment period (1976–77), until the postpro-

    gramme period (1978) (an except from LaLonde’s Table 3)

    FIG

    The results indicate that the earnings of the treated- and the untreated individuals

    were indeed very similar before the programme started. During the programme the

    earnings of the treated was substantially higher and after the programme had …nished

    16

  • the gap narrowed somewhat. Nevertheless, using the simple di¤erences in means after the

    programme as the estimate of the treatment e¤ect the estimated e¤ect would be $886. If

    construct a (di¤erence-in-di¤erence) estimator that subtracts the di¤erences in earnings

    between the two groups before the programme the estimate would be $847.

    The second aim of the LaLonde paper is to explore how well non-experimental methods

    that would typically be used would perform. The idea is to use that, since we have

    experimental data, we have the “right answer”; we can then ignore the control group

    in the data and use other constructed control groups (from external data sources) along

    with various regression methods to try to recover the treatment e¤ect. LaLonde uses

    data from the PSID and the CPS to generate alternative control groups. He then uses a

    variety of methods including (i) simple regression using post-treatment data that controls

    for di¤erences in observed characteristics, (ii) di¤erences-in-di¤erence methods (with or

    without controls for observed di¤erences in characteristics), (iii) a Heckman (two-step)

    selection model. LaLonde concludes that the non-experimental methods perform very

    poorly.

    The US Re-Employment Bonus Experiments After a decade of increasing un-

    employment in the US since the mid 70s, there was an increased interest in trying to

    identify reform of the unemployment insurance (UI) system in order to get unemployed

    back to work faster, and hence to reduce the …nancial pressure on the UI system. A

    number of states implemented randomized experiments where unemployed workers were

    given cash-bonuses for …nding jobs quickly. (See Meyer, 1995, for a survey.) One example

    was the Illinois re-employment bonus experiment. This involved a $500 cash bonus for

    …nding a job within 11 weeks (and keeping it for at least 4 months). The outcome of the

    Illinois re-employment bonus experiments was analyzed by Woodbury and Spiegelman

    (1987) and will be considered in more detail later on in the course.

    The Negative Income Tax Experiments

    ² Next time...

    17

  • 3.4 Natural Experiments

    Randomized experiments in economics are rare. However, situations can also occur when

    some event occurs that naturally creates an experiment-like situations in the sense that

    di¤erent individuals are exposed to di¤erent “treatments” in a way that is e¤ectively

    random. Situation like these are generally referred to “natural experiments” to mark

    that it was not a planned randomization but rather one that occurred “naturally”. In

    so far as the allocation of treatment was indeed e¤ectively random the same analysis as

    under planned randomized experiments apply.

    A neat example of an analysis based on a natural experiment is Gould, Lavy and

    Paserman (2004). The authors consider an event in which a large number of Ethiopian

    Jews were due to political instability, over a very short period of time (a few days),

    relocated to Israel. The families arriving from Ethiopia were, almost immediately upon

    arrival, randomly allocated to destinations in Israel. Hence, some families would be

    allocated to urban areas whereas other families would be allocated to rural areas etc. In

    particular, the children of the immigrating families were, by chance, randomly allocated

    to schools of di¤erent qualities. Using the randomization created by the event allowed

    the authors to consider the impact of school quality of the children’s achievements.

    Unfortunately, natural experiments are also rare in the context of public economics.

    3.5 Potential Problems with Randomized Experiments

    Even though randomized experiments are commonly thought of as the gold standard

    among evaluation methods it should be noted even this approach has some potential

    problems (some of which also apply to other evaluation methods).

    First, random experiments are generally very costly to implement. Often the …nan-

    cial costs are large. Moreover, there are often “ethical costs” that make randomized

    experiments unfeasible: if some “treatment” is available that we believe might help some

    individuals it can be considered unethical to randomly allocate the treatment across indi-

    viduals rather than give it to those who would be in most need. Similarly there are certain

    “treatments” that would be completely unfeasible to randomly allocate. Consider e.g.

    the treatment “growing up in a poor neighborhood”: it would be completely unthinkable

    18

  • that to construct a randomized scheme whereby children would be randomly re-allocated

    to new homes and neighborhoods.

    Second, there are threats to internal validity. As is obvious from the above analy-

    sis, it is important that the randomization is correctly implemented so that the treated

    and untreated populations are indeed identical in all respects except for the receipt of

    treatment. Further, it is important that the initial randomization is adhered to: that all

    people that were initially assigned to receive treatment indeed do receive treatment and

    do not fail to take it up.6 Another potential problem is that the individual included in

    the experiments may change their behaviour due to the mere fact of being in an exper-

    iment (“Hawthorne e¤ects”). These problems relate to the so-called internal validity of

    the evaluation results.

    However, there can also be threats to the external validity of the evaluation that

    compromise the ability to generalize the results of the experiments to other populations

    and settings. One threat to external validity is “experiment speci…city ”: when the

    experimental sample is not representative of the population to which the policy might

    be extended or when the policy applied to the experimental sample is not representative

    of the policy that would be implemented on a broader population. A second threat

    to external validity is “limited duration”: the duration of an experiment may be too

    short to identify the long-run responses that would obtain if the policy was permanently

    adopted. Relatedly, adopting the programme as a wide-spread permanent policy could

    have “general equilibrium e¤ects” that are large enough that results from the experiment

    cannot be generalized.

    4 When We Don’t Have a Randomized Experiment

    Most often we have to get by with less “clean” evidence than would o¤erer by randomized

    experiments. Non-randomization in the public policy context tend to occur for two

    reasons:

    1. The policy change (i.e. the treatment) when introduced a¤ected some individuals

    6As we will see below, if this is not the case, e.g. due to some degree of self-selection, but the original

    assignment of treatment is known, the original assignment can be used as an instrumental variable.

    19

  • but not others; however the two groups di¤ered systematically. Hence we have a

    treated group and an untreated group, but the two groups cannot be expected

    to be identical.

    2. The individuals themselves partly determine whether they receive treatment. In

    other words, there is self-selection into treatment.

    As an example of the …rst, suppose the government introduces a policy that is available

    to people under the age of 25. Hence those under the age of 25 can be considered

    as “treated” (i.e. exposed to the policy) while those above 25 are “untreated” (i.e.

    una¤ected by the policy). However, it is quite clear that the two groups are not identical.

    Or, similarly, suppose that a policy is introduced in one area but not in another; the two

    areas might not be identical compositions.

    As an example of the second problem, take e.g. a government training program. If

    people can choose whether or not to participate, then we might suspect that those who

    would gain the most from participating are also most likely to actually participate. In

    this case it is easy to see why, if we are not careful, we could easily get a wrong estimate

    the e¤ect of the program.

    4.1 A Useful Decomposition

    To see how biases can easily occur it is useful to decompose the individual’s treatment

    e¤ect. Recall that E (y1) is the average outcome, with treatment, in the population. For

    concreteness think of “treatment” as participating in a particular training program and

    think of the outcome of interest as y earnings. Then y1 is the individual’s earnings after

    participating in training.

    Let’s introduce some short-hand: de…ne

    ¹1 ´ E (y1) ; (16)

    to be average outcome, with treatment, in the population. Then we can decompose the

    individual’s treated outcome as

    y1 = ¹1 + À1: (17)

    20

  • This has a simple interpretation: the individual’s earnings with training is the average

    earnings (with training) in the population ¹1 plus an individual-speci…c component À1which, by construction, has zero mean in the population, E (À1) = 0. Similarly, we can

    de…ne

    ¹0 ´ E (y0) (18)

    as the average outcome in the population when no one gets treatment. If the speci…c

    individual does not get any training her earnings will be y0 which we can decompose

    into the population average ¹0 and an individual-speci…c component of earnings without

    training À0,

    y0 = ¹0 + À0: (19)

    Note that

    ¹1 ¡ ¹0 = E (y1) ¡ E (y0) = E (y1 ¡ y0) ; (20)

    is nothing but the average treatment e¤ect in the population ATE. The individual’s

    treatment e¤ect, y1 ¡ y0, on the other hand is, by substitution,

    y1 ¡ y0 = (¹1 + À1) ¡ (¹0 + À0) = (¹1 ¡ ¹0) + (À1 ¡ À0) : (21)

    Using the above decompositions we see that an individual’s treatment e¤ect can be though

    of as having two components: the average treatment e¤ect ATE and plus an individual-

    speci…c component À1 ¡ À0; note that, since E (À1) = E (À0) = 0; the individual-speci…ccomponent is, on average, zero in the population, E (À1 ¡ À0) = 0.

    Consider then a particular individual: if À1 > À0 the individual gains more from

    participating in training than the average person as she has a larger individual earnings

    component when participating than when not participating.

    Suppose now that we take expectations in (21) conditional on receiving treatment,

    w = 1; this will give us the average treatment e¤ect on the treated,

    ATT = E (y1 ¡ y0jw = 1) = (¹1 ¡ ¹0) + E (À1 ¡ À0jw = 1) : (22)

    In other words,

    ATT = ATE + E (À1 ¡ À0jw = 1) : (23)

    21

  • Hence unless the last term is zero we have that the average treatment e¤ect on the treated

    is not the same as the average treatment e¤ect in the population, ATT 6= ATE. Supposee.g. that, due to self-selection, the participating individuals have, on average, positive

    individual-speci…c treatment e¤ect components, E (À1 ¡ À0jw = 1) > 0. It then followsthat the average treatment e¤ect on the treated is larger than the average treatment

    e¤ect in the population, ATT > ATE.

    It is then also easy to see why the simple di¤erence-in-means estimator will generally

    not estimate any treatment e¤ect of interest; in general the di¤erence-in-means estimator

    will estimate the expected di¤erence in outcome for the treated and the untreated which

    is

    E (y1jw = 1) ¡E (y0jw = 0) = (¹1 ¡ ¹0) + E (À1jw = 1) ¡ E (À0jw = 0) : (24)

    But this is, in general, neither ATE nor ATT: This is intuitive: suppose that those

    who gain the most from participating in training are more likely to actually participate;

    conversely, those who bene…t the least are less likely to participate. Then it is clear that:

    ² The average earnings among those who actually participate is not a good estimateof what would have been the earnings in the whole population if everyone had par-

    ticipated !

    ² The average earnings among those who do not participate is not a good estimate ofwhat would have been the earnings in the whole population if no one had participated.

    Simply put, neither group is representative of the population. Hence if we were to use

    a di¤erence-in-means approach, we’d be subtracting something that is unrepresentative

    from something that is unrepresentative; it is then quite clear that we should not expect

    to uncover the average treatment e¤ect.

    4.2 Roadmap to Methodologies for Non-Experimental Data

    The available menu of empirical methodologies for non-experimental data depends on two

    main factors. The …rst factor is the timing-structure of the data: Is the data a panel (or,

    22

  • possibly, a repeated cross-section) or a single cross-section? A range of methodologies

    can be applied to pure cross-sectional data; these include matching estimators, regression

    discontinuity designs, (Heckman) selection model, and Instrumental Variables methods.

    If, on the other hand, data are in longitudinal or repeated cross-section format, the

    di¤erence-in-di¤erence approach can be applied. Moreover, various methods are, as we

    will see, often combined.

    The second factor is the richness of the data: Are we measuring all factors that are

    relevant for the selection into treatment? If we believe that we have good measures of

    all factors that a¤ect the allocation of treatment we say that there is “selection on ob-

    servables” (only). If, on the other hand, we believe that some unmeasured individual

    characteristics (e.g. intrinsic motivation, self-esteem etc.) a¤ecting who received treat-

    ment, then we have to acknowledge that there may be some “selection on unobservables”.

    The main class of models that rely on the assumption of selection on observables is

    the class of matching models. In contrast, the main models allowing for selection on

    unobservables are the (Heckman) selection model and the IV methods.

    5 Selection on Observables and Matching

    Suppose then that we do not have a perfectly randomized experiment. Indeed, “natural”

    experiments, by not being controlled randomizations, frequently give rise to treatment

    and control groups that may be quite di¤erent from each other in terms of their observ-

    able characteristics. This then begs the question of how to control for these observable

    di¤erences. Matching provides a way forward for the case where we only have cross-

    sectional information available; however, as we will see it will provide a way forward by

    making fairly strong assumptions about knowledge of the treatment allocation process.

    Hence suppose that we have cross-sectional data where some individuals have received

    treatment and some others have not, but the two groups are not identical in their char-

    acteristics. A simple di¤erence-in-means estimator is then likely to confound the e¤ects

    of the treatment with the e¤ects of the di¤erences between the groups. What can we do

    then? A natural way forward is to try to understand how treatment is allocated. After all,

    we …gured out that was not completely random. Speci…cally, suppose that we observe

    23

  • a number of individual characteristics x. To make things easy, suppose for now that

    the observed characteristics can only take on a …nite number of possible values; in other

    words, suppose that x is discrete and has some distribution in the population which we

    can represent by a probability density function Á (x) (with some support X). Moreover,

    suppose that the observed characteristics x capture everything that is relevant to the

    allocation of treatment. In other words, suppose we believe that the treatment depends

    on the potential outcomes only through x. That is to say that the allocation of treatment

    w is independent of (y0; y1) once we condition on any particular value of x.

    Suppose e.g. that x = fgender; ageg. If we compare a 30 year old man with a 25 yearold woman, it may e.g. be that the man is more likely to participate. However, the main

    point is that all men of the same age are exactly equally likely to participate! No other

    information about these men, no matter how hard or how easy to obtain, would help

    us predict who in the group receives treatment. If the measured characteristics x truly

    capture everything that is relevant to the allocation of treatment, then if we focus on a

    group of individuals with the same characteristics x, who gets treatment and who does

    not should be purely random! But that means that, as long as we focus on individuals

    with the same characteristics, it is as if we have a randomized experiment. That will

    clearly give us a way forward.

    The assumption that the allocation of treatment is independent of the potential out-

    comes once we condition on a set of observable variables, introduced by Rosenbaum and

    Rubin (1983), goes under various names in the literature: it is commonly referred to as the

    “ignorability of treatment”, the “unconfoundedness” or then “selection on observables”

    assumption.

    Theory: A Formalization

    Suppose then that we observe outcome the outcome variable y for a random sample from

    the population (where the observed outcome is either y1 or y0 depending on whether

    or not the individual has received treatment as described above); we also observe who

    receives treatment, as indicated by the variable w; …nally we observe a vector of individual

    characteristics x. Our key assumption will be that x contains everything that is relevant

    to the allocation of treatment. Hence once we condition on x the allocation of treatment

    24

  • will no longer be correlated with the potential outcomes.

    Assumption 2 Conditional Independence Assumption. Conditional on x; w and

    (y0; y1) are independent: (y0; y1) ? wjx.

    This means that, if we take two individuals with the same observed characteristics x,

    they will be equally likely to receive treatment: speci…cally, the allocation of treatment

    cannot be related to any other factors, including the potential outcomes. The fact that

    the allocation of treatment in this sense “ignores” the outcomes has motivated the name

    “ignorability assumption”.

    Another way of looking at the assumption of conditional independence is to note that it

    allows the allocation of treatment w to be correlated with the potential outcomes (y0; y1),

    but that the correlation disappears once we partial out the observed characteristics x.

    One immediate e¤ect of this is that we can unequivocally talk about the conditional

    probability of participation given x; to see this note that

    Pr (w = 1jx) = Pr (w = 1jx;y0; y1) ; (25)

    since, given that we are already conditioning on x, w is independent of the potential

    outcomes. Hence we can write an individual’s probability of receiving treatment as a

    function of her characteristics. Formally, for all x 2 X, de…ne

    p (x) ´ Pr (w = 1jx) : (26)

    The function p (x) is in the treatment evaluation literature commonly known as the

    propensity score function.

    For future reference we can also consider a second useful assumption; this assumption

    states that, for every possible value of x (in the support of x), there are both treated and

    untreated individuals

    Assumption 3 The Overlap Assumption. p (x) 2 (0; 1) for all x 2 X.

    A simple violation of the overlap assumption is e.g. when treatment are given to all

    men and no women: in this case p (male) = 1 and p (female) = 0: As we will see, the

    overlap assumption is critical for the feasibility of matching techniques.

    25

  • A problem with the assumption that x captures everything that is relevant to the

    allocation of treatment is that it is virtually impossible to verify. Hence for this to be

    convincing, we really should have a good set of observed characteristics. Nevertheless,

    let’s proceed on faith and consider how the average treatment e¤ect can be estimated

    when selection into treatment is on observable variables.

    5.1 Simple Matching

    We noted above that if the observed characteristics x capture everything that is relevant

    to the allocation of treatment, then if we focus on a group of individual’s with exactly the

    same characteristics, who gets treatment and who does not is e¤ectively random: it is as

    if, within that group, there is a randomized experiment. But in that case we know how

    to proceed. We can simply compare the average outcome of those who receive treatment

    with the average ourcome of those that don’t.

    Thus e.g. if x = fgender; ageg we would e.g. pick out all women aged 35 (say) and,within this group, we would compute the mean outcome among those who receive treat-

    ment y1 and the mean outcome among those that don’t y0. In order to emphasize that we

    are looking only at individuals with the speci…c characteristics x = ffemale; age = 35gwe can write these (conditional) sample averages are y1 (x) and y0 (x). We could then

    take the di¤erence y1 (x) ¡ y0 (x) as a natural estimate of the average treatment e¤ectamong individuals with these speci…c characteristics. Recall that we denoted this by

    ATE (x) above. Thus we again use the simple di¤erence-in-means estimator, only this

    time on individuals with the same characteristics:

    dATE (x) = y1 (x) ¡ y0 (x) : (27)

    What if we wanted to estimate the average treatment e¤ect in the whole population?

    We would need to determine what fraction of the population is of each type and take the

    weighted average of the group-speci…c average treatment e¤ects. Hence, if we want to

    estimate ATE, then what we need to do is to:

    ² Obtain the estimate dATE (x) for all possible values of x using the di¤erences-in-means estimator within each group x 2 X.

    26

  • ² For each possible value x 2 X estimate the fraction of the population with thoseparticular characteristics; the obvious estimator is the corresponding fraction in the

    sample, which we can denote f (x).

    ² Take the weighted average across all groups to obtain the estimated ATE,

    dATE =X

    x2Xf (x) dATE (x) : (28)

    Theory

    Let’s look a bit more in detail at the theory behind this. Recall that we de…ned ATE (x)

    as the average treatment e¤ect among the part of the population that has characteristics

    x,

    ATE (x) ´ E (y1 ¡ y0jx) : (29)

    The average treatment e¤ect in the population ATE is simply the weighted average of

    the average treatment e¤ects in the subpopulations

    ATE =X

    x2XÁ (x)ATE (x) (30)

    The key to the analysis is the assumption that, conditional on x, treatment status

    w is independent of (y0; y1). This means that, the expected outcome for those actually

    treated is representative of the treated outcome among everyone with the characteristics

    x; formally, Assumption 2 implies that

    E (y1jx;w = 1) = E (y1jx) : (31)

    Similarly, the expected outcome for those were not treated is representative of the un-

    treated outcome among everyone with the characteristics x,

    E (y0jx;w = 1) = E (y0jx) : (32)

    Turning from the potential outcomes (y0; y1) to the actual outcomes y, we then have

    that

    E (yjx; w = 1) = E [y0 + w (y1 ¡ y0) jx; w = 1] = E [y0 + 1 ¢ (y1 ¡ y0) jx; w = 1](33)

    = E (y1jx; w = 1) = E (y1jx) :

    27

  • The …rst equality follows from replacing the actual outcome y using the switching equation

    (7), the second equality follows from plugging in that w = 1, the third equality follows

    by direct simpli…cation, while the last equality reiterates equation (31).

    By an analogous argument we have that

    E [yjx; w = 0] = E [y0 + w (y1 ¡ y0) jx; w = 0] = E [y0 + 0 ¢ (y1 ¡ y0) jx; w = 0](34)

    = E [y0jx; w = 0] = E [y0jx] :

    Hence

    E [yjx; w = 1] ¡ E [yjx; w = 0] = E [y1jx] ¡ E [y0jx] = E [y1 ¡ y0jx] = ATE (x) : (35)

    Proceeding exactly as in the case of randomized experiments we would use that, due

    to random sampling, the corresponding sample means are consistent estimators of the

    population.

    Let y1 (x) denote the sample mean outcome among the treated individuals with char-

    acteristics x and let y0 (x) denote the sample mean outcome among the untreated in-

    dividuals with characteristics x. Again, using the weak law of large numbers we have

    that the probability limit of y1 (x) is E [yjx; w = 1] and the probability limit of y0 (x) isE [yjx; w = 0]. Hence y1 (x) ¡ y0 (x) is a consistent estimator of ATE (x).

    Moreover, the observed fraction of individuals in the sample who have the character-

    istics x,

    f (x) ´ 1N

    NX

    i=1

    I (xi = x) (36)

    (where I (¢) is the indicator function that is one if the statement in the brackets is true andzero otherwise) is a consistent estimator of the corresponding fraction in the population,

    i.e. the probability limit of f (x) is Á (x).7 The matching estimator

    dATE =X

    x2Xf (x)

    £y1 (x) ¡ y0 (x)

    ¤(37)

    is therefore (using the Slutsky Theorem) a consistent estimator of ATE.

    7Recall that we are assuming that x is discrete.

    28

  • Potential Problems

    Although this sounds straightforward, three complications have to be tackled.

    1. The number of possible combinations of characteristics tends to grow very quickly

    2. Some characteristics may be continous.

    3. For some values of x there may be no treated (or no untreated) individuals.

    The …rst problem is known as the curse of dimensionality; to see the problem

    suppose that we initially have x = fgender; ageg and that age can take on values thevalues 20,21,22, ... , 50. Then there are already 2 £ 31 = 62 possible x-vectors. Supposethat we add another dimension; e.g. suppose we add years of schooling which can take

    on 10 di¤erent values (say). Then the number of possible x-vectors quickly increase to

    620!! Hence we see that simple matching quickly runs into problems as we add more

    variables – the number of groups simply tends to grow very, very quickly. Note that this

    is a problems since, for the analysis to be convincing we need to make the case that we

    are including in x all variables that are relevant to the allocation of treatment. But as

    the possible number of x vectors grow we also need the number of observations to grow

    so that we can estimate with precision the group-speci…c means y0 (x) and y1 (x) at every

    x.

    The second problem is similar: any continous variable can, in principle, take on an

    in…nite number of values which makes it impossible to estimate ATE (x) on all possible

    values of x: for one we cannot even list all possible values of x and moreover, we cannot

    expect to …nd individual with exactly the same characteristics. One way to “solve” this

    problem is to “discretize” the continous variables, stratifying the sample into bins or

    cells.8

    The third problem points to the importance of the overlap assumption: problem three

    obtains when that assumption is violated. For some value of x, either y1 (x) or y0 (x)

    8Another way to proceed is to accept “inexact” matching; e.g. one can compare each treated indi-

    viduals with the untreated individual with the “most similar” characteristics (where “most similar” is

    de…ne using some pre-speci…ed distance measure, e.g. the Euclidean metric). If one further assumes that

    ATE (x) is continous in x a range of ‡exible non-parametric methods are available.

    29

  • cannot be computed – there are simply no one to average over! Hence we cannot estimate

    ATE (x) – the average treatment e¤ect at that speci…c x is simply not identi…ed (and

    hence, neither is ATE).

    An Example: Earnings by Veteran Status

    Angrist (1998) considers the e¤ect of voluntary military service on the earnings and

    employment status of veterans. A simple comparison on veterans to non-veterans can be

    expected to be misleading for two reasons:

    ² There is likely to be self-selection in applying for the military

    ² The is e¤ective screening of applicants by the military.

    To tackle these problems Angrist’s comparison by veteran status is restricted to a

    sample of applicants to the military, only about half of whom actually enlist. Moreover,

    the data contains most of the variables that the military uses to screen applicants (age,

    schooling, the Armed Forces Quali…cation Test (AFQT) score, application year). Angrist

    compares three di¤erent estimators:

    ² Di¤erence in means between veterans and nonveterans.

    ² Matching on the observed variables.

    ² Regression estimates with controls for the observed variables.

    Angrist merges military data (from the Defense Manpower Data Center) with earnings

    data (from the Social Security Administration) using social security number. Knowledge

    of the military selection process suggests that the recorded characteristics matter for

    the entry decision. In order to perform matching, Angrist de…nes cells using the year

    of application (1979-1982, 4 categories), AFQT score (5 categories), schooling level at

    the time of application (6 categories), year of birth (1954-1965, 12 categories), race (2

    categories), generating a total 2,880 categories.

    ² NEED TO COMPLETE THIS EXAMPLE>>>

    30

  • 5.2 Linear Regression

    Early on we asked what’s wrong with the …rst-year econometrics solution of simply using

    the treatment indicator w as a dummy variable in an OLS regression. We can now have

    a look at the answer to this question. Hence consider the simple formulation,

    y = ®+ ¯w + ±x+ ": (38)

    We want to know if, under some circumstances, the coe¢cient on the treatment

    dummy ¯ measures something meaningful like the ATE.

    Note …rst that if we do not include the observable covariates x then we are almost

    surely in trouble. Running the regression, using OLS, with the single dummy variable

    as regressor would the give that the estimated value of ¯ that is equal to the di¤erence

    in sample means between the treated and the untreated individuals, b̄ = y1 ¡ y0 – i.e.it would be the simple di¤erence-in-means estimator. From our earlier analysis we know

    that the di¤erence-in-means estimator is a consistent estimator of

    E (y1jw = 1) ¡ E (y0jw = 0) = ATE + E (À1jw = 1) ¡ E (À0jw = 0) (39)

    which is neither the ATE nor the ATT . Only in the case of pure randomization will this

    approach in general consistently estimate ATE (and ATT ).

    However, maybe by including the observed covariates x we can rid the regression of

    the problem that the allocation of treatment is correlated with the error term. Hence,

    let’s investigate under which conditions estimating this equation by OLS will give us

    the answer we are looking for, namely the ATE. Based on the above intuition, we can

    now show that applying OLS on will provide a consistent estimate of ATE under three

    assumptions.

    1. We maintain the assumption that the characteristics in x capture everything that

    is relevant to the allocation of treatment, i.e. we maintain the conditional indepen-

    dence assumption 2.

    2. We assume that the average treatment e¤ect does not vary across groups: ATE (x)

    is the same for all x.

    31

  • 3. The conditional average outcome in the absence of treatment is linear in x; E [y0jx] =®+ x±.

    The above assumptions are clearly quite strong; nevertheless, there are plenty of

    examples of OLS estimates in the literature. The thing to take away from this is that

    OLS can make sense; however, the conditions under which it is perfectly valid are quite

    stringent.

    Theory*

    Let’s prove that OLS makes sense under the above assumptions. Let’s start by formalizing

    the second assumption:

    ATE (x) = E [y1 ¡ y0jx] = ¯ for all x 2 X (40)

    where ¯ is a constant. Hence ATE (x) = ATE = ¯ for all groups x. Recall then the

    decomposition we generated above,

    yk = ¹k + Àk with ¹k = E [yk] , k = 0; 1 (41)

    We noted that the individual treatment e¤ect can then be written as

    y1 ¡ y0 = (¹1 ¡ ¹0) + (À1 ¡ À0) (42)

    where ¹1 ¡ ¹0 = ATE and where À1 ¡ À0 is the individual-speci…c component of thetreatment e¤ect. Taking the average within a given group x then yields

    ATE (x) = E [y1 ¡ y0jx] = ATE + E [À1 ¡ À0jx] (43)

    The left hand side is ATE (x). Hence since ATE (x) = ATE for all x it must be that

    E [À1 ¡ À0jx] = 0 for all x 2 X (44)

    This of course simply re‡ects that, conditional on x, treatment is as if randomized.

    Next we turn to the observed outcome y; by the switching equation (7). Substituting

    for y1 and y0 using the decomposition we see that

    y = w (¹1 + À1) + (1 ¡w) (¹0 + À0) (45)

    = ¹0 + (¹1 ¡ ¹0)w + À0 + w (À1 ¡ À0)

    32

  • Now use that we observe the characteristics x; taking expectations of the observed out-

    come conditional on x and w (the treatment indicator),

    E [yjx; w] = E [¹0 + À0 + (¹1 ¡ ¹0)w + w (À1 ¡ À0) jx; w] (46)

    In order to simplify this we use that ¹0 and (¹1 ¡ ¹0) = ATE are constants and that(trivially) E [wjx; w] = w. Thus

    E [yjx; w] = ¹0 +ATE ¢ w + E [À0jx; w] + E [w (À1 ¡ À0) jx; w] (47)

    However, when we consider the last term we see that this simpli…es,

    E [w (À1 ¡ À0) jx; w] = wE [(À1 ¡ À0) jx; w] = wE [(À1 ¡ À0) jx] = 0 (48)

    where the …rst equality follows from the fact that we are conditioning on w (so we can

    treat it as a constant), the second equality follows from the conditional independence

    assumption which states that, once we condition on x, the conditional outcomes are

    independent of the treatment allocation w; …nally, the last equality comes from equation

    (44). Hence we have that

    E [yjx; w] = ¹0 +ATE ¢ w + E [À0jx] (49)

    where we also used that, due to selection on observables E [À0jx; w] = E [À0jx]. Finallyby the linearity assumption

    E [y0jx] = E [¹0 + À0jx] = ¹0 + E [À0jx] = ®+ x± (50)

    Hence have that

    E [yjx; w] = ®+ATE ¢ w + ±x (51)

    Hence, under the speci…c assumptions, the linear structure holds and the coe¢cient on

    the treatment dummy is the ATE. It can also easily be show that w and the disturbance

    " are uncorrelated; this follows naturally from the assumption that the allocation of

    treatment is only related to x, not to the individual potential outcomes. Hence the ATE

    can, in this case, be consistently estimated using OLS.

    33

  • 5.3 Matching on the Propensity Score

    One advantage of the simple matching was that it allows us to explore how the average

    treatment e¤ect vary with the individuals’ characteristics – speci…cally, we could estimate

    ATE (x) at the various values of x. The main problem with the matching method is that

    is tends to run into the problem of the curse of dimensionality: If we have a rich datasets

    there will be many variables available for categorising and grouping that the data gets

    grouped into a large number of cells each containing few observations. In short, if we

    exploit the richness of the data to compare like with like we are likely to end up having

    few comparable observations and so correspondingly imprecise estimates.

    It turns out that there is another way to exploit the assumption that the observed

    variables x capture everything that is relevant to the allocation of treatment. The basic

    idea is that it may be possible to summarize x in a “lower dimension”; speci…cally, it

    may be possible to control for the allocation of treatment directly rather than for all

    dimensions of x. This is the basic idea underlying the propensity score matching.

    Above we used the notation p (x) to highlight that the probility of receiving treatment

    depends on (and only on) the observed characteristics x. The probability of receiving

    treatment p (x) is sometimes also known as the propensity score. The propensity score

    function p (x) thus summarizes the process by which treatment is allocated. Note that

    p (x) is always a number in the unit interval, so it is a particular way of summarizing x

    in a single dimension. The question is: is it of any use? It turns out that it is. Let’s

    elaborate a bit on this.

    Recall the key assumption that x captures everything that is relevant to the allocation

    of treatment. This was the reason why simple matching on x was valid. By focusing on

    individuals with the same characteristics x we managed to get a handle on the counter-

    factual: within the group of individuals with the same value of x, the average outcome

    of those that were not treated should not be systematically di¤erent from the average

    outcome that would have obtained for the treated had they in fact not been treated.

    However, as the Angrist example showed, the number of categories can grow very

    rapidly. Hence it would be useful if the information about the treatment allocation process

    could be summarized in a lower dimension. Rosenbaum and Rubin (1983) showed that

    rather than matching on the characteristics x one can match on the propensity score

    34

  • p (x).9 Speci…cally, they showed that the key conditional independence assumption 2

    implies that conditional independence also holds once we condition on the propensity

    score (rather than full vector x): Formally, assumption 2 implies that (y0; y1) ? wjp (x)(we prove this below).

    To see the usefulness of this, it is worthwhile to go back to the logic underlying simple

    matching on characteristics. The logic there was as follows: Take a group of individuals

    with the same characteristics x. Since x fully determines the probability of being treated,

    those who were in fact treated must be representative, in terms of potential outcomes,

    of the whole group of individuals with characteristics x. Similarly, those who were not

    treated must be representative, in terms of potential outcomes, of the whole group of

    individuals with characteristics x. Hence the untreated group form a valid control group

    for the treated group in the sense that the expected outcome in the untreated group

    corresponds to the expected outcome in the treated group that would have obtained had

    the latter not been treated.

    Now do the same mind experiment, only now focus on a group of individuals with the

    same propensity score. i.e. the same value of p (x), say p0 (a number between zero and

    one). The individuals within this group generally have di¤erent observed characteristics;

    indeed, the group consists of all individuals with any characteristics x such that p (x) =

    p0. However, they do share one crucial feature: they are equally likely to be selected for

    treatment. Hence those within this group who were in fact treated must be representative,

    in terms of potential outcomes, of the whole group of individuals with propensity score

    p (x) = p0. Similarly, those who were not treated must be representative, in terms of

    potential outcomes, of the whole group of individual with propensity score p (x) = p0.

    Hence the untreated group form a valid control group for the treated group in the sense

    that the expected outcome in the untreated group corresponds to the expected outcome

    in the treated group that would have obtained had the latter not been treated.

    Hence matching individuals by their propensity scores should in principle be feasible.

    In practice, the propensity score function will not be known but rather must be estimated

    e.g. using a Logit or Probit model; after doing so we can proceed by matching individuals

    9See also Rosenbaum and Rubin (1984) and Heckman, Ichimura and Todd (1998).

    35

  • using the estimated propensity score bp (x).10 Suppose then that, within a random sample,we can determine what was each individual’s probability of receiving treatment. Then

    we can match the individuals into groups according to their propensity scores. Thus,

    suppose that we have collected all the individuals in the sample who had the speci…c

    probability p0 (a number between 0 and 1) of receiving treatment. Proceeding as in the

    case with simple matching, we can then compute the sample average outcome y1 among

    those within this group who were actually treated. To emphasize that we are doing this

    for the group of individuals with propensity score p0, we can write y1 (p0). Since the

    people we are averaging over are representative in terms of their potential outcomes of

    all individuals with propensity score p0, y1 (p0) estimates

    E£y1jp (x) = p0

    ¤

    Similarly, we compute the average outcome among those within the group who were not

    treated; the resulting average y0 (p0) estimates

    E£y0jp (x) = p0

    ¤

    Hence, the di¤erence, y1 (p0) ¡ y0 (p0) estimates the average treatment e¤ect amongthe portion of the population that have the speci…c propensity score p0. In line with

    our previous notation, we can denote this conditional average treatment e¤ect ATE (p0),

    de…ned as

    ATE¡p0

    ¢´ E

    £y1 ¡ y0jp (x) = p0

    ¤

    The idea is then straightforward: we can try to proceed as we did with simple match-

    ing:

    ² We estimate ATE (p) at “every value” of p between zero and one.

    ² We also work out, for each probility p, how large is the fraction of the populationthat has this speci…c probability of receiving treatment.

    10Once we have estimated the propensity score function, we can (in principle) estimate the distribution

    of propensity scores, F (p). (Note the interpretation: F (p) is the fraction of the population that have a

    probability of being treated that is less than or equal to p).

    36

  • ² For the population average treatment e¤ect, ATE, we then take the weighted av-erage across all values of p.

    To summarize, in the propensity score matching method we thus model the allocation

    of treatment as a function of observable variables and predict the probability of treatment

    for both the treated and the untreated groups. The method then proceeds by comparing

    the outcomes across treated and untreated individuals within groups of individuals that

    have a very similar probability of receiving the treatment.

    There are, however, a couple of practical problems with this approach.

    1. We need to know each individual’s probability of receiving treatment – speci…cally,

    we need to know that propensity score function p (x).

    2. The probability p0 is a continous variables so there is in principle an in…nite number

    of probabilities p0 between zero and one.

    The …rst problem is, as noted above, usually handled by estimating, in a …rst step, the

    function p (x) in a straightforward way (typically using a Logit or a Probit model) and

    use the predicted probability bp (x) for each individual. The second problem implies thatwe are unlikely to …nd very many treated and untreated individuals with exactly the same

    probability of treatment, that is, we are unlikely to …nd very many exact matches on the

    probability score. Moreover, there is in principle an in…nite number of possible values of

    p; hence we cannot possibly hope to estimate ATE (p) at every possible probability p.

    There are two basic methods methods used to overcome this latter problem. One involves

    matching each treated individual within the sample with an untreated individual who is

    the “nearest neighbour” (according to some pre-speci…ed criterion) in terms of propensity

    score.11 A second approach involves discretizing the population based on the (predicted)

    propensity score.

    If we adopt the approach of discretizing the estimated propensity score we should also

    check for “balance” within each bin, by which we mean that we should check that once

    we focus on individual with the same (or similar) propensity scores who, within the group

    11Stata programs that do propensity score matching are available on the web. In Stata type “net

    search propensity score”.

    37

  • receives treatment should not be related to the observe characteristics x. Hence, if within

    a block, we …nd that we do not have balance, this suggests that we are lumping together

    individuals who are not close enough in terms of propensity scores. That would mean

    that, within the block, those received treatment seem to be systematically di¤erent from

    those that didn’t, implying that the latter are not valid as a control group for the former.

    What can we do? One simple remedy is to stratify again (within the problematic block)

    so as to ensure that we are only comparing individual with su¢ciently similar propensity

    scores. To check for balance within a block one can simply compare the means of each

    characteristic x between the treated and the untreated individuals.

    Theory*

    For now let’s suppose that the propensity score p (x) is a known function. Note also that,

    since the treatment indicator dummy w is binary,

    p (x) = E (wjx) : (52)

    We want to show that the conditional independence assumption 2 implies that we

    also have conditional independence when we condition, not on the full vector x, but on

    the summary function which is the propensity score function. In particular we want to

    show the following:

    Proposition 1 Suppose that Assumption 2 holds so that (y0; y1) ? wjx and supposealso that Assumption 3 holds so that p (x) 2 (0; 1) for all x 2 X. Then (y0; y1) ? wjp (x).

    We will show that Pr (w = 1jy0; y1; p (x)) = Pr (w = 1jp (x)) = p (x), which impliesthat w is independent of (y0; y1) conditional on p (x). The proof uses the law of iterated

    expectations. First, note that since w is binary, w = 0; 1,

    Pr (w = 1j (y0; y1) ; p (x)) = E (wj (y0; y1) ; p (x)) (53)

    Then expand the right hand side by using the law of iterated expectations,

    Pr (w = 1j (y0; y1) ; p (x)) = E [E (wj (y0; y1) ; p (x) ;x) j (y0; y1) ; p (x)] (54)

    38

  • Then simplify the right hand side using Assumption 2 (which implies that the conditioners

    other than x in the inner expectation are super‡ous)

    Pr (w = 1j (y0; y1) ; p (x)) = E [E (wjx) j (y0; y1) ; p (x)] (55)

    Then use that E (wjx) is, by de…nition of the propensity score, equal to p (x). Hence

    Pr (w = 1j (y0; y1) ; p (x)) = E [p (x) j (y0; y1) ; p (x)] (56)

    But, then trivially,

    Pr (w = 1j (y0; y1) ; p (x)) = p (x) (57)

    Moreover, by a similar argument,

    Pr (w = 1jp (x)) = E (wjp (x)) = E [E (wjp (x) ;x) jp (x)] (58)= E [E (wjx) jp (x)] = E [p (x) jp (x)] = p (x) (59)

    Hence we have that

    Pr (w = 1j (y0; y1) ; p (x)) = Pr (w = 1jp (x)) (60)

    since both equal p (x). Thus it follows that w is independent of (y0; y1) given that we

    condition on p (x).

    Another implication of the conditional independence assumption 2 is that, once we

    condition on p (x), x will be independent of w, i.e. x ? wjp (x). This is sometimes knownas the “balancing score property”. Intuitively, di¤erent combinations of the covariates

    x can generate the same value of the propensity score; however, once we condition on

    x being in the set fxjp (x) = p0g for any given value of p0, x will not be related in anyfurther way to the allocation of treatment. The implication of this is that, in regression

    of w on x and p (x), the coe¢cients for x should be zero (or not signi…cantly di¤erent

    from zero).

    Proposition 1 proves that the conditional independence assumption 2 extends to the

    propensity score: once we restrict our attention to individuals with the same value of the

    propensity score it is as if the allocation of treatment was random within this group. If

    we know the propensity score function then we can proceed as in the case of matching

    39

  • (ignoring for a second the two complications that p (x) is a continuous variable and that

    it is also unknown). For “each value” of the propensity score function p (x) compute

    the sample average of those treated, denoted y1 (p (x)), and those untreated, denoted

    y0 (p (x)). The …rst sample mean consistently estimates E [yjp (x) ; w = 1] while the lattersample mean consistently estimates E [yjp (x) ; w = 0]. Estimate the density of p (x) todetermine the fraction of individuals belonging to each possible value of p (x), denoted

    f (p (x)), which consistently estimates the corresponding population fraction Á (p (x)).

    Then take the weighted average over the possible value of the propensity score to obtain

    a consistent estimate of the ATE,

    dATE =X

    p(x)

    f (p (x))¡y1 (p (x)) ¡ y0 (p (x))

    ¢(61)

    The two problems here are, as noted above, (i) that the propensity score is initially

    unknown, and (ii) that the propensity score p (x) is generally a continuous variable.

    The …rst problem can be handled if we can …nd a way of consistently estimating the

    propensity score function p (¢). In that case consistency carries over to the case werewe use the estimated propensity scores bp (x). The second problem is typically handledby partitioning the range of the propensity score function – i.e. the unit interval – into

    subintervals or by using di¤erent versions of “nearest neighbour” matching.

    Examples

    The Evaluation of a Training Program Above we discussed the paper by LaLonde

    (1986); that paper used experimental data to explore how well standard nonexperimental

    estimators succeed in correctly estimating treatment e¤ects. Deheija and Wahba (1998)

    use the same data to explore how well the propensity score matching approach fares.

    They conclude that the propensity score matching approach works much better than

    the non-experimental approaches considered by LaLonde (1986) and seems to frequently

    come close to the experimental results.12

    The study by Deheija and Wahba is a nice example of the practical implementation

    of propensity score matching. Recall that LaLonde studied experimental data on the

    12See also Heckman and Smith (1995) for a discussion.

    40

  • National Supported Work (NSW) Demonstration. In addition to using the treatment

    and control group from the NSW experimental data, LaLonde also constructed alterative

    “control groups” from other external data sources (PSID, CPS) in order to check how

    other standard non-experimental approaches would fare. Deheija and Wahba take the

    data from LaLonde to examine if the propensity score method fares better tha the non-

    experimental approaches considered by LaLonde. To do this, DW combine the data on

    the treated individuals from the NSW experimental data with the arti…cially constructed

    control groups from the PSID and the CPS (which contrains the same background infor-

    mation).

    They then proceed in two steps. First, they estimate the propensity score p (x) using

    the pre-treatment variables x (observed for both the individuals in the NSW data and

    the arti…cial control groups); to do this they use a standard Logistic probability model.

    They then group the observations into strata based on the estimated propensity score

    and check for balancing of the pre-treatment variables within each strata; that is, they

    use statistical tests to check whether, within each stratum, the distribution of the pre-

    treatment variables x is the same for the treated and the untreated individuals (as it

    should due to the balancing score property w ? xjp (x)). If there are no signi…cantdi¤erences in the distribution of x between the