evaluation design: experiments, quasi-experiments, and non-experiments t1 x t2 c1 c2 something...

Evaluation Design: Experiments, Quasi-Experiments, and Non-Experiments

T1 x T2C1 C2

Somethinghappens

here

(pre-post reflexive design)

Core Concepts

1. Causal Analysis and the Counter-Factual

1. The Competing Hypotheses Framework

2. Calculation of Program Effect (effect size)

3. Randomization Process

4. Control versus Comparison Group

5. Treatment Effect

1. Average Treatment Effect (ATE)

2. Intention to Treat (ITT)

3. Treatment on the Treated (TOT)

THE COUNTERFACTUAL:

Compared to what?

T=1 T=2Program

Program Effect?

The counter-factual is exercise of trying to figure out what the outcome would have been if there had been no program intervention. So we need a point of reference to measure that scenario…

Were suicide rates HIGH for a specific high school in suburban California?

95% CI:Average Rate Per Year at HS

Null Hypothesis: Population Average

Null Hypothesis: All HS Students

OR All Californians

Null Hypothesis: All Suburban HS

Students

Null

Null

Null

These are all valid conter-factuals.

How we define our comparison drives the conclusions.

What is the implied counter-factual?

T=1 T=2Program

Program Effect?

If you see a p-value, there is often an implied counter-factual. Figuring out what it is, you often know what went wrong with the research design.

Another Example:

G1t=0 G1t=1 G1t=2 G1t=3 G1t=4

G2t=0 G2t=1 G2t=2 G2t=3 G2t=4

G3t=0 G3t=1 G3t=2 G3t=3 G3t=4

G4t=0 G4t=1 G4t=2 G4t=3 G4t=4

x

x

x

x

Treatment Groups

Control Groups

G1t=0 G1t=1 G1t=2 G1t=3 G1t=4

G2t=0 G2t=1 G2t=2 G2t=3 G2t=4

G3t=0 G3t=1 G3t=2 G3t=3 G3t=4

G4t=0 G4t=1 G4t=2 G4t=3 G4t=4

x

x

x

x

Treatment GroupControl Group

Specific tests: treatment gains for late treatment?

A valid counter-factual allows us to answer the following two questions:

1) Compared to what? The program outcomes are different than outcomes in the comparison group. The comparison group is defined by the researcher.

In some special cases the comparison group is identical (statistically speaking) to the treatment group. In this case we call it a “control” group.

Selecting a good (valid) counter-factual is hard:

There is no going back…we can’t unlearn, undo the effects of a drug, reverse the effects of a program.

How do we identify a plausible counter-factual?

A valid counter-factual allows us to answer the following two questions:

1) Compared to what? The program outcomes are different than outcomes in the comparison group. The comparison group is defined by the researcher.

In some special cases the comparison group is identical (statistically speaking) to the treatment group. In this case we call it a “control” group.

2) How big is the program effect? Is the difference is meaningful (statistically significant and socially salient).?

In the simple case the program effects is just the difference of the average outcome of the treatment and control group, but in practice there are many ways we calculate an effect.

Effect Size

Is the Princeton Review Class Effective?

Treatment Group: GRE=580

Control Group: GRE=440

Effect = 580 – 440 = 140

P-val = 0.023Diff=0

XT-XC=140

𝑆𝐸(𝑋 ¿¿𝑇 −𝑋𝐶)=√ 𝑣𝑎𝑟𝑇𝑛𝑇

+𝑣𝑎𝑟 𝐶

𝑛𝐶

¿

Is it significant?

How big is the effect?

“Effect” translates to a calculation of impact plus statistical significance.

Effects Size in Correlation Study:

50 55 60 65 70

50

100

150

Dosage and Response

Caffeine (mm)

He

art

Ra

te (

per

min

)

For a one-unit change in X, we expect a β1 change in Y.

Effect

One unit of X

𝑦=𝛽0+𝛽1𝑥

Slope=0

β1 =140

How big is the effect?

Is it significant?

THE THREE WAYS WE CALCULATE EFFECTS

Two “counterfeit” or “weak” counterfactuals

time=1 time=2Program


Pre-Post

Effect

Post-Only

Effect

These are only valid when certain conditions are met.

Example: Do Charter Schools Outperform Public Schools?

http://www.rightmichigan.com/story/2011/6/21/23927/4600

“Don't get me wrong. I am not opposed to charter schools on principle. My beef with charter schools is that most skim the most motivated students out of the poorest communities, and many have disproportionately small numbers of children who need special education or who are English-language learners. The typical charter, operating in this way, increases the burden on the regular public schools, while privileging the lucky few. Continuing on this path will further disable public education in the cities and hand over the most successful students to private entrepreneurs.”

http://blogs.edweek.org/edweek/Bridging-Differences/2009/11/obama-and-duncan-are-wrong-abo.html

http://www.rightmichigan.com/story/2011/6/21/23927/4600

http://blogs.edweek.org/edweek/Bridging-Differences/2009/11/obama-and-duncan-are-wrong-abo.html

The difference in difference estimator:

Group / Time T=1 T=2

Treatment T1 T2

Control C1 C2

Difference between treated and untreated: T2 – C2

But what about the trend???

Program Impact= (T2 – T1) – (C2 – C1)

Gains during the program period

Gains as a result of trend


Policy / program group

Control / comparison group

Outcome T2

T1

C1

C2

First Difference: Difference in Difference:

Effect = T2 – T1 Effect = (T2 – T1) – (C2 – C1)

Second difference:

Effect = T2 – C2

*trend*

Total Change

Trend

Effect

The difference in difference estimator:

Example of the difference-in-difference estimate in practice

Comparing teacher “effects”

Distribution of diff-in-diff scores

Breaking it Down:

Pre Post

aa + c

a + b

a + b + c + d

Effect = (T2 – T1) – (C2 – C1)

Effect = ( c + d ) – c = d

Single Diff 2 = (a+b+c+d)-(a+b) = (c+d)

Single Diff 1 = (a+c)-(a)= c

As a regression of dummies

)2(2 3210 TreatGroupPeriodTreatGroupPeriodY

Group / Time T=1 T=2

Treatment T1 T2

Control C1 C2

Slope Effect Interpretation

β0 C1 Baseline

β1 C2-C1 Trend

β2 T1-C1 Initial Difference

β3 (T2-T1)-(C2-C1) Treatment effect

Yi,t = a + bTreati,t+ cPosti,t + d(Treati,tPosti,t )+ ei,t

Diff-in-Diff=(Single Diff 2-Single Diff 1)=(c+d)-c=d

Putting Graph & Regression Together

Pre Post

aa + c

a + b

a + b + c + d

Single Diff 2 = (a+b+c+d)-(a+b) = (c+d)

Single Diff 1 = (a+c)-(a)= c

Validity of two “counterfeit” counterfactuals:

(T2 – T1) – (C2 – C1) = T2 – T1

IFF

C2 – C1 = 0(no trend)

Pre-Post Post-Only

(T2 – T1) – (C2 – C1) = T2 – C2

IFF

C1 – T1 = 0(equivalent at time 1)


Outcome

T2

T1

C1

C2

Why we useRandomization

or Matching

Textbook notation for the effect calculation:

Program Effect = E( Y | P=1 ) – E( Y | P=0 )

In plain English, we calculate the effect by subtracting the average outcome E(Y) of the control group (P=0) from the average outcome of the treatment group (P=1).

Why might this be misleading?

Validity of two “counterfeit” counterfactuals:

(T2 – T1) – (C2 – C1) = T2 – T1

IFF

C2 – C1 = 0(no trend)

Pre-Post Post-Only

(T2 – T1) – (C2 – C1) = T2 – C2

IFF

C1 – T1 = 0(equivalent at time 1)


OutcomeT2

T1

C1C2

Randomization gives us this. Therefore

we can use post-testonly estimators.

This is the exception, not the rule.

DIFFERENT TYPES OF COUNTERFACTUALS IN PRACTICE

Varieties of the Counter-Factual:

T=1 T=2Program

T=1 T=2Program

Pre-Post

Effect

Post-OnlyPre-Post

With Control

Effect

T=1 T=2Program

A

B

Effect:A-B

Time

InterruptedTime Series

Effect

Program Qualification

Regression Discontinuity

Effect

Program

(Diff-in-Diff)

Interrupted Time Series

Regression discontinuity design

Source: Martinez, 2006, Course notes

Regression Discontinuity

Core Concepts

1. Causal Analysis and the Counter-Factual

1. The Competing Hypotheses Framework

2. Calculation of Program Effect (effect size)

3. Randomization Process

4. Control versus Comparison Group

5. Treatment Effect

1. Average Treatment Effect (ATE)

2. Intention to Treat (ITT)

3. Treatment on the Treated (TOT)

SUCCESSFUL AND UNSUCCESSFUL RANDOMIZATION

Is this problematic?

Bonferroni Correction:

When we want to be 95% confident that two groups are the same, and we can measure those groups using a set of contrasts, then our decision rule is no longer to reject the null (that the groups are the same) if thep-value < 0.05. A “contrast” is a comparison of means of any measured characteristic between two groups.

If we have a 5% chance of observing a p-value of 0.05 for each contrast, then the probability of observing at least one contrast with a p-value that small is greater than 5%! It is actually n*.05, where n is the number of contrasts.

So if we want to be 95% confident that the groups are different (not just the contrasts), we have to adjust our decision rule to /n.

For example, if we have 10 contrasts, then our decision rule is now 0.05/10, or 0.005. The p-value of at least one contrast must be below 0.005 for us to conclude that the groups are different.

x1 <- rbinom( 10000, 6, 0.05 ) table( x1 ) / 10000y1 <- rbinom( 10000, 6, 0.05/6 ) table( y1 ) / 10000

Test for “Happy” Randomization

0.05 / 6 = 0.0083

x1 <- rbinom( 10000, 6, 0.05 ) table( x1 ) / 10000y1 <- rbinom( 10000, 6, 0.05/6 ) table( y1 ) / 10000

RCT versus Natural Experiments:

1. RCT assumes complete control over the assignment process

2. Natural Experiments often utilize randomization:

Charter Schools Example

Vietnam Veterans Example


Policy / program group

Control / comparison group

“Control” Versus “Comparison” Groups

Outcome

First Difference: Difference in Difference:

Effect = T2 – T1 Effect = (T2 – T1) – (C2 – C1)

Effect = T2 – C2

T2

T1

C1

C2

*trend*

Two Considerations:

Does C1 = T1 ?Not usually for comparison groups.

Is C2 – C1 accurate reflection of trend?In some cases, the comparison group can adequately capture trend.

DIFFERENT INTERPRETATIONS OF PROGRAM EFFECTS

Estimation of the counter-factual:

Program Effect = E( Y | P=1 ) – E( Y | P=0 )

Operationalized as:

Program Effect = ( T2 – T1 ) – (C2 – C1 )

Control Group

Given Bed Nets

Given Bed Nets AND Use Them!

Calculation of Treatment Effects:

Terminology:

• “Average” treatment effects– Treatment on the Treated (TOT) Effects– Intention to Treat (ITT) Effects

Given Bed Nets

Given Bed Nets AND Use Them!

Exam Question:

What is the difference between non-compliance and attrition?

CAMPBELL SCORES: ELIMINATING COMPETING HYPOTHESES

http://www.youtube.com/watch?v=7DDF8WZFnoU

Can Ants Count?

http://www.youtube.com/watch?v=7DDF8WZFnoU

Competing Hypotheses

The Program Hypothesis:

The change that we saw in our study group above and beyond the comparison group (the effect size) was a result of the program.

The Competing Hypothesis:

The change that we saw in our study group above and beyond the comparison group was a result of _______.

(insert any item of the Campbell Score)

The Campbell Score:

Omitted Variable BiasSelection / Omitted VariablesNon-Random Attrition Trends in the DataMaturationSecular TrendsTestingSeasonalityRegression to the Mean

Study CalibrationMeasurement ErrorTime-Frame of Study Contamination FactorsIntervening Events

Competing Hypothesis #1

Selection Into a Program

If people have a choice to enroll in a program, those that enroll will be different than those that do not.

This is a source of omitted variable bias.

The Fix:

Randomization into treatment and control groups.

Randomization must be “happy”!

Test for “Happy” Randomization

x1 <- rbinom( 10000, 6, 0.04 ) table( x1 ) / 10000


Non-Random Attrition

If the people that leave a program or study are different than those that stay, the calculation of effects will be biased.

The Fix:

Examine characteristics of those that stay versus those that leave.

$3.00$2.50$2.00$1.50$1.00

$3.00$2.50$2.00$1.50$1.00

Mean = $2.00Mean = $2.50

Microfinance Example: Artificial effect size

Attrit

Test for Attrition

Can also be tested in another way: do background characteristics of T1 = T2

Separating Trend from Effects

TreatmentEffect

Trend

Total Gains During Study

Time=1 Time=2

C1 = T1

T2

C2

Outcome

T2-C2 removes trend


Total Gain

RemovedBy T2-C2

Time=1 Time=2

C1 ≠ T1

T2

C2

Outcome

Actual Trend

T2-C2 does NOT fully remove trend

NOTE, diff-in-diff separates trends even when groups are not equivalent.

T2-C2


Total Gain

RemovedBy T2-C2

Time=1 Time=2

C1 ≠ T1

T2

C2

Outcome

Actual Trend

T2-C2 removes too much trend

NOTE, diff-in-diff separates trends even when groups are not equivalent.

T2-C2


Maturation

Occurs when growth is expected naturally, such as increase in cognitive ability of children because of natural development independent of program effects.

The Fix:

Use a comparison group to remove the trend.

Pre-Post With Control

T=1 T=2Program

A

B

Effect:A-B


Secular Trends

Very similar to maturation, except the trend in the data is caused by a global process outside of individuals, such as economic or cultural trends.

The Fix:

Use a comparison group to remove the trend.


T=1 T=2Program

A

B

Effect:A-B


Seasonality

Data with seasonal trends or other cycles will have natural highs and lows.

The Fix:

Only compare observations from the same time period, or average observations over an entire year (or cycle period).

AprilOctober

General Trend


Testing

When the same group is exposed repeatedly to the same set of questions or tasks they can improve independent of any training.

The Fix:

This problem only applies to a small set of programs. Change tests, use post-test only designs, or use a control group that receives the test.


T=1 T=2Program

A

B

Effect:A-B


Regression to the Mean

Every time period that you observe an outcome, during the next time period the outcome naturally has a higher probability of being closer to the mean than it does of staying the same or being more extreme. As a result, quality improvement programs for low-performing units often have a built-in improvement bias regardless of program effects.

The Fix:

Take care not to select a study group from the top or bottom of the distribution in a single time period (only high or low performers).

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 20100

5

10

15

20

25

30

Home Runs by Derek Jeter Per SeasonHo

me

Runs

Average = 15.6

Regression to the Mean Example


Measurement Error

If there is significant measurement error in the dependent variables, it will bias the effects towards zero and make programs look less effective.

The Fix:

Use better measures of dependent variables.


Study Time-Frame

If the study is not long enough it make look like the program had no impact when in fact it did. If the study is too long then attrition becomes a problem.

The Fix:

Use prior knowledge or research from the study domain to pick an appropriate study period.

Examples:

• Michigan Affirmative Action Study• Iowa liquor law change


Intervening Events

Has something happened during the study that affects one of the groups (treatment or control) but not the other?

The Fix:

If there is an intervening event, it may be hard to remove the effects from the study.

evaluation design: experiments, quasi-experiments, and non-experiments t1 x t2 c1 c2 something...

Documents

programprogram effect

effect sizeis

implied counterfactual

program outcomes

program effects

plausible counterfactual

good valid counterfactual

treatment gains