evaluation design: experiments, quasi-experiments, and non-experiments t1 x t2 c1 c2 something...
TRANSCRIPT
Evaluation Design: Experiments, Quasi-Experiments, and Non-Experiments
T1 x T2C1 C2
Somethinghappens
here
(pre-post reflexive design)
Core Concepts
1. Causal Analysis and the Counter-Factual
1. The Competing Hypotheses Framework
2. Calculation of Program Effect (effect size)
3. Randomization Process
4. Control versus Comparison Group
5. Treatment Effect
1. Average Treatment Effect (ATE)
2. Intention to Treat (ITT)
3. Treatment on the Treated (TOT)
THE COUNTERFACTUAL:
Compared to what?
T=1 T=2Program
Program Effect?
The counter-factual is exercise of trying to figure out what the outcome would have been if there had been no program intervention. So we need a point of reference to measure that scenario…
Were suicide rates HIGH for a specific high school in suburban California?
95% CI:Average Rate Per Year at HS
Null Hypothesis: Population Average
Null Hypothesis: All HS Students
OR All Californians
Null Hypothesis: All Suburban HS
Students
Null
Null
Null
These are all valid conter-factuals.
How we define our comparison drives the conclusions.
What is the implied counter-factual?
T=1 T=2Program
Program Effect?
If you see a p-value, there is often an implied counter-factual. Figuring out what it is, you often know what went wrong with the research design.
Another Example:
G1t=0 G1t=1 G1t=2 G1t=3 G1t=4
G2t=0 G2t=1 G2t=2 G2t=3 G2t=4
G3t=0 G3t=1 G3t=2 G3t=3 G3t=4
G4t=0 G4t=1 G4t=2 G4t=3 G4t=4
x
x
x
x
Treatment Groups
Control Groups
G1t=0 G1t=1 G1t=2 G1t=3 G1t=4
G2t=0 G2t=1 G2t=2 G2t=3 G2t=4
G3t=0 G3t=1 G3t=2 G3t=3 G3t=4
G4t=0 G4t=1 G4t=2 G4t=3 G4t=4
x
x
x
x
Treatment GroupControl Group
Specific tests: treatment gains for late treatment?
A valid counter-factual allows us to answer the following two questions:
1) Compared to what? The program outcomes are different than outcomes in the comparison group. The comparison group is defined by the researcher.
In some special cases the comparison group is identical (statistically speaking) to the treatment group. In this case we call it a “control” group.
Selecting a good (valid) counter-factual is hard:
There is no going back…we can’t unlearn, undo the effects of a drug, reverse the effects of a program.
How do we identify a plausible counter-factual?
A valid counter-factual allows us to answer the following two questions:
1) Compared to what? The program outcomes are different than outcomes in the comparison group. The comparison group is defined by the researcher.
In some special cases the comparison group is identical (statistically speaking) to the treatment group. In this case we call it a “control” group.
2) How big is the program effect? Is the difference is meaningful (statistically significant and socially salient).?
In the simple case the program effects is just the difference of the average outcome of the treatment and control group, but in practice there are many ways we calculate an effect.
Effect Size
Is the Princeton Review Class Effective?
Treatment Group: GRE=580
Control Group: GRE=440
Effect = 580 – 440 = 140
P-val = 0.023Diff=0
XT-XC=140
𝑆𝐸(𝑋 ¿¿𝑇 −𝑋𝐶)=√ 𝑣𝑎𝑟𝑇𝑛𝑇
+𝑣𝑎𝑟 𝐶
𝑛𝐶
¿
Is it significant?
How big is the effect?
“Effect” translates to a calculation of impact plus statistical significance.
Effects Size in Correlation Study:
50 55 60 65 70
50
100
150
Dosage and Response
Caffeine (mm)
He
art
Ra
te (
per
min
)
For a one-unit change in X, we expect a β1 change in Y.
Effect
One unit of X
𝑦=𝛽0+𝛽1𝑥
Slope=0
β1 =140
How big is the effect?
Is it significant?
THE THREE WAYS WE CALCULATE EFFECTS
Two “counterfeit” or “weak” counterfactuals
time=1 time=2Program
time=1 time=2Program
Pre-Post
Effect
Post-Only
Effect
These are only valid when certain conditions are met.
Example: Do Charter Schools Outperform Public Schools?
http://www.rightmichigan.com/story/2011/6/21/23927/4600
“Don't get me wrong. I am not opposed to charter schools on principle. My beef with charter schools is that most skim the most motivated students out of the poorest communities, and many have disproportionately small numbers of children who need special education or who are English-language learners. The typical charter, operating in this way, increases the burden on the regular public schools, while privileging the lucky few. Continuing on this path will further disable public education in the cities and hand over the most successful students to private entrepreneurs.”
http://blogs.edweek.org/edweek/Bridging-Differences/2009/11/obama-and-duncan-are-wrong-abo.html
The difference in difference estimator:
Group / Time T=1 T=2
Treatment T1 T2
Control C1 C2
Difference between treated and untreated: T2 – C2
But what about the trend???
Program Impact= (T2 – T1) – (C2 – C1)
Gains during the program period
Gains as a result of trend
time=1 time=2Program
Policy / program group
Control / comparison group
Outcome T2
T1
C1
C2
First Difference: Difference in Difference:
Effect = T2 – T1 Effect = (T2 – T1) – (C2 – C1)
Second difference:
Effect = T2 – C2
*trend*
Total Change
Trend
Effect
The difference in difference estimator:
Example of the difference-in-difference estimate in practice
Comparing teacher “effects”
Distribution of diff-in-diff scores
Breaking it Down:
Pre Post
aa + c
a + b
a + b + c + d
Effect = (T2 – T1) – (C2 – C1)
Effect = ( c + d ) – c = d
Single Diff 2 = (a+b+c+d)-(a+b) = (c+d)
Single Diff 1 = (a+c)-(a)= c
As a regression of dummies
)2(2 3210 TreatGroupPeriodTreatGroupPeriodY
Group / Time T=1 T=2
Treatment T1 T2
Control C1 C2
Slope Effect Interpretation
β0 C1 Baseline
β1 C2-C1 Trend
β2 T1-C1 Initial Difference
β3 (T2-T1)-(C2-C1) Treatment effect
Yi,t = a + bTreati,t+ cPosti,t + d(Treati,tPosti,t )+ ei,t
Diff-in-Diff=(Single Diff 2-Single Diff 1)=(c+d)-c=d
Putting Graph & Regression Together
Pre Post
aa + c
a + b
a + b + c + d
Single Diff 2 = (a+b+c+d)-(a+b) = (c+d)
Single Diff 1 = (a+c)-(a)= c
Validity of two “counterfeit” counterfactuals:
(T2 – T1) – (C2 – C1) = T2 – T1
IFF
C2 – C1 = 0(no trend)
Pre-Post Post-Only
(T2 – T1) – (C2 – C1) = T2 – C2
IFF
C1 – T1 = 0(equivalent at time 1)
time=1 time=2Program
Outcome
T2
T1
C1
C2
Why we useRandomization
or Matching
Textbook notation for the effect calculation:
Program Effect = E( Y | P=1 ) – E( Y | P=0 )
In plain English, we calculate the effect by subtracting the average outcome E(Y) of the control group (P=0) from the average outcome of the treatment group (P=1).
Why might this be misleading?
Validity of two “counterfeit” counterfactuals:
(T2 – T1) – (C2 – C1) = T2 – T1
IFF
C2 – C1 = 0(no trend)
Pre-Post Post-Only
(T2 – T1) – (C2 – C1) = T2 – C2
IFF
C1 – T1 = 0(equivalent at time 1)
time=1 time=2Program
OutcomeT2
T1
C1C2
Randomization gives us this. Therefore
we can use post-testonly estimators.
This is the exception, not the rule.
DIFFERENT TYPES OF COUNTERFACTUALS IN PRACTICE
Varieties of the Counter-Factual:
T=1 T=2Program
T=1 T=2Program
Pre-Post
Effect
Post-OnlyPre-Post
With Control
Effect
T=1 T=2Program
A
B
Effect:A-B
Time
InterruptedTime Series
Effect
Program Qualification
Regression Discontinuity
Effect
Program
(Diff-in-Diff)
Interrupted Time Series
Regression discontinuity design
Source: Martinez, 2006, Course notes
Regression Discontinuity
Core Concepts
1. Causal Analysis and the Counter-Factual
1. The Competing Hypotheses Framework
2. Calculation of Program Effect (effect size)
3. Randomization Process
4. Control versus Comparison Group
5. Treatment Effect
1. Average Treatment Effect (ATE)
2. Intention to Treat (ITT)
3. Treatment on the Treated (TOT)
SUCCESSFUL AND UNSUCCESSFUL RANDOMIZATION
Is this problematic?
Bonferroni Correction:
When we want to be 95% confident that two groups are the same, and we can measure those groups using a set of contrasts, then our decision rule is no longer to reject the null (that the groups are the same) if thep-value < 0.05. A “contrast” is a comparison of means of any measured characteristic between two groups.
If we have a 5% chance of observing a p-value of 0.05 for each contrast, then the probability of observing at least one contrast with a p-value that small is greater than 5%! It is actually n*.05, where n is the number of contrasts.
So if we want to be 95% confident that the groups are different (not just the contrasts), we have to adjust our decision rule to /n.
For example, if we have 10 contrasts, then our decision rule is now 0.05/10, or 0.005. The p-value of at least one contrast must be below 0.005 for us to conclude that the groups are different.
x1 <- rbinom( 10000, 6, 0.05 ) table( x1 ) / 10000y1 <- rbinom( 10000, 6, 0.05/6 ) table( y1 ) / 10000
Test for “Happy” Randomization
0.05 / 6 = 0.0083
x1 <- rbinom( 10000, 6, 0.05 ) table( x1 ) / 10000y1 <- rbinom( 10000, 6, 0.05/6 ) table( y1 ) / 10000
RCT versus Natural Experiments:
1. RCT assumes complete control over the assignment process
2. Natural Experiments often utilize randomization:
Charter Schools Example
Vietnam Veterans Example
time=1 time=2Program
Policy / program group
Control / comparison group
“Control” Versus “Comparison” Groups
Outcome
First Difference: Difference in Difference:
Effect = T2 – T1 Effect = (T2 – T1) – (C2 – C1)
Effect = T2 – C2
T2
T1
C1
C2
*trend*
Two Considerations:
Does C1 = T1 ?Not usually for comparison groups.
Is C2 – C1 accurate reflection of trend?In some cases, the comparison group can adequately capture trend.
DIFFERENT INTERPRETATIONS OF PROGRAM EFFECTS
Estimation of the counter-factual:
Program Effect = E( Y | P=1 ) – E( Y | P=0 )
Operationalized as:
Program Effect = ( T2 – T1 ) – (C2 – C1 )
Control Group
Given Bed Nets
Given Bed Nets AND Use Them!
Calculation of Treatment Effects:
Terminology:
• “Average” treatment effects– Treatment on the Treated (TOT) Effects– Intention to Treat (ITT) Effects
Given Bed Nets
Given Bed Nets AND Use Them!
Exam Question:
What is the difference between non-compliance and attrition?
CAMPBELL SCORES: ELIMINATING COMPETING HYPOTHESES
http://www.youtube.com/watch?v=7DDF8WZFnoU
Can Ants Count?
Competing Hypotheses
The Program Hypothesis:
The change that we saw in our study group above and beyond the comparison group (the effect size) was a result of the program.
The Competing Hypothesis:
The change that we saw in our study group above and beyond the comparison group was a result of _______.
(insert any item of the Campbell Score)
The Campbell Score:
Omitted Variable BiasSelection / Omitted VariablesNon-Random Attrition Trends in the DataMaturationSecular TrendsTestingSeasonalityRegression to the Mean
Study CalibrationMeasurement ErrorTime-Frame of Study Contamination FactorsIntervening Events
Competing Hypothesis #1
Selection Into a Program
If people have a choice to enroll in a program, those that enroll will be different than those that do not.
This is a source of omitted variable bias.
The Fix:
Randomization into treatment and control groups.
Randomization must be “happy”!
Test for “Happy” Randomization
x1 <- rbinom( 10000, 6, 0.04 ) table( x1 ) / 10000
Competing Hypothesis #2
Non-Random Attrition
If the people that leave a program or study are different than those that stay, the calculation of effects will be biased.
The Fix:
Examine characteristics of those that stay versus those that leave.
$3.00$2.50$2.00$1.50$1.00
$3.00$2.50$2.00$1.50$1.00
Mean = $2.00Mean = $2.50
Microfinance Example: Artificial effect size
Attrit
Test for Attrition
Can also be tested in another way: do background characteristics of T1 = T2
Separating Trend from Effects
TreatmentEffect
Trend
Total Gains During Study
Time=1 Time=2
C1 = T1
T2
C2
Outcome
T2-C2 removes trend
Separating Trend from Effects
Total Gain
RemovedBy T2-C2
Time=1 Time=2
C1 ≠ T1
T2
C2
Outcome
Actual Trend
T2-C2 does NOT fully remove trend
NOTE, diff-in-diff separates trends even when groups are not equivalent.
T2-C2
Separating Trend from Effects
Total Gain
RemovedBy T2-C2
Time=1 Time=2
C1 ≠ T1
T2
C2
Outcome
Actual Trend
T2-C2 removes too much trend
NOTE, diff-in-diff separates trends even when groups are not equivalent.
T2-C2
Competing Hypothesis #3
Maturation
Occurs when growth is expected naturally, such as increase in cognitive ability of children because of natural development independent of program effects.
The Fix:
Use a comparison group to remove the trend.
Pre-Post With Control
T=1 T=2Program
A
B
Effect:A-B
Competing Hypothesis #4
Secular Trends
Very similar to maturation, except the trend in the data is caused by a global process outside of individuals, such as economic or cultural trends.
The Fix:
Use a comparison group to remove the trend.
Pre-Post With Control
T=1 T=2Program
A
B
Effect:A-B
Competing Hypothesis #5
Seasonality
Data with seasonal trends or other cycles will have natural highs and lows.
The Fix:
Only compare observations from the same time period, or average observations over an entire year (or cycle period).
AprilOctober
General Trend
Competing Hypothesis #6
Testing
When the same group is exposed repeatedly to the same set of questions or tasks they can improve independent of any training.
The Fix:
This problem only applies to a small set of programs. Change tests, use post-test only designs, or use a control group that receives the test.
Pre-Post With Control
T=1 T=2Program
A
B
Effect:A-B
Competing Hypothesis #7
Regression to the Mean
Every time period that you observe an outcome, during the next time period the outcome naturally has a higher probability of being closer to the mean than it does of staying the same or being more extreme. As a result, quality improvement programs for low-performing units often have a built-in improvement bias regardless of program effects.
The Fix:
Take care not to select a study group from the top or bottom of the distribution in a single time period (only high or low performers).
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 20100
5
10
15
20
25
30
Home Runs by Derek Jeter Per SeasonHo
me
Runs
Average = 15.6
Regression to the Mean Example
Competing Hypothesis #8
Measurement Error
If there is significant measurement error in the dependent variables, it will bias the effects towards zero and make programs look less effective.
The Fix:
Use better measures of dependent variables.
Competing Hypothesis #9
Study Time-Frame
If the study is not long enough it make look like the program had no impact when in fact it did. If the study is too long then attrition becomes a problem.
The Fix:
Use prior knowledge or research from the study domain to pick an appropriate study period.
Examples:
• Michigan Affirmative Action Study• Iowa liquor law change
Competing Hypothesis #10
Intervening Events
Has something happened during the study that affects one of the groups (treatment or control) but not the other?
The Fix:
If there is an intervening event, it may be hard to remove the effects from the study.