understanding and misunderstanding randomized controlled...

Post on 02-May-2018

222 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Understandingandmisunderstandingrandomizedcontrolledtrials

AngusDeatonandNancyCartwright

PrincetonUniversity

DurhamUniversityandUCSanDiego

Thisversion,August2016

Weacknowledgehelpfuldiscussionswithmanypeopleoverthemanyyearsthispaperhasbeeninpreparation.WewouldparticularlyliketonotecommentsfromseminarparticipantsatPrinceton,ColumbiaandChicago,theCHESSresearchgroupatDurham,aswellasdiscussionswithOrleyAshenfelter,AnneCase,NickCowen,HankFarber,BoHonoré,andJulianReiss.UlrichMuellerhadamajorinfluenceonshapingSection1ofthepaper.Wehavebenefitedfromgen-erouscommentsonanearlierversionbyTimBesley,ChrisBlattman,SylvainChassang,StevenDurlauf,JeanDrèze,WilliamEasterly,JonathanFuller,LarsHansen,JimHeckman,JeffHammer,MacartanHumphreys,HelenMilner,SureshNaidu,LantPritchett,DaniRodrik,BurtSinger,RichardZeckhauser,andSteveZiliak.Cartwright’sresearchforthispaperhasreceivedfundingfromtheEuropeanResearchCouncil(ERC)undertheEuropeanUnion’sHorizon2020researchandinnovationprogram(grantagreementNo667526K4U).DeatonacknowledgesfinancialsupportthroughtheNationalBureauofEconomicResearch,Grants5R01AG040629-02andP01AG05842-14andthroughPrincetonUniversity’sRoybalCenter,GrantP30AG024928.

1

ABSTRACTRCTsarevaluabletoolswhoseuseisspreadingineconomicsandinothersocialsciences.Theyareseenasdesirableaidsinscientificdiscoveryandforgeneratingevidenceforpoli-cy.YetsomeoftheenthusiasmforRCTsappearstobebasedonmisunderstandings:thatrandomizationprovidesafairtestbyequalizingeverythingbutthetreatmentandsoallowsapreciseestimateofthetreatmentalone;thatrandomizationisrequiredtosolveselectionproblems;thatlackofblindingdoeslittletocompromiseinference;andthatstatisticalin-ferenceinRCTsisstraightforward,becauseitrequiresonlythecomparisonoftwomeans.Noneofthesestatementsistrue.RCTsdoindeedrequireminimalassumptionsandcanop-eratewithlittlepriorknowledge,anadvantagewhenpersuadingdistrustfulaudiences,butacrucialdisadvantageforcumulativescientificprogress,whererandomizationaddsnoiseandunderminesprecision.ThelackofconnectionbetweenRCTsandotherscientificknowledgemakesithardtousethemoutsideoftheexactcontextinwhichtheyarecon-ducted.Yet,oncetheyareseenaspartofacumulativeprogram,theycanplayaroleinbuildinggeneralknowledgeandusefulpredictions,providedtheyarecombinedwithothermethods,includingconceptualandtheoreticaldevelopment,todiscovernot“whatworks,”butwhythingswork.Unlesswearepreparedtomakeassumptions,andtostandonwhatweknow,makingstatementsthatwillbeincredibletosome,allthecredibilityofRCTsisfornaught.

2

IntroductionRandomizedtrialsarecurrentlymuchusedineconomicsandarewidelyconsideredtobeade-

sirablemethodofempiricalanalysisanddiscovery.Thereisalonghistoryofsuchtrialsinthe

subject.Therewerefourlargefederallysponsorednegativeincometaxtrialsinthe1960sand

1970s.Inthemid-1970s,therewasafamous,andstillfrequentlycited,trialonhealthinsurance,

theRandhealthexperiment.Therewasthenaperiodduringwhichrandomizedcontrolledtrials

(RCTs)receivedlessattentionbyacademiceconomics;evenso,randomizedtrialsonwelfare,

socialpolicy,labormarkets,andeducationhavecontinuedsincethemid-1970s,somewithsub-

stantialinvolvementanddiscussionbyacademiceconomists,seeGreenbergandShroder

(2004).

Recentrandomizedtrialsineconomicdevelopmenthaveattractedattention,andthe

ideathatsuchtrialscandiscover“whatworks”hasbeenwidelyadoptedineconomics,aswell

asinpoliticalscience,education,andsocialpolicy.Amongbothresearchersandthegeneral

public,RCTsareperceivedtoyieldcausalinferencesandparameterestimatesthataremore

crediblethanotherempiricalmethodsthatdonotinvolvethecomparisonofrandomlyselected

treatmentandcontrolgroups.RCTsareseenaslargelyexemptfrommanyoftheeconometric

problemsthatcharacterizeobservationalstudies.WhenRCTsarenotfeasible,researchersoften

mimicrandomizeddesignsbyusingobservationaldatatoconstructtwogroupsthat,asfaras

possible,areidenticalanddifferonlyintheirexposuretotreatment.

Thepreferenceforrandomizedtrialshasspreadbeyondtrialiststothegeneralpublic

andthemedia,whichtypicallyreportsfavorablyonthem.Theyareseenasaccurate,objective,

andlargelyindependentof“expert”knowledgethatisoftenregardedasmanipulable,politically

biased,orotherwisesuspect.Therearenow“WhatWorks”centersusingandrecommending

RCTsinahugerangeofareasofsocialconcernacrossEuropeandtheAnglophoneworld,such

astheUSDepartmentofEducation’sWhatWorksClearingHouse,TheCampbellCollaboration

(paralleltotheCochraneCollaborationinhealth),theScottishIntercollegiateGuidelinesNet-

work(SIGN),theUSDepartmentofHealthandHumanServicesChildWelfareInformation

Gateway,theUSSocialandBehavioralSciencesTeam,andothers.TheBritishgovernmenthas

establishedeightnew(well-financed)WhatWorksCenterssimilartotheNationalInstitutefor

HealthandCareExcellence(NICE),withmoreplanned.TheyextendNICE’sevaluationofhealth

treatmentintoaging,earlyintervention,education,crime,localeconomicgrowth,Scottishser-

vicedelivery,poverty,andwellbeing.Thesecentersseerandomizedcontrolledtrialsastheir

3

preferredtool.Thereisawidespreaddesireforcarefulevaluation—tosupportwhatissome-

timescalledthe“auditsociety”—andeveryoneassentstotheideathatpolicyshouldbebased

onevidenceofeffectiveness,forwhichrandomizedtrialsappeartobeideallysuited.Trialsare

easily,ifnotveryprecisely,explainedalongthelinesthatrandomselectiongeneratestwooth-

erwiseidenticalgroups,onetreatedandonenot;resultsareeasytocompute—allweneedis

thecomparisonoftwoaverages;andunlikeothermethods,itseemstorequirenospecialized

understandingofthesubjectmatter.Itseemsatrulygeneraltoolthat(nominally)worksinthe

samewayinagriculture,medicine,sociology,economics,politics,andeducation.Itissupposed

torequirenopriorknowledge,whethersuspectornot,whichisseenasagreatadvantage.

Inthispaper,wepresenttwosetsofarguments,oneonconductingRCTSandonhowto

interprettheresults,andoneonhowtousetheresultsoncewehavethem.Althoughwedonot

carefortheterms—forreasonsthatwillbecomeapparent—thetwosectionscorrespondrough-

lytointernalandexternalvalidity.

Randomizedcontrolledtrialsareoftenuseful,andhavebeenimportantsourcesofem-

piricalevidenceforcausalclaimsandevaluationofeffectivenessinmanyfields.Yetmanyofthe

popularinterpretations—notonlyamongthegeneralpublic,butalsoamongtrialists—arein-

completeandsometimesmisleading,andthesemisunderstandingscanleadtounwarranted

trustintheimpregnabilityofresultsfromRCTs,toalackofunderstandingoftheirlimitations,

andtomistakenclaimsabouthowwidelytheirresultscanbeused.Allthese,inturn,canleadto

flawedpolicyrecommendations.

Amongthemisunderstandingsarethefollowing:(a)randomizationensuresafairtrial

byensuringthat,atleastwithhighprobability,treatmentandcontrolgroupsdifferonlyinthe

treatment;(b)RCTsprovidenotonlyunbiasedestimatesofaveragetreatmenteffects,butalso

preciseestimates;(c)randomizationisnecessarytosolvetheselectionproblem;(d)lackof

blinding,whichiscommoninsocialscienceexperiments,doesnotseriouslycompromiseinfer-

ence;(e)statisticalinferenceinRCTs,whichrequiresonlythesimplecomparisonofmeans,is

straightforward,sothatstandardsignificancetestsarereliable.

WhilemanyoftheproblemsofRCTsaresharedwithobservationalstudies,someare

unique,forexamplethefactthatrandomizingitselfcanchangeoutcomesindependentlyof

treatment.Moregenerally,itisalmostneverthecasethatanRCTcanbejudgedsuperiortoa

well-conductedobservationalstudysimplybyvirtueofbeinganRCT.Theideathatallmethods

4

havetheirflaws,butRCTsalwayshavefewest,isoneofthedeepestandmortperniciousmis-

understandings.

Inthesecondpartofthepaper,wediscusstheusesandlimitationsofresultsfromRCTs

formakingpolicy.Thenon-parametricandtheory-freenatureofRCTs,whichisarguablyanad-

vantageinestimation,isaseriousdisadvantagewhenwetrytousetheresultsoutsideofthe

contextinwhichtheywereobtained.Muchoftheliterature,ineconomicdevelopmentand

elsewhere,perhapsinspiredbyCampbellandStanley’s(1963)famous“primacyofinternalvalid-

ity,”assumesthatinternalvalidityisenoughtoguaranteetheusefulnessoftheestimatesindif-

ferentcontexts.WithoutunderstandingRCTswithinthecontextoftheknowledgethatweal-

readypossessabouttheworld,muchofitobtainedbyothermethods,wedonotknowhowto

usetrialresults.ButoncethecommitmenthasbeenmadetoseeingRCTswithinthisbroader

structureofknowledgeandinference,andwhentheyaredesignedtofitwithinit,theycanplay

ausefulroleinbuildinggeneralknowledgeandpolicypredictions;forexample,anRCTcanbea

goodwayofestimatingakeypolicymagnitude.ThebroadercontextwithinwhichRCTsneedto

besetincludesnotonlymodelsofeconomicstructure,butalsothepreviousexperiencethat

policymakershaveaccumulatedaboutlocalsettingsandimplementation.Mostimportantlyfor

economicdevelopment,theuseofRCTresultsshouldbesensitivetowhatpeoplewant,both

individuallyandcollectively.RCTsshouldnotbecomeyetanothertechnicalfixthatisimposed

onpeoplebybureaucratsorforeigners;RCTresultsneedtobeincorporatedintoademocratic

processofpublicreasoning,Sen(2011).Greenberg,Shroder,andOnstott(1999)documentthat,

evenbeforetherecentwaveofRCTsindevelopment,mostRCTsineconomicshavebeencar-

riedoutbyrichpeopleonpoorpeople,andthefactshouldmakeusespeciallysensitivetoavoid

chargesofpaternalism.

Section1:InterpretingtheresultsofRCTs

1.1Prolog

RCTswerefirstpopularizedbyFisher’sagriculturaltrialsinthe1930sandaretodayoftende-

scribedbytheRubincounterfactualcausalmodel,whichitselftracesbacktoNeymanin1923,

seeFreedman(2006)foradescriptionofthehistory:Eachuniti(aperson,apupil,aschool,an

agriculturalplot)isassumedtohavetwopossibleoutcomes, and ,theformeroccurring

ifthereisnotreatmentatthetimeinquestion,thelatteriftheunitistreated.Thedifference

betweenthetwooutcomes istheindividualtreatmenteffect,whichweshalldenote

Treatmenteffectsaretypicallydifferentfordifferentunits.Nounitcanbebothtreatedand

Yio Yi1

Yi1 −Yi0βi .

5

untreatedatthesametime,soonlyoneorotheroftheoutcomesoccurs;theotheriscounter-

factualsothatindividualtreatmenteffectsareinprincipleunobservable.

Wenoteparentheticallythatwhileweusethecounterfactualframeworkhere,wedo

notendorseit,norargueagainstotherapproachesthatdonotuseit,suchastheCowlescom-

missioneconometricframeworkwherethecausalrelationsarecodedasstructuralequations,

seealsoPearl(2009.)ImbensandWooldridge(2009,Introduction)provideaneloquentdefense

oftheRubinformulation,emphasizingthecredibilitythatcomesfromatheory-freespecifica-

tionwithunlimitedheterogeneityintreatmenteffects.HeckmanandVytlacil(2007,Introduc-

tion)makeanequallyeloquentcaseagainst,notingthatthetreatmentsinRCTsareoftenun-

clearlyspecifiedandthatthetreatmenteffectsarehardtolinktoinvariantparametersthat

wouldbeusefulelsewhere.

ThebasictheoremgoverningRCTsisaremarkableone.Itstatesthattheaveragetreat-

menteffectistheaverageoutcomeinthetreatmentgroupminustheaverageoutcomeinthe

controlgroup.Whilewecannotobservetheindividualtreatmenteffects,wecanobservetheir

mean.Theestimateoftheaveragetreatmenteffect(ATE)issimplythedifferencebetweenthe

meansinthetwogroups,andithasastandarderrorthatcanbeestimatedandusedtomake

significancestatementsaccordingtothestatisticaltheorythatappliestothedifferenceoftwo

means,onwhichmorebelowinSection1.3.Thedifferenceinmeansisanunbiasedestimatorof

themeantreatmenteffect.

Thetheoremisremarkablebecauseitrequiressofewassumptions;nomodelisre-

quired,noassumptionsaboutcovariatesareneeded,thetreatmenteffectscanbeheterogene-

ous,andnothingisassumedabouttheshapesofstatisticaldistributionsotherthanthestatisti-

calquestionoftheexistenceofthemeanofthecounterfactualoutcomevalues.Intermsofone

ofourrunningthemes,itrequiresnoexpertknowledge,ornoacceptanceofpriors,expertor

otherwise.Thetheoremalsohasitslimitations;theproofusesthefactthatthedifferencein

twomeansisthemeanoftheindividualdifferences,i.e.thetreatmenteffects.Thisisnottrue

forthemedian(thedifferenceintwomediansisnotthemedianofthedifferenceswhichisthe

mediantreatmenteffect).Italsodoesnotallowustoestimateanypercentileofthedistribution

oftreatmenteffects,oritsvariance.(Quantileestimatesoftreatmenteffectsarenotthequan-

tilesofthedistributionoftreatmenteffects,butthedifferencesinthequantilesofthetwomar-

ginaldistributionsoftreatmentsandcontrols;thetwomeasurescoincideiftheexperimenthas

noeffectonranks,anassumptionthatwouldbeconvenientbutishardtojustify,atleastin

6

general.)AllofthesestatisticscanbeofinterestforpolicybutRCTsarenotinformativeabout

them,oratleastnotwithoutfurtherassumptions,forexampleonthedistributionoftreatment

effects,seeHeckman,Smith,andClements(1997),andmuchoftheattractionofRCTsisthe

absenceofsuchassumptions.

Thebasictheoremtellsusthatthedifferenceinmeansisanunbiasedestimatorofthe

averagetreatmenteffectbutsaysnothingaboutthevarianceofthisestimator.Ingeneral,abi-

asedestimatorthatistypicallyclosertothetruthwilloftenbebetterthananunbiasedestima-

torthatistypicallywideofthetruth.Thereisnothingtosaythatanon-RCTestimator,inspite

ofbias,mightnothavealowermeansquarederror(MSE),onemeasureofthedistanceofthe

estimatefromthetruth,oralowervalueofa“lossfunction”thatdefinesthelosstotheexper-

imenterofmissingthetarget.

ItisusefultothinkofthemeanaveragetreatmenteffectfromanRCTintermsofsam-

plingfromafinitepopulation,aswhentheBureauoftheCensusestimatesaverageincomeof

theUSpopulationin2013.FortheRCT,thepopulationisthepopulationofunitswhoseaverage

treatmenteffectisofinterest;notetheimportanceofdefiningthepopulationofinterestbe-

cause,giventheheterogeneityoftreatmenteffects,theaveragetreatmenteffectwillvary

acrossdifferentpopulations,justasaverageincomesdifferacrossdifferentsubpopulationsof

theUS.Finitepopulationsamplingtheorytellsushowtogetaccurateestimatesofmeansfrom

samples;intheRCTcase,thesampleisthestudysample,bothtreatmentsandcontrols.Inprin-

ciple,thestudysamplecouldbearandomsampleoftheparentpopulationofinterest,inwhich

caseitisrepresentativeofit,butthatisseldomthecase.Becausetheestimateispopulation

specific,itisnot(orneednotbe)thoughtofastheparameterofasuper-population,orother-

wisegeneralizableinanyway.AverageincomeintheUSin2013maybeofinterestinitsown

right;butitwillnotbethesameasaverageincomein2014,norwillitbethesameasaverage

incomeofwhites,orofthepopulationsofWyomingorNewYork.Exactlythesameistrueof

theestimateofanaveragetreatmenteffect;itappliestothestudysampleinwhichthetrialwas

done,atthetimewhenitwasdone,anditsuseoutsideofthoseconfines,thoughoftenpossi-

ble,requiresargumentandjustification.Withoutsuchanargument,wecannotclaimthatan

ATEis“the”meantreatmenteffectanymorethanthataverageincomeintheUSin2013is

“the”averageincomeoftheUSinanyotheryear.Ofcourse,knowingaverageincomein2013

canbeusefulformakingothercalculations,suchasanestimateofincomein2014,orofasub-

7

populationthatweknowisricherorpoorer;thefactthatanestimatedoesnotuniversallygen-

eralizedoesnotmakeituseless.WeshallreturntotheseissuesinSection2.

1.2.Precision,balance,andrandomization

1.2.1Precisionandbias

Weshouldlikeourestimateoftheaveragetreatmenteffecttobeasclosetothetruthaspossi-

ble.Onewaytoassessclosenessisthemeansquareerror(MSE),definedas

(1)

where isthetrueaveragetreatmenteffect,and isitsestimatefromaparticulartrial.The

expectationistakenoverrepeatedrandomizationsoftreatmentsandcontrolsusingthesame

studypopulation.Itisalsostandardtorewrite(1)as

(2)

sothatmeansquareerroristhesumofthevarianceoftheestimator—whichwetypicallyknow

somethingaboutfromtheestimatedstandarderror—andthesquareofthebias—whichinthe

caseofa(nideal)randomizedcontrolledtrialiszero.Theelementary,butcrucialpointisthat,

whileitiscertainlygoodthatthebiasiszero,thatfactdoesnothingtomakethedistancefrom

thetruthassmallasitmightbe,whichiswhatwereallycareabout.Anunbiasedestimatorthat

isnearlyalwayswideofthetargetisnotasusefulasonethatisalwaysneartoit,evenif,on

average,itisoffcenter.Moregenerally,itwilloftenbedesirabletotradeinsomeunbiasedness

forgreaterprecision.Experimentsareoftenexpensive,sowecannotalwaysrelyonlargesam-

plestobringtheestimateclosetothetruthandresolvetheseissuesforus.MuchofthisSection

isconcernedwithhowtodesignexperimentstomaximizeprecision.

Unbiasednessalonecannotthereforejustifytheoften-expressedpreferenceforRCTs

overotherestimators.TheminimalistassumptionsrequiredforanRCTtobeunbiasedarean

attractionalthough,asweshallseeinthisSection,thisadvantageusuallycomesatthecostof

loweredprecisionandofdifficultiesinknowinghowtousetheresult,asweshallseeinSection

2.YetthereisanoftenexpressedbeliefthatRCTsaresomehowguaranteedtobeprecise,simp-

lybecausetheyareRCTs.Occasionallybiasandprecisionareexplicitlyconfused;theJPALweb-

site,initsexplanationofwhyitisgoodtorandomize,saysthatRCTs“aregenerallyconsidered

themostrigorousand,allelseequal,producethemostaccurate(i.e.unbiased)results.”Shad-

ish,Cook,andCampbell(2002,p.276),inwhatis(rightly)consideredoneofthebiblesofcausal

inferenceinsocialscience,statewithoutqualificationthat“randomizedexperimentsprovidea

MSE = E(⌢θ −θ )2

θ ⌢θ

MSE = E (

⌢θ − E(

⌢θ )( )2 + E(

⌢θ )−θ( )2 = var( ⌢θ )+ bias( ⌢θ ,θ )2

8

preciseansweraboutwhetheratreatmentworked”(p.276)and“Therandomizedexperimentis

oftenthepreferredmethodforobtainingapreciseandstatisticallyunbiasedestimateofthe

effectsofanintervention,”(p.277)ouritalics.

ContrastthiswithCronbachetal(1980)whoquotesKendall’s(1957)pasticheofLong-

fellow,“Hiawathadesignsanexperiment,”whereHiawatha’sinsistenceonunbiasednessleads

tohisneverhittingthetargetandtohiseventualbanishment.

1.2.2Balanceandprecisioninalinearall-causemodel

AusefulwaytothinkaboutprecisionandwhatanRCTdoesanddoesnotdoistouseasche-

maticlinearcausalmodeloftheform:

(3)

where,asbefore, istheoutcomeforuniti, isadichotomous(1,0)treatmentdummyin-

dicatingwhetherornotiistreated,and istheindividualtreatmenteffectofthetreatment

oni.Thex’saretheobservedorunobservedothercausesoftheoutcome,andwesupposethat

(3)capturesallthecausesof Yi . Jmaybeverylarge.Becausetheheterogeneityoftheindividu-

altreatmenteffects βi isunrestricted,weallowthepossibilitythatthetreatmentinteractswith

thex’sorothervariables,sothattheeffectsofTcandependonanyothervariables,andwe

shallhaveoccasiontomakethisexplicitbelow.Anobviousandimportantexampleiswhenthe

treatmentifeffectiveonlyinthepresenceofaparticularvalueofoneofthex’s.

Wedonotneedisubscriptsonthe γ 's thatcontroltheeffectsoftheothercauses;if

theireffectsdifferacrossindividuals,weincludetheinteractionsofindividualcharacteristics

withtheoriginalx’sasnewx’s.Giventhatthex’scanbeunobservable,thisisnotrestrictive.

Becausethe β 's candependonthex’s,theeffectsofthex’sontheoutcomecandependon

Ti , or,equivalently,theeffectsoftreatmentcandependoncovariates.

Inanexperiment,withorwithoutrandomization,wecanrepresentthetreatmentgroup

ashaving andthecontrolgroupashaving Sowhenwesubtracttheaverageout-

comesamongthecontrolsfromtheaverageoutcomesamongthetreatments,wewillget

Y

1−Y

0= β

1+ γ j (xij

1−

j=1

J

∑ xij0) = β

1+ (S

1− S

0) (4)

Thefirsttermonthefarrighthandside,whichistheaveragetreatmenteffect,iswhatwewant,

butthesecondtermorerrorterm,whichisthesumofthenetaveragebalancesofothercauses

Yi = βiTi + γ j xijj=1

J∑Yi Ti

βi

Ti = 1, Ti = 0.

9

acrossthetwogroups,willgenerallybenon-zero—becauseofselectionormanyotherrea-

sons—andneedstobedealtwithsomehow.Wegetwhatwewantwhenthemeansofallthe

othercausesareidenticalinthetwogroups,ormorepreciselywhenthesumoftheirnetdiffer-

ences S1− S

0iszero;thisisthecaseofperfectbalance.Withperfectbalance,thedifference

betweenthetwomeansisexactlyequaltotheaverageofthetreatmenteffectamongthe

treated,sothatwehavetheultimateprecisionandweknowtheanswerexactly,atleastinthis

linearcase.

1.2.3Balancingacts:realandmagical

Howdowegetbalance,orsomethingclosetoit?What,exactly,istheroleofrandomization?In

alaboratoryexperiment,wherethereisgoodbackgroundknowledgeoftheothercauses,the

experimenterhasagoodchanceofcontrollingalloftheothercauses,aimingtoensurethatthe

lasttermin(4)isclosetozero.Failingsuchknowledgeandcontrol,analternativeismatching,

frequentlyusedinstatistical,medical,andeconometricwork.Foreachtreatment,amatchis

foundthatisascloseaspossibleonallsuspectedcauses,sothat,onceagain,thelasttermin(4)

canbekeptsmall.Again,whenwehaveagoodideaofthecauses,matchingmayalsodelivera

preciseestimate.Ofcourse,whenthereareimportantunknownorunobservablecauses,nei-

therlaboratorycontrolnormatchingoffersprotection.

Whatdoesrandomizationdo?Becausethetreatmentsandcontrolscomefromthe

sameunderlyingdistribution,randomizationguarantees,byconstruction,thatthelasttermon

therightin(4)iszeroinexpectationatbaseline(muchcanhappentodisturbthisbeyondbase-

line).Thisistruewhetherornotthecausesareobserved.IftheRCTisrepeatedmanytimeson

thesametrialpopulation,thenthelasttermwillbezerowhenaveragedoveraninfinitenumber

of(entirelyhypothetical)trials.Ofcourse,thisdoesnothingtomakeitzeroinanyonetrial

wherethedifferenceinmeanswillbeequaltotheaveragetreatmenteffectamongthosetreat-

edplusatermthatreflectstheimbalanceintheneteffectsoftheothercauses.Wedonot

knowthesizeofthiserrorterm,andthereisnothingintherandomizationthatlimitsitssize;by

chance,therecanbeone(ormore)importantexcludedcause(s)thatisveryunequallydistribut-

edbetweentreatmentandcontrols.Thisimbalancewillvaryoverreplicationsofthetrial,and

itsaveragesizewillideallybecapturedbythestandarderroroftheestimatedATE,whichgives

ussomeideaofhowlikelywearetobeawayfromthetruth.Gettingthestandarderrorand

associatedsignificancestatementsrightarethereforeofgreatimportance.

10

Exactlywhatrandomizationdoesisfrequentlylostinthepracticalliterature,andthere

isoftenaconfusionbetweenperfectcontrol,ontheonehand—asinalaboratoryexperimentor

perfectmatchingwithnounobservablecauses—andcontrolinexpectation—whichiswhatRCTs

do.WesuspectthatatleastsomeofthepopularandprofessionalenthusiasmforRCTs,aswell

asthebeliefthattheyareprecisebyconstruction,comesfrommisunderstandingsaboutbal-

ance.Thesemisunderstandingsarenotsomuchamongthetrialistswho,whenpressed,willgive

acorrectaccount,butcomefromimprecisestatementsbytrialiststhataretakenasgospelby

thelayaudiencethatthetrialistsarekeentoreach.

SuchamisunderstandingiswellcapturedbythefollowingquotefromtheWorldBank’s

onlinemanualonimpactevaluation:

“Wecanbeveryconfidentthatourestimatedaverageimpact,givenasthedifference

betweentheoutcomeundertreatment(themeanoutcomeoftherandomlyassigned

treatmentgroup)andourestimateofthecounterfactual(themeanoutcomeofthe

randomlyassignedcomparisongroup)constitutethetrueimpactoftheprogram,since

byconstructionwehaveeliminatedallobservedandunobservedfactorsthatmightoth-

erwiseplausiblyexplainthedifferenceinoutcomes.”Gertleretal(2011)(ouritalics.)

Thisstatementconfusesactualbalanceinanysingletrialwithbalanceinexpectationovermany

entirelyhypotheticaltrials.Ifthestatementaboveweretrue,andifallfactorswereindeedcon-

trolled(andnoimbalanceswereintroducedpostrandomization),thedifferencewouldbean

exactmeasureoftheaveragetreatmenteffect,atleastintheabsenceofmeasurementerror.

Weshouldnotonlybeconfidentofourestimate;wewouldknowthetruth,asthequotesays.

AsimilarquotecomesfromJohnList,oneofthemostimaginativeandsuccessfulschol-

arswhouseRCTs:

“complicationsthataredifficulttounderstandandcontrolrepresentkeyreasonsto

conductexperiments,notapointofskepticism.Thisisbecauserandomizationactsasan

instrumentalvariable,balancingunobservablesacrosscontrolandtreatmentgroups.”

Al-UbaydliandList(2013)(italicsintheoriginal.)

AndfromDeanKarlan,founderandPresidentofYale’sInnovationsforPovertyAction,which

runsdevelopmentRCTsaroundtheworld:

“Asinmedicaltrials,weisolatetheimpactofaninterventionbyrandomlyassigningsub-

jectstotreatmentsandcontrolgroups.Thismakesitsothatallthoseotherfactors

whichcouldinfluencetheoutcomearepresentintreatmentandcontrol,andthusany

11

differenceinoutcomecanbeconfidentlyattributedtotheintervention.”Karlan,Gold-

bergandCopestake(2009)

Andfromthemedicalliterature,fromadistinguishedpsychiatristwhoisdeeplyskepticalof

RCTs,

“Thebeautyofarandomizedtrialisthattheresearcherdoesnotneedtounderstandall

thefactorsthatinfluenceoutcomes.Saythatanundiscoveredgeneticvariationmakes

certainpeopleunresponsivetomedication.Therandomizingprocesswillensure—or

makeithighlyprobable—thatthearmsofthetrialcontainequalnumbersofsubjects

withthatvariation.Theresultwillbeafairtest.”(Kramer,2016,p.18)

ClaimsareevenmadethatRCTsrevealknowledgewithoutpossibilityoferror.JudyGueron,the

long-timepresidentofMDRC,whichhasbeenrunningRCTsonUSgovernmentpolicyfor45

years,askswhyfederalandstateofficialswerepreparedtosupportrandomizationinspiteof

frequentdifficultiesandinspiteoftheavailabilityofothermethods,andconcludesthatitwas

because“theywantedtolearnthetruth,”GueronandRolston(2013,429).Therearemany

statementsoftheform“Weknowthat[projectX]workedbecauseitwasevaluatedwitharan-

domizedtrial,”Dynarski(2015).

Manywritersaremorecautious,andmodifystatementsabouttreatmentandcontrol

groupsbeingidenticalwithtermssuchas“statisticallyidentical,”“reasonablysimilar”ordonot

differ“systematically.”Andwehavenodoubtthatalloftheauthorsquotedaboveunderstand

theneedforthesequalifications.Buttotheuninformedreader,thequalifiedstatementsare

unlikelytobedifferentiatedfromtheunqualifiedstatementsquotedabove.Norisitalways

clearwhatsomeofthesetermsmean.Forexample,iftwopeopleareselectedatrandomfroma

population,anditsohappensthatoneisfemaleandonemale,inwhatsensetheyarestatisti-

callyidentical?Whileitistruethattheywererandomlyselectedfromthesameparentdistribu-

tion,whichprovidesthebasisforinference,thecalculationofstandarderrors,andsignificance

statements,itdoesnothingtohelpwithbalanceorprecisioninanygiventrial.

1.2.4Samplesizeandstatisticalinferenceinunbalancedtrials

Isasingletrialmorelikelytobebalanced,andthusmoreprecise,whenthesamplesizeislarge?

Indeed,asthesamplesizetendstoinfinity,themeansofthex’sinthetreatmentandcontrol

groupswillbecomearbitrarilyclose.YetthisisoflittlehelpinfinitesamplesasFisher(1926)

noted:“Mostexperimentersoncarryingoutarandomassignmentwillbeshockedtofindhow

farfromequallytheplotsdistributethemselves,”quotedinMorganandRubin(2012).Evenwith

12

verylargesamplesizes,iftherearealargenumberofcauses,balanceoneachcausemaybe

infeasible.Vandenbroucke(2004)notesthattherearethreemillionbasepairsinthehuman

genome,manyorallofwhichcouldberelevantprognosticfactorsforthebiologicaloutcome

thatweareseekingtoinfluence.

However,as(4)makesclear,wedonotneedbalanceonallcauses,onlyontheirnetef-

fect,theterm S 1 − S 0 whichdoesnotrequirebalanceoneachcauseindividually.Yetthereis

noguaranteethateventheneteffectwillbesmall.Forexample,theremayonlybeoneomitted

unobservedcausewhoseeffectislarge,onesinglebasepairsay,sothatifthatonecauseisun-

balancedacrosstreatmentsandcontrols,thatthereisindividualorevennetbalanceonother

lessimportantcausesisnotgoingtohelp.

Statementsaboutlargesamplesguaranteeingbalancearenotusefulwithoutguidelines

abouthowlargeislargeenough,andsuchstatementscannotbemadewithoutknowledgeof

othercausesandhowtheyaffectoutcomes.

Asimplecaseillustrates.Supposethatthereisonehiddencausein(3),abinaryvariable

xthatisunitywithprobabilitypand0otherwise.Withncontrolsandntreatments,thediffer-

enceinfractionswithx=1inthetwogroupshasmean0andvariance 1/ np(1− p). Withn=100

andp=0.5,thestandarderroraround0is0.2sothat,ifthisunobservedconfounderhasalarge

effectontheoutcome,theimbalancecouldeasilymasktheeffectoftreatment,orbemistaken

asevidencefortheeffectivenessofatrulyineffectivetreatment.

Lackofbalanceintheaboveexampleorintheneteffectofeitherobservablesornon-

observablesin(4)doesnotcompromisetheinferenceinanRCTinthesenseofobtaininga

standarderrorfortheunbiasedATE,seeSenn(2013)foraparticularlyclearstatement.The

randomizationdoesnotguaranteebalancebutitprovidesthebasisformakingprobability

statementsaboutthevariouspossibleoutcomes,whichisalsoclearintheexampleintheprevi-

ousparagraph.ThiswasalsoFisher’sargumentforrandomization.Sennwrites“theprobability

calculationappliedtoaclinicaltrialautomaticallymakesanallowanceforthefactthatthe

groupswillalmostcertainlybeunbalanced.”(italicsintheoriginal.)Ifthedesignissuchthat,

evenwithperfectrandomization,successivereplicationstendtogeneratelargeimbalances,the

resultingimprecisionoftheATEwillshowupinitsstandarderror.Ofcourse,theusefulnessof

thisrequiresthatthecalculatedstandarderrorspermitcorrectsignificancestatements,which,

asweshallseeinthenextsubsection,isoftenfarfromstraightforward.Intheexampleabove,

anextreme,butentirelypossible,caseoccurswhen,bychance,theunobservedconfounderis

13

perfectlycorrelatedwiththetreatment;unlessthereareactualreplications,thefalsecertainty

thatsuchanexperimentprovideswillbereinforcedbyfalsesignificancetests.

1.2.4Testingforbalance

Inpractice,trialistsineconomics(andinsomeotherdisciplines)usuallycarryoutastatistical

testforbalanceafterrandomizationbutbeforeanalysis,presumablywiththeaimoftaking

someappropriateactionifbalancefails.Thefirsttableofthepapertypicallypresentsthesam-

plemeansofobservablecovariates—theobservablex’sin(3),whichareeithercausesintheir

ownrightorinteractwiththe β 's—forthecontrolandtreatmentgroups,togetherwiththeir

differences,andtestsforwhetherornottheyaresignificantlydifferentfromzero,eithervaria-

blebyvariable,orjointly.Thesetestsareappropriateifweareconcernedthattherandom

numbergeneratormighthavefailed(becausewearedrawingplayingcards,rollingdice,or

spinningbottletops,thoughpresumablynotiftherandomizationisdonebyarandomnumber

generator,alwayssupposingthatthereissuchathingasrandomness,SingerandPincus(1998)),

orifweareworriedthattherandomizationisunderminedbynon-blindedsubjectsortrialists

systematicallyunderminingtheallocation.Otherwise,asthenextparagraphshows,thetest

makesnosenseandisnotinformative,whichdoesnotseemtostopitbeingroutinelyused.

Ifwewrite µ0 and µ1 forthe(vectorsof)populationmeans(i.e.themeansoverall

possiblerandomizations)oftheobservedx’sinthecontrolandtreatmentgroupsatthepointof

assignment,thenullhypothesisis(presumably,asjudgedbythetypicalbalancetest)thatthe

twovectorsareidentical,withthealternativebeingthattheyarenot.Butiftherandomization

hasbeencorrectlydone,thenullhypothesisistruebyconstruction,seee.g.Altman(1985)and

Senn(1994),whichmayhelpexplainwhyitsorarelyfailsinpractice.Indeed,althoughwecan-

not“test”it,weknowthatthenullhypothesisisalsotruefortheunobservablecomponentsof

x.NotethecontrastwiththestatementsquotedaboveclaimingthatRCTsguaranteebalanceon

causesacrosstreatmentandcontrolgroups.Thosestatementsrefertobalanceofcausesatthe

pointofassignmentinanysingletrial,whichisnotguaranteedbyrandomization,whereasthe

balancetestsareaboutthebalanceofcausesatthepointofassignmentinexpectationover

manytrials,whichisguaranteedbyrandomization.Theconfusionisperhapsunderstandable,

butitisconfusionnevertheless.Ofcourse,itmakessensetolookforbalancebetweenobserved

covariatesusingsomemoreappropriatedistancemeasureforexamplethenormalizeddiffer-

enceinmeans,ImbensandWooldridge(2009,equation3).

14

1.2.5Methodsforbalancing

Oneproceduretoimprovebalanceistoadaptthedesignbeforerandomization,forexampleby

stratification.Fisher,whoasthequoteaboveillustrates,waswellawareofthelossofprecision

fromrandomizationarguedfor“blocking”(stratification)inagriculturaltrialsorforusingLatin

Squares,bothofwhichrestricttheamountofimbalance.Stratification,tobeuseful,requires

somepriorunderstandingofthefactorsthatarelikelytobeimportant,andsoittakesusaway

fromthe“noknowledgerequired,”or“nopriorsaccepted”appealofRCTs.ButasScriven(1974,

103)notes:“causehunting,likelionhunting,isonlylikelytobesuccessfulifwehaveaconsider-

ableamountofrelevantbackgroundknowledge,”orevenmorestrongly,“nocausesin,no

causesout,”Cartwright(1994,Chapter2).StratificationinRCTs,asinotherformsofsampling,is

astandardmethodforusingbackgroundknowledgetoincreasetheprecisionofanestimator.It

hasthefurtheradvantagethatitallowsfortheexplorationofdifferentaveragetreatmentef-

fectsindifferentstratawhichcanbeusefulinadaptingortransportingtheresultstootherloca-

tions,seeSection2.

Stratificationisnotpossiblewhentherearetoomanycovariates,orifeachhasmany

values,sothattherearemorecellsthancanbefilledgiventhesamplesize.Analternativeisto

re-randomize,repeatingtherandomizationuntilthedistancebetweentheobservedcovariates

islessthansomepredeterminedcriteria.MorganandRubin(2012)suggesttheMahalanobisD–

statistic,anduseFisher’srandomizationinference(tobediscussedfurtherbelow)tocalculate

standarderrorsthattakethere-randomizationintoaccount.Analternative,widelyadaptedin

practice,istoadjustforcovariatesbyrunningaregression(orcovariance)analysis,withthe

outcomeonthelefthandsideandthetreatmentdummyandthecovariatesasexplanatoryvar-

iables,includingpossibleinteractionsbetweencovariatesandtreatmentdummies.

Freedman(2008)hasanalyzedthismethodandargues“ifadjustmentmadeasubstan-

tialdifference,wewouldsuggestmuchcautionwheninterpretingtheresults.”Butasubstantial

differenceisexactlywhatwewouldliketosee,atleastsomeofthetime,iftheadjustment

movestheestimateclosertothetruth.FreedmanshowsthattheadjustedestimateoftheATE

isbiasedinfinitesamples,withthebiasdependingonthecorrelationbetweenthesquared

treatmenteffectandthecovariates.Thereisalsonogeneralguaranteethattheregressionad-

justmentwillgenerateamorepreciseestimate,althoughitwilldosoifthereareequalnumbers

oftreatmentsandcontrolsorifthetreatmenteffectsareconstantoverunits(inwhichcase

therewillalsobenobias).Evenwithbias,theregressionadjustmentisattractiveifitdoesin-

15

deedtradeoffbiasforprecision,thoughpresumablynottoRCTpuristsforwhomunbiasedness

isthesinequanon.Noteagainthattheincreasedprecision,whenitexists,comesfromusing

priorknowledgeaboutthevariablesthatarelikelytobeimportantfortheoutcome.Thatthe

backgroundknowledgeortheoryiswidelysharedandunderstoodwillalsoprovidesomepro-

tectionagainstdataminingbysearchingthroughcovariatesinthesearchfor(perhapsfalsely)

estimatedprecision.

1.2.6Shouldwerandomize?

ThetensionbetweenrandomizationandprecisiongoesbacktotheearlydebatebetweenFisher

andStudent(Gosset)whoneveracceptedFisher’sargumentsforrandomization,seealsoZiliak

(2014).InhisdebatewithFisheraboutagriculturaltrials,Studentarguedthatrandomization

ignoredrelevantpriorinformation,forexampleabouthowlikelyconfounderswouldbedistrib-

utedacrossthetestplots,sothatrandomizationwastedresourcesandledtounnecessarily

poorestimates.Thisgeneralquestionofwhetherrandomizationisdesirablehasbeenreopened

inrecentpapersbyKasy(2016),Banerjee,Chassang,andSnowberg(2016)andBanerjee,

Chassang,Montero,andSnowberg(2016).

ReferbacktotheMSEintroducedabove,andconsiderdesigninganexperimentthatwill

makethisassmallaspossible.Unfortunately,thisisnotgenerallypossible;forexample,the“es-

timator”of3,say,fortheATEhasthelowestpossiblemean-squarederrorifthetrueATEisac-

tually3.Instead,weneedtoaveragetheMSEoveradistributionofpossibleATEs.Thisleadsto

adecisiontheoryapproachtoestimationwherebyaBayesianeconometricianwillestimatethe

ATEbychoosingtheallocationoftreatmentandcontrolssoastominimizetheexpectedvalue

ofalossfunction—theMSEbeingoneexample.Suchanapproachrequiresustospecifyaprior

ontheATE,ormoregenerally,ontheexpectationofoutcomesconditionalonthecovariates.

Thesepriorsareformalversionsoftheissuethathasalreadycomeuprepeatedly,thattoget

goodestimators,weneedtoknowsomethingabouthowthecovariatesaffecttheoutcome.

Kasy(2016)solvesthisproblemforthecaseofexpectedMSEandshowsthatrandomizationis

undesirable;itsimplyaddsnoiseandmakestheMSElarger.Heusesanon-parametricpriorthat

hasprovedusefulinanumberofotherapplications—wecouldpresumablydoevenbetterifwe

werepreparedtocommitfurther,andheprovidescodetoimplementhismethod,whichshows

a20percentreductioninMSEcomparedwithrandomization(14percentforstratifiedrandomi-

zation)forthewell-knownTennesseeSTARclass-sizeexperiment.

16

Banerjeeetalproposeamoregenerallossfunctionandprovethecomparabletheorem,

thatrandomizationleadstolargerlossesthantheoptimalnon-randompurposiveassignment.

Theseauthorsrecommendrandomizationonothergrounds,whichwewilldiscussbelow,but

agreethat,forstandardstatisticalefficiencyormaximizationofexpectedutilityrandomization

shouldnotbeusedinexperimentaldesign.Studentwasright.

Severalpointsshouldbenoted.First,theanti-randomizationtheoremisnotajustifica-

tionofanynon-experimentaldesign,forexampleonethatcomparesoutcomesofthosewhodo

ordonotself-selectintotreatment.Selectioneffectsarerealenough,andifselectionisbased

onunobservablecauses,comparisonoftreatedandcontrolswillbebiased.Oneacceptablenon-

randomschemeistousetheobservablecovariatestodividethestudysampleintocellswithin

whichallobservationshavethesamevalueandthendivideeachcellintotreatmentsandcon-

trols.Withineachcell,orforthoseunitsonwhichwehavenoinformation,wecanchooseany

waywelike,includingrandomly,thoughrandomizationhasnoadvantageordisadvantage.Such

allocationsruleoutself-selection(ordoctororprogramadministratorselection)wheretheindi-

vidual(doctor,oradministrator)hasinformationnotvisibletothepersonassigningtreatments

andcontrols.Thekeyisthatthepersonwhomakestheassignment(theanalyst)usesallofthe

informationthatheorshepossesses,andthatoncethishasbeentakenintoaccount,allunits

areinterchangeableconditionalonthatinformation,sothatassignmentbeyondthatdoesnot

matter.Ofcourse,theprogramadministratorsmustenforcetheanalyst’sassignment,sothat

privateinformationthattheyortheunitspossessisnotallowedtoaffecttheassignment,condi-

tionalontheinformationusedbytheanalyst.Giventhis,selectiononunobservablesisruled

out,anddoesnotaffecttheresults.Randomizationisnotrequiredtoeliminateselectionbias.

Whetheritisreallypossiblefortheanalysttoassignarbitrarilyisanopenquestion,asis

whether“randomization”fromarandom-numbergeneratorwilldoso.Evenmachine-generated

sequenceshavecauses,andeveniftheanalysthasonlyasetofuninformativelabelsforthe

units,thosetoomustcomefromsomewhere,sothatitispossiblethatthosecausesarelinked

totheunobservedcausesintheexperiment.Wedonotattempttodealherewiththesedeep

issuesonthemeaningofrandomization,butseeSingerandPincus(1998).

AccordingtoChalmers(2001)andBothwellandPodolsky(2016),thedevelopmentof

randomizationinmedicineoriginatedwithBradford-HillwhousedrandomizationinthefirstRCT

inmedicine—thestreptomycintrial—becauseitpreventeddoctorsselectingpatientsonthe

basisofperceivedneed(oragainstperceivedneed,leaningoverbackwardasitwere),anargu-

17

mentmorerecentlyechoedbyWorrall(2007).Randomizationservesthispurpose,butsodo

othernon-discretionaryschemes;whatisrequiredisthatthehiddeninformationnotaffectthe

allocation.Whileitistruethatdoctorscannotbeallowedtomaketheassignment,itisnottrue

thatrandomizationistheonlyschemethatcanbeenforced.

Second,theidealrulesbywhichunitsareallocatedtotreatmentorcontroldependon

thecovariates,andontheinvestigators’priorsabouthowthecovariatesaffecttheoutcomes.

Thisopensupallsortsofmethodsofinferencethatareexcludedbypurerandomization.For

example,thehypothetico-deductivemethodworksbyusingtheorytomakeapredictionthat

canbetakentothedata;herethepredictionswouldbeoftheformthataunitwithcharacteris-

ticsxwillrespondinaparticularwaytotreatment,falsificationofwhichcanbetestedbyan

appropriateallocationofunitstotreatment.Banerjee,ChassangandSnowberg(2016)provide

suchexamples.

Third,randomization,byrunningroughshodoverpriorinformationfromtheoryand

fromthecovariates,iswastefulandevenunethicalwhenitunnecessarilyexposespeople,or

unnecessarilymanypeople,topossibleharminariskyexperiment,seeWorrall(2002)foran

egregiouscaseofhowanunthinkingdemandforrandomizationandtherefusaltoacceptprior

informationputchildren’slivesdirectlyatrisk.

Fourth,thenon-randommethodsusepriorinformation,whichiswhytheydobetter

thanrandomization.Thisisbothanadvantageandadisadvantage,dependingonone’sperspec-

tive.Ifpriorinformationisnotwidelyaccepted,orisseenasnon-crediblebythoseweareseek-

ingtopersuade,wewillgeneratemorecredibleestimatesifwedonotusethosepriors.Indeed,

thisiswhyBanerjee,ChassangandSnowberg(2016)recommendrandomizeddesigns,including

inmedicineandindevelopmenteconomics.Theydevelopatheoryofaninvestigatorwhoisfac-

inganadversarialaudiencethatwillchallengeanypriorinformationandcanevenpotentially

vetoresultsthatarebasedonit(thinkadministrativeagenciesorjournalreferees).Theexperi-

mentertradesoffhisorherowndesireforprecision(andpreventingpossibleharmtosubjects),

whichusespriorinformation,againstthewishesoftheaudience,whowantnothingofthepri-

ors.Eventhen,theapprovalofthisaudienceisonlyexante;oncethefullyrandomizedexperi-

menthasbeendone,nothingstopscriticsarguingthat,infact,therandomizationdidnotoffera

fairtest.AmongdoctorswhouseRCTs,andespeciallymeta-analysis,suchargumentsare(ap-

propriately)common;seeagainKramer(2016).

18

AswenotedintheIntroduction,muchofthepublichascometoquestionexpertprior

knowledge,andBanerjee,Chassang,MonteroandSnowberg(2016)haveprovidedanelegant

(positive)accountofwhyRCTswillflourishinsuchanenvironment.Incaseswherethereisgood

reasontodoubtthegoodfaithofexperimenters,asinsomepharmaceuticaltrials,randomiza-

tionwillindeedbetheappropriateresponse.Butwebelievesuchargumentsaredeeplyde-

structiveforscientificendeavorandshouldberesistedasageneralprescriptionforscientific

research.Economistsandothersocialscientistsknowagreatdeal,andtherearemanyareasof

theoryandpriorknowledgethatarejointlyendorsedbylargenumbersofknowledgeablere-

searchers.Suchinformationneedstobebuiltonandincorporatedintonewknowledge,notdis-

cardedinthefaceofaggressiveknow-nothingignorance.Thesystematicrefusaltouseprior

knowledgeandtheassociatedpreferenceforRCTsarerecipesforpreventingcumulativescien-

tificprogress.Intheend,itisalsoself-defeating;toquoteRodrik(2016)“thepromiseofRCTsas

theory-freelearningmachinesisafalseone.”

1.3StatisticalinferenceinRCTs

IfwearetointerprettheresultsofanRCTasdemonstratingthecausaleffectofthetreatment

inthetrialpopulation,wemustbeabletotellwhetherthedifferencebetweenthecontroland

treatmentmeanscouldhavecomeaboutbychance.Anyconclusionaboutcausalityishostage

toourabilitytocalculatestandarderrorsandaccuratep–values.Butthisisnotgenerallypossi-

blewithoutassumptionsthatgobeyondthoseneededtosupportthebasictheoremofRCTs.In

particular,ithaslongbeenknownthatthemean—andafortiorithedifferencebetweentwo

means—isastatisticthatissensitivetooutliers.IndeedBahadurandSavage(1956)demon-

stratethat,withoutrestrictionsontheparentdistributions,standardt–testsareinherentlyun-

reliable.

Thekeyproblemhereisskewness;standardt–testsbreakdownindistributionswith

largeskewness,seeLehmannandRomano(2005,p.466–8).Inconsequence,RCTswillnotwork

wellwhenthedistributionoftheindividualtreatmenteffectsisstronglyasymmetric,atleastif

thestandardtwo-samplet–statistics(orequivalentlyWhite’s(1980)heteroskedasticrobustre-

gressiont–values)areused.Whilewemaybewillingtoassumethattreatmenteffectsaresym-

metricinsomecases,theneedforsuchanassumption—whichrequirespriorknowledgeabout

thespecificprocessbeingstudied—underminestheargumentthatRCTsarelargelyassumption

freeanddonotdependonsuchknowledge.Thereisadeepironyhere.Inthesearchforrobust-

nessandthedesiretodoawaywithunnecessaryassumptions,theRCTcandeliverthemeanof

19

theATE,yetthemean—asopposedtothemedian,whichcannotbeestimatedbyanRCT—does

notpermitrobustprobabilitystatementsabouttheestimatesoftheATE

Howdifficultisittomaintainsymmetry?Andhowbadlyisinferenceaffectedwhenthe

distributionoftreatmenteffectsisnotsymmetric?Ineconomics,manytrialshaveoutcomes

valuedinmoney.Doesananti-povertyinnovation—forexamplemicrofinance—increasethe

incomesoftheparticipants?Incomeitselfisnotsymmetricallydistributed,andthismightbe

trueofthetreatmenteffectstoo,ifthereareafewpeoplewhoaretalentedbutcredit-

constrainedentrepreneursandwhohavetreatmenteffectsthatarelargeandpositive,while

thevastmajorityofborrowersfritterawaytheirloans,oratbestmakepositivebutmodest

profits.Anotherimportantexampleisexpendituresonhealthcare.Mostpeoplehavezeroex-

penditureinanygivenperiod,butamongthosewhodoincurexpenditures,afewindividuals

spendhugeamountsthataccountforalargeshareofthetotal.Indeed,inthefamousRand

healthexperiment,Manning,Newhouseetal.(1987,1988),thereisasingleverylargeoutlier.

Theauthorsrealizethatthecomparisonofmeansacrosstreatmentarmsisfragile,and,alt-

houghtheydonotseetheirproblemexactlyasdescribedhere,theyobtaintheirpreferredes-

timatesusingastructuralapproachthatisdesignedtoexplicitlymodeltheskewnessofexpendi-

tures.

Insomecases,itwillbeappropriatetodealwithoutliersbytrimming,eliminatingob-

servationsthathavelargeeffectsontheestimates.Butiftheexperimentisaprojectevaluation

designedtoestimatethenetbenefitsofapolicy,theeliminationofgenuineoutliers,asinthe

RandHealthExperiment,willvitiatetheanalysis.Itispreciselytheoutliersthatmakeorbreak

theprogram.

1.3.1Spuriousstatisticalsignificance:anillustrativeexample

Weconsideranexamplethatillustrateswhatcanhappeninarealisticbutsimplifiedcase.There

isaparentpopulation,orpopulationofinterest,definedasthecollectionofunitsforwhichwe

wouldliketoestimateanaveragetreatmenteffect.ItmightbeallvillagesinIndia,orallrecipi-

entsoffoodsubsidies,orallusersofhealthcareintheUS.Fromthispopulationwehaveasam-

plethatisavailableforrandomization,thetrialorexperimentalsample;inarandomizedcon-

trolledtrial,thiswillsubsequentlyberandomlydividedintotreatmentsandcontrols.Ideally,

thetrialsamplewouldberandomlyselectedfromtheparentsample,sothatthesampleaver-

agetreatmenteffectwouldbeanunbiasedestimatorofthepopulationaveragetreatmentef-

fect;indeedinsomecasesthecompletepopulationofinterestisavailableforthetrial.Clearly,

20

intheseidealcases,itisstraightforwardtousestandardsamplingtheorytogeneralizethetrial

resultsfromthesampletothepopulation.However,foranumberofpracticalandconceptual

reasons,thetrialsampleisrarelyeitherthewholepopulationorarandomlyselectedsubset,

seeShadishetal(2002,pp.341–8)foragooddiscussionofbothpracticalandtheoreticalobsta-

cles.

Inourillustrativeexample,thereisparentpopulationeachmemberofwhichhashisor

herowntreatmenteffect;thesearecontinuouslydistributedwithashiftedlognormaldistribu-

tionwithzeromeansothatthepopulationaveragetreatmenteffectiszero.Theindividual

treatmenteffectsβ aredistributedsothat β + e0.5 ∼ Λ(0,1) ,forstandardizedlognormaldis-

tributionΛ. Wehavesomethinglikeamicrofinancetrialinmind,wherethereisalongpositive

tailofrareindividualswhocandoamazingthingswithcredit,whilemostpeoplecannotuseit

effectively.Atrial(experimental)sampleof2n individualsisrandomlydrawnfromtheparent

andisrandomlysplitbetweenntreatmentsandncontrols.Intheabsenceoftreatment,every-

oneinthesamplerecordszero,sothesampleaveragetreatmenteffectinanyonetrialissimply

themeanoutcomeamongthentreatments.Forvaluesofnequalto25,50,100,200,and500

wedraw100trial/experimentalsampleseachofsize2n;withfivevaluesofn,thisgivesus500

trial/experimentalsamplesinall.Foreachofthese500samples,werandomizeintoncontrols

andntreatments,estimatetheATEanditsestimatedt–value(usingthestandardtwo-samplet–

value,orequivalently,byrunningaregressionwithrobustt–values),andthenrepeat1,000

times,sowehave1,000ATEestimatesandt–valuesforeachofthe500trialsamples;theseal-

lowustoassessthedistributionofATEestimatesandtheirnominalt–valuesforeachtrial.

Table1:RCTswithskewedtreatmenteffects

Samplesize MeanofATE

estimates

Meanofnominalt–

values

Fractionnullreject-

ed(percent)

25

50

0.0268

0.0266

–0.4274

–0.2952

13.54

11.20

100 –0.0018 –0.2600 8.71

200 0.0184 –0.1748 7.09

500 –0.0024 –0.1362 6.06

21

Note:1,000randomizationsoneachof100drawsofthetrialsamplerandomlydrawnfromalognormaldistributionoftreatmenteffectsshiftedtohaveazeromean.

TheresultsareshowninTable1.Eachrowcorrespondstoasamplesize.Ineachrow,

weshowtheresultsof100,000individualtrials,composedof1,000replicationsoneachofthe

100trial(experimental)samples.Thecolumnsareaveragedoverall100,000trials.

Thelastcolumnshowsthefractionsoftimesthetruenullisrejectedandisthekeyre-

sult.Whenthereareonly50treatmentsand50controls(row2),the(true)nullisrejected11.2

percentofthetime,insteadofthe5percentthatwewouldlikeandexpectifwewereunaware

oftheproblem.Whenthereare500unitsineacharm,therejectionrateis6.06percent,much

closertothenominal5percent.

Whydoesthestandardapplicationofthet–distributiongivesuchstrangeresultswhen

allwearedoingisestimatingamean?Theproblemcasesarewhenthetrialsamplehappensto

containoneormoreoutliers,somethingthatisalwaysariskgiventhelongpositivetailofthe

parentdistribution.Whenthishappens,everythingdependsonwhethertheoutlierisamong

thetreatmentsorthecontrols;ineffecttheoutliersbecomethesample,reducingtheeffective

numberofdegreesoffreedom.

Figure1:EstimatesofanATEwithanoutlierinthetrialsample

Figure1illustratestheestimatedaveragetreatmenteffectsfromanextremecasefrom

thesimulationswith100observationsintotal,thesecondrowofTable1;thehistogramshows

the1,000estimatesoftheATE.Thetrialsamplehasasinglelargeoutlyingtreatmenteffectof

0.5

11.

5D

ensi

ty

-.5 0 .5 1 1.5 21,000 estimates of average treatment effect

22

48.3;themean(s.d.)oftheother99observationsis–0.51(2.1);whentheoutlierisinthe

treatmentgroup,wegettheright-handsideofthefigure,whenitisnot,wegettheleft-hand

side.Ontheright-handside,whentheoutlierisamongthetreatmentgroup,thedispersion

acrossoutcomesislarge,asistheestimatedstandarderror,andsothoseoutcomesrarelyreject

thenullusingthestandardtableoft–values.Theover-rejectionscomefromtheleft-handside

ofthefigurewhentheoutlierisinthecontrolgroup,theoutcomesarenotsodispersed,and

thet–valuescanbelarge,negative,andsignificant.Whilethesecasesofbimodaldistributions

maynotbecommon,anddependonlargeoutliers,theyillustratetheprocessthatgenerates

theover-rejectionsandspurioussignificance.

Wecouldescapetheseproblemsifwecouldcalculatethemediantreatmenteffect,but

RCTscannot(withoutfurtherassumption)identifythemedian,onlythemean,anditisthe

meanthatisatriskbecauseoftheBahadur-Savagetheorem.Notetoothatthereisonlymoder-

atecomforttobetakeninlargesamplesizes.Whilethelastrowiscertainlybetterthantheoth-

ers,therearestillmanytrialsamplesthataregoingtogivesampleaverageeffectsthataresig-

nificant,evenwhenthenumberwewantiszero.TheproofoftheBahadur-Savagetheorem

worksbynotingthatforanysamplesize,itisalwayspossibletofindanoutlierthatwillgivea

misleadingt–value.NoristhereanescapeherebyusingtheFisherexactmethodforinference;

theFishermethodteststhenullhypothesisthatallofthetreatmenteffectsarezerowhereas

whatweareinterestedinhere,atleastifwewanttodoprojectevaluationorcost-benefitanal-

ysis,isthattheaveragetreatmenteffectiszero.

Theproblemsillustratedabove,thatstemfromtheBahadur-Savagetheorem,arecer-

tainlynotconfinedtoRCTs,andoccurmoregenerallyineconometricandstatisticalwork.How-

ever,theanalysishereillustratesthatthesimplicityofidealRCTs,subtractingonemeanfrom

another,bringsnoexemptionfromtroublesomeproblemsofinference.Escapefromtheseis-

sues,asintheRandHealthExperiment,requiresexplicitmodeling,ormightbebesthandledby

estimatingquantilesofthetreatmentdistribution,whichagainrequiresadditionalassumptions.

OurreadingoftheliteratureonRCTsindevelopmentsuggeststhattheyarenotexempt

fromtheseconcerns.Manydevelopmenttrialsarerunon(sometimesvery)smallsamples,they

havetreatmenteffectswhereasymmetryishardtoruleout—especiallywhentheoutcomesare

inmoney—andtheyoftengiveresultsthatarepuzzling,oratleastnoteasilyinterpretedin

termsofeconomictheory.NeitherBanerjeeandDuflo(2012)norKarlanandAppel(2011),who

citemanyRCTs,raiseconcernsaboutmisleadinginference,treatingallresultsassolid.Nodoubt

23

therearebehaviorsintheworldthatareinconsistentwithstandardeconomics,andsomecan

beexplainedbystandardbiasesinbehavioraleconomics,butitwouldalsobegoodtobesuspi-

ciousofthesignificancetestsbeforeacceptingthatanunexpectedfindingiswellsupportedand

theoryshouldberevised.Replicationofresultsindifferentsettingsmaybehelpful—iftheyare

therightkindofplaces(seeourdiscussioninSection2)—butithardlysolvestheproblemgiven

thattheasymmetrymaybeinthesamedirectionindifferentsettings(andseemslikelytobeso

injustthosesettingsthataresufficientlyliketheoriginaltrialsettingtobeofuseforinference

aboutthetrialpopulation),andthatthe“significant”t–valueswillshowdeparturesfromthe

nullinthesamedirection,thusreplicatingspuriousfindings.

1.2.11:Significancetests:Fisher-Behrens,robustinference,andmultiplehypotheses

Skewnessoftreatmenteffectsisnottheonlythreattoaccuratesignificancetests.Thetwo–

samplet–statisticiscomputedbydividingtheATEbytheestimatedstandarderrorwhose

squareisgivenby

⌢σ 2 =(n1 −1)−1 (Yi −

⌢µ1)2

i∈1∑n1

+(n0 −1)−1 (Yi −

⌢µ0 )2

i∈0∑n0

(5)

where0referstocontrolsand1totreatments,sothatthereare n1 treatmentsand n0 con-

trols,and µ̂1 and µ̂0 arethetwomeans.Ashasbeenlongknown,thist–statisticisnotdistrib-

utedasStudent’stifthetwovariances(treatmentandcontrol)arenotidentical;thisisknown

astheBehrens–Fisherproblem.Inextremecases,whenoneofthevariancesiszero,thet–

statistichaseffectivedegreesoffreedomhalfofthatofthenominaldegreesoffreedom,sothat

thetest-statistichasthickertailsthanallowedfor,andtherewillbetoomanyrejectionswhen

thenullistrue.

Inaremarkablerecentpaper,Young(2016)arguesthatthisproblemgetsmuchworse

whenthetrialresultsareanalyzedbyregressingoutcomesnotonlyonthetreatmentdummy,

butalsoonadditionalcontrols,someofwhichmightinteractwiththetreatmentdummy.Again

theproblemconcernsoutliersincombinationwiththeuseofclusteredorrobuststandarder-

rors.Whenthedesignmatrixissuchthatthemaximalinfluenceislarge,sothatforsomeobser-

vationsoutcomeshavelargeinfluenceontheirownpredictedvalues,thereisareductioninthe

effectivedegreesoffreedomforthet–value(s)oftheaveragetreatmenteffect(s)leadingto

spuriousfindingsofsignificance.

24

Younglooksat2003regressionsreportedin53RCTpapersintheAmericanEconomic

AssociationjournalsandrecalculatesthesignificanceoftheestimatesusingFisher’srandomiza-

tioninferenceappliedtotheauthors’originaldata;seeagainImbensandWooldridge(2009)for

agoodmodernaccountofFisher’smethod.In30to40percentoftheestimatedtreatmentef-

fectsinindividualequationswithcoefficientsthatarereportedassignificant,hecannotreject

thenullofnoeffect;thefractionofspuriouslysignificantresultsincreasesfurtherwhenhesim-

ultaneouslytestsforallresultsineachpaper.Thesespuriousfindingscomeinpartfromthe

well-knownproblemofmultiple-hypothesistesting,bothwithinregressionswithseveraltreat-

mentsandacrossregressions.Withinregressions,treatmentsarelargelyorthogonal,butau-

thorstendtoemphasizesignificantt–valuesevenwhenthecorrespondingF-testsareinsignifi-

cant.Acrossequations,resultsareoftenstronglycorrelated,sothat,atworst,differentregres-

sionsarereportingvariantsofthesameresult,thusspuriouslyaddingtothe“killcount”ofsig-

nificanteffects.Atthesametime,thepervasivenessofobservationswithhighinfluencegener-

atesspurioussignificanceonitsown.

Oursenseisthattheseissuesarebeingtakenmoreseriouslyinrecentwork,especially

asconcernsmultiplehypothesistesting.YounghimselfisastrongproponentofRCTsingeneral

andbelievesthatrandomizationinferencewillyieldcorrectinferences.Yetrandomizationinfer-

encecanonlytestthenullthatalltreatmenteffectsarezero,thattheexperimentdoesnothing

toanyone,whereasmanyinvestigatorsareinterestedintheweakerhypothesisthattheaver-

agetreatmenteffectiszero.Thissimplymakesmattersworsesincethestrongerhypothesis

impliestheweakerhypothesisandtherearepresumablyundiscoveredcaseswheretheATEis

spuriouslysignificant,evenwhentheFishertestrejectsthatalltreatmenteffectsarezero.Note

thattestingdoesnotalwaysmatchlogic;itispossibletorejectthenullthattheATEiszeroeven

whenwecansimultaneouslyacceptthe(joint)hypothesisthatalltreatmenteffectsarezero;

thisisfamiliarfromOLSregression,whereanF–testcanshowjointinsignificance,evenwhena

t–testofsomelinearcombinationissignificant.

Itisclearthat,asofnow,allreportedsignificancelevelsfromRCTresultsineconomics

shouldbetreatedwithconsiderablecaution.Greatercareaboutskewnessandoutlierswould

help,aswouldgreateruseoftheFishermethodandofproceduresthatdealcorrectlywithmul-

tiplehypothesistesting.Yetifthenullhypothesisisthattheaveragetreatmenteffectiszero,as

inmostprojectevaluation,theFishertestisnotavailable,sothatwecurrentlydonothavea

reliablesetofprocedures.Robustorclusteredstandarderrorsarenecessarytoallowforthe

25

possibilitythattreatmentchangesvariances,andtheinclusionofcovariatesisnecessarytocon-

trolforimbalanceinfinitesamples.

1.3Blinding

Blindingisrarelypossibleineconomicsorsocialsciencetrials,andthisisoneofthemajordif-

ferencesfrommost(althoughnotall)RCTsinmedicine,whereblindingisstandard,bothfor

thosereceivingthetreatmentandthoseadministeringit.Indeed,theabilitytoblindhasbeen

oneofthekeyargumentsinfavorofrandomization,fromBradford-Hillinthe1950s,see

Chalmers(2003),towelfaretrialstoday,GueronandRolston(2013).Considerfirsttheblinding

ofsubjects.SubjectsinsocialRCTsusuallyknowwhethertheyarereceivingthetreatmentornot

andsocanreacttotheirassignmentinwaysthatcanaffecttheoutcomeotherthanthroughthe

operationofthetreatment;ineconometriclanguage,thisisakintoaviolationofexclusionre-

strictions,orafailureofexogeneity.Intermsof(1),thereisapathwayfromthetreatmentas-

signmenttoanotherunobservedcause,whichwillresultinabiasedATE.Thisisnottoarguein

favorofinstrumentalvariablesoverRCTs,orviceversa,butsimplytonotethat,withoutblind-

ing,RCTsdonotautomaticallysolvetheselectionproblemanymorethanIVestimationauto-

maticallysolvestheselectionproblem.Inbothcases,theexogeneity(exclusionrestriction)ar-

gumentneedstobeexplicitlymadeandjustified.Yettheliteratureineconomicsgivesgreatat-

tentiontothevalidityofexclusionrestrictionsinIVestimation,whiletendingtoshrugoffthe

essentiallyidenticalproblemswithlackofblindinginRCTs.

Notealsothatknowledgeoftheirassignmentmaycausepeopletowanttocrossover

fromtreatmenttocontrol,orviceversa,todropoutoftheprogram,ortochangetheirbehavior

inthetrialdependingontheirassignment.Inextremecases,onlythosemembersofthetrial

samplewhoexpecttobenefitfromthetreatmentwillaccepttreatment.Consider,forexample,

atrialinwhichchildrenarerandomlyallocatedtotwoschoolsthatteachindifferentlanguages,

RussianorEnglish,ashappenedduringthebreakupoftheformerYugoslavia.Thechildren(and

theirparents)knowtheirallocation,andthemoreeducated,wealthier,andless-ideologically

committedparentswhosechildrenareassignedtotheRussian-mediumschoolscan(anddid)

removetheirchildrentoprivateEnglish-mediumschools.Inacomparisonofthosewhoaccept-

edtheirassignments,theeffectsofthelanguageofinstructionwillbedistortedinfavorofthe

Englishschoolsbydifferencesinfamilycharacteristics.Thisisacasewhere,eveniftherandom

numbergeneratorisfullyfunctional,alaterbalancetestwillshowsystematicdifferencesinob-

26

servablebackgroundcharacteristicsbetweenthetreatmentandcontrolgroups;evenifthebal-

ancetestispassed,theremaystillbeselectiononunobservablesforwhichwecannottest.

Moregenerally,whenpeopleknowtheirallocation,whentheyhaveastakeintheout-

come,andwhenthetreatmenteffectisdifferentfordifferentpeople,thereareincentivesand

opportunitiesforselectioninresponsetotherandomization,andthatselectioncancontami-

natetheestimatedaveragetreatmenteffect,seeHeckman(1997)whomakesthesamepointin

thecontextofinstrumentalvariables.Thosewhowererandomizedbyalotteryintogoingto

Vietnamwillhavedifferenttreatmenteffectsdependingontheirlabormarketprospects,and

thosewithbetterprospectsaremorelikelytoresistthedraft.Asweshallseeinthenextsub-

section,variousstatisticalcorrectionsareavailableforafewoftheselectionproblemsnon-

blindingpresents,butallrelyonthekindofassumptionsthat,whilecommoninobservational

studies,RCTsaredesignedtoavoid.Ourownviewisthatassumptionsandtheuseofprior

knowledgearewhatweneedtomakeprogressinanykindofanalysis,includingRCTswhose

promiseofassumption-freelearningisalwayslikelytobeillusory.

Theremaybeatendencyineconomicstofocusontheselectionbiaseffectsofnon-

blindingbecausesomesolutionsareavailable,butselectionbiasisnottheonlyserioussource

ofbiasinsocialandmedicaltrials.Concernsabouttheplacebo,Pygmalion,Hawthorne,John

Henry,and'teacher/therapist'effectsarewidespreadacrossstudiesofmedicalandsocialinter-

ventions.Thisliteraturearguesthatdoubleblindingshouldbereplacedbyquadrupleblinding;

blindingshouldextendbeyondparticipantsandinvestigatorsandincludethosewhomeasure

outcomesandthosewhoanalyzethedata,allofwhommaybeaffectedbybothconsciousand

unconsciousbias.Theneedforblindinginthosewhoassessoutcomesisparticularlyimportant

inanycaseswhereoutcomesarenotdeterminedbystrictlyprescribedprocedureswhoseappli-

cationistransparentandcheckablebutrequireselementsofjudgment;agoodexampleisther-

apistswhoareaskedtoassesstheextentofdepressioninclinicaltrialsofanti-depressants,see

Kramer(2016).

Thelessonhereisthatblindingmattersandisveryoftenmissing.Thereisnoreasonto

supposethatapoorlyblindedtrialwithrandomassignmenttrumpsbetterblindedstudieswith

alternativeallocationmechanisms,ormatchedstudies.

1.13WhatdoRCTsdoinpractice?

TheexecutionofanRCTwilloftendeviatefromitsdesign.Peoplemaynotaccepttheirassign-

ment,controlsmaymanagetogettreatment,andviceversa,andpeoplemayaccepttheiras-

27

signment,butdropoutbeforethecompletionofthestudy.Insomedesigns,thetrialworksby

givingpeopleincentivestoparticipate,forexamplebymailingthemavoucherthatgivesthem

subsidizedaccesstoaschoolortoasavingsproduct.Iftheaimistoevaluatethevoucher

schemeitself,nonewissuearises.However,iftheaimistofindoutwhattheeducationorsav-

ingsprogramdoes,andthevoucherissimplyadevicetoinducevariation,muchdependson

whetherornotpeopledecidetousethevoucherwhich,likeattritionandcrossover,issubject

topurposivedecisionsbythesubjectsinducingdifferencesbetweentreatmentsandcontrols.

Everythingdependsonthepurposeofthetrial.Intheexampleabove,wemaywantto

evaluatethevoucherprogram,orwemaywanttofindoutwhatthesavingproductdoesfor

people.Wearesometimesinterestedinestablishingcausality,andsometimesinestimatingan

averagetreatmenteffect;intheeconomicsliterature,somewritersdefineinternalvalidityas

gettingtheATEright,whileothers,followingtheoriginaldefinitionoftheterm,defineinternal

validityasgettingcausalityright.Sometimesthetriallimitsitselftoestablishingcausality(orto

estimatinganATE)inonlythetrialsample,butsometrialsaremoreambitious,andtrytoestab-

lishcausality(orestimateanATE)forabroaderpopulationofinterest.When,asiscommonin

economicstrials,nolimitsareplacedontheheterogeneityoftreatmentresponses,different

trialsamplesanddifferentpopulationswillgenerallyhavedifferentATEsandmayhavedifferent

casualoutcomes,e.g.ifthetreatmenthasaneffectinonepopulationbutnoneortheopposite

effectinanother.Ourviewisthatthetargetofthetrial,includingthepopulationofinterest,

needstobedefinedinadvance.Otherwise,almostanyestimatednumbercanbeinterpretedas

avalidATEforsomepopulation,weallowdeviationsfromthedesigntodefineourtarget,and

wehavenowayofknowingwhetherapparentlycontradictoryresultsarereallycontradictoryor

arecorrectforthepopulationonwhichtheywerederived.Differencesinresults,betweendif-

ferentRCTsandbetweenRCTsandobservationalstudies,mayowelesstotheselectioneffects

thatRCTsaredesignedtoremove,thantothefactthatwearecomparingnon-comparablepeo-

ple,Heckman,Lalonde,andSmith(1999,p.2082).Withoutaclearideaofhowtocharacterize

thepopulationofindividualsinthetrial,whetherwearelookingforanATEortoidentifycausal-

ity,andforwhichgroupsenrolledinthetrialtheresultsaresupposedtohold,wehavenobasis

forthinkingabouthowtousethetrialresultsinothercontexts.

Toillustratesomeoftheissues,considerasimpleRCTinwhichatreatmentTisadminis-

teredtoatrialsamplethatissplitbetweenatreatmentgroupofsizenandacontrolgroupof

sizen,butthatonlyafractionpofthetreatmentgroupacceptstheirassignment,withfraction

28

(1− p) receivingnotreatment.SupposethattheparameterofinterestistheATEintheoriginal

population,fromwhichthetrialsamplewasdrawnrandomly.Denotebyβ thehypothetical

idealATEestimatethatwouldhavebeencalculatedifeveryonehadacceptedassignment;aswe

haveseen,thisisanunbiasedestimatoroftheparameterofinterestforboththetrialsample

andtheparentpopulation.β cannotbecalculated,buttherearevariousoptions.

Optiononeistoignoretheoriginalassignmentandcalculatethedifferenceinmeans

betweenthosewhoreceivedthetreatmentandthosewhodidnot,includingamongthelatter

thosewhowereintendedtoreceiveitbutdidnot.Denotethis(“astreated”)estimateβ1. Al-

ternatively,optiontwo,istocomparetheaverageoutcomeamongthosewhowereintendedto

betreatedandthosewhowereintendedtobecontrols.Denotethisestimate,the“intentto

treat”(ITT)estimator,β2. Itiseasytoshowthatonesetofconditionsforβ1 = β isthatthose

whoweretreatedhavethesameATEasthosewhowereintendedtobetreated,andthatthose

whobroketheirassignmenthavethesameuntreatedmeanasthosewhowereassignedtobe

controls,conditionsthatmayholdinsomeapplications,forexamplewherethetreatmentef-

fectsareidentical.

TheITTestimator,β2 ,willtypicallybeclosertozerothanisβ ,anditwillcertainlybe

soiftheaveragetreatmenteffectamongthosewhobreaktheirassignmentisthesameasthe

overallATE,inwhichcaseβ2 = pβ.Forthesereasons,theITTisoftendescribedasyieldinga

conservativeestimateandisroutinelyadvocatedinmedicaltrialseventhoughitisanattenuat-

edestimatoroftheATE.Athirdestimator,β3 ,thelocalaveragetreatmentestimator(LATE)is

computedbyrunningaregressionofoutcomesonan(actual)treatmentdummyusingthe

treatmentassignmentasaninstrumentalvariable.Inthiscase,theLATEissimplytheITT,scaled

upbythereciprocalofp,sothatβ3 = β2 / p. Fromtheabove,theLATEisβ iftheaverage

treatmenteffectofthosewhobreaktheirassignmentisthesameastheaveragetreatmentef-

fectingeneral,sothattheITTestimatorisbiaseddownbycountingthosewhoshouldhave

beentreatedasiftheywerecontrols.Moregenerally,andwithadditionalassumptions,Imbens

andAngrist(1994)showthattheLATEistheaveragetreatmenteffectamongthosewhowere

inducedtoacceptthetreatmentbytheirassignmenttotreatmentstatus,whichcanbeavery

differentobjectfromtheoriginaltargetofinvestigation.Thesevariousestimators,theATE,the

ITT,andtheLATE,areallaveragesoverdifferentgroups;moreformally,HeckmanandVytlacil

(2005)defineamarginaltreatmenteffect(MTE)astheATEforthoseonthemarginoftreat-

29

ment—whatevertheassignmentmechanism—andshowthattheotherestimatorscanbe

thoughtofasaveragesoftheMTEsoverdifferentpopulations.

Ingeneral,andunlesswearepreparedtosaymoreabouttheheterogeneityinthe

treatmenteffects,thethreeestimatorswillgivedifferentresultsbecausetheyareaveragesover

differentpopulations.Economiststendtobelievethatpeopleactintheirowninterest,atleast

inpart,soitisnotattractivetobelievethatthosewhobreaktheirassignmentshavethesame

distributionoftreatmenteffectsasdothosewhoacceptthem.InHeckman’s(1992)analogy,

peoplearenotlikeagriculturalplots,whichareinnopositiontoevadethetreatmentwhenthey

seeitcoming.Suchpurposivebehaviorwillgenerallyalsoaffectthecompositionofthetrial

samplecomparedwiththeparentpopulation,withthosewhoagreetoparticipatedifferent

fromthosewhodonot.Forexample,peoplemaydislikerandomizationbecauseoftherisksit

entails,orpeoplemayseektoentertrialsinthehopethattheywillreceiveabeneficialtreat-

mentthatisotherwiseunavailable.AfamousexampleineconomicsistheAshenfelter(1978)

pre-program“dip,”wherethosewhoentertrialsoftrainingprogramstendtobethosewhose

earningshavefallenimmediatelypriortoenrolment,seealsoHeckmanandSmith(1999).Peo-

plewhoparticipateindrugtrialsaremorelikelytobesickthanthosewhodonot,orarelikely

tobethosewhohavefailedonstandardmedication.AnotherexampleisChyn’s(2016)evidence

thatthosewhoappliedforvouchersintheMovingtoOpportunityexperimentandwerethus

eligibleforrandomization—andonlyaquarterofthosewhowereeligibleactuallydidso—were

thosewhowerealreadymakingunusualeffortsontheirchildren’sbehalf.Theseparentshad

effectivelysubstitutedforpartofthebetterenvironment,sothattheATEfromthetrialunder-

statesthebenefitstotheaveragechildofmoving.Similarphenomenaoccurinmedicine.Inthe

1954trialsoftheSalkpoliovaccineintheUS,theratesofinfection,whilelowestamongthe

treatedchildren,werehigherinthecontrolchildrenthaninthegeneralpopulationatrisk,so

thattheparentsofthosewhoselectedintothetrialpresumablyhadsomeideathattheymight

havebeenexposed,HausmanandWise(1985,p.193–4).Inthiscase,theaveragetreatment

effectinthetrialsampleexaggeratestheATEinthegeneralpopulation,whichiswhatwewant

toknowforpublicpolicy.

Giventhenon-parametricspiritofRCTs,andtheunwillingnessofmanytrialiststomake

assumptionsortoincorporatepriorinformation,theonlywayforwardistobeveryclearabout

thepurposeofthetrialand,inparticular,whichaveragewearetryingtoestimate.Forthose

whofocusoninternalvalidityintermsofestablishingcausalitybyfindinganATEsignificantly

30

differentfromzero,thedefinitionofthepopulationseemstobeasecondaryconcern.Theidea

seemstobethatifcausalityisestablishedinsomepopulation,thatfindingisimportantinitself,

withthetaskofexploringitsapplicabilitytootherpopulationsleftasasecondarymatter.For

themanyeconomicorcost–benefitanalyseswheretheATEistheparameterofinterest,the

populationofinterestisdefinitional,andtheinferenceneedstofocusonapathfromtheresults

ofthetrialtotheparameterofinterest.Thisisoftendifficultorevenimpossiblewithoutaddi-

tionalassumptionsand/ormodelingofbehavior,includingthedecisiontoparticipateinthetri-

al,andamongparticipants,thedecisionnottodropout.Manski(1990,1995,2003)hasshown

that,withoutadditionalevidence,thepopulationATEisnot(point)identifiedfromthetrialre-

sults,andhasdevelopednon-parametricbounds(anintervalestimate)fortheATE.Aswiththe

ITT,theseboundsaresometimestightenoughtobeinformative,thoughtheintervaldefinedby

theboundswilloftencontainzero,seeManski(2013)foradiscussionaimedatabroadaudi-

ence.Facedwiththis,manyscholarsarepreparedtomakeassumptionsortobuildmodelsthat

givemorepreciseresults.

RCTsmaytellusaboutcausality,evenwhentheydonotdeliveragoodestimateofthe

ATE.Forexample,iftheITTestimateissignificantlydifferentfromzero,thetreatmenthasa

causaleffectforatleastsomeindividualsinthepopulation.ThesameistrueiftheLATEissignif-

icantlydifferentfromzero;againthetreatmentiscausalforsomesub-population,evenifwe

mayhavedifficultycharacterizingitoracceptingitasthepopulationofinterest.Fromthis,we

alsolearnthat,providedwehadapopulationwiththerightdistributionofβi 's andgoverned

bythesamepotentialoutcomeequation,thetreatmentwouldproducetheeffectinatleast

someindividualsthere.

Section2:Usingtheresultsofrandomizedcontrolledtrials

2.1Introduction

Supposewehavetheresultsofawell-conductedRCT.Wehaveestimatedanaveragetreatment

effect,andourstandarderrorgivesusreasontobelievethattheeffectdidnotcomeaboutby

chance.Wethushavegoodwarrantthatthetreatmentcausestheeffectinoursamplepopula-

tion,uptothelimitsofstatisticalinference.Whataresuchfindingsgoodfor?Howshouldwe

usethem?

Theliteratureineconomics,asindeedinmedicineandinsocialpolicy,haspaidmoreat-

tentiontoobtainingresultsthantowhetherandhowtheyshouldbeadaptedforuse,oftenas-

31

sumingthatfindingscanbeused“asis.”Mucheffortisdevotedtodemonstratingcausalityand

estimatingeffectsizesinstudypopulations,bothinempiricalwork—moreandbetterRCTs,or

substitutesforRCTs,suchasinstrumentalvariablesorregressiondiscontinuitymodels—aswell

asintheoreticalstatisticalwork—forexampleontheconditionsunderwhichwecanestimate

anaveragetreatmenteffect,oralocalaveragetreatmenteffect,andwhattheseestimates

mean.Thereislesstheoreticalorempiricalworktoguideushowandforwhatpurposestouse

thefindingsofRCTs,suchastheconditionsunderwhichthesameresultsholdoutsideofthe

originalsettings,howtheymightbeadaptedforuseelsewhere,orhowtheymightbeusedfor

formulating,testing,understanding,orprobinghypothesesbeyondtheimmediaterelationbe-

tweenthetreatmentandtheoutcomeinvestigatedinthestudy.

Yetitcannotbethatknowinghowtouseresultsislessimportantthanknowinghowto

demonstratethem.Anychainofevidenceisonlyasstrongasitweakestlink,sothatarigorously

establishedeffectwhoseapplicabilityisjustifiedbyaloosedeclarationofsimilewarrantslittle

morethananestimatethatwaspluckedoutofthinair.Iftrialsaretobeuseful,weneedpaths

totheirusethatareascarefullyconstructedasarethetrialsthemselves.

Itissometimesassumedthataparameter,oncewellestablished,isinvariantacrossset-

tings.Theparametermaybedifficulttoestimate,becauseofselectionorotherissues,andit

maybethatonlyawell-conductedRCTcanprovideacredibleestimateofit.Ifso,internalvalidi-

tyisallthatisrequired,anddebateaboutusingtheresultsbecomesadebateabouttheconduct

ofthestudy.Theargumentforthe“primacyofinternalvalidity,”Shadish,Cook,andCampbell

(2002),isreasonableasawarningthatbadRCTsareunlikelytogeneralize,butitissometimes

incorrectlytakentoimplythatresultsofaninternallyvalidtrialwillautomaticallyoroftenapply

‘asis’elsewhere,orthatthisisthedefaultassumptionfailingargumentstothecontrary.Anin-

varianceargumentisoftenmadeinmedicine,whereitissometimesplausiblethataparticular

procedureordrugworksthesamewayeverywhere,thoughseeHorton(2000)forastrongdis-

sentandRothwell(2005)forexamplesonbothsidesofthequestion.Weshouldalsonotethe

recentmovementtoensurethattestingofdrugsincludeswomenandminoritiesbecausemem-

bersofthosegroupssupposethattheresultsoftrialsonmostlyhealthyyoungwhitemalesdo

notapplytothem.

2.2Usingresults,transportability,andexternalvalidity

Supposeatrialhasestablishedaresultinaspecificsetting,andweareinterestedinusingthe

resultoutsidetheoriginalcontext.If“thesame”resultholdselsewhere,wesaywehaveexter-

32

nalvalidity,otherwisenot.Externalvaliditymayreferjusttothetransportabilityofthecausal

connection,orgofurtherandrequirereplicationofthemagnitudeoftheaveragetreatment

effect.Eitherway,theresultholds—everywhere,orwidely,orinsomespecificelsewhere—orit

doesnot.

Thisbinaryconceptofexternalvalidityisoftenunhelpful;itbothoverstatesandunder-

statesthevalueoftheresultsfromanRCT.Itdirectsustowardsimpleextrapolation—whether

thesameresultwillholdelsewhere—orsimplegeneralization—whetheritholdsuniversallyor

atleastwidely—andawayfrompossiblymorecomplexbutmoreusefulapplicationsoftheevi-

dence.Justasinternalvaliditysaysnothingaboutwhetherornotatrialresultwillholdelse-

where,thefailureofexternalvalidityinterpretedassimplegeneralizationorextrapolationsays

littleaboutthevalueofthetrial.

First,thereareseveralusesofRCTsthatdonotrequiretransportabilitybeyondtheorig-

inalcontext;wediscusstheseinthenextsubsection.Second,thereareoftengoodreasonsto

expectthattheresultsfromawell-conducted,informative,andpotentiallyusefulRCTwillnot

applyelsewhereinanysimpleway.Evensuccessfulreplicationbyitselftellsuslittleeitherforor

againstsimplegeneralizationorextrapolation.Withoutfurtherunderstandingandanalysis,

evenmultiplereplicationscannotprovidemuchsupportfor,letaloneguarantee,theconclusion

thatthenextwillworkinthesameway.Nordofailuresofreplicationmaketheoriginalresult

useless.Wecanoftenlearnmuchfromcomingtounderstandwhyreplicationfailedanduse

thatknowledgetomakeappropriateuseoftheoriginalfindings,notbyexpectingreplication,

butbylookingforhowthefactorsthatcausedtheoriginalresultmightbeexpectedtooperate

differentlyindifferentsettings.Third,andparticularlyimportantforscientificprogress,theRCT

resultcanbeincorporatedintoanetworkofevidenceandhypothesesthattestorexplore

claimsthatlookverydifferentfromtheresultsreportedfromtheRCT.Weshallgiveexamples

belowofextremelyusefulRCTsthatarenotexternallyvalidinthe(usual)sensethattheirre-

sultsdonotholdelsewhere,whetherinaspecifictargetsettingorinthemoresweepingsense

ofholdingeverywhere.

BertrandRussell’schickenprovidesanexcellentexampleofthelimitationstostraight-

forwardextrapolationfromrepeatedsuccessfulreplication.Thebirdinfers,basedonmultiply

repeatedevidence,thatwhenthefarmercomesinthemorning,hefeedsher.Theinference

servesherwelluntilChristmasmorning,whenhewringsherneckandservesherforChristmas

dinner.Ofcourse,ourchickendidnotbaseherinferenceonanRCT.Buthadweconstructed

33

oneforher,wewouldhaveobtainedexactlythesameresultthatshedid.Herproblemwasnot

hermethodology,butratherthatshewasstudyingsurfacerelations,andthatshedidnotun-

derstandthesocialandeconomicstructurethatgaverisetothecausalrelationsthatsheob-

served.Soshedidnotknowhowwidelyorhowlongtheywouldobtain.Russellnotes,“more

refinedviewsastotheuniformityofnaturewouldhavebeenusefultothechicken”(1912,p.

44).Weoftenactasifthemethodsofinvestigationthatservedthechickensobadlywilldoper-

fectlywellforus.

Establishingcausalitydoesnothinginandofitselftoguaranteegeneralizability.Nor

doestheabilityofanidealRCTtoeliminatebiasfromselectionorfromomittedvariablesmean

thattheresultingATEwillapplyanywhereelse.Theissueisworthmentioningonlybecauseof

theenormousweightthatiscurrentlyattachedineconomicstothediscoveryandlabelingof

causalrelations,aweightthatishardtojustifyforeffectsthatmayhaveonlylocalapplicability,

whatmight(perhapsprovocatively)belabeled‘anecdotalcausality’.Theoperationofacause

generallyrequiresthepresenceofsupportorhelpingfactors,withoutwhichacausethatpro-

ducesthetargetedeffectinoneplace,eventhoughitmaybepresentandhavethecapacityto

operateelsewhere,willremainlatentandinoperative.WhatMackie(1974)calledINUScausality

(InsufficientbutNon-redundantpartsofaconditionthatisitselfUnnecessarybutSufficientfora

contributiontotheoutcome)isoftenthekindofcausalitywesee;astandardexampleisa

houseburningdownbecausethetelevisionwaslefton,althoughtelevisionsdonotoperatein

thiswaywithouthelpingfactors,suchaswiringfaults,thepresenceoftinder,andsoon.Thisis

standardfareinepidemiology,whichusestheterm“causalpie”torefertothecasewhereaset

ofcausesarejointlybutnotseparatelysufficientforaneffect.Ifwerewrite(3)intheform

Yi = βiTi + γ j xij = θk wik

k=1

K

∑⎛⎝⎜⎞⎠⎟

Ti +j=1

J

∑ γ j xijj=1

J

∑ (6)

where θk controlshow wik affectsindividualI’streatmenteffect βi . The“helping”or“support”

factorsforthetreatmentarerepresentedbytheinteractivevariables wik , amongwhichmaybe

includedsomex’s.SincetheATEistheaverageofthe βi 's ,twopopulationswillhavethesame

ATEonlyif,exceptbyaccident,theyhavethesameaverageforthesupportfactorsnecessary

forthetreatmenttowork.Thesearehoweverjustthekindoffactorsthatarelikelytobediffer-

entlydistributedindifferentpopulations,andindeedwedogenerallyfinddifferentATEsindif-

34

ferentdevelopment(andothersocialpolicy)RCTsindifferentplaceseveninthecaseswhere

(unusually)theyallpointinthesamedirection.

Causalprocessesoftenrequirehighlyspecializedeconomic,cultural,orsocialstructures

toenablethemtowork.ConsidertheRubeGoldbergmachinethatisriggedupsothatflyinga

kitesharpensapencil,CartwrightandHardie(2012,77),oranotherwherealongchainofropes

andpulleyscausestheinsertionoffoodintothemouthtoactivateaface-wipingnapkin.These

arecausalmachines,buttheyarespeciallyconstructedtogiveakindofcausalitythatoperates

extremelylocallyandhasnogeneralapplicability.Theunderlyingstructureaffordsaveryspecif-

icformof(6)thatwillnotdescribecausalprocesseselsewhere.NeitherthesameATEnorthe

samequalitativecausalrelationscanbeexpectedtoholdwherethespecificformfor(6)isdif-

ferent.

Indeed,wecontinuallyattempttodesignsystemsthatwillgeneratecausalrelations

thatwelikeandthatwillruleoutcausalrelationsthatwedonotlike.Healthcaresystemsare

designedtopreventnursesanddoctorsmakingerrors;carsaredesignedsothatdriverscannot

starttheminreverse;workschedulesforpilotsaredesignedsotheydonotflytoomanycon-

secutivehourswithoutrestbecausealertnessandperformancearecompromised.

AsintheRubeGoldbergmachinesandinthedesignofcarsandworkschedules,the

economicstructureandequilibriummaydifferinwaysthatsupportdifferentkindsofcausal

relationsandthusrenderatrialinonesettinguselessinanother.Forexample,atrialthatrelies

onprovidingincentivesforpersonalpromotionisofnouseinastateinwhichapoliticalsystem

lockspeopleintotheirsocialandeconomicpositions.Conditionalcashtransferscannotimprove

childhealthintheabsenceoffunctioningclinics.Policiestargetedatmenmaynotworkfor

women.Weusealevertotoastourbread,butleversonlyoperatetotoastbreadinatoaster;

wecannotbrowntoastbypressinganaccelerator,eveniftheprincipleoftheleveristhesame

inbothatoasterandacar.Ifwemisunderstandthesetting,ifwedonotunderstandwhythe

treatmentinourRCTworks,werunthesamerisksasRussell’schicken.

2.3WhenRCTsspeakforthemselves:notransportabilityrequired

Forsomethingswewanttolearn,anRCTisenoughbyitself.AnRCTmaydisproveageneral

theoreticalpropositiontowhichitprovidesacounterexample.Thetestmightbeofthegeneral

propositionitself(asimplerefutationtest),orofsomeconsequenceofitthatissusceptibleto

testingusinganRCT(acomplexrefutationtest).Ofcourse,counterexamplesareoftenchal-

lenged—forexample,itisnotthegeneralpropositionthatcausedtherejection,butaspecial

35

featureofthetrial—buthereweareonfamiliarinferentialturf.AnRCTmayalsoconfirmapre-

dictionofatheory,andalthoughthisdoesnotconfirmthetheory,itisevidenceinitsfavor,es-

peciallyifthepredictionseemsinherentlyunlikelyinadvance.Onceagain,thisisfamiliarterri-

tory,andthereisnothinguniqueaboutanRCT;itissimplyoneamongmanypossibletesting

procedures.Evenwhenthereisnotheory,orveryweaktheory,anRCT,bydemonstratingcau-

salityinsomepopulationcanbethoughtofasproofofconcept,thatthetreatmentiscapableof

workingsomewhere.Thisisoneoftheargumentsfortheimportanceofinternalvalidity.

AnothercasewherenotransportationiscalledforiswhenanRCTisusedforevaluation,

forexampletosatisfydonorsthattheprojecttheyfundedactuallyachieveditsaimsinthepop-

ulationinwhichitwasconducted.Evenso,forsuchevaluations,saybytheWorldBank,tobe

globalpublicgoodsrequiresthedevelopmentofargumentsandguidelinesthatjustifyusingthe

resultsinsomewayelsewhere;theglobalpublicgoodisnotanautomaticby-productofthe

Bankfulfillingitsfiduciaryresponsibility.Whenthecomponentsoftreatmentschangeacross

studies,evaluationsneednotleadtocumulativeknowledge.OrasHeckmanetal(1999,p.1934)

note,“thedataproducedfromthem[socialexperiments]arefarfromidealforestimatingthe

structuralparametersofbehavioralmodels.Thismakesitdifficulttogeneralizefindingsacross

experimentsortouseexperimentstoidentifythepolicy-invariantstructuralparametersthat

arerequiredforeconometricpolicyevaluation.”Ofcourse,whenweaskexactlywhatthosein-

variantstructuralparametersare,whethertheyexist,andhowtheyshouldbemodeled,we

openupmajorfaultlinesinmodernappliedeconomics.Forexample,wedonotintendtoen-

dorseintertemporaldynamicmodelsofbehaviorastheonlywayofrecoveringtheparameters

thatweneed.Wealsorecognizethattheusefulnessofsimplepricetheoryisnotasuniversally

acceptedasitoncewas.Butthepointremainsthatweneedsomething,someregularity,and

thatthesomethingneededcanrarelyberecoveredbysimplygeneralizingacrosstrials.

Athirdnon-problematicandimportantuseofanRCTiswhentheparameterofinterest

istheaveragetreatmenteffectinawell-definedpopulationfromwhichthesampletrialpopula-

tion—fromwhichtreatmentsandcontrolsarerandomlyassigned—isitselfarandomsample.In

thiscasethesampleaveragetreatmenteffect(SATE)isanunbiasedestimatorofthepopulation

averagetreatmenteffect(PATE)that,byassumption,isourtarget,seeImbens(2004)forthese

terms.Werefertothisasthe“publichealth”case;likemanypublichealthinterventions,the

targetistheaverage,“populationhealth,”notthehealthofindividuals.Onemajor(andwidely

recognized)dangerofthepublic-health-styleusesofRCTsisthatthescalingupfrom(evena

36

random)sampletothepopulationwillnotgothroughinanysimplewayiftheoutcomesofindi-

vidualsorgroupsofindividualschangethebehaviorofothers—whichwillbecommonineco-

nomicexamplesbutperhapslesscommoninhealth.Thereisalsoanissueoftimingiftheresults

aretobeimplementedsometimeafterthetrial.

Ineconomics,a‘public-health-style’exampleistheimpositionofacommoditytax,

wherethetotaltaxrevenueisofinterestandwedonotcarewhopaysthetax.Indeed,theory

canoftenidentifyaspecific,well-definedmagnitudewhosemeasurementiskeyforthepolicy;

seeDeatonandNg(1998)foranexampleofwhatChetty(2009)callsa“sufficient”statistic.In

thiscase,thebehaviorofarandomsampleofindividualsmightwellprovideagoodguidetothe

taxrevenuethatcanbeexpected.Anothercasecomesfromworkonpovertyprogramswhere

theinterestofthesponsorsisintheconsequencesforthebudgetofthestateresponsiblefor

theprogram;wediscussthesecasesattheendofthisSection.Evenhere,itiseasytoimagine

behavioraleffectscomingintoplaythatdriveawedgebetweenthetrialanditsfullscaleim-

plementation,forexampleifcomplianceishigherwhentheschemeiswidelypublicized,orif

governmentagenciesimplementtheschemedifferentlyfromtrialists.

2.4Transportingresultslaterallyandglobally

TheprogramofRCTsindevelopmenteconomics,asinotherareasofsocialscience,hasthe

broadergoaloffindingout“whatworks.”Atitsmostambitious,thisaimsforuniversalreach,

andthedevelopmentliteraturefrequentlyarguesthat“credibleimpactevaluationsareglobal

publicgoodsinthesensethattheycanofferreliableguidancetointernationalorganizations,

governments,donors,andnongovernmentalorganizations(NGOs)beyondnationalborders,”

KremerandDuflo(2008,p.93).SometimestheresultsofasingleRCTareadvocatedashaving

wideapplicability,withespeciallystrongendorsementwhenthereisatleastonereplication.

Forexample,KremerandHolla(2009)useaKenyantrialasthebasisforablanketstatement

withoutcontextrestriction,“Provisionoffreeschooluniforms,forexample,leadsto10%-15%

reductionsinteenpregnancyanddropoutrates.”KremerandDuflo(2008),writingaboutan-

othertrial,aremorecautious,citingtwoevaluations,andrestrictingthemselvestoIndia:“One

canberelativelyconfidentaboutrecommendingthescaling-upofthisprogram,atleastinIndia,

onthebasisoftheseestimates,sincetheprogramwascontinuedforaperiodoftime,waseval-

uatedintwodifferentcontexts,andhasshownitsabilitytoberolledoutonalargescale.”

Ofcourse,theproblemofgeneralizationextendsbeyondRCTs,toboth“fullycon-

trolled”laboratoryexperimentsandtomostnon-experimentalfindings.Forexample,eversince

37

AlfredMarshallthoughtofitwhilesunbathing,economistshaveusedtheconceptofanelastici-

ty—asintheincomeelasticityofthedemandforfood,orthepriceelasticityofthesupplyof

cotton—andhavetransportedelasticities—whichareconvenientlydimensionless—fromone

contexttoanother,asnumericalestimates,orinranges,suchashigh,medium,orlow.Articles

thatcollectsuchestimatesarewidelycitedeventhough,ashaslongbeenknown,theinvari-

anceofelasticitiesisnotguaranteedinpracticeandissometimesinconsistentwithchoicetheo-

ry.OurargumenthereisthatevidencefromRCTs,likeevidenceonelasticities,isnotautomati-

callysimplygeneralizable,andthatitsinternalvalidity,whenitexists,doesnotprovideitwith

anyuniqueinvarianceacrosscontext.WeshallalsoarguethatspecificfeaturesofRCTs,suchas

theirfreedomfromparametricassumptions,althoughadvantageousinestimation,canbease-

rioushandicapinuse.

MostadvocatesofRCTsunderstandthat“whatworks”needstobequalifiedto“what

worksunderwhichcircumstances,”andtrytosaysomethingaboutwhatthosecircumstances

mightbe,forexample,byreplicatingRCTsindifferentplaces,andthinkingintelligentlyabout

thedifferencesinoutcomeswhentheyfindthem.Sometimesthisisdoneinasystematicway,

forexamplebyhavingmultipletreatmentswithinthesametrialsothatitispossibletoestimate

a“responsesurface,”thatlinksoutcomestovariouscombinationsoftreatments,seeGreenberg

andSchroder(2004)orShadishetal(2002).Forexample,theRANDhealthexperimenthadmul-

tipletreatments,allowinginvestigation,notonlyofwhetherhealthinsuranceincreasedexpend-

itures,buthowmuchitdidsounderdifferentcircumstances.Someofthenegativeincometax

experiments(NITs)inthe1960sand1970sweredesignedtoestimateresponsesurfaces,with

thenumberoftreatmentsandcontrolsineacharmoptimizedtomaximizeprecisionofestimat-

edresponsefunctionssubjecttoanoverallcostlimit,Conlisk(1973).Experimentsontime-of-

daypricingforelectricityhadasimilarstructure,seeAigner(1985).

TheMDRCexperimentshavealsobeenanalyzedacrosscitiesinanefforttolinkcityfea-

turestotheresultsoftheRCTswithinthem,Bloom,Hill,andRiccio(2005).UnliketheRANDand

NITexamples,theseareexpostanalysesofcompletedtrials;thesameistrueofVivalt(2015)

whoassemblesevidenceonalargenumberoftrials,andfinds,forthecollectionoftrialsshe

studied,thatdevelopment-relatedRCTsrunbygovernmentagenciestypicallyfindsmaller

(standardized)effectsizesthanRCTsrunbyacademicsorbyNGOs.Boldetal(2013),whoran

parallelRCTsonaninterventionimplementedeitherbyanNGOorbythegovernmentofKenya,

foundsimilarresultsthere.Notethattheseanalyseshaveadifferentpurposefromthosemeta-

38

analysesthatassumethatdifferenttrialsestimatethesameparameteruptonoiseandaverage

inordertoincreaseprecision.

Althoughthereareissueswithallofthesemethodsofinvestigatingdifferencesacross

trials,withoutsomedisciplineitistooeasytocomeupwith“just-so”orfairystoriesthatac-

countforalmostanydifferences.Weriskaprocedurethat,ifaresultisreplicatedinfullorin

partinatleasttwoplaces,putsthattreatmentintothe“itworks”boxand,iftheresultdoesnot

replicate,causallyinterpretsthedifferenceinawaythatallowsatleastsomeofthefindingsto

survive.

Howcanwethinkaboutthismoreseriously?Howcanwedobetterthansimplegener-

alizationandsimpleextrapolation?Manywritershaveemphasizedtheroleoftheoryintrans-

portingandusingtheresultsoftrials,andweshalldiscussthisfurtherinthenextsubsection.

Butstatisticalapproachesarealsowidelyused;thesearedesignedtodealwiththepossibility

thattreatmenteffectsvarysystematicallywithothervariables.Referringbackto(6),suppose

thattheβi 's ,theindividualtreatmenteffects,arefunctionsofasetofKobservableorunob-

servablesupportvariables,wik ,andthatthenon-vacuousw’smayevenrepresentdifferentfea-

turesindifferentplaces.Itisthenclearthat,providedthedistributionofthewvaluesisthe

sameinthenewcircumstancesastheold,thentheATEintheoriginaltrialwillholdinthenew

circumstances.Ingeneral,ofcourse,thisconditionwillnothold,nordowehaveanyobvious

wayofcheckingitunlessweknowwhatthesupportfactorsareinbothplaces.

Oneproceduretodealwithinteractionsispost-experimentalstratification,whichparal-

lelspost-surveystratificationinsamplesurveys.Thetrialisbrokenupintosubgroupsthathave

thesamecombinationofknown,observablew’s,theATEswithineachofthesubgroupscalcu-

lated,andthenreassembledaccordingtotheconfigurationofw’sinthenewcontext.Forex-

ample,ifthetreatmenteffectsvarywithage,theage-specificATEscanbeestimated,andthe

agedistributioninthenewcontextusedtoreweighttheage-specificATEstogiveanew,overall,

ATE.ThiscanbeusedtoestimatetheATEinanewcontext,ortocorrectestimatestothepar-

entpopulationwhenthetrialsampleisnotarandomsampleoftheparent.Ofcourse,this

methodwillonlyworkinspecialcases;forexample,ifweonlyknowsomeofthew’s,thereisno

reasontosupposethatreweightingforthosealonewillgiveausefulcorrection.

Othermethodsalsoworkwhentherearetoomanyw’sforstratification,forexampleby

estimatingtheprobabilityofeachobservationinthepopulationbeingincludedinthetrialsam-

pleasafunctionofthew’s,thenweightingeachobservationbytheinverseofthesepropensity

39

scores.AgoodreferenceforthesemethodsisStuartetal(2011),orineconomics,Angrist

(2004)andHotz,Imbens,andMortimer(2005).

Thereareyetfurtherreasonswhythesemethodsdonotalwayswork.Aswithanyform

ofreweighting,thevariablesusedtoconstructtheweightsmustbepresentinboththeoriginal

andnewcontext.Iftreatmenteffectsvarybysex,wecannotpredicttheoutcomesformenus-

ingatrialsamplethatisentirelyfemale.Ifwearetocarryaresultforwardintime,wemaynot

beabletoextrapolatefromaperiodoflowinflationtoaperiodofhighinflation;asHotzetal

(2005)note,itwilltypicallybenecessarytoruleoutsuch“macro”effects,whetherovertime,or

overlocations.Italsodependsonassumingthatthesamegoverningequation(6)coversthe

trialandthetargetpopulation.Iftheydiffernotonlybywhatcausalfactorsarepresentinwhat

proportionsbutalsoinhow(ifatall)thecausescontributetotheeffects,re-weightingtheeffect

sizesthatoccurintrialsub-populationswillnotproducegoodpredictionsabouttargetpopula-

tionoutcomes.

Itshouldbeclearfromthisthatreweightingworksonlywhentheobservablefactors

usedforreweightingincludeallandonlygenuineinteractivecauses;weneeddataonallthe

relevantinteractivefactors.ButasMuller(2015)notes,thistakesusbacktothesituationthat

RCTsaredesignedtoavoid,whereweneedtostartfromacompleteandcorrectspecificationof

thecausalstructure.RCTscanavoidthisinestimation—whichisoneoftheirstrengths,support-

ingtheircredibility—butthebenefitvanishesassoonaswetrytocarrytheirresultstoanew

context.

PearlandBareinboim(2014)usePearl’sdo–calculustoprovideafullerformalanalysis

fortransportabilityofcausalempiricalfindingsacrosspopulations.Theydefinetransportability

as“alicensetotransfercausaleffectslearnedinRCTstoanewpopulation,inwhichonlyobser-

vationalstudiescanbeconducted,”PearlandBareinboim(2015,p.1).Theyconsiderbothquali-

tativecausalrelations,whichtheyrepresentindirectedacyclicgraphs,andprobabilisticfacts,

suchastheconditionalprobabilityoftheoutcomeonatreatmentconditionalonsomethird

factor.Theythenprovidetheoremsaboutwhattherelationshipbetweenthecausalandproba-

bilisticfactsintwopopulationsmustbeifitistobepossibletoinferaparticularcausalfact,

suchastheATE,aboutpopulation2fromcausalandprobabilisticinformationaboutpopulation

1coupledwithpurelyprobabilisticinformationaboutpopulation2.Notsurprisingly,formany

thingsweshouldliketoknowaboutpopulation2,knowledgeofeventhefullstructureonpopu-

lation1willnotsuffice.Inferencestofactsaboutanewpopulationrequirenotonlythatthe

40

factswesupposeaboutpopulation1—likeanATE—arewellgrounded,thattheRCTwaswell

conducted,thatthestatisticalinferenceissound—butthatwehaveequallygoodgroundingfor

otherassumptionsweneedabouttherelationbetweenthetwopopulations.Forexample,using

theresultdescribedabovefordirectlytransportingtheATEfromatrialpopulationtosomeoth-

er—simpleextrapolation—weneedgoodgroundstosupposeboththattheaverageofthenet

effectoftheinteractivefactorsisthesameinbothpopulationsandalsothatthesamegovern-

ingequationdescribesbothpopulations.

Thisdiscussionleadstoanumberofpoints.First,wecannotgettogeneralclaimsby

simplegeneralization;thereisnowarrantfortheconvenientassumptionthattheATEestimated

inaspecificRCTisaninvariantparameter.Weneedtothinkthroughthecausalchainthathas

generatedtheRCTresult,andtheunderlyingstructuresthatsupportthiscausalchain,whether

thatcausalchainmightoperateinanewsettingandhowitwoulddosowithdifferentjointdis-

tributionsofthecausalvariables;weneedtoknowwhyandwhetherthatwhywillapplyelse-

where.Whileitistruethatthereexistgeneralcausalclaims—theforceofgravity,orthatpeople

respondtoincentives—theyuserelativelyabstractconceptsandoperateatamuchhigherlevel

thantheclaimsthatcanbereasonablyinferredfromatypicalRCT,andcannot,bythemselves,

guaranteetheoutcomesthatweareconsideringhere.Thattransportationisfarfromautomatic

alsotellsuswhy(evenideal)RCTsofsimilarinterventionscanbeexpectedtogivedifferentan-

swersindifferentsettings.Suchdifferencesdonotnecessarilyreflectmethodologicalfailings

andwillholdacrossperfectlyexecutedRCTsjustastheydoacrossobservationalstudies.

Second,thoughtfulpre-experimentalstratificationinRCTsislikelytobevaluable,or

failingthat,subgroupanalysis,becauseitcanprovideinformationthatmaybeusefulforgener-

alizationortransportation.Forexample,KremerandHolla(2009)notethat,intheirtrials,

schoolattendanceissurprisinglysensitivetosmallsubsidies,whichtheysuggestisbecause

therearealargenumberofstudentsandparentswhoareonthe(financial)marginbetween

attendingandnotattendingschool;ifthisisindeedthemechanismfortheirresults,agoodvar-

iableforstratificationwouldbethefractionofpeopleneartherelevantcutoff.Wealsoneedto

knowthatthesamemechanismworksinanynewsettingwhereweconsiderusingsmallsubsi-

diestoincreaseschoolattendance.

Third,weneedtobeexplicitaboutcausalstructure,evenifthatmeansmoremodel

buildingandmore—ordifferent—assumptionsthanadvocatesofRCTsareoftencomfortable

with.Tobeclear,modelingcausalstructuredoesnotnecessarilycommitustotheelaborateand

41

oftenincredibleassumptionsthatcharacterizesomestructuralmodelingineconomics,but

thereisnoescapefromthinkingaboutthewaythingswork,thewhyaswellasthewhat.

Fourth,wewilltypicallyneedtoknowmorethantheresultsoftheRCTitself,forexam-

pleaboutdifferencesinsocial,economic,andculturalstructuresandaboutthejointdistribu-

tionsofcausalvariables,knowledgethatwilloftenonlybeavailablethrougharangeofempiri-

calstrategiesincludingobservationalstudies.Wewillalsoneedtobeabletocharacterizethe

populationtowhichtheoriginalRCTanditsATEappliedbecausehowthepopulationisde-

scribediscommonlytakentobesomeindicationofwhichotherpopulationstheresultsarelike-

lytobeexportabletoandwhichnot.Manymedicalandpsychologicaljournalsareexplicitabout

this.Forinstance,therulesforsubmissionrecommendedbytheInternationalCommitteeof

MedicalJournalEditors,ICMJE(2015,p14)insistthatarticleabstracts“Clearlydescribethese-

lectionofobservationalorexperimentalparticipants(healthyindividualsorpatients,including

controls),includingeligibilityandexclusioncriteriaandadescriptionofthesourcepopulation.”

Theproblemsofcharacterizingthepopulationheregoesbeyondthosewefacedinconsidering

aLATE.AnRCTisconductedonapopulationofspecificindividuals.Theresultsobtained,

whetherwethinkintermsofanATEorintermsofestablishingcausality,arefeaturesofthat

population,ofthoseveryindividualsatthatverytime,notanyotherpopulationwithanydiffer-

entindividualsthatmight,forexample,satisfyoneoftheinfinitesetofdescriptionsthatthe

trialpopulationsatisfies.Howisthedescriptionofthepopulationthatisusedinreportingthe

resultstobechosen?Forchoosewemust—thealternativetodescribingisnaming,identifying

eachindividualinthestudybyname,whichiscumbersomeandunhelpfulandoftenunethical.

Thissameissueisconfrontedalreadyinstudydesign.Apartfromspecialcases,likepost

hocevaluationforpayment-for-results,wearenotespeciallyconcernedtolearnaboutthevery

populationenrolledinthetrial.Mostexperimentsare,andshouldbe,conductedwithaneyeto

whattheresultscanhelpuslearnaboutotherpopulations.Thiscannotbedonewithoutsignifi-

cantsubstantialassumptionsaboutwhatmightbeandwhatmightnotberelevanttothepro-

ductionoftheoutcomestudied.(Forexample,theICMJEguidelinesgoontosay:“Becausethe

relevanceofsuchvariablesasage,sex,orethnicityisnotalwaysknownatthetimeofstudyde-

sign,researchersshouldaimforinclusionofrepresentativepopulationsintoallstudytypesand

ataminimumprovidedescriptivedatafortheseandotherrelevantdemographicvariables,”

p14.)Sobothintelligentstudydesignandresponsiblereportingofstudyresultsinvolvesubstan-

tialbackgroundassumptions.Ofcoursethisistrueforallstudies,notjustRCTs.ButRCTsrequire

42

specialconditionsiftheyaretobeconductedatallandespeciallyiftheyaretobeconducted

successfully—localagreements,compliantsubjects,affordableadministrators,peoplecompe-

tenttomeasureandrecordoutcomesreliably,asettingwhererandomallocationismorallyand

politicallyacceptable,etc.,whereasobservationaldataareoftenmorereadilyandwidelyavail-

able.InthecaseofRCTs,thereisdangerthatthesekindsofconsiderationshavetoomuchef-

fect.Thisisespeciallyworrisomewherethefeaturesthestudypopulationshouldhavearenot

justified,madeexplicit,orsubjectedtoseriouscriticalreview.Thiscarefuldescriptionofthe

studypopulationisuncommonineconomics,whetherinRCTsormanyobservationalstudies.

Theneedforobservationalknowledgeisoneofmanyreasonswhyitiscounter-

productivetoinsistthatRCTsaretheuniquegoldstandard,orthatsomecategoriesofevidence

shouldbeprioritizedoverothers;thesestrategiesleaveushelplessinusingRCTsbeyondtheir

originalcontext.TheresultsofRCTsmustbeintegratedwithotherknowledge,includingthe

practicalwisdomofpolicymakers,iftheyaretobeuseableoutsidethecontextinwhichthey

wereconstructed.Contrarytomuchpracticeinmedicineaswellasineconomics,conflictsbe-

tweenRCTsandobservationalresultsneedtobeexplained,forexamplebyreferencetothedif-

ferentpopulationsineach,aprocessthatwillsometimesyieldimportantevidence,includingon

therangeofapplicabilityoftheRCTitself.WhilethevalidityoftheRCTwillsometimesprovide

anunderstandingofwhytheobservationalstudyfoundadifferentanswer,thereisnobasis(or

excuse)forthecommonpracticeofdismissingtheobservationalstudysimplybecauseitwas

notanRCTandthereforemustbeinvalid.Itisabasictenetofscientificadvancethatnewfind-

ingsmustbeabletoexplainpreviousresults,evenresultsthatarenowthoughttobeinvalid;

methodologicalprejudiceisnotanexplanation.

Theseconsiderationscanbeseeninpracticeintherangeofrandomizedcontrolledtrials

ineconomics,whichweshallexploreinthefinalsubsectionbelow.

2.5Usingtheoryforgeneralization

Economistshavebeencombiningtheoryandrandomizedcontrolledtrialssincetheearlyexper-

iments.OrcuttandOrcutt(1968)laidouttheinspirationfortheincometaxtrialsusingasimple,

statictheoryoflaborsupply.Accordingtothis,peoplechoosehowtodividetheirtimebetween

workandleisureinanenvironmentinwhichtheyreceiveaminimumGiftheydonotwork,and

wheretheyreceiveanadditionalamount (1− t)w foreachhourtheywork,wherewisthe

wagerate,andtisataxrate.ThetrialsassigneddifferentcombinationsofGandttodifferent

trialgroups,sothattheresultstracedoutthelaborsupplyfunction,allowingestimationofthe

43

parametersofpreferences,whichcouldthenbeusedinawiderangeofpolicycalculations,for

exampletoraiserevenueatminimumutilitylosstoworkers.

Followingtheseearlytrials,therehasbeenalongandcontinuingtraditionofusingtrial

results,togetherwiththebaselinedatacollectedforthetrial,tofitstructuralmodelsthatareto

beusedmoregenerally.EarlyexamplesincludeMoffitt(1979)onlaborsupplyandWise(1985)

onhousing;morerecentexamplesareHeckman,PintoandSavelyev(2013)forthePerrypre-

schoolprogram.DevelopmenteconomicsexamplesincludeAttanasio,MeghirandSantiago

(2012),Attanasioetal(2015),ToddandWolpin(2006)andDuflo,HannaandRyan(2012).The-

sestructuralmodelssometimesrequireformidableauxiliaryassumptionsonfunctionalformsor

thedistributionsofunobservables,whichmakesmanyeconomistsreluctanttoembracethem,

buttheyhavecompensatingadvantages,includingtheabilitytointegratetheoryandevidence,

tomakeout-of-samplepredictions,andtoanalyzewelfare—whichalwaysrequiressomeunder-

standingofwhythingshappen—andtheuseofRCTevidenceallowstherelaxationofatleast

someoftheassumptionsthatareneededforidentification.Inthisway,thestructuralmodels

borrowcredibilityfromtheRCTsandinreturnhelpsettheRCTresultswithinacoherent

framework.Withoutsomesuchinterpretation,thewelfareimplicationsofRCTresultscanbe

problematic;knowinghowpeopleingeneral(letalonejustpeopleinthetrialpopulation,which

iswhat,aswekeeprepeating,thetrialresultstellusabout)respondtosomepolicyisrarely

enoughtotellwhetherornottheyaremadebetteroff.Whatworksisnotequivalenttowhat

shouldbe.

Inmanypapers,Heckmanhasdevelopedwaystomodelhowthebeliefsandinterestsof

participantsaffecttheirparticipationin,behaviorduring,andtheiroutcomesintrials,forexam-

pleusingaRoymodelofchoice;seee.g.HeckmanandSmith(1995),andmorerecently

Chassang,PadróIMiguel,andSnowberg(2012)andChassangetal(2015).Themodelingofbe-

liefsandbehaviorallowspredictionsabouttheresultsoftrialsthatdifferfromthebasetrial,or

wheretheriskandrewardstructuresaredifferent.Beyondthat,andinlinewitharunning

themeofthisSection,thinkingabouthowtohandlenewsituationscanbeincorporatedintothe

designoftheoriginaltrialsoastoprovidetheinformationneededfortransportation.

LighttouchtheorycandomuchtoextendandtouseRCTresults.InboththeRAND

HealthExperimentandnegativeincometaxexperiments,animmediateissueconcernedthe

differencebetweenshortandlong-runresponses;indeed,differencesbetweenimmediateand

ultimateeffectsoccurinawiderangeofRCTs.BothhealthandtaxRCTsaimedtodiscoverwhat

44

wouldhappenifconsumers/workerswerepermanentlyfacedwithhigherorlowerpric-

es/wages,butthetrialscouldonlyrunforalimitedperiod.Atemporarilyhightaxrateonearn-

ingswaseffectivelya“firesale”onleisure,sothattheexperimentprovidedanopportunityto

takeavacationandmakeuptheearningslater,anincentivethatwouldbeabsentinaperma-

nentscheme.Howdowegetfromtheshort-runresponsesthatcomefromthetrialtothelong-

runresponsesthatwewanttoknow?Metcalf(1973)andAshenfelter(1978)providedanswers

fortheincometaxexperiments,asdidArrow(1975)fortheRandHealthExperiment.

Arrow’sanalysisillustrateshowtousebothstructureandobservationaldatato

transportandadaptresultsfromonesettingtoanother.Hemodelsthehealthexperimentasa

two-periodmodel,inwhichthepriceofmedicalcareisloweredinthefirstperiodonly,and

showshowtoderivewhatwewant,whichistheresponseinthefirstperiodifpriceswerelow-

eredbythesameproportioninbothperiods.ThemagnitudethatwewantisS,thecompen-

satedpricederivativeofmedicalcareinperiod1inthefaceofidenticalincreasesin p1 and p2

inbothperiods1and2,andthisisequalto s11 + s12 ,thesumofthederivativesofperiod1’s

demandwithrespecttothetwoprices.Thetrialgivesonly s11 .Butifwehavepost-trialdataon

medicalservicesforbothtreatmentsandcontrols,wecaninfer s21 ,theeffectoftheexperi-

mentalpricemanipulationonpost-experimentalcare.Choicetheory,intheformofSlutsky

symmetry,allowsArrowtousethistoinfer s12 andthusS.HecontraststhiswithMetcalf’sal-

ternativesolution,whichmakesdifferentassumptions—thattwoperiodpreferencesareinter-

temporallyadditive,inwhichcasethelong-runelasticitycanbeobtainedfromknowledgeofthe

incomeelasticityofpost-experimentalmedicalcare,whichwouldhavetocomefromanobser-

vationalanalysis.Thesetwoalternativeapproachesshowhowwecanchoose,basedonourwill-

ingnesstomakeassumptionsandonthedatawehave,asuitablecombinationof(elementary

andtransparent)theoreticalassumptionsandobservationaldatainorderadaptandusethetrial

results.Suchanalysiscanalsohelpdesigntheoriginaltrialbyclarifyingwhatweneedtoknowin

ordertobeabletousetheresultsofatemporarytreatmenttoestimatethepermanenteffects

thatweneed.Ashenfelterprovidesathirdsolution,notingthatthetwoperiodmodelisformally

identicaltoatwopersonmodel,sothatwecanuseinformationontwo-personlaborsupplyto

tellusaboutthedynamics.

Theorycanoftenallowustoreclassifyneworunknownsituationsasanalogoustositua-

tionswherewealreadyhavebackgroundknowledge.Onefrequentlyusefulwayofdoingthisis

45

whenthenewpolicycanberecastasequivalenttoachangeinthebudgetconstraintthatre-

spondentsface.Theconsequencesofanewpolicymaybeeasiertopredictifwecanreduceit

toequivalentchangesinincomeandprices,whoseeffectsareoftenwellunderstoodandwell

studied.ToddandWolpin(2008)makethispointandprovideexamples.Inthelaborsupply

case,anincreaseinthetaxratethasthesameeffectasadecreaseinthewageratew,sothat

wecanrelyonpreviousliteraturetopredictwhatwillhappenwhentaxratesarechanged.In

thecaseofMexico’sPROGRESAconditionalcashtransferprogram,ToddandWolpinnotethat

thesubsidiespaidtoparentsiftheirchildrengotoschoolcanbethoughtofasacombinationof

reductioninchildren’swageratesandanincreaseinparents’income,whichallowsthemto

predicttheresultsoftheconditionalcashexperimentwithlimitedadditionalassumptions.If

thisworks,asitpartiallydoesintheiranalysis,thetrialhelpsconsolidatepreviousknowledge

andcontributestoanevolvingbodyoftheoryandempirical,includingtrial,evidence.

Theprogramofthinkingaboutpolicychangesasequivalenttopriceandincomechang-

eshasalonghistoryineconomics;muchofrationalchoicetheorycanbesointerpreted,see

DeatonandMuellbauer(1980)formanyexamples.Whenthisconversioniscredible,andwhen

atrialonsomeapparentlyunrelatedtopiccanbemodeledasequivalenttoachangeinprices

andincomes,andwhenwecanassumethatpeopleindifferentsettingsrespondrelevantlysimi-

larlytochangesinpricesandincomes,wehaveareadymadeframeworkforincorporatingthe

trialresultsintopreviousknowledge,aswellasforextendingthetrialresultsandusingthem

elsewhere.Ofcourse,alldependsonthevalidityandcredibilityofthetheory;peoplemaynotin

factthinkofataxincreaseasadecreaseinthepriceofleisure,andbehavioraleconomicsisfull

ofexampleswhereapparentlyequivalentstimuligeneratenon-equivalentoutcomes.Theem-

braceofbehavioraleconomicsbymanyofthecurrentgenerationoftrialistsmayaccountfor

theirlimitedwillingnesstouseconventionalchoicetheoryinthisway;unfortunately,behavioral

economicsdoesnotyetofferareplacementforthegeneralframeworkofchoicetheorythatis

sousefulinthisregard.

Theorycanalsohelpwiththeproblemweraisedofdelineatingthepopulationtowhich

thetrialresultsimmediatelyapplyandforthinkingaboutmovingfromthispopulationtothe

populationofinterest.Ashenfelter’s(1978)analysisisagainagoodillustrationandpredates

muchsimilarworkinlaterliterature.Theincometaxexperimentsofferedparticipationinthe

trialtoarandomsampleofthepopulationofinterest.Becausetherewasnoblindingandno

compulsion,peoplewhowererandomizedintothetreatmentgroupwerefreetochoosetore-

46

fusetreatment.Asinmanysubsequentanalyses,Ashenfeltersupposesthatpeoplechooseto

participateifitisintheirinteresttodoso,dependingonwhathasbecomeknownintheRCT

andInstrumentalVariablesliteratureastheirownidiosyncratic“gain.”Thesimplelaborsupply

modelgivesanapproximatecondition:ifthetreatmentincreasesthetaxratefrom t0 to t1 with

anoffsettingincreaseinG,thenanindividualassignedtotheexperimentalgroupwilldeclineto

participateif

(t1 − t0 )w0h0 +12s00 (t1 − t0 ) >G1 −G0 (7)

wheresubscript1referstothetreatmentsituation,0tothecontrol,h0 ishoursworked,and

s00 isthe(negative)utility-constantresponseofhoursworkedtothetaxrate.Ifthereisnosub-

stitution,thesecondtermontheleft-handsideiszero,andpeoplewillaccepttreatmentifthe

increaseinGmorethanmakesupfortheincreasesintaxespayable,the“breakeven”condition.

Inconsequence,thosewithhigherearningsarelesslikelytoaccepttreatment.Somebetter-off

peoplewithhighsubstitutioneffectswillalsoaccepttreatmentiftheopportunitytobuymore

cheapleisureissufficiententicement.

Theselectiveacceptanceoftreatmentlimitstheanalyst’sabilitytolearnaboutthebet-

ter-offorlow-substitutionpeoplewhodeclinetreatmentbutwhowouldhavetoacceptitifthe

policywereactuallyimplemented.BoththeITTestimatorandthe“astreated”estimatorthat

comparesthetreatedandtheuntreatedareaffected,notjustbythelaborsupplyeffectsthat

thetrialisdesignedtoinduce,butbythekindofselectioneffectsthatrandomizationisde-

signedtoeliminate.Ofcourse,theanalysisthatleadsto(3)canperhapshelpussaysomething

aboutthisandhelpusadjustthetrialestimatesbacktowhatwewouldliketoknow.Yetthisis

noeasymatterbecauseselectiondepends,notonlyonobservables,suchaspre-experimental

earningsandhoursworked,buton(muchhardertoobserve)laborsupplyresponsesthatlikely

varyacrossindividuals.ParaphrasingAshenfelter,wecannotestimatetheeffectsofaperma-

nentcompulsorynegativeincometaxprogramfromatransitoryvoluntarytrialwithoutstrong

assumptionsoradditionalevidence.

Muchofthemodernliterature,forexampleontrainingprograms,wrestleswiththeis-

sueofexactlywhoisrepresentedbytheRCTresults,seeagainHeckman,LalondeandSmith

(1999).Whenpeopleareallowedtorejecttheirrandomlyassignedtreatmentaccordingtotheir

own(realorperceived)individualadvantage,wehavecomealongwayawayfromtherandom

allocationinthestandardconceptionofarandomizedcontrolledtrial.Moreover,theabsenceof

47

blindingiscommoninsocialandeconomicRCTs,andwhiletherearetrials,suchaswelfaretri-

als,thateffectivelycompelpeopletoaccepttheirassignments,andsomewherethetreatment

isgenerousenoughtodoso,therearetrialswheresubjectshavemuchfreedomand,inthose

cases,itislessthanobvioustouswhatrole,ifany,randomizationplaysinwarrantingthere-

sults.

2.6Scalingup:usingtheaverageforpopulations

AtypicalRCT,especiallyinthedevelopmentcontext,issmall-scaleandlocal,forexampleina

fewschools,clinics,orfarmsinaparticulargeographic,cultural,socio-economicsetting.Ifsuc-

cessfulaccordingtoacost-effectivenesscriterion,forexample,itisacandidateforscaling-up,

applyingthesameinterventionforamuchlargerarea,oftenawholecountry,orsometimes

evenbeyond,aswhensometreatmentisconsideredforallrelevantWorldBankprojects.The

factthattheinterventionmightworkdifferentlyatscalehaslongbeennotedintheeconomics

literature,e.g.GarfinkelandManski(1992),Heckman(1992),andMoffitt(1992),andisrecog-

nizedintherecentreviewbyBanerjeeandDuflo(2009).Wewantheretoemphasizetheperva-

sivenessofsucheffects—thatfailureofthetrialresultstoreplicateatalargerscaleislikelyto

betheruleratherthantheexception—aswellastonoteonceagainthat,asinfailuresoftrans-

portability,thisshouldnotbetakenasanargumentagainstusingRCTs,butonlyagainsttheidea

thateffectsatscalearelikelytobethesameasinthetrial.UsingRCTresultsisnotthesameas

assumingthesameresultsholdsinallcircumstances.

Anexampleofwhatareoftencalledgeneralequilibriumeffectscomesfromagriculture.

SupposeanRCTdemonstratesthatinthestudypopulationanewwayofusingfertilizerorinsec-

ticidehadasubstantialpositiveeffecton,say,cocoayields,sothatfarmerswhousedthenew

methodssawincreasesinproductionandinincomescomparedtothoseinthecontrolgroup.If

theprocedureisscaleduptothewholecountry,ortoallcocoafarmersworldwide,theprice

willdrop,andifthedemandforcocoaispriceinelastic—asisusuallythoughttobethecase,at

leastintheshortrun—cocoafarmers’incomeswillfall.Indeed,theconventionalwisdomfor

manycropsisthatfarmersdobestwhentheharvestissmall,notlarge.Ofcourse,theseconsid-

erationsmightnotbedecisiveindecidingwhetherornottopromotetheinnovation,andthere

maystillbelongtermgainsif,forexample,somefarmersfindsomethingbettertodothan

growingcocoa.Butthebasicpointisthatthescaled-upeffectinthiscaseisoppositeinsignto

thetrialeffect.Theproblemhereisnotwiththetrialresults,whichcanbeusefullyincorporated

intoamorecomprehensivemarketmodelthatincorporatestheresponsesestimatedbythe

48

trial.Theproblemisonlyifweassumethattheaggregatelooksliketheindividual.Thatother

ingredientsoftheaggregatemodelmustcomefromobservationalstudiesshouldnotbeacriti-

cism,evenforthosewhofavorRCTs;itissimplythepriceofdoingseriousanalysis.

Therearemanypossibleinterventionsthataltersupplyordemandwhoseeffect,inag-

gregate,willchangeapriceorawagethatisheldconstantintheoriginalRCT.Educationwill

changethesuppliesofskilledversusunskilledlabor,withimplicationsforrelativewagerates.

Conditionalcashtransfersincreasethedemandfor(andperhapssupplyof)schoolsandclinics,

whichwillchangepricesorwaitinglines,orboth.Thereareinteractionsbetweenpeoplethat

willoperateonlyatscale.Givingonechildavouchertogotoprivateschoolmightimproveher

future,butdoingsoforeveryonecandecreasethequalityofeducationforthosechildrenwho

areleftinthepublicschools,seethecontrastingstudiesofAngristetal(1999)andHsiehand

Urquiola(2002).Educationalortrainingprogramsmaybenefitthosewhoaretreated,butharm

thoseleftbehind;ifthecontrolgroupisselectedfromthelatter,theRCTmaygenerateaposi-

tiveresultinspiteofhurtingsomeandhelpingnone;Créponetal(2014)recognizetheissueand

showhowtoadaptanRCTtodealwithit.

Scalingupcanalsodisturbthepoliticalequilibrium.Anexploitativegovernmentmaynot

allowthemasstransferofmoneyfromabroadtoapowerlesssegmentofthepopulation,

thoughitmaypermitasmall-scaleRCTofcashtransfers.Provisionofhealthcarebyforeign

NGOsmaybesuccessfulintrials,buthaveunintendednegativeconsequencestoscalebecause

ofgeneralequilibriumeffectsonthesupplyofhealthcarepersonnel,orbecauseitdisturbsthe

natureofthecontractbetweenthepeopleandagovernmentthatisusingtaxrevenuetopro-

videservices.InIndia,thegovernmentspendslargesumsonfoodsubsidiesthroughasystem

(thePDS)thatisbothcorruptandinefficient,withmuchofthegrainthatisprocuredfailingto

finditswaytotheintendedbeneficiaries.LocalizedRCTsonwhetherornotfamiliesarebetter

offwithcashtransfersarenotinformativeabouthowpoliticianswouldchangetheamountof

thetransferiffacedwithunanticipatedinflation,andatleastasimportant,whetherthegov-

ernmentcouldcutprocurementfromrelativelywealthyandpoliticallypowerfulfarmers.With-

outapoliticalandgeneralequilibriumanalysis,itisimpossibletothinkabouttheeffectsofre-

placingfoodsubsidieswithcashtransfers,seee.g.Basu(2010).

Eveninmedicine,wherebiologicalinteractionsbetweenpeoplearelesscommonthan

aresocialinteractionsinsocialscience,interactionscanbeimportant;infectiousdiseasesarean

example,andimmunizationprogramsaffectthedynamicsofdiseasetransmissionthroughherd

49

immunity,sothattheeffectsonanindividualdependonhowmanyothersarevaccinated,Fine

andClarkson(1986),Manski(2013,p52).Theusual,ifseldomcorrect,conceptionofanRCTin

medicineisofabiologicalprocess—forexample,theadministrationofaspirinafteraheartat-

tack—wheretheeffectisthoughttobesimilaracrossindividuals,andwheretherearenointer-

actions.Yetevenhere,thesocialandeconomicsettingaffectshowdrugsareactuallyusedand

thesameissuescanarise;thedistinctionbetweenefficacyandeffectivenessinclinicaltrialsisin

partrecognitionofthefact.

2.7Drillingdown:usingtheaverageforindividuals

Justasthereareissueswithscaling-up,itisnotobvioushowtousetheresultsfromRCTsatthe

levelofindividualunits,evenindividualunitsthatwereactually(orpotentially)includedinthe

trial.Awell-conductedRCTdeliversanaveragetreatmenteffectforawell-definedpopulation

but,ingeneral,thataveragedoesnotapplytoeveryone.Itisnottrue,forexample,asarguedin

JAMA’s“Users’guidetothemedicalliterature”that“ifthepatientwouldhavebeenenrolledin

thestudyhadshebeenthere—thatisshemeetsalloftheinclusioncriteriaanddoesn’tviolate

anyoftheexclusioncriteria—thereislittlequestionthattheresultsareapplicable,”Guyattetal

(1994).Evenmoremisleadingaretheoften-heardstatementsthatanRCTwithanaverage

treatmenteffectinsignificantlydifferentfromzerohasshownthatthetreatmentworksforno

one,thoughsuchaconclusionwouldbebettersupportedbyaFisherrandomizationtest.

Theseissuesarefamiliartophysicianspracticingevidence-basedmedicinewhoseguide-

linesrequire“integratingindividualclinicalexpertisewiththebestavailableexternalclinicalevi-

dencefromsystematicresearch,”Sackettetal(1996).Exactlywhatthismeansisunclear;phy-

siciansknowmuchmoreabouttheirpatientsthanisallowedforintheATEfromtheRCT

(though,onceagain,stratificationinthetrialislikelytobehelpful)andtheyoftenhaveintuitive

expertisefromlongpracticethattheyrelyontohelpthemidentifyfeaturesinaparticularpa-

tientthatarelikelytoaffecttheeffectivenessofagiventreatmentforthatpatient.Butthereis

anoddbalancebeingstruckhere.Thesejudgmentsaredeemedadmissibleindealingwiththe

individualpatient,atleastfordiscussionwiththepatientaspossibleconsiderations,butthey

don’tadduptoevidencetobemadepubliclyavailable,withtheusualcautionsaboutcredibility,

bythestandardsadoptedbymostEBMsites.Itisalsotruethatphysicianscanhaveprejudices

and“knowledge”thatmightbeanythingbut.Clearly,therearesituationswhereforcingpracti-

tionerstofollowtheaveragewilldobetter,evenforindividualpatients,andotherswherethe

oppositeistrue,seeKahnemanandKlein(2009).

50

Whetherornotaveragesareusefultoindividualsraisesthesameissueinsocialscience

research.Imaginetwoschools,StJoseph’sandSt.Mary’s,bothofwhichwereincludedinan

RCTofaclassroominnovation,oratleastwereeligibletobeso.Theinnovationissuccessfulon

average,butshouldtheschoolsadoptit?ShouldStMary’sbeinfluencedbyapreviousattempt

inStJoseph’sthatwasjudgedafailure?Manywoulddismissthisexperienceasanecdotaland

askhowStJoseph’scouldhaveknownthatitwasafailurewithoutbenefitof“rigorous”evi-

dence.YetifStMary’sislikeStJoseph’s,withasimilarmixofpupils,asimilarcurriculum,and

similaracademicstanding,mightnotStJoseph’sexperiencebemorerelevanttowhatmight

happenatStMary’sthanisthepositiveaveragefromtheRCT?Andmightitnotbeagoodidea

fortheteachersandgovernorsofStMary’stogotoStJoseph’sandfindoutwhathappenedand

why?Theymaybeabletoobservethemechanismofthefailure,ifsuchitwas,andfigureout

whetherthesameproblemswouldapplyforthem,orwhethertheymightbeabletoadaptthe

innovationtomakeitworkforthem,perhapsevenmoresuccessfullythanthepositiveaverage

inthetrial.

Onceagain,thesequestionsareunlikelytobesimplyansweredinpractice;but,aswith

transportability,thereisnoseriousalternativetotrying.Assumingthattheaverageworksfor

youwilloftenbewrong,anditwillatleastsometimesbepossibletodobetter.Asinthemedi-

calcase,theadvicetoindividualschoolsoftenlacksspecificity.Forexample,theUSInstituteof

EducationScienceshasprovideda“user-friendly”guidetopracticessupportedbyrigorousevi-

dence,USDepartmentofEducation(2003).Theadvice,whichisverysimilartorecommenda-

tionsindevelopmenteconomics,isthattheinterventionbedemonstratedeffectivethrough

well-designedRCTsinmorethanonesiteofimplementation,andthat“thetrialsshoulddemon-

stratetheintervention’seffectivenessinschoolsettingssimilartoyours”(2003,p.17).Nooper-

ationaldefinitionof“similar”isprovided.

Wenotefinallythatthesecaveats,whichapplytoindividuals(orschools)evenifthey

wereinthetrial,provideanotherreasonwhytheconceptof“external”validityisunhelpful.The

realissueishowtousethefindingsofatrialinnewsettings,includingsettingsincludedinthe

trial;externalvalidityinthesenseofinvarianceoftheATEemphasizessimplereplication,which

guaranteesnothing,whileignoringthepossibilitythatlackofreplicationcanbeakeytounder-

standing.

51

2.8Examplesandillustrationsfromeconomics

OurargumentsinthisSectionshouldnotbecontroversial,yetwebelievethattheyrepresentan

approachthatisdifferentfrommostcurrentpractice.Todocumentthisandtofilloutthear-

guments,weprovidesomeexamples.Whiletheseareoccasionallycritical,ourpurposeiscon-

structive;indeed,webelievethatmisunderstandingsabouthowtouseRCTshaveartificially

limitedtheirusefulness,aswellasalienatedsomewhowouldotherwiseusethem.

Conditionalcashtransfers(CCTs)areinterventionsthathavebeentestedusingRCTs

(andotherRCT-likemethods)andareoftencitedasaleadingexampleofhowanevaluation

withstronginternalvalidityleadstoarapidspreadofthepolicy,e.g.AngristandPischke(2010)

amongmanyothers.IThinkthroughthecausalchainthatisrequiredforCCTstobesuccessful:

peoplemustlikemoney,theymustlike(ordonotobjecttoomuch)totheirchildrenbeingedu-

catedandvaccinated,theremustexistschoolsandclinicsthatarecloseenoughandwell

enoughstaffedtodotheirjob,andthegovernmentoragencythatisrunningtheschememust

careaboutthewellbeingoffamiliesandtheirchildren.Thatsuchconditionsholdinawide

rangeof(althoughcertainlynotall)countriesmakesitunsurprisingthatCCTs“work”inmany

replications,thoughtheycertainlywillnotworkinplaceswheretheschoolsandclinicsdonot

exist,Levy(2001),norinplaceswherepeoplestronglyopposeeducationorvaccination.

Similarly,giventhatthehelpingfactorswilloperatewithdifferentstrengthsandeffec-

tivenessindifferentplaces,itisalsonotsurprisingthatthesizeoftheATEdiffersfromplaceto

place;forexample,Vivalt’sAidGradewebsitelists29estimatesfromarangeofcountriesofthe

standardized(dividedbylocalstandarddeviationoftheoutcome)effectsofconditionalcash

transfersonschoolattendance;allbutfourshowtheexpectedpositiveeffect,andtherange

runsfrom–8to+38percentagepoints.Eveninthisleadingcase,wherewemightreasonably

concludethatCCTs“work”ingettingchildrenintoschool,itwouldbehardtocalculatecredible

cost-effectivenessnumbers,ortocometoageneralconclusionaboutwhetherCCTsaremoreor

lesscosteffectivethanotherpossiblepolicies.Bothcostsandeffectsizescanbeexpectedto

differinnewsettings,justastheyhaveinobservedones,makingthesepredictionsdifficult.

Therangeofestimatesillustratesthatthesimpleviewofexternalvalidity—thattheATE

shouldtransportfromoneplacetoanother—isnotwelldefined.AidGradeusesstandardized

measuresofeffectsizedividedbystandarddeviationofoutcomeatbaseline,asdoesthemajor

multi-countrystudybyBanerjeeetal(2015),Butwemightprefermeasuresthathaveaneco-

nomicinterpretation,suchasadditionalmonthsofschoolingper$100spent(forexampleifa

52

donoristryingtodecidewheretospend,seebelow).Nutritionmightbemeasuredbyheight,or

bythelogofheight.EveniftheATEbyonemeasurecarriesacross,itwillonlydosousingan-

othermeasureiftherelationshipbetweenthetwomeasuresisthesameinbothsituations.This

isexactlythesortofthingthataformalanalysisoftransportabilityforcesustothinkabout.

(NotealsothatATEintheoriginalRCTcandifferdependingonwhethertheoutcomeismeas-

uredinlevelsorinlogs;thetwoATEscouldevenhavedifferentsigns.)

Dewormingissurelymorecomplicatedthanconditionalcashtransfersthoughnotbe-

causeanyonedisputesthedesirabilityofremovingparasiticalwormsorthebiologicalefficacyof

themedicines,atleastiftheyarerepeatedlyandeffectivelyadministered;thatisthepartofthe

causalprocessthatistransportablefromoneplacetoanother.Yetnutritionalorschoolattend-

anceoutcomesdependonreinfectionfromonepersontoanother—whichdependsonlocal

customsaboutdefecation(whichvaryfromplacetoplaceandaresubjecttoreligiousandcul-

turalfactors),particularlyontheextentofopendefecationandthedensityofpopulation,on

whetherornotchildrenwearshoes,andontheavailabilityanduseofpublicandprivatesanita-

tion;thislastwascrucialintheeliminationofhookworminthesouthernstatesoftheU.S.ac-

cordingtoStiles(1939).Temperaturemayalsobeimportant;indeed,such“macro”variablesare

likelytobeimportantinawiderangeofmedical,employment,andproductiontrials,

RosenzweigandUdry(2016).Therearetwoprominentpositivestudiesintheeconomicslitera-

ture,oneinKenya,KremerandMiguel(2000)andoneinIndia,Bobonis,MiguelandPuri-

Sharma(2006);theseareoftencitedasexamplesofthepowerofRCTstocomeupwiththe

“right”answer,forexamplebyKarlanandAppel(2008).YettheCochraneCollaborationreview

ofdewormingandschooling,Taylor-Robinsonetal(2015),whichreviewsonetrial(fromIndia)

coveringmorethanamillionparticipants,and44otherscovering67,672participants,including

KremerandMiguel(2004),concludethatthereis“substantialevidence”thatdewormingshows

nobenefitinnutritionalstatus,hemoglobin,cognition,schoolperformanceordeath.Thevalidi-

tyofthismeta-analysisisdisputedbyCrokeetal(2016).Areplication,Aikenetal(2015)andre-

analysis(usingdifferentmethods)ofMiguelandKremer’soriginaldatabyDaveyetal(2015)

concludedthatthestudy“providedsomeevidence,butwithhighriskofbias,”provokinga

lengthyexchange,Hicksetal(2015)andHargreavesetal(2015).Mostofthedifferencesinre-

sultscomefromdifferentmethodologicalchoices,themselveslargelybasedondisciplinarytra-

ditions,ratherfromtheeffectsofmistakesorerrors.Inanimpressiveandclearreanalysis,

Humphreys(2015)arguesthatonepuzzlingfeatureofMiguelandKremer’sresultsistheab-

53

senceofanycleareffectofdewormingonhealth,aswasthecaseinthelargeIndianRCT.Yet

theeffectsofdewormingoneducation,whicharethemaintargetofthepaper,presumably

workthroughhealth,sothattheabsenceofhealtheffects—afailureofexpectedmediators—is

apuzzle,seealsoMiguel,KremerandHicks(2015),andAhujaetal(2015).Recalltooourearlier

discussionofthedifficultyofinterpretingthestandarderrorsoftheoriginalstudyintheab-

senceofrandomization.

Itisnotourpurposeheretotrytoadjudicatethesecompetingclaimsbutrathertore-

latethisworktoourgeneralargument.First,itisnotclearthatthereisarightanswertobedis-

covered;giventhecausalchainsinvolved,dewormingmightbehelpfulinoneplacebutunhelp-

fulinanother.Yetthefocusofthedebateisalmostentirelyoninternalvalidity,onwhetherthe

originalstudieswerecorrectlydone.TheCochranereview,inlinewiththis,andinlinewith

muchmeta-analysisoftrials,seemstosupposethatthereisasingleeffecttobeuncoveredthat,

onceestablished,willbeinvarianttolocalandenvironmentaldifferences.Externalvalidity,it

seems,isimpliedbyinternalvalidity.Indeed,Chalmers,oneofthefoundersoftheCochrane

Collaboration,hasexplicitlyargued(inresponsetooneofus)that,intheabsenceofstrongrea-

sonstothecontrary,resultsshouldbetakenasapplicableeverywhere,PettigrewandChalmers

(2011).

Second,thedebatemakesitclearthatthepracticeofRCTsineconomicdevelopment

hasdonelittletofulfilltheoriginalpromisethattheirsimplicity—howhardisittosubtractone

meanfromanother?—woulddisposeofthemethodologicalandeconometricdisputesthat

characterizesomanyobservationalstudiesandwerethoughttobeoneoftheirmainflaws.

WhileRCTstendtotakesomecontentiousissuesofidentificationoffthetable,theyleavemuch

tobedisputed,includingthehandlingoffactorsthatinteractwithtreatmenteffects,theappro-

priatelevelofrandomization,thecalculationofstandarderrors,thechoiceofoutcomemeas-

ure,theinclusioncriteriaforthesample,placeboandHawthorneeffects,andmuchmore.The

claimthatRCTscutthroughtheusualeconometricdisputestodelivertopolicymakersasimple,

convincing,andeasilyunderstoodanswerissimplyfalse.Thedewormingdebatesareperhaps

theleadingillustration.

Muchofthedevelopmentliterature,likethemedicalliterature,workswiththeviewof

externalvaliditythat,unlessthereisevidencetothecontrary,thedirectionandsizeoftreat-

menteffectscanbetransportedfromoneplacetoanother.TheJ-PALwebsitereportsitsfind-

ingsunderageneralheadofpolicyrelevance,subdividedbyaselectionoftopics.Undereach

54

topic,thereisalistofrelevantRCTsfromarangeofdifferentsettingsaroundtheworld.These

areconvenientlyconvertedintoacommoncost-effectivenessmeasuresothat,forexample,

under‘education’,subhead‘studentparticipation’,therearefourstudiesfromAfrica:onin-

formingparentsaboutthereturnstoeducationinMadagascar,ondeworming,onschooluni-

forms,andonmeritscholarships,allfromKenya.Theunitsofmeasurementareadditionalyears

ofstudenteducationper$100,andamongthesefourstudies,theaverageeffectsizesofspend-

ing$100are20.7years,13.9years,0.71yearsand0.27yearsrespectively.(Notethatthisisa

different—andsuperior—standardizationfromtheeffectsizestandardizationdiscussedabove.)

Whatcanweconcludefromsuchcomparisons?Foraphilanthropicdonorinterestedin

education,andifmarginalandaverageeffectsarethesame,theymightindicatethatthebest

placetodevoteamarginaldollarisinMadagascar,whereitwouldbeusedtoinformparents

aboutthevalueofeducation.Thisiscertainlyuseful,butitisnotasusefulasstatementsthat

informationordewormingprogramsareeverywheremorecost-effectivethanprogramsinvolv-

ingschooluniformsorscholarships,orifnoteverywhere,atleastoversomedomain,anditis

thesesecondkindsofcomparisonthatwouldgenuinelyfulfillthepromiseof“findingoutwhat

works.”Butsuchcomparisonsonlymakesenseifwecantransporttheresultsfromoneplaceto

another,iftheKenyanresultsalsoholdinMadagascar,Mali,orNamibia,orsomeotherlistof

Africanornon-Africanplaces.J-PAL’smanualforcost-effectiveness,Dhaliwaletal(2012)ex-

plainsin(entirelyappropriate)detailhowtohandlevariationincostsacrosssites,notingvaria-

blefactorssuchaspopulationdensity,prices,exchangerates,discountrates,inflation,andbulk

discounts.Butitgivesshortshrifttocross-sitevariationinthesizeofaveragetreatmenteffects

whichplayanequalpartinthecalculationsofcosteffectiveness.Themanualbrieflynotesthat

diminishingreturns(orthe“last-mile”problem)mightbeimportantintheory,butarguesthat

thebaselinelevelsofoutcomesarelikelytobesimilarinthepilotandreplicationareas,sothat

theaveragetreatmenteffectcanbesafelytransportedasis.Allofthislacksajustificationfor

transportability,someunderstandingofwhenresultstransport,whentheydonot,orbetter

still,howtheyshouldbemodifiedtomakethemtransportable.

OneofthelargestandmosttechnicallyimpressiveofthedevelopmentRCTsisby

Banerjeeetal(2015),whichtestsa“graduation”programdesignedtopermanentlyliftextreme-

lypoorpeoplefrompovertybyprovidingthemwithagiftofaproductiveasset(fromguinea-

pigs,(regular-)pigs,sheep,goats,orchickensdependingonlocale),trainingandsupport,life

skillscoaching,aswellassupportforconsumption,saving,andhealthservices;theideaisthat

55

thispackageofaidcanhelppeoplebreakoutofpovertytrapsinawaythatwouldnotbepossi-

blewithoneinterventionatatime.ComparableversionsoftheprogramweretestedinEthio-

pia,Ghana,Honduras,India,Pakistan,andPeruand,exceptingHonduras(wherethechickens

died)findlargelypositiveandpersistenteffects—withsimilar(standardized)effectsizes—fora

rangeofoutcomes(economic,mentalandphysicalhealth,andfemaleempowerment).Onesite

apart,essentiallyeveryoneacceptedtheirassignment,sothatmanyofthefamiliarcaveatsdo

notapply.ReplicationofpositiveATEsoversuchawiderangeofplacescertainlyprovidesproof

ofconceptforsuchascheme.YetBauchet,Morduch,andRavi(2015)failtoreplicatetheresult

inSouthIndia,wherethecontrolgroupgotaccesstomuchthesamebenefits,whatHeckman,

Hohman,andSmith(2000)call‘substitutionbias’.Evenso,theresultsareimportantbecause,

althoughthereisalongstandinginterestinpovertytraps,manyeconomistshavelongbeen

skepticaloftheirexistenceorthattheycouldbesprungbysuchaid-basedpolicies.Inthissense,

thestudyisanimportantcontributiontothetheoryofeconomicdevelopment;ittestsatheo-

reticalpropositionandwill(orshould)changemindsaboutit.

Anumberofdifficultiesremain.Astheauthorsnote,suchtrialscannottelluswhich

componentofthetreatmentaccountedfortheresults,orwhichmightbedispensable—amuch

moreexpensivemultifactorialtrialwouldberequired—thoughitseemslikelyinpracticethat

thecostliestcomponent—therepeatedvisitsfortrainingandsupport—islikelytobethefirstto

becutbycash-strappedpoliticiansoradministrators.Andasnoted,itisunclearwhatshould

countas(simple)replicationininternationalcomparisons;itishardtothinkoftheusesof

standardizedeffectsizes,excepttodocumentthateffectsexisteverywhereandthattheyare

similarlylargerelativetolocalvariationinsuchthings.

Theeffectsize—theaveragetreatmenteffectexpressedinnumbersofstandarddevia-

tionsoftheoriginaloutcome—thoughconvenientlydimensionless,haslittletorecommendit.

AswithmuchofRCTpractice,itstripsoutanyeconomiccontent—noratesofreturn,orbenefits

minuscosts—anditremovesanydisciplineonwhatisbeingcompared.Applesandorangesbe-

comeimmediatelycomparable,asdotreatmentswhoseinclusioninameta-analysisislimited

onlybytheimaginationoftheanalystsinclaimingsimilarity.Inpsychology,wheretheconcept

originated,thereareendlessdisputesaboutwhatshouldandshouldnotbepooledinameta-

analysis.Beyondthat,asarguedbySimpson(2016),restrictionsonthetrialsample—oftengood

practicetoreducebackgroundnoiseandtohelpdetectaneffect—willreducethebaseline

standarddeviationandinflatetheeffectsize.Moregenerally,effectsizesareopentomanipula-

56

tionbyexclusionrules.Itmakesnosensetoclaimreplicabilityonthebasisofeffectsizes,let

alonetousethemtorankprojects.

Thegraduationstudycanbetakenastheclosesttofulfillingthe“findingoutwhat

works”aimoftheRCTmovementindevelopment.Yetitissilentonperhapsthecrucialaspect

forpolicy,whichisthatthetrialwasrunentirelyinpartnershipwithNGOs,whereaswhatwe

wouldliketoknowiswhetheritcouldbereplicatedbygovernments,includingthosegovern-

mentsthatareincapableofgettingdoctors,nurses,andteacherstoshowuptoclinics,or

schools,Chaudhuryetal(2005),Banerjee,DeatonandDuflo(2004),orofregulatingthequality

ofmedicalcareineitherthepublicorprivatesectors,Filmer,HammerandPritchett(2000)or

DasandHammer(2005).Infact,wealreadyknowagreatdealabout“whatworks.”Vaccina-

tionswork,maternalandchildhealthcareserviceswork,andclassroomteachingworks.Yet

knowingthisdoesnotgetthosethingsdone.Addinganotherprogramthatworksunderideal

conditionsisusefulonlywheresuchconditionsexist,andthatwouldlikelybeunnecessarywhen

theyexist.Findingoutwhatworksisnotthemagickeytoeconomicdevelopment.Technical

knowledge,thoughalwaysworthhaving,requiressuitableinstitutionsifitistodoanygood.

Asimilarpointisdocumentedinthecontrastbetweenasuccessfultrialthatusedcam-

erasandthreatsofwagereductionstoincentivizeattendanceofteachersinschoolsrunbyan

NGOinRajasthaninIndia,Duflo,Hanna,andRyan(2012),andthesubsequentfailureofafol-

low-upprograminthesamestatetotacklemassabsenteeismofhealthworkers,Banerjee,

Duflo,andGlennerster(2008).Intheschools,thecamerasandtimekeepingworkedasintended,

andteacherattendanceincreased.Intheclinics,therewasashort-runeffectonnurseattend-

ance,butitwasquicklyeliminated.(Theabilityofagentseventuallytounderminepoliciesthat

areinitiallyeffectiveiscommonenoughandnoteasilyhandledwithinanRCT.)Inbothtrials,

therewereincentivestoimproveattendance,andtherewereincentivestofindawaytosabo-

tagethemonitoringandrestoreworkerstotheiraccustomedpositions;theforceofthesein-

centivesisa“high-level”cause,likegravity,ortheprincipleofthelever,thatworksinmuchthe

samewayeverywhere.Fortheclinics,somesabotagewasdirect—thesmashingofcameras—

andsomewassubtler,whengovernmentsupervisorsprovidedofficial,thoughessentiallyspe-

ciousreasons,formissingwork.Wecanonlyconjecturewhythecausalitywasswitchedinthe

movefromNGOtogovernment;wesuspectthatworkingforahighly-respectedlocalNGOisa

differentcontractfromworkingforthegovernment,wherenotshowingupforworkiswidely(if

informally)understoodtobepartofthedeal.Theincentiveleverworkswhenitiswiredup

57

right,aswiththeNGOs,butnotwhenthewiringcutsitout,aswiththegovernment.Knowing

“whatworks”inthesenseofthetreatmenteffectonthetrialpopulationisoflimitedvalue

withoutunderstandingthepoliticalandinstitutionalenvironmentinwhichitisset.Thisunder-

linestheneedtounderstandtheunderlyingsocial,economic,andculturalstructures—including

theincentivesandagencyproblemsthatinhibitservicedelivery—thatarerequiredtosupport

thecausalpathwaysthatweshouldliketoseeatwork.

Trialsineconomicdevelopmentaresusceptibletothecritiquethattheytakeplaceinar-

tificialenvironments.Drèze(2016)notes,basedonextensiveexperienceinIndia,“whenafor-

eignagencycomesinwithitsheavybootsandsuitcasesofdollarstoadministera`treatment,’

whetherthroughalocalNGOorgovernmentorwhatever,thereisalotgoingonotherthanthe

treatment.”Thereisalsothesuspicionthatatreatmentthatworksdoessobecauseofthepres-

enceofthe“treators,”oftenfromabroad,ratherthanbecauseofthepeoplewhowillbecalled

toworkitinreality.

ThereisalsomuchtobelearnedfrommanyyearsofeconomictrialsintheUnited

States,particularlyfromtheworkoftheManpowerDemonstrationResearchCorporation(now

knownbyitsinitialsMDRC),fromtheearlyincometaxtrials,aswellasfromtheRandHealth

Experiment.Followingtheincometaxtrials,MDRChasrunmanyrandomizedtrialssincethe

1970s,mostlyfortheFederalgovernmentbutalsoforindividualstatesandforCanada,seethe

thoroughandinformativeaccountbyGueronandRolston(2011)forthefactualinformation

underlyingthefollowingdiscussion.MRDC’sprogram,likethatofJPALindevelopment,isin-

tendedtofindout“whatworks”inthestateandfederalwelfareprograms.Theseprogramsare

conditionalcashtransfersinwhichpoorrecipientsaregivencashprovidedtheysatisfycertain

conditionswhichareoftenthesubjectofthetrial.Shouldtherebeworkrequirements?Should

thereberemedialeducationalbeforeworkrequirements?Whatarethebenefitsandcostsof

variousalternatives,bothtotherecipientsandtothelocalandfederaltaxpayers?Allofthese

programsaredeeplypoliticized,withsharplydifferentviewsoverbothfactsanddesirability.

Manyengagedinthesedisputesfeelcertainofwhatshouldbedoneandwhatitsconsequences

willbesothat,bytheirlights,controlgroupsareunethicalbecausetheydeprivesomepeopleof

whattheadvocates“know”willbecertainbenefits.Giventhis,itisperhapssurprisingthatRCTs

havebecometheacceptednormforthiskindofpolicyevaluationintheUS.

Thereasonsowemuchtopoliticalinstitutions,aswellastothecommonfaiththatRCTs

canrevealthetruth.AttheFederallevel,prospectivepoliciesarevettedbythenon-partisan

58

CongressionalBudgetOffice,whichmakesitsownestimatesofthebudgetaryimplicationsof

theprogram.IdeologueswhoseprogramsscorepoorlybytheCBOhaveanincentivetosupport

anRCT,nottoconvincethemselves,buttoconvincetheiropponents;onceagain,RCTsarees-

peciallyvaluablewhenyouropponentsdonotshareyourprior.Andcontrolgroupsareeasierto

putinplacewhenthereareinsufficientfundstocoverthewholepopulation.Therewasalsoa

widespreadandlargelyuncriticalbeliefthatRCTsalwaysgivetherightanswer,atleastforthe

budgetaryimplications,which,ratherthanthewellbeingoftherecipients,wereoftenthepri-

mary(andindeedsometimestheonly)concern;notethatallofthesetrialsareonpoorpeople

byrichpeoplewhoaretypicallymoreconcernedwithcostthanwiththewellbeingofthepoor,

Greenberg,SchroderandOnstott(1999).MDRCstrialscouldthereforebeeffectivedisputerec-

onciliationmechanismsbothforthosewhosawtheneedforevidenceandforthosewhodid

not(exceptinstrumentally).Notethattheoutcomeherefitswithour“publichealth”case;what

thepoliticiansneedtoknowisnottheoutcomesforindividuals,orevenhowtheoutcomesin

onestatemighttransporttoanother,buttheaveragebudgetarycostinaspecificplaceforeach

poorpersontreated,somethingthatagoodRCTconductedonarepresentativesampleofthe

targetpopulationisequippedtodeliver,atleastintheabsenceofgeneralequilibriumeffects,

timingeffects,etc.

TheseRCTsbyMDRCandothercontractorsdeservemuchcredit.Theyhavedemon-

stratedboththefeasibilityoflarge-scalesocialtrialsincludingthepossibilityofrandomizationin

thesesettings(wheremanyparticipantswerehostiletotheidea),aswellastheirusefulnessto

policymakers.Theyalsoseemtohavechangedbeliefs,forexampleinfavorofthedesirabilityof

workrequirementsasaconditionofwelfare,evenamongmanyofthosewhowereoriginally

opposed.Therearealsolimitations;thetrialsappeartohavehadatbestalimitedinfluenceon

scientificthinkingaboutbehaviorinlabormarkets.Theresultsofsimilarprogramshaveoften

beendifferentacrossdifferentsites,andtherehastodatebeennofirmunderstandingofwhy;

indeed,thetrialsarenotdesignedtorevealthis,Moffitt(2004).Finally,andperhapscruciallyfor

thepotentialcontributiontoeconomicscience,therehasbeenlittlesuccessinunderstanding

eithertheunderlyingstructuresorchainsofcausation,inspiteofadeterminedeffortfromthe

verybeginningtopeerintotheblackboxes.Withoutsuchmechanisms,transportabilityisal-

waysindoubt,itisimpossibleforpolicymakersoracademicstopurposivelyimprovethepoli-

cies,andthecontributionstocumulativescienceareseverelylimited.

59

TheRANDhealthexperiment,Manningetal(1975a,b),providesadifferentbutequally

instructivestoryifonlybecauseitsresultshavepermeatedtheacademicandpolicydiscussions

abouthealthcareeversince.Itwasoriginallydesignedtotestthequestionofwhethermore

generousinsurancewouldcausepeopletousemoremedicalcareand,ifso,byhowmuch.The

incentiveeffectsarehardlyindoubttoday;theimmortalityofthestudycomesratherfromthe

factthatitsmulti-arm(responsesurface)designallowedthecalculationofanelasticityforthe

studypopulation,thatmedicalexpendituresdecreasedby–0.1to–0.2percentforeveryper-

centageincreaseinthecopayment.AccordingtoAron-Dine,Einav,andFinkelstein(2013),itis

thisdimensionlessandthusapparentlytransportablenumberthathasbeenusedeversinceto

discussthedesignofhealthcarepolicy;theelasticityhascometobetreatedasauniversalcon-

stant.Ironically,theyarguethattheestimatecannotbereplicatedinrecentstudies,anditis

evenunclearthatitisfirmlybasedontheoriginalevidence.Thisaccountpoints,onceagain,to

thecentralimportanceoftransportabilityfortheusefulnessandlong-termusefulnessofatrial.

Here,thesimpledirecttransportabilityoftheresultseemstohavebeenlargelyillusorythough,

aswehaveargued,thisdoesnotmeanthatmorecomplexconstructionsbasedontheresultsof

thetrialwouldnothavedonebetter.

Conclusions

RCTsaretheultimateincredibleestimationofaveragetreatmenteffectsinthepopulationbe-

ingstudiedbecausetheymakesofewassumptionsaboutheterogeneity,causalstructure,

choiceofvariables,andfunctionalform.Theyaretrulynonparametric.Andindeed,thisissome-

timesjustwhatwewant,particularlywherewehavelittlecrediblepriorinformation.RCTsare

oftenconvenientwaystointroduceexperimenter-controlledvariance—ifyouwanttoseewhat

happens,thenkickitandsee,twistthelion’stail—butnotethatmanyexperiments,including

manyofthemostimportant(andNobelPrizewinning)experimentsineconomics,donotand

didnotuserandomization,Harrison(2013),Svorencik(2015).Butthecredibilityoftheresults,

eveninternally,canbeunderminedbyexcessiveheterogeneityinresponses,andespecially

whenthedistributionofeffectsisasymmetric,whereinferenceonmeanscanbehazardous.

Ironically,thepriceofthecredibilityinRCTsisthatallwegetaremeans.Yet,inthepresenceof

outliers,meansthemselvesdonotprovidethebasisforreliableinference.Andrandomizationin

andofitselfdoesnothingunlessthedetailsareright;purposiveselectionintotheexperimental

population,likepurposiveselectionintoandoutofassignment,underminesinferenceinjust

60

thesamewayasdoesselectioninobservationalstudies.Lackofblinding,whetherofpartici-

pants,trialists,datacollectors,oranalysts,underminesinferencebypermittingfactorsother

thanthetreatmenttoaffecttheoutcome,akintoafailureofexclusionrestrictionsininstru-

mentalvariableanalysis.

ThelackofstructurecanbecomeseriouslydisablingwhenwetrytouseRCTresults,

outsideofafewcontexts,suchasprogramevaluation,hypothesistesting,orestablishingproof

ofconcept.Beyondthat,weareintrouble.Wecannotusetheresultstohelpmakepredictions

elsewherewithoutmorestructure,withoutmorepriorinformation,andwithouthavingsome

ideaofwhatmakestreatmenteffectsvaryfromplacetoplace,ortimetotime.Thereisnoop-

tionbuttocommittosomecausalstructureifwearetoknowhowtouseRCTevidenceelse-

where,ortousetheestimatesoutoftheoriginalcontext.Simplegeneralizationandsimpleex-

trapolationjustdonotcutthemustard.Thisistrueofanystudy,experimentalorobservational.

Butobservationalstudiesarefamiliarwith,androutinelyworkwith,thesortofassumptions

thatRCTsclaimtoavoid,sothatiftheaimistouseempiricalevidence,anycredibilityadvantage

thatRCTshaveinestimationisnolongeroperative.

Yetoncethatcommitmenthasbeenmade,RCTevidencecanbeextremelyuseful,pin-

ningdownpartofastructure,helpingtobuildstrongerunderstandingandknowledge,andhelp-

ingtoassesswelfareconsequences.Asourexamplesshow,thiscanoftenbedonewithout

committingtothefullcomplexityofwhatareoftenthoughtofasstructuralmodels.Yetwithout

thestructurethatallowsustoplaceRCTresultsincontext,ortounderstandthemechanisms

behindthoseresults,notonlycanwenottransportwhether“itworks”elsewhere,butwecan-

notdothestandardstuffofeconomics,whichistosaywhetherornottheinterventionisactual-

lywelfareimproving,seeHarrison(2014)foravividaccountthatsharplyidentifiesthisandoth-

erissues.Withoutknowingwhythingshappenandwhypeopledothings,weruntheriskof

worthlesscasual(“fairystory”)causaltheorizingandhaveessentiallygivenupononeofthe

centraltasksofeconomics.

Wemustbackawayfromtherefusaltotheorize,fromtheexultationinourabilityto

handleunlimitedheterogeneity,andactuallySAYsomething.Perhapsparadoxically,unlesswe

arepreparedtomakeassumptions,andtosaywhatweknow,makingstatementsthatwillbe

incredibletosome,allthecredibilityoftheRCTisfornaught.

Inthespecificcontextofdevelopmentthathasconcernedushere,RCTshaveproven

theirworthinprovidingproofsofconceptandattestingpredictionsthatsomepoliciesmust

61

alwaysworkorcanneverwork.But,aselsewhereineconomics,wecannotfindoutwhysome-

thingworksbysimplydemonstratingthatitdoeswork,nomatterhowoften,whichleavesus

uninformedastowhetherthepolicyshouldbeimplemented.Beyondthat,smallscale,demon-

strationRCTsarenotcapableoftellinguswhatwouldhappenifthesepolicieswereimplement-

edtoscale,ofcapturingunintendedconsequencesthattypicallycannotbeincludedinthepro-

tocols,orofmodelingwhatwillhappenifschemesareimplementedbygovernments,whose

motivesandoperatingprinciplesaredifferentfromtheNGOswhotypicallyruntrials.Whileitis

truethatabstractknowledgeisalwayslikelytobebeneficialtoeconomicdevelopment,success-

fuldevelopmentdependsoninstitutionsandonpolitics,mattersonwhichRCTshavelittleto

say.Intheend,RCTsareoneofthemanyexternaltechnicalfixesthathavemeanderedoffand

onthedevelopmentstagesincetheSecondWorldWar,includingbuildinginfrastructure,getting

pricesright,andservicedelivery,noneofwhichhavefaceduptotheessentialdomesticpolitical

foundationsfordevelopment.

Citations

Ahuja,Amrita,SarahBaird,JoanHamoryHicks,MichaelKremer,EdwardMiguel,andShawnPowers,2015,“Whenshouldgovernmentssubsidizehealth?Thecaseofmassdeworming,”WorldBankEconomicReview,29,S9–S24.

Aigner,DennisJ.,1985,“Theresidentialelectricitytime-of-usepricingexperiments.Whathavewelearned?”inDavidA.WiseandJerryA.Hausman,Socialexperimentation,Chicago,Il.Chi-cagoUniversityPressforNationalBureauofEconomicResearch,11–54.

Aiken,AlexanderM.,CalumDavey,JamesR.HargreavesandRichardJ.Hayes,“Re-analysisofhealthandeducationalimpactsofaschool-baseddewormingprogrammeinwesternKenya:apurereplication,”InternationalJournalofEpidemiology,0(0),1–9.

Al-Ubaydil,Omar,andJohnA.List,2013,“Onthegeneralizabilityofexperimentalresultsineco-nomics,”inG.FrechetteandA.Schotter,Methodsofmodernexperimentaleconomics,Ox-fordUniversityPress.

Altman,DouglasG.,1985,“Comparabilityofrandomizedgroups,”JournaloftheRoyalStatisticalSociety,SeriesD(TheStatistician),34(1),Statisticsinhealth,125–36.

Angrist,JoshuaD.,2004,“Treatmenteffectheterogeneityintheoryandpractice,”EconomicJournal,114,C52–C83.

Angrist,JoshuaD.,EricBettinger,ErikBloom,ElizabethKingandMichaelKremer,2002,“Vouch-ersforprivateschoolinginColombia:evidencefromarandomizednaturalexperiment,”AmericanEconomicReview,92(5),1535–58.

Angrist,JoshuaD.,andJörn-SteffenPischke,2010,“Thecredibilityrevolutioninempiricaleco-nomics:howbetterresearchdesignistakingtheconoutofeconometrics,”JournalofEco-nomicPerspectives,24(2),3–30.

Aron-Dine,Aviva,LiranEinav,andAmyFinkelstein,2013,“TheRANDhealthinsuranceexperi-ment,threedecadeslater,”JournalofEconomicPerspectives,27(1),197–222.

62

Arrow,KennethJ.,1975,“Twonotesoninferringlongrunbehaviorfromsocialexperiments,”DocumentNo.P-5546,SantaMonica,CA.RandCorporation.

Ashenfelter,Orley,1978,“Estimatingtheeffectoftrainingprogramsonearnings,”ReviewofEconomicsandStatistics,60(1),47–57.

Ashenfelter,Orley,1978,“Thelaborsupplyresponseofwageearners,”inJohnL.PalmerandJosephA.Pechman,eds.,Welfareinruralareas:theNorthCarolina–IowaIncomeMainte-nanceExperiment,Washington,DC.TheBrookingsInstitution.109–38.

Attanasio,Orazio,CostasMeghir,andAnaSantiago,2012,“EducationchoicesinMexico:usingastructuralmodelandarandomizedexperimenttoevaluatePROGRESA,”ReviewofEconomicStudies,79(1),37–66.

Attanasio,Orazio,SarahCattan,EmlaFitzsimons,CostasMeghir,andMartaRubioCodina,2015,“Estimatingtheproductionfunctionforhumancapital:resultsfromarandomizedcontrolledtrialinColumbia,”London.InstituteforFiscalStudies,WorkingPapernoW15/06.

Bahadur,R.R.,andLeonardJ.Savage,1956,“Thenon-existenceofcertainstatisticalproceduresinnonparametricproblems,”AnnalsofMathematicalStatistics,25:1115–22.

Banerjee,Abhijit,SylvainChassang,SergioMontero,andErikSnowberg,2016,“Atheoryofex-perimenters,”processed,July2016.

Banerjee,Abhijit,SylvainChassang,andErikSnowberg,2016,“Decisiontheoreticapproachestoexperimentdesignandexternalvalidity,”Cambridge,MA.NBERWorkingPaperno22167,April.

Banerjee,Abhijit,AngusDeaton,andEstherDuflo,2004,“HealthcaredeliveryinruralRaja-sthan,”EconomicandPoliticalWeekly,39(9),944–9.

Banerjee,Abhijit,andEstherDuflo,2012,Pooreconomics:aradicalrethinkingofthewaytofightglobalpoverty,PublicAffairs.

Banerjee,Abhijit,EstherDuflo,NathanaelGoldberg,DeanKarlan,RobertOsei,WilliamParienté,JeremyShapiro,BramThuysbaert,andChristopherUdry,2015,“Amultifacetedprogramcauseslastingprogressfortheverypoor:evidencefromsixcountries,”Science,348(6236),1260799.

Banerjee,Abhijit,EstherDuflo,andRachelGlennerster,2008,“Puttingaband-aidonacorpse:incentivesfornursesintheIndianpublichealthcaresystem,”JournaloftheEuropeanEco-nomicAssociation,6(2–3),487–500.

Banerjee,AbhijitV.,andRuiminHe,2003,“TheWorldBankofthefuture,”AmericanEconomicReview,93(2),39–44.

Bauchet,Jonathan,JonathanMorduchandShamikaRavi,2015,“Failurevsdisplacement:whyaninnovativeanti-povertyprogramshowednonetimpactinSouthIndia,”JournalofDevel-opmentEconomics,116,1–16.

Basu,Kaushik,2010,“TheeconomicsoffoodgrainmanagementinIndia,”MinistryofFinance,Delhi.http://finmin.nic.in/workingpaper/Foodgrain.pdf

Bloom,HowardS.,CarolynJ.Hill,andJamesA.Riccio,2005,“Modelingcross-siteexperimentaldifferencestofindoutwhyprogrameffectivenessvaries,”inHowardS.Bloom,ed.,Learningmorefromsocialexperiments:evolvinganalyticalapproaches,NewYork,NY.RussellSage.

Bobonis,Gustavo,EdwardMiguel,andCharuPuri-Sharma,2006,“Anemiaandschoolparticipa-tion,”JournalofHumanResources,41(4),692–721.

Bold,Tessa,MwangiKimenyi,,GermanoMwabu,AliceNg’ang’aandJustinSandefur,2013,“Scalingupwhatworks:experimentalevidenceonexternalvalidityinKenyaneducation,”Washington,DC.CenterforGlobalDevelopment,WorkingPaper321.

Bothwell,LauraE.,andScottH.Podolsky,2016,“Theemergenceoftherandomized,controlledtrial,”NewEnglandJournalofMedicine,375(6),501–4.doi:10.1056/NEJMp1604635

63

Campbell,D.T.,andJ.C.Stanley,1963,Experimentalandquasi-experimentaldesignsforre-search.Chicago.RandMcNally.

Cartwright,Nancy,1994,Nature’scapacitiesandtheirmeasurement.Oxford.ClarendonPress.Cartwright,Nancy,andJeremyHardie,2012,Evidencebasedpolicy:apracticalguidetodoingit

better,Oxford.OxfordUniversityPress.Chalmers,Iain,2001,“Comparinglikewithlike:somehistoricalmilestonesintheevolutionof

methodstocreateunbiasedcomparisongroupsintherapeuticexperiments,”InternationalJournalofEpidemiology,30,1156–64.

Chalmers,Iain,2003,“FisherandBradfordHill:theoryandpragmatism?”InternationalJournalofEpidemiology,32,922–24.

Chassang,Sylvain,GerardPadróIMiguel,andErikSnowberg,2012,“Selectivetrials:aprincipal–agentapproachtorandomizedcontrolledexperiments,”AmericanEconomicReview,102(4),1279–1309.

Chassang,Sylvain,ErikSnowberg,BenSeymour,andCayleyBowles,2015,“Accountingforbe-haviorintreatmenteffects:newapplicationsforblindtrials,”PLoSOne,10(6),e0127227.doi:10:1371/journal.pone.0127227.

Chaudhury,Nazmul,JeffreyHammer,MichaelKremer,KarthikMuralidharanandF.HalseyRog-ers,2005,“Missinginaction:teacherandhealthworkerabsenceindevelopingcountries,”JournalofEconomicPerspectives,19(4),91–116.Chyn,Eric,2016,“Movedtoopportunity:thelong-runeffectofpublichousingdemolitiononlabormarketoutcomesofchildren,”Uni-versityofMichigan.http://www-personal.umich.edu/~ericchyn/Chyn_Moved_to_Opportunity.pdf

Conlisk,John,1973,“Choiceofresponsefunctionalformindesigningsubsidyexperiments,”Econometrica,41(4),643–56.

Crépon,Bruno,EstherDuflo,MarcGurgand,RolandRathelot,andPhilippeZamora,2014,“Dolabormarketpolicieshavedisplacementeffects?evidencefromaclusteredrandomizedex-periment,”QuarterlyJournalofEconomics,128(2),531–80.

Croke,Kevin,JoanHamoryHicks,EricHsu,MichaelKremer,andEdwardMiguel,2016,“Doesmassdewormingaffectchildren’snutrition?Metaanalysis,costeffectiveness,andstatisticalpower,”Cambridge,MA.NBERWorkingPaperNo.22382(July.)

Cronbach,LeeJ.,S.R.Ambron,S.M.Dornbusch,R.D.Hess,R.C.Hornick,D.C.Phillips,D.F.Walker,andS.S.Weiner,1980,Towardsreformofprogramevaluation,SanFrancisco,Jossey-Bass.

Das,JishnuandJeffreyHammer,2005,”’Whichdoctor?Combiningvignettesanditemresponsetomeasureclinicalcompetence,”JournalofDevelopmentEconomics,78,348–83.

Davey,Calum,AlexanderM.Aitken,RichardJ.Hayes,andJamesR.Hargreaves,2015,“Re-analysisofhealthandeducationalimpactsofaschool-baseddewormingprogrammeinwesternKenya:astatisticalreplicationofaclusterquasi-randomizedsteppedwedgetrial,”InternationalJournalofEpidemiology,0(0),1–12.

Deaton,Angus,andJohnMuellbauer,1980,Economicsandconsumerbehavior,NewYork.Cam-bridgeUniversityPress.

Dhaliwal,Iqbal,EstherDuflo,RachelGlennerster,andCaitlinTulloch,2012,“Comparativecost-effectivenessanalysistoinformpolicyindevelopingcountries:ageneralframeworkwithap-plicationsforeducation,”J–PAL,MIT,December3rd.http://www.povertyactionlab.org/publication/cost-effectiveness

Drèze,Jean,2016,Personalemailcommunication.Duflo,Esther,RemaHanna,andStephenP.Ryan,2012,“Incentiveswork:gettingteachersto

cometoschool,”AmericanEconomicReview,102(4),1241–78.

64

Duflo,Esther,andMichaelKremer,2008,“Useofrandomizationintheevaluationofdevelop-menteffectiveness,”inWilliamEasterly,ed.,Reinventingforeignaid.Washington,DC.Brook-ings,93–120.

Dynarski,Susan,2015,”Helpingthepoorineducation:thepowerofasimplenudge,”NewYorkTimes,Jan17,2015.

Fine,PaulE.M.,andJacquelineA.Clarkson,1986,“Individualversuspublicprioritiesinthede-terminationofoptimalvaccinationpolicies,”AmericanJournalofEpidemiology,124(6),1012–20.

Fisher,RonaldA.,1926,“Thearrangementoffieldexperiments,”JournaloftheMinistryofAgri-cultureofGreatBritain,33,503–13.

Filmer,Deon,JeffreyHammer,andLantPritchett,2000,“Weaklinksinthechain:adiagnosisofhealthpolicyinpoorcountries,”WorldBankResearchObserver,15(2),199–204.

Freedman,DavidA.,2006,“Statisticalmodelsforcausation:whatinferentialleveragedotheyprovide?”EvaluationReview,30:691−713.

Freedman,DavidA.,2008,“Onregressionadjustmentstoexperimentaldata,”AdvancesinAp-pliedMathematics,40,180–93.

Garfinkel,Irwin,andCharlesF.Manski,1992,“Introduction,”inIrwinGarfinkelandCharlesF.Manski,eds.,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress.1–22.

Gertler,PaulJ.,SebastianMartinez,PatrickPremand,LauraB.Rawlings,andChristelM.J.Ver-meersch,Impactevaluationinpractice,Washington,DC.TheWorldBank.

Glewwe,Paul,MichaelKremer,SylvieMoulin,andEricZitzewitz,2004,“Retrospectivevs.pro-spectiveanalysesofschoolinputs:thecaseofflip-chartsinKenya,”JournalofDevelopmentEconomics,74,251–68.

Greenberg,DavidandMarkShroder,2004,Thedigestofsocialexperiments(3rded.),Washing-ton,DC.UrbanInstitutePress.

Greenberg,David,MarkShroder,andMatthewOnstott,1999,“Thesocialexperimentmarket,”JournalofEconomicPerspectives,13(3),157–72.

Gueron,JudithM.,andHowardRolston,2013,Fightingforreliableevidence,NewYork,RussellSage.

Guyatt,Gordon,DavidL.SackettandDeborahJ.CookfortheEvidence-BasedMedicineWorkingGroup,1994,“Users’guidestothemedicalliteratureII:howtouseanarticleabouttherapyorprevention.B.Whatweretheresultsandwilltheyhelpmeincaringformypatients?”JournaloftheAmericanMedicalAssociation,271(1),59–63.

Hargreaves,JamesR.,AlexanderM.Aiken,CalumDavey,andRichardJ.Hayes,2015,“Authors’responseto:dewormingexternalitiesandschoolimpactsinKenya,”InternationalJournalofEpidemiology,0(0),1–3.

Harrison,GlennW.,2013,“Fieldexperimentsandmethodologicalintolerance,”JournalofEco-nomicMethodology,20(2),103–17.

Harrison,GlennW.,2014,“Impactevaluationandwelfareevaluation,”EuropeanJournalofDe-velopmentResearch,26,39–45.

Hausman,JerryA.,andDavidA.Wise,1985,“Technicalproblemsinsocialexperimentation:costversuseaseofanalysis,”inJerryA.HausmanandDavidA.Wise,eds.,SocialExperimentation,Chicago,IL.ChicagoUniversityPress.187–220.

Heckman,JamesJ.,1992,“Randomizationandsocialpolicyevaluation,”inCharlesF.ManskiandIrwinGarfinkel,eds.,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress.547–70.

65

Heckman,JamesJ.,1997,“Instrumentalvariables:astudyofimplicitbehavioralassumptionsusedinmakingprogramevaluations,”JournalofHumanResources,32(3),441–62.

Heckman,JamesJ.,NeilHohman,andJeffreySmith,withtheassistanceofMichaelKhoo,2000,“Substitutionanddropoutbiasinsocialexperiments:astudyofaninfluentialsocialexperi-ment,”QuarterlyJournalofEconomics,115(2),651–94.

Heckman,JamesJ.,RobertJ.Lalonde,andJeffreyA.Smith,1999,“Theeconomicsandecono-metricsofactivelabormarkets,”Chapter31inAshenfelter,OrleyandDavidCard,eds.Handbookoflaboreconomics,Amsterdam.North-Holland,3(A),1866–2097.

Heckman,JamesJ,,RodrigoPinto,andPeterSavelyev,2013,“Understandingthemechanismsthroughwhichaninfluentialearlychildhoodprogramboostedadultoutcomes,”AmericanEconomicReview,103(6),2052–86.

Heckman,JamesJ.,JeffreySmith,andNancyClements,1997,“Makingthemostoutofpro-grammeevaluationsandsocialexperiments:accountingforheterogeneityinprogrammeimpacts,”ReviewofEconomicStudies,64(4),487–535.

Heckman,JamesJ,andEdwardVytlacil,2005,“Structuralequations,treatmenteffects,andeconometricpolicyevaluation,”Econometrica,73(3),669–738.

Heckman,JamesJ.andEdwardJ.Vytlacil,2007,“Econometricevaluationofsocialprograms,Part1:causalmodels,structuralmodels,andeconometricpolicyevaluation,”Chapter70inJamesJ.HeckmanandEdwardE.Leamer,eds.,HandbookofEconometrics,6B,4779–874.

Hicks,JoanHamory,MichaelKremer,andEdwardMiguel,2015,“Commentary:dewormingex-ternalitiesandschoolingimpactsinKenya:acommentonAikenetal(2015)andDaveyetal.(2015),”InternationalJournalofEpidemiology,0(0),1–4.

Horton,Richard,2000,“Commonsenseandfigures:therhetoricofvalidityinmedicine:Brad-fordHillmemoriallecture1999,”Statisticsinmedicine,19,3149–64.

Hotz,V.Joseph,GuidoW.ImbensandJulieH.Mortimer,2005,“Predictingtheefficacyoffuturetrainingprogramsusingpastexperienceatotherlocations,”JournalofEconometrics,125,241–70.

Hsieh,Chang-taiandMiguelUrquiola,2006,“Theeffectsofgeneralizedschoolchoiceonachievementandstratification:evidencefromChile’svoucherprogram,”JournalofPublicEconomics,90,1477–1503.

Humphreys,Macartan,2015,“Whathasbeenlearnedfromthedewormingreplications:anon-partisanview,”ColumbiaUniversity,Aug.http://www.columbia.edu/~mh2245/w/worms.html

Imbens,GuidoW.,2004,“Nonparametricestimationofaveragetreatmenteffectsunderexoge-neity:areview,”ReviewofEconomicsandStatistics,86(1),4–29.

Imbens,GuidoW.,2010,“BetterLATEthannothing:somecommentsonDeaton(2009)andHeckmanandUrzua,”JournalofEconomicLiterature,48(2),399–423.

Imbens,GuidoW.andJoshuaD.Angrist,1994,“Identificationandestimationoflocalaveragetreatmenteffects,”Econometrica,62(2),467–75.

Imbens,GuidoW.,andJeffreyM.Wooldridge,2009,“Recentdevelopmentsintheeconometricsofprogramevaluation,”JournalofEconomicLiterature,47(1),5–86.

InternationalCommitteeofMedicalJournalEditors,2015,Recommendationsfortheconduct,reporting,editing,andpublicationofscholarlyworkinmedicaljournals,http://www.icmje.org/icmje-recommendations.pdf(accessed,August20,2016.)

Kahneman,DanielandGaryKlein,2009,“Conditionsforintuitiveexpertise:afailuretodisa-gree,”AmericanPsychologist,64(6),515–26.

Karlan,DeanandJacobAppel,2011,Morethangoodintentions:howaneweconomicsishelp-ingtosolveglobalpoverty,Dutton.

66

Karlan,Dean,NathanealGoldbergandJamesCopestake,2009,“Randomizedcontrolledtrialsarethebestwaytomeasureimpactofmicrofinanceprogramsandimprovemicrofinanceproductdesigns,”EnterpriseDevelopmentandMicrofinance,20(3),167–76.

Kasy,Maximilian,2016,“Whyexperimentersmightnotwanttorandomize,andwhattheycoulddoinstead,”PoliticalAnalysis,1–15doi:10.1093/pan/mpw012

Kendall,MauriceG.,1959,“Hiawathadesignsanexperiment,”AmericanStatistician,13(5),23–4.

Kramer,Peter,2016,Ordinarilywell:thecaseforantidepressants,Farrar,Straus,andGiroux.Kremer,Michael,andAlakaHolla,2009,“Improvingeducationinthedevelopingworld:what

havewelearnedfromrandomizedevaluations?”AnnualReviewofEconomics,1,513–42. Lehman,Erich.L.,andJosephP.Romano,2005,Testingstatisticalhypotheses(thirdedition),

NewYork.Springer.Levy,Santiago,2006,Progressagainstpoverty:sustainingMexico’sProgresa-Oportunidades

program,Washington,DC.Brookings.Mackie,JohnL.,1974,Thecementoftheuniverse:astudyofcausation,Oxford.OxfordUniversi-

tyPress.Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeelerandArleenLeibowitz,

1988a,“Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedex-periment,”AmericanEconomicReview,77(3),251–77.

Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeeler,BernadetteBenjamin,ArleenLeibowitz,M.SusanMarquis,andJackZwanziger,1988b,Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedexperiment,SantaMonica,CA.RAND.

Manski,CharlesF.,1990,“Nonparametricboundsontreatmenteffects”AmericanEconomicReview,80(2),319–23.

Manski,CharlesF.,1995,Identificationproblemsinthesocialsciences,Cambridge,MA.HarvardUniversityPress.

Manski,CharlesF.,2003,Partialidentificationofprobabilitydistributions,NewYork.Springer.Manski,CharlesF.,2013,Publicpolicyinanuncertainworld:analysisanddecisions,Cambridge,

MA.HarvardUniversityPress.Metcalfe,CharlesE.,1973,“Makinginferencesfromcontrolledincomemaintenanceexperi-

ments,”AmericanEconomicReview,63(3),478–83.Miguel,Edward,andMichaelKremer,2004,“Worms:identifyingimpactsoneducationand

healthinthepresenceoftreatmentexternalities,”Econometrica,72(1),159–217.Miguel,Edward,MichaelKremer,andJoanHamoryHicks,2015,“CommentonMacartanHum-

phreys’andotherrecentdiscussionsoftheMiguelandKremer(2004)study,”Berkeley,Dec.http://emiguel.econ.berkeley.edu/assets/miguel_research/63/Worms-Comment_2015-12-21.pdf

Moffitt,Robert,1979,“ThelaborsupplyresponseintheGaryexperiment,”JournalofHumanResources,14(4),477–87.

Moffitt,Robert,1992,“Evaluationmethodsforprogramentryeffects,”Chapter6inCharlesManskiandIrwinGarfinkel,Evaluatingwelfareandtrainingprograms,Cambridge,MA.Har-vardUniversityPress,231–52.

Moffitt,Robert,2004,“Theroleofrandomizedfieldtrialsinsocialscienceresearch:aperspec-tivefromevaluationsofreformsofsocialwelfareprograms,”AmericanBehavioralScientist,47(5),506–40

Morgan,KariLock,andDonaldB.Rubin,2012,“Rerandomizationtoimprovecovariatebalanceinexperiments,”AnnalsofStatistics,40(2),1263–82.

67

Muller,SeánM.,2015,“Causalinteractionandexternalvalidity:obstaclestothepolicyrele-vanceofrandomizedevaluations,”WorldBankEconomicReview,29,S217–S225.

Orcutt,GuyH.,andAliceG.Orcutt,1968,“Incentiveanddisincentiveexperimentationforin-comemaintenancepolicypurposes,”AmericanEconomicReview,58(4),754–72.

Pearl,Judea,2009,Causality:models,reasoning,andinference,2ndedition,Cambridge.Cam-bridgeUniversityPress.

Pettigrew,Mark,andIainChalmers,2011,“Useofresearchevidenceinpractice,”Lancet,378(9804),1696.

Rodrik,Dani,2006,personalemailcommunication.Rosenzweig,MarkandChristopherUdry,2016,“Externalvalidityinastochasticworld,”Cam-

bridge,MA.NBERWorkingPaper22449(July).Rothwell,PeterM.,2005,“Externalvalidityofrandomizedcontrolledtrials:‘towhomdothe

resultsofthetrialapply’”,Lancet,365,82–93.Russell,Bertrand,2008[1912],Theproblemsofphilosophy,Rockville,MD.ArcManor.Sackett,DavidL.,WilliamM.C.Rosenberg,J.A.MuirGray,R.BrianHaynesandW.ScottRich-

ardson,1996,“Evidencebasedmedicine:whatitisandwhatitisn’t,”BritishMedicalJournal,312(January13),71–2.

Scriven,Michael,1974,“Evaluationperspectivesandprocedures,”inW.JamesPopham,ed.,Evaluationineducation—currentapplications,Berkeley,CA.McCutchanPublishingCorpora-tion.

Sen,AmartyaK.,2011,Theideaofjustice,Cambridge,MA.HarvardUniversityPress.Senn,Stephen,1994,“Testingforbaselinebalanceinclinicaltrials,”StatisticsinMedicine,13,

1715–26.Senn,Stephen,2013,“Sevenmythsofrandomizationinclinicaltrials,”StatisticsinMedicine32,

1439–50.Shadish,WilliamR.,ThomasD.Cook,andDonaldT.Campbell,2002,Experimentalandquasi-

experimentaldesignsforgeneralizedcausalinference,Boston,MA.HoughtonMifflin.Simpson,Adrian,2016,“Comparingandcombiningstandardizedeffectsizes:themisdirectionof

publicpolicy,”WorkingPaper,UniversityofDurham(July).Singer,BurtonH.,andStevePincus,1998,“Irregulararraysandrandomization,”Proceedingsof

theNationalAcademyofSciencesoftheUSA,”95,1363–8.Stiles,CharlesWardell,1939,“Earlyhistory,inpartesoteric,ofthehookworm(uncinariasis)

campaigninoursouthernUnitedStates,”JournalofParasitology,25(4),283–308.Stuart,ElizabethA.,StephenR.Cole,andCatharineP.BradshawandPhilipJ.Leaf,2011,“The

useofpropensityscorestoassessthegeneralizabilityofresultsfromrandomizedtrials,”JournaloftheRoyalStatisticalSocietyA,174(2)369–86.

Svorencik,Andrej,2015,Theexperimentalturnineconomics:ahistoryofexperimentaleconom-ics,UtrechtSchoolofEconomics,DissertationSeries#29,http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2560026

Taylor-Robinson,DavidC.,NicolaMaayan,KarlaSoares-Weiser,SarahDonegan,andPaulGar-ner,2015,“Dewormingdrugsforsoil-transmittedintestinalwormsinchildren:effectsonnu-tritionalindicators,haemoglobin,andschoolperformance(review),”TheCochraneCollabo-ration.Wiley.http://onlinelibrary.wiley.com/doi/10.1002/14651858.CD000371.pub6/abstract

Todd,PetraE.,andKennethJ.Wolpin,2006,“AssessingtheimpactofaschoolsubsidyprograminMexico:usingasocialexperimenttovalidateadynamicbehavioralmodelofchildschool-ingandfertility,”AmericanEconomicReview,96(5),1384–1417.

68

Todd,PetraE.,andKennethJ.Wolpin,2008,“Exanteevaluationofsocialprograms,”Annalesd’EconomieetdelaStatistique,91/92,263–91.

U.S.DepartmentofEducation,InstituteofEducationSciences,NationalCenterforEducationEvaluationandRegionalAssistance,2003,Identifyingandimplementingeducationalpractic-essupportedbyrigorousevidence:auserfriendlyguide,Washington,DC.InstituteofEduca-tionSciences.

Vandenbroucke,JanP.,2004,“Whenareobservationalstudiesascredibleasrandomizedcon-trolledtrials?”TheLancet,363:1728–31.

Vivalt,Eva,2015,“Howmuchcanwegeneralizefromimpactevaluations?”NYU,unpublished.http://evavivalt.com/wp-content/uploads/2014/10/Vivalt-JMP-10.27.14.pdf

White,Halbert,1980,“Aheteroskedasticity-consistentcovariancematrixestimatorandadirecttestforheteroskedasticity,”Econometrica,50(1),1–25.

Wise,DavidA.,1985,“Abehavioralmodelversusexperimentation:theeffectsofhousingsubsi-diesonrent,”inP.BruckerandR.Pauly,eds..MethodsofOperationsResearch,50,VerlagAnonHain.441–89.

Worrall,John,2002,“WhatEvidenceinEvidence-BasedMedicine?”PhilosophyofScience69,S316-S330.

Worrall,John,2007,“Evidenceinmedicineandevidence-basedmedicine,”PhilosophyCompass,2/6,981–1022.

Young,Alwyn,2016,“ChannelingFisher:randomizationtestsandthestatisticalinsignificanceofseeminglysignificantexperimentalresults,”LondonSchoolofEconomics,WorkingPaper,Feb.

Ziliak,StephenT.,2014,“Balancedversusrandomizedfieldexperimentsineconomics:whyW.S.Gossetaka‘Student’matters,”ReviewofBehavioralEconomics,1,167–208.

top related