statistical inference and reproducibility in geobiology · su, kuhn, polyak, & bissell, 2014;...
TRANSCRIPT
Geobiology. 2019;1–11. wileyonlinelibrary.com/journal/gbi | 1© 2019 John Wiley & Sons Ltd
Received:4June2018 | Revised:30October2018 | Accepted:30December2018DOI:10.1111/gbi.12333
P E R S P E C T I V E
Statistical inference and reproducibility in geobiology
1 | INTRODUC TION
Thelate,greatKarlTurekianwouldoftenjokeaboutthenumberofnewdatapointsrequiredforageochemicalpaper.Theanswerwasone:Whencombinedwithapreviouslypublisheddatapoint,a“best-fit” linecouldbedrawnbetweenthetwopoints,andtheslopeofthe linecalculated, therebygivingrate.The jokehad its roots inaseminarDr.Turekiangavefarearlierinhiscareer,wheretheentiretalkfocusedonthefirstdatapointofanovelmeasurement(ahard-wondatapoint,butonlyonenonetheless).Whenanapparentlyex-asperated audiencememberpointed this out, Turekian reportedlyreplied“wellit'sonemorethananyoneelsehas!”(Thiemens,Davis,Grossman, & Colman, 2013). This anecdote raises two importantpointsaboutourfield:First,establishingnewanalyticaltechniquesis laborious, expensive, and requires considerable vision and skill,andsecond,atthenascentstagesofanynewrecord,thenumberofobservationswillbesmall.
Geobiologyisnowatthepointwheresomematuredatarecords(forinstanceδ13Corδ34S)havehadthousandsortensofthousandsofdatapointsgenerated.Therealsoexistmillionsofindividualgenese-quencesfromenvironmentalgenomicssurveys.Incontrast,emerg-ingproxies (for instance, selenium isotopes, I/Ca ratios, carbonate“clumped” isotopes),orbiomarker studiesand technicalgeomicro-biological experiments still face the issue Turekian described andhave only a handful ofmeasurements or are transitioning towardlargerdatasets. Thequestion remainshowbest to interpretdata-richrecordsontheonehand,andscattered,hard-wondatapointsontheotherhand.Here,usingexamplesfromwithingeobiologyaswellasthedevelopmentofotherfieldssuchasecology,psychology,andmedicine,wearguethatincreasedclarityregardingsignificancetesting,multiple comparisons, and effect sizeswill help research-ers avoid false-positive inferences, better estimate themagnitudeofchanges inproxyrecordsorexperiments,andultimatelyyieldaricherunderstandingofgeobiologicalphenomena.
2 | STATISTIC S AND REPRODUCIBILIT Y IN OTHER FIELDS AND IN GEOBIOLOGY
We start by examining statistical practice and reproducibilityoutside of our field, before relating these broader themes backto geobiology. Science as awhole is currently described as fac-ing a “crisis of reproducibility,” with diverse studies in multiplefields failing toreplicatepublishedfindings (seeBaker,2016fora Nature survey of reproducibility across fields). The issue here
isnot technical reproducibilityat thesample level (e.g., “If Iputthissamesample inamassspectrometer,will Igetthesamere-sulttwice?”)butratherattheleveloftheeffectsobservedinthemanuscript(e.g.,agiventreatmentcausesaspecificoutcome,ora predictor variable is correlatedwith a response variable). Forexample,independenteffortstoreproduce“landmarkpapers”incancerbiomedicalresearchbypharmaceuticalcompaniesAmgen(Begley & Ellis, 2012) and Bayer (Prinz, Schlange, & Asadullah,2011) have only been able to replicate 11% and 20%–25% ofpublishedresults,respectively.Similarly,acriticalstudyofgene-by-environment (GxE) interactions inpsychiatry found thatonly27%of published replication attemptswerepositive (Duncan&Keller, 2011). Further, this small proportion of apparently posi-tivereplicationswaslikelyinflated;therewereclearsignaturesofpublicationbias(thetendencytopublishsignificantresultsmorereadilythannon-significantresults)amongreplicationattemptsintheGxEliterature. Inpsychology,a large-scalereplicationeffortfoundthatonly39%ofstudiescouldbereplicated(OpenScienceCollaboration, 2015), and similar problemsplaguemuchofneu-roscience research (Button etal., 2013). The studiesmentionedherearelikelyjustthetipoftheiceberg(Begley&Ioannidis,2015;Ioannidis,2005).
Thecausesoftheselowratesofreplicationarevaried.Thereare likely different underlying causes for poor reproducibilityacrossfields,andtheseverityoftheproblemundoubtedlyvariesaswell.Our personal experience trying to implement publishedprotocolswithout thehighlydetailed, laboratory-specificknowl-edge that is often omitted (generally for space reasons) fromMaterialsandMethodssectionssuggeststousthatatleastsomepercentageoffailedreplicationsarecausedbyinadvertentmeth-odological differences. Indeed, the Open Science Collaborationreplicationofpsychologystudieswasattackedforpooradherencetooriginalprotocols(Gilbert,King,Pettigrew,&Wilson,2016).Inordertoachieveprecisereplicationofprotocols,inextremecasesresearchershavehadtotraveltootherlaboratoriesandworkside-by-sidetoidentifyseeminglytrivialmethodologicaldifferences—vigorousstirringversusprolongedgentleshakingtoisolatecells,forinstance—thathadanoutsizedeffectonreproducibility(Hines,Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips,2017).Methodologicaldifferences,though,canlikelyonlyexplainaportionofthefailedreplicationattempts.Truescientificfraudisalsosomethingthatmakesheadlinesanderodespublictrust,butagainlikelyonlyaccountsforasmallproportionofreplicationfail-ures(Bouter,Tijdink,Axelsen,Martinson,&tenRiet,2016;Fanelli,2009).
2 | PERSPECTIVE
So, what causes such poor reproducibility?While recognizingagainthatthecausescanvaryacrossfields,clearlysomeofthemostimportant factors are under-powered studies, a reliance on NullHypothesisSignificanceTesting(NHST,e.g.,p <0.05)todetermine“truth”orwhetherapapershouldbepublished,anda lackofcor-rection for implicitorexplicitmultiplecomparisons (morebroadly,“researcherdegreesoffreedom”).Theseissueswereknownwithinsomefields,butwerebroughttomorewidespreadattentionthroughthepublicationofaprovocative2005essay,“WhyMostPublishedResearchFindingsAreFalse” (Ioannidis,2005). Ioannidis identifiedthefollowingcausalfactorsas leadingtotheoverallvery lowper-centagesofpositive replication findings: (a) small sample sizes, (b)smalleffectsizes,(c)highnumbersoftestedrelationships,(d)flexi-bilityindesignsanddefinitions,andwhatrepresentsanunequivo-caloutcome,(e)conflictsofinterestorprejudicewithinafield,thusincreasingbias,and(f)“hot”fieldsofscience,wheremoreteamsaresimultaneouslytestingmanydifferentrelationships.Thistendencyisexacerbatedbytheincentivestructureforjournalsandauthorstopublishpositiveresultsandpublishoften.
Our field of geobiology is likely less besot by some of theseissues related to, for instance,publicationbias.The reasonwhy isthatdataanalysisisoftenaccomplishedthroughvisualratherthanstatistical comparison;explicitp-values thatwouldbeusedas thearbitrator of accept/reject decisions in other fields are not gener-ated.Figure1documentstheuseofstatisticaltestinginthejournalGeobiologycomparedtoasisterpublicationbyWiley,Marine Ecology. ThiscomparisonisnotmeanttosingleoutGeobiologyasajournal;wearecertaintheresultswouldbesimilarinourotherdisciplinaryjournals.Inthiscomparison,weconsideredthe100mostrecentpa-persatthetimeofwritingthatreportednewdata(soreviewpapers,commentaries, or papers that were descriptive in nature, such asdescribingstromatolitemorphologies—basicallyanypaperwherenonewnumericaldataweregenerated—wereexcluded).Paperswereconsideredtohavea“statisticalanalysis”simplyiftherewassomeefforttounderstandthepossibleinfluenceoferrorandsamplingontheprecisionoftheinference,recognizingthiscantakemanyforms(e.g.,formalNHST,bootstrapping,andBayesianposteriorprobabil-ities). In themarineecology journal, essentially everypaper (97%)reportingnewdatauseda statistical analysis (similar resultswerereported by Fidler, Burgman, Cumming, Buttrose, & Thomason,2006,forotherecologyjournals).InGeobiology,thepercentagewithanysortofstatisticalanalysisisfarlower,at38%(chi-squaredtest;p = 2.0 × 10−18).Werecognizethatsomeofthisdifferencemaybedue todifferentdata types andapproaches, butourpersonalob-servation is thatmanystudies inGeobiologyarecomparinggroupsofdatavisuallyratherthanstatistically.Further,thestatisticalanal-ysesthatarepublishedinGeobiologyareconcentratedinmolecularphylogeneticstudies,wherebootstrapsorBayesianposteriorprob-abilities are commonly used to assess precision. Statistical testingis far less common innon-phylogenetic studies (e.g.,manygeomi-crobiology,geochemistry,biomarker,andbiomineralizationstudies).ComparingthemostrecentpapersinGeobiologyagainstthefirst100data-driven papers in the journal (first published in 2003) reveals
thatthepercentagewithastatisticaltesthasincreasedslightly(from26%to38%)butthedifferencebetweenthetwotimeintervalsex-aminedisnotstatisticallysignificant(usingp<0.05todeclarestatis-ticalsignificance;chi-sq=2.8,df=1,p=0.095).
We propose that increased emphasis on statistical testing asacriticalstep indataanalysis isneeded inthefieldofgeobiology.Putsimply,whybuildascientificworldviewbasedonstudieswhereit has not been demonstrated that the observed differences aremore thanwhatwouldbeexpected fromsamplingvariability?Ontheotherhand,blindrelianceonstatisticaltestingwillnotbehelp-ful either. As cogently argued by Fidler, Cumming, Burgman, andThomason(2004),ecologyasafieldismiredinstatisticalfallacies:specifically,theerroneousbeliefsthatp-valuesareadirectindexofeffectsize,andthatthep-valuerepresentstheprobabilitythatthenullhypothesisistrue(orfalse).Consequently,simplyrunningmorestatisticalanalysesisnotasufficientavenuetoaccurate,reproduc-ible, and correctly interpreted findings.Here,wehope to use theopportunity provided by broader discussions about statistics andreproducibility in science to review three important concepts—(a)significancetesting,(b)multiplecomparisons,and(c)effectsize—andtranslatethemtoourfieldofgeobiology.Thismanuscriptisintended
F IGURE 1 Useofstatisticaltestinginthe100mostrecentdata-drivenpapersinGeobiology(present)versusMarine Ecology. Thefirst100papersinGeobiology(early)werealsocompared.“Statisticaltests”werebroadlydefined(forinstance,bootstrapping,Bayesianapproaches,etc.,andnotonlyformalNullHypothesisSignificanceTesting).Paperswithouta“MaterialsandMethods”sectionorthatdidnotpresentnewnumericaldata(e.g.,review/synthesispapers,modelingpapers)werenotincludedinthecomparison
Use of statistics in 100 papers from each journal
Geobiology (early) Geobiology (present) Marine ecology
Sta
tistic
al te
stN
o st
atis
tical
test
| 3PERSPECTIVE
tobeeducationalandtostartadiscussionregardingproperstatisti-calanalysesingeobiology.OurfocusisonwhatwebelievearethemorefamiliartermsofformalNHST(i.e.,afrequentistapproach),butwediscussthegoalofamoreflexibleapproachtostatisticalthink-ingat theconclusionof thepaper. Inthisspirit,werecognizethatthesetopicswillbefamiliartomanyreaders,andindeedasidefromageobiologicalspin,thereislittleintellectualterritoryherethathasnot been extensively covered in medical, psychological, and eco-logicaljournals.However,forreaderslessfamiliarwiththesetopicswehopethisessaymayprovideusefulguidanceforavoidingsomeof the problems of reproducibility that have plagued other fields.Geobiologyultimatelyhastheopportunitytobeamongtheselectgroupoffieldsthatdonothavemajorreproducibilityproblems.
Take home message: There are acknowledged issues with NullHypothesisSignificanceTesting(includingpublicationbiasandrep-licability in science at large), but studies shouldnot rely solely onvisualanalysisalone.Aformalexaminationofthesizeoftheeffectrelativetosamplingvariationisakeytoolinevaluatingnewscientificclaims.
3 | NULL HYPOTHESIS SIGNIFIC ANCE TESTING (NHST)
Despite itswidespread use in science, the development ofNHSThas its roots in industrial applications.For instance,RonaldFisherdevelopedtheanalysisofvarianceworkingwithexperimentalcropdata,andWilliamGosset(nominatedhereasthepatronsaintofgeo-biologicaldataanalysis)developedtheStudent'sttestworkingtoin-creaseyieldsattheGuinnessBrewery(Student,1908).Whenfacedwithtwo(ormore)groupsofsamples,eachwithscatter,theperti-nentquestioniswhetherthesetsofsamplesmayactuallyrepresentthesameunderlyingdatadistribution,withobservedvariationbe-tweengroupsarisingfromsamplingofthisdistribution(Thenullhy-pothesis,H0,isthatthereisnodifferencebetweengroups;inotherwords,thereisonlyoneunderlyingdistribution).InNHST,thisques-tionisaddressedbycomparingthemeansormediansofthegroups(anddatavariability)andcalculatingtheprobabilitythataresultatleastthatextremewouldbefoundifthegroupsweredrawnfromthesamedistributionorpopulation.Thisprobabilityisexpressedasthep-value.Incorrectlyrejectingthenullhypothesis(afalsepositive)isreferredtoasTypeIerror,andincorrectlyfailingtorejectthenullhypothesis (afalsenegative) isTypeIIerror.Avarietyofparamet-ric (those thatdependona specifiedprobabilitydistribution fromwhich the data are drawn) and nonparametric (those that do not)statistical analysesexist tomake thesecomparisonsbetween twoormoregroups.A full reviewofparticularanalyses isbeyond thescopeof this articleand isbest found in statistical textbooksandwebresources.
Choosingthecorrectstatisticalanalysis foragivensetofdataposessomeissues,butthemoreimportantissue—thefocusofthispiece—is understanding the fundamental logic of statistical tests:What theydoanddonot tellyou. It iserrors in logic, rather than
someone using a t test when they should have used aWilcoxon,whicharemostproblematic.OnecommonfallacyregardingNHSTisthatthelevelofsignificancerejectingthenullhypothesis(thep-value)istheprobabilitythatthenullhypothesisiscorrect.Thereisawideliteraturediscussingthisfallacy,withoneofthebestbeingCohen's1994essay “TheEarth is round (p<0.05).”Henotes thatwhatwewanttoknow,asresearchers,is“giventhesedata,whatistheprobabilitythatH0istrue?”ButwhatNHSTtellsusis“giventhatH0istrue,whatistheprobabilityofthese(ormoreextreme)data?”Inotherwords, rejectinga specificnullhypothesisdoesnotactu-allyconfirmanyunderlyingtruthortheory.Addressingthisrequiresknowing the likelihood that a real effect exists in the first place,whichmaybedifficulttocalculate.Eveniftheseoddscanbecalcu-lated, thecombineduncertaintyresults inmoresubstantialmurki-nessabouttheresultsthanthep-valuealoneindicate(Nuzzo,2014,providesmore informationonwhatap-value really tellsyou—in a formmorepalatabletotheaveragereaderthanCohen—aswellasahistoricaldiscussionofhowthep=0.05thresholdcameabout).
Another common error in interpreting formal statistical testsregards “mechanical dichotomous decisions around a sacred 0.05criterion” (Cohen, 1994). Obvious to most readers is that a 5.5%probabilityofgeneratingresultsatleastthatextreme(e.g.,p=0.055)isbasicallynodifferent,inaninterpretivesense,thana4.5%prob-ability(p=0.045).Inotherwords,p <0.05,p <0.01,andp <0.005(Benjamin etal., 2018, recently advocated for the more stringentp <0.005 statistical criteria) are all arbitrary criteria, although stillusefulconventions.AmemorablequotebyRosnowandRosenthal(1989)states:
Wewant to underscore that, surely, God loves the0.06 nearly asmuch as the 0.05. Can there be anydoubtthatGodviewsthestrengthofevidencefororagainstthenullasafairlycontinuousfunctionofthemagnitudeofp?
Yetwhileresearchers inherently“know”p=0.045and0.055arereallynodifferent,oneresultisdeemed“true”andpublishable(posi-tivepublicationbias)andoneisdeemedinconsequentialandignored.OrinRosnowandRosenthal'smoredramaticprose:
Itmaynot be an exaggeration to say that formanyPhDstudents,forwhomthe0.05alphahasacquiredanalmostontologicalmystique,itcanmeanjoy,adoc-toral degree, and a tenure-track position at amajoruniversity if their dissertationp is <0.05.However,ifthep is>0.05,itcanmeanruin,despair,andtheiradvisor'ssuddenlythinkingofanewcontrolconditionthatshouldberun.
Take home message: Null Hypothesis Significance Testing inves-tigates the probability (expressed as p-values) of finding results asextreme as the observed data, given the null hypothesis. Of note,p-values cannot address the likelihood that a hypothesis is correct.
4 | PERSPECTIVE
p-value thresholds representarbitrarycutoffs rather than true/falsecriteriaforaccept/rejectdecisions.
4 | MULTIPLE TESTING AND RESE ARCHER DEGREES OF FREEDOM
A colleague once described a conference—in the pre-PowerPointdays—in which geologists were instructed to bring an overheadtransparency summarizing the record of their subfield throughEarthhistory(tectonicevents,fossiltrends,geochemicalproxyre-cords,etc.)alongthesamex-axis temporalscale.Theparticipantsthen took turns swappingdifferent transparenciesonandoff theprojector, looking for correlations. This strategy was innovativefrom a data exploration perspective—and indeed sounds intellec-tuallystimulating—butultimatelyisanightmareregardingmultiplecomparisons.Themultiple comparisonproblemessentially resultsfrom“moreshotsongoal.” Itcanbe illustratedwiththefollowingexample—the probability of flipping any given coin as “heads” 10timesoutof10isverylow;however,ifthisisdoneoverandover,theprobability thatonecoinwillbe “heads”every timeobviouslyincreases.Itwouldbeincorrect,though,toconcludethatthatcoinisdifferentfromtherest.SpecificallyforNHST,theprobabilityofafalsepositiveinabatteryoftestswillbe1–(1–α)k,wherealpha(α)isthesignificancelevelrequired(e.g.,p<0.05)andkisthenum-beroftestsperformed(discussedindetailinStreiner,2015).For10separatetestsatthestandardlevelofsignificance,theprobabilityofafalsepositiveis1–(1–0.05)10orabout40%.Thisissueisper-hapsevenbetterillustratedbyFigure2below(analyzingtheeffectofjellybeancolors).
Uncorrectedmultiplecomparisonsareoneoftheprimarycausesof replicability issues across many scientific fields. In some cases,thesecomparisonsmayhavebeendoneexplicitly,suchaswiththegreenjellybeans.Manyreportsinthe1990sand2000saboutwhatgenes“cause”specificeffectsinhumanswerethespuriousresultoftrawlinglimitedgeneticdataacrosscomprehensiveepidemiologicaldatasetsand lookingfor“significant”correlations.Ananalog inourfieldmaybeinstanceswherebiologicaldataofinterest(e.g.,micro-bialabundance/ecologicaltraits/geneexpression)arecollectedfromamodernenvironment,suchasahotspring,alongsideenvironmen-tal data.Which environmental parameters are correlatedwith thebiologicalmetricof interest?(leavingasidethebroaderquestionofcorrelationandcausality). Itwouldbe inappropriate tosimplycon-ductapairwisecomparisonofalltheenvironmentalparameterswiththebiologicalmetricwithoutcorrectionformultipletests.Bychancealone, some parameters might be significantly correlated: In nulldatasets,p<0.05occurs,bydefinition,5%ofthetime.
Perhapsmoreinsidiously,multiplecomparisonscanbedoneim-plicitlyorunconsciously.Forinstance,aresearchermaycompleteacompilationof fossil data from shelly invertebrates and thenhalf-asleepintheshowermentallywanderthroughallthedifferentgeo-logicaldatarecords(sealevel,temperature,redoxproxies,strontiumisotopes,etc.)beforesnappingawakeafternotingthattheidentified
fossil trend looks very similar to a previously published record ofcalciumisotopes.Asinglestatisticalcomparisonismade,andvoila:p<0.05. Perhaps something in the calcium cycle is affecting thelivelihoodoftheseshell-formingorganisms?Maybe.Inthiscase,theresearcher did not explicitly test each comparisonwith ap-value,as inthejellybeanorhotspringsexample,buttheresearcherstillmentallyconductedtheequivalentofswappingoutoverheadtrans-parencies:multiplecomparisonsuntilamatchwasfound.Arelatedproblemisexploratorydataanalysisasdataarebeinggenerated(theroleofexploratorydataanalysisisdiscussedfurtherbelow).Arangeofanalysesmightbeconducted,withonepredictorvariableoutofmanyyieldingasignificantcorrelation.Whenthefulldatasetisgen-erated,“onlyone”explicittestisconductedonthefinaldatasetandincludedinthepublication,withtheresearcherhonestlyforgettingjusthowmanyanalyseshadbeenconducted.Or,aspuriouslysignif-icantp-valueisfoundearlyon,sayforallavailablemarinesamples,whichthendisappearsasmoredataareadded.Asecondanalysis,looking at individual ocean basins, and a third, looking by depthclass, are conducted, perhaps excluding some extreme outliers,until anothersignificantp-value reappears [so-calledp-hacking,or“researcherdegreesoffreedom”;Simmons,Nelson,andSimonsohn(2011)]. It isthenthissub-groupanalysisthat isemphasizedinthemanuscript.Moreoftenthannot,itisanunwittingerrorbyascien-tistexcitedlyanalyzingtheirdata.RememberFeynman'squotethat“thefirstprincipleisthatyoumustnotfoolyourself—andyouaretheeasiestperson to fool.”This isnot todiscouragedataexploration,butcorrectlyaccountingforthesecomparisons—oratleastremain-ingcognizantoftheissue—willultimatelybekeytoproducinglastinginsights.
4.1 | Avoiding multiple comparison pitfalls
Howthenshouldoneaccountformultiplecomparisons?Thebestapproachisearly,explicitplanning.Thegoldstandard,aspracticedin clinical trials, is pre-registration (for instance, on ClinicalTrials.govrunbytheU.S.NationalLibraryofMedicine). Inthisstrategy,theresearcherpublishesawhitepaperpriortostartingtheexperi-ments,explicitlydetailinghowthedatawillbecollected(includingthestoppingpoint),andthenumberandtypesofstatisticalanalysestobeconducted.Theyarethenheldtothisplan,orthetrialisnotconsidered valid. Such a strategy is often considered anunrealis-ticidealoutsideofclinicaltrials,butnotablyanefforttopublicallypostmethodologiesa priorihasrecentlybeeninitiatedformolecularphylogenetics(Phylotocol;DeBiasse&Ryan,2018),afieldparticu-larly susceptible to “researcher degrees of freedom.”Whether ornotsuchregistriesareappropriateforgeobiologyinthelongtermisasubjectfordebate,andcertainlytheapproachignoresthefactthatmuchofourscience (especially fieldscience) is truly “discov-erydriven”ratherthan“hypothesisdriven.”Nonetheless,increasedpre-experimenteffortputintoplanningstatisticalanalyseswillbeeffectiveinreducingmultiplecomparison“creep.”Thismaybepar-ticularlyusefultobringupduringprojectplanningwithearly-careerresearchers, as essentially all aspects of experimental design are
| 5PERSPECTIVE
beingtaughtatthatpoint.Discussionsofstudy-levelreproducibilityshould also startmaking it intoundergraduate andgraduategeo-biology curricula, as for instance is occurring in somepsychologyprograms(Button,2018).
Moving from pre-experiment awareness and planning to post-experimentdataanalysis,thereareseveraltechniquestoaccountformultiplecomparisons(“multipletestingcorrection”or“alphainflationcorrection”).CertainBayesianapproachesmaynotrequireexplicitcor-rection(seeGelman,Hill,&Yajima,2012),butinafrequentistcontextacommonapproachistoapplyadirectcorrectionthataccountsforthe increased family-wise error rate. The Bonferroni correction, forinstance, divides the level of significance required (e.g.,α=0.05) bythenumberofindependentanalysesconducted.So,aresearcherin-vestigatinghowbrachiopodbodysizechangedbetweenfourdifferentstratigraphicformationswouldrequirep<0.008(anoverallα=0.05/6independentpairwisetests=0.008)inordertoachievesignificance.Anothercommonpracticeforsuchananalysis(multiplepairwisecom-parisons)istoconductanomnibustestsuchasANOVA,followedbyaposthoctestsuchastheTukeyHSDtestthatdirectlyaccountsforincreasedfamily-wiseerror.
This general practice of alpha inflation correction has receivedsomecriticism, as the thresholds for significancecanbeoverly con-servative, leading tomore false negatives and potentially restrictingthepathoffuturescientificcuriosity(Moran,2003;Rothman,1990).Suchcriticismscommonlynotethatstudiesintheirfieldsoftenhave“asmallnumberofreplicates,highvariability,and(subsequently) lowstatisticalpower” (Moran,2003).Dr.Turekianmightwell relate!Theargumentputforthbysuchpapersisthattheincreasedincidenceinfalsepositivesisnotreallyanissue,becauseotherresearcherswillre-peattheexperiments,beunabletoreplicatethem,andthefalseclaimswilleventuallydisappearfromtheliterature.However,carefulstudyofreplicationattempts inpsychiatryandpsychologyhasdemonstratedthat(a)uncorrectedmultiplecomparisontestingdoesempiricallyleadtoamorassofpublishedfalsepositives(Duncan&Keller,2011),and(b)theoriginalhypothesesdonotsimplydisappearfromtheliteraturebuthaveincrediblepersistence(Ioannidis,2012).Thecausesofsuchper-sistencearevariedbutbasicallyboildowntolowincentivesforjournalsorauthorstopublishnegativereplications(Ioannidis,2012;Simmonsetal.,2011).Whiletheseargumentsonthestringencyofmultipletest-ingcorrectionsarenotmeritless (seenextparagraph), inouropinionwidespreadTypeIerrors(falsepositives)areafargreaterhindrancetotheadvancementofsciencethanTypeIIerrors(falsenegatives).
Thatsaid,determiningthecorrectbalancebetweenthelikelihoodoffalsenegativesandfalsepositives,aswellasencouragingcutting-edge methods development that must start with small datasets
F IGURE 2 FiguremodifiedfromthewebcomicXKCD.com.Multiplestatisticalcomparisonsincreasetheprobabilityofafalse-positiveresult(“moreshotsongoal”).Specifically,withtwentyindependentcomparisonsasshowninthecomic,theprobabilityofafalsepositiveis~65%(Streiner,2015)
6 | PERSPECTIVE
(Turekian'spoint),isobviouslyacomplicatedendeavor.Severalothermethods, such as the Holm and Hochberg methods (reviewed byStreiner,2015), offerprotectionagainst alpha inflationbut arenotas conservative as a strict Bonferroni correction. Such correctionsshouldalsofollowcommonsense,astheycandegenerateintotheab-surd(García,2004;Streiner,2015).Forinstance,howmanytrulyin-dependentcomparisonsarereallybeingconducted?Oceanographicfactorssuchastemperature,oxygen,pH,light,andpressurecanallbebroadlycorrelatedwithdepth.Comparingalloftheseagainstmicro-bialabundancesmightresultinmultiplesignificantcorrelationsthatbecome(inappropriately)non-significantaftercorrectionformultiplecomparisons.Anotherissuetoconsideriswhethertheremaybepre-existinghypothesesthatmotivatedthestudy.Forthebrachiopodol-ogiststudyingbodysizeacrossfourformations,theymaybetestingprevioushypothesesofbodysizeevolutionduringaspecifictimepe-riod(e.g.,Heim,Knope,Schaal,Wang,&Payne,2015).Therealtestmaybebetweenthestratigraphicallylowestandhighestformations,and itmaybeundulystringent to requireavery lowvaluep-valueresultingfromthemultiplepairwisecomparisons.
ThisdifferencehasbeendiscussedbyStreiner(2015)asthediffer-encebetweenhypothesistesting(whichmaynotrequirecorrection)andhypothesis generating (which shouldbe reportedas tentativeand/orexploratoryresults).Orinplainerlanguage,exploratorydataanalysiscanbegood,andexplicithypothesis testingcanbegood,buttheapproachbeingusedshouldbeclear.Indeed,therearemanypowerfultechniques(includingnewmachinelearningtechniques)tounderstandwhichofmultiplepredictorvariablesmightbestexplainvariance in the response variableof interest (a situationweoftenfaceingeobiology).Theresultsofsuchanalyses,though,cannotbeturnedaroundandpresentedasana priorihypothesiscompletewithasignificantp-value.Dr.BrianMcGill'sblogpostprovidesacogent“defense” of exploratory data analysis: (https://dynamicecology.wordpress.com/2013/10/16/in-praise-of-exploratory-statistics).Hecloses,“IuseexploratorystatisticsandI'mproudofit!AndifIclaimsomethingwasahypothesisitreallywasana priorihypothesis.Youcantrustmebecause Iamoutandproudaboutusingexploratorystatistics.”
Thekeythreadrunningthroughthesestatisticalcommentariesregarding multiple comparisons is conscientiousness—personallywithrespecttohowmanyanalyseshavebeenconducted,butalsowith respect tohow the full scopeof statisticalprocedures isde-scribedinthepaper.Simmonsetal.(2011)provideexcellentguide-lines in this regard forbothauthorsandreviewers.Notably,whilepromotingstrict reportingguidelines, theseauthorsalsoadvocateforincreasedtoleranceofstatistical“imperfection”byreviewersandeditorsinwell-documentedpapers:“onereasonresearchersexploitresearch degrees of freedom is the unreasonable expectationweoftenimposeasreviewersforeverydatapatterntobe(significantly)as predicted. Under-powered studieswith perfect results are theonesthatshouldinviteextrascrutiny.”
Finally,attheendofthisdiscussion,itisworthwhiletoaskyour-selfthebasicquestionofwhethermultipletestingisanissueinyourparticularresearcharea.Ifyouareworkingsolelywithasingleproxy
record,oranisolatedgeneticsystem,theanswermightbeno.Theissue arises if youwant to understandwhat correlates with yourdata—ifthereisalargeconstellationofpossiblecorrelates,thereisan equally large likelihoodof spurious correlations. Learning fromthe abysmal record of replication in other fields, caution againstalphainflationinthesecaseswillbeacornerstonetoarobustandhealthyfieldofgeobiology.
Take home message:Vigilanceagainstalphainflationstartswiththeindividualresearcher,ideallyduringthepre-experimentalde-signphase.Dataexplorationisencouraged,buttrawlingthroughdata for significant correlations that are then presented as an a priorihypothesismustbeavoided:Papersshouldexplicitlystateiftheyaretobeviewedasexploratory.Thefullsweepofdatacol-lected,statisticaltestsconducted,andchoicesofdatainclusion/exclusion(orother“degreesoffreedom”)byaresearchershouldbe made clear in publication in order to avoid cherry-picking(Simmonsetal.,2011).
5 | EFFEC T SIZE
So letus sayyouhave founda significantdifferencebetween twogroupsofdata,andthroughattentivepracticeandanalysis,youhavedetermined it is not a chance result based on numerous “shots ongoal.”Thequestionnowis—doestheresultmatter?Thislastquestionseems silly, but amistaken focuson significantp-values as thebe-all/end-all,insteadofoneffectsize,hashamperedprogressinfieldssuchasecology(Fidleretal.,2004).Simply,effectsizeisaquantita-tivemeasureofthemagnitudeofaphenomenon.Statisticalpoweristhelikelihoodthatananalysiswilldetectarealeffect(asdeterminedbyasignificantp-value),andisgovernedbythesizeoftheeffect,thevariationpresentinthegroups,thenumberofsamplesintheanalysis,andthethresholdforsignificance(α).Thus,twogroupswithwidelyseparatedmeans (largeeffect size) and littlewithin-groupvariationwillrequirerelativelyfewsamplesforawell-poweredstudy.
The flip side of this—andwhy p-valuesmust be regarded asthe statistical likelihood a result that extreme would be foundby chance, rather than how “important” a result is—is that givenenoughsamples,literallyanyeffectsize,nomatterhowsmall,canbedetected.Thishasreceivedconsiderableattentioninthesta-tisticalliterature:
[Thenullhypothesis]canonlybetrueinthebowelsofacomputerprocessorrunningaMonteCarlostudy(andeventhenastrayelectronmaymakeitfalse).Ifitisfalse,eventoatinydegree,itmustbethecasethatalargeenoughsamplewillproduceasignificantresultand lead to its rejection.So if thenullhypothesis isalwaysfalse,what'sthebigdealinrejectingit?
(Cohen,1990)
At this point, you may be asking, “since what we are really in-terested in is effect size, andgivenCohen's statement that thenull
| 7PERSPECTIVE
hypothesisisalwaysfalse,doweevenneedtorunstatisticaltests?”Yes!Withoutasignificantresult,thereisnoreasontobelievetheob-servedresultsmaynotbeduetosamplingvariation.Inotherwords,significanceisthejumpingoffpoint,thelicensetostarttalkingabouteffectsizefromapositionofconfidence.Forfurtherreading,there-lationshipbetweensamplesize,p-values,andaccuracyininferringef-fect size is intelligentlydissectedbyHalsey,Curran-Everett,Vowler,andDrummond(2015).
Whatconstitutesanimportanteffectwillvarybyfield.Returningtothecointossexample,givenmillionsorbillionsortrillionsofflips—whatevertherequiredsamplesizemaybe—thedifferingweightsofthecoinsideswillultimatelycauseonesidetolandupsignificantlymore times than the other. Yet, this does not matter when twofriendsarechoosingarestaurant:Itisstill~50:50,andtheminisculeeffect size is irrelevant in this instance. This is the difference be-tweenasignificanteffectandanimportanteffect.Nonetheless,thisfallacyisoftencommittedintheliterature:
Allpsychologistsknowthatstatistically significantdoesnotmeanplain-Englishsignificant,butifonereadstheliterature,oneoftendiscoversthatafindingreportedintheResultssectionsstuddedwithasterisksimplic-itlybecomes intheDiscussionsectionhighlysignifi-cantorveryhighlysignificant,important,big!
(Cohen,1994)
Sometimes, though, small effect sizes really domatter. As anexample,humangeneticistshavelearnedthatthemajorityofphe-notypesarenotcontrolledbyasinglegene/locus(suchasthecaseforHuntington'sdisease).Rather,characteristicslikeheight,andtheriskofdiseasessuchasheartdisease,TypeIIdiabetes,anddepres-sionarecausedbythe(largely)additiveeffectsofthousandsofge-neticloci.Inthecaseofpsychiatricdisorders,eachindividuallocusexplains far<1%ofvariance in risk forapsychiatricdisorder (i.e.,smallindividualeffects)andyettotalgeneticcontributionsexplain40%–80%ofthevarianceinriskforthesedisorders(largesummedeffect) (Duncanetal.,2017;Ripkeetal.,2014;Wrayetal.,2018).Thus, as our questions move toward datasets with hundreds orthousandsofmeasurements,thefocusmustbeongeobiologicallyimportanteffectsizes(whichwillvarybyquestion)andconfidenceintervalsratherthansimplystatisticalsignificance.
6 | INTERPRETATIVE E X AMPLES
Therelationshipbetweenp-values (significance),effectsize,andsamplesizeisillustratedinFigure3.Notethereisnorelationshipimplied between the left and right panels, they are simply illus-trativeexamplesof theseconcepts. InPanelA,a researcherhasidentified relationships that appear interesting to pursue, withtheGroupBmean~50%higherthanGroupAinthettestexam-ple (left side).On the right side, thepredictor variable accountsfor27%ofthevariation(R2value)intheregressionexample.But
the results are not statistically significant. Especially in light ofrecentclaims thatp <0.05 is too laxastandard (Benjaminetal.,2018),thereisastronglikelihoodthatthisresult—specificallytheobservedsizeanddirectionof theeffect—isdue to samplingef-fects (Halsey etal., 2015). If thiswere your hard-wondata, it isimportant to avoid the temptationof knowinglyor unknowinglymanipulatingthedata(p-hacking)toachievea“significantresult.”For instance, simply removing the lowest data point inGroupBresultsinp = 0.03,significant!Maybe,thereisanimminentlylogi-calreasontoexcludethatdatapoint—perhapsit isfromadiffer-entandinappropriatelithology,orthesamplingmethodologywasactuallydifferent.Thegoalthoughistoavoidinventingposthocjustificationsfordatamanipulation,asitissoeasytofoolyourself,particularlyifremovingthatpointprovidessupportforlong-heldideas (confirmation bias). If such data exclusions aremade, theyshouldbenotedclearlyinthemanuscript,andbothsetsofanaly-ses, with a reasoned explanation, should be included (Simmonsetal.,2011).Movingtowardsharedtransparencywithinscientificsubfieldsfordatacollection,reportingandpresentationmethodswillbeinstrumentalinhelpingresearcherspresentreasonablere-sultswithlessthreatofconscious/unconsciousp-hacking.PanelBdepictsessentiallythesameanalysisasinPanelA,butwithmoresamples. Inthiscase,theresult issignificant,andtheresearchercanfeelmoreconfidentdescribingthesizeanddirectionoftheef-fect.NotethatifPanelBwereanextensionofthestudyinPanelA,thebestpractice(sometimesdifficulttoachievebutnonethelessthebestpractice) isactually toestablishapre-determinedstop-pingpoint for data collection (Simmons etal., 2011).Continuingtoaddbitsofdata,withtheanalysisreruneachtime,effectivelyrepresentsmultipletests.
PanelCdepictshowwithlargersamplesizesthepowertoiden-tify small but statistically significant effects also increases. In thecomparisonofgroupsAandB,themeansonlydifferby~3%,buttheresultishighlysignificant(p =0.007).Intheregressionanalysis,thepredictorvariableonlydescribes4%ofthevariationintheresponsevariable,butnonetheless,theresultissignificantbytraditionalmea-sures(p =0.047).Lookingatthescatterinthisplotisinstructive,asitappearsasacloudofpointswithnocorrelation,butstatistically,asmallcorrelationdoesexist. Inotherwords, itvisually illustratesthe quotes fromCohen: Effect size, not significance alone, is theendgoal.Asalsopreviouslydiscussed, theultimate interpretationofimportanceisquestion-andfield-specific.Toreiteratethepoint,robust increases in crop yields of 4%may feedmillions, whereasachangeof4% inmodernmarine sulfate levels (e.g.,~1mM)mayhavelittlerelevancetocurrentresearchquestionsregardingsulfurbiogeochemistry.
7 | BEST PR AC TICES MOVING FORWARD
Westartthis“bestpractices”sectionfromahumbleposition,aswearebynomeanstrainedstatisticiansbutratherenthusiasticadvocatesforincreasedstatisticalrigoringeobiology.Wehavemade(orwillmake)
8 | PERSPECTIVE
technical and logical errors inourpublishedpapers and almost cer-tainlyhavemadeunconscious“researcherdegreesoffreedom”deci-sionsthatimpactedresults(Simmonsetal.,2011).EvenJacobCohen,whosethoughtfulpaperswehavecitednumeroustimes,wascalledoutforincorrectstatisticallogic(Oakes,1986).Learningcorrectstatisticalpracticeandlogicisthusacareer-longendeavorformostscientists.
Asafirststep,wesuggestthatbothreviewersandauthorsin-sistonsomeformofstatisticalanalysisasabestpracticeapproach.Thecriticalconcepthereisthatyourparticularsetofmeasurements
isnotastatictruthaboutagroup.Rather, theyarevaluesdrawnfromadistribution, and repeateddrawsmayyield verydifferentresults—especiallyatlowsamplesizes(Halseyetal.,2015).Wedorecognizethatmanygeobiologicalstudieswillnotlendthemselvesto formal statistical tests and that there is a healthy traditionofdiscovery-based geobiological science—especially in field stud-ies—that should not be stifled.Nonetheless, if a paper is report-ingobserveddifferencesbetweengroupsorcorrelationsbetweenvariables,itisscientificdue-diligencetoinvestigatethedegreeto
F IGURE 3 Therelationshipbetweensignificance,effectsize,andsamplesizeshownwithhypotheticaldatapoints.Thisisdepictedformeasurementssampledfromtwodifferentgroups(leftside)andcorrelationbetweenapredictorvariableandresponsevariable(rightside);notethereisnostrictrelationshipbetweenthedatainthetwopanels.(a)Smallsamplesizesmayrevealalargeeffect,buttheresultisnotsignificantandshouldbeconsideredanintriguingfindingratherthanaconfidentconclusion.(b)Withincreasedsampling,aresearchermay(ormaynot)revealasignificanteffect.Thesign(direction)ofthiseffectislikelycorrect,whileincreasedsamplingwillleadtobetteraccuracyoftheeffect'smagnitude.(c)Givenenoughsampling,eventhesmallesteffectsizewillbecomestatisticallysignificant.Therightsideplotincisvisuallyinformativeinthisregard—whatappearstobeacloudofpointsare,statisticallyspeaking,correlated.Forillustrativepurposes,wehavedescribedeffectshereas“large”and“small,”butasdiscussedinthetext,thisdistinctionwillbefield-andquestion-specific
0 1 2 30
2
4
6
8
10
Group A(4.3 ± 2.3*)
Group B(6.4 ± 2.8*)
0 2 4 6 8 100
2
4
6
8
10
Predictor variable
Resp
onse
var
iabl
ep = 0.11
(a) Non-signi�cant result, possibly(?) large e�ect size* (10 samples)
p = 0.12, R2 = 0.27*
0 1 2 30
2
4
6
8
10
Group A(4.2 ± 2.1)
Group B(6.6 ± 2.0)
p = 0.0001
0 2 4 6 8 100
2
4
6
8
10
Predictor variable
Resp
onse
var
iabl
e
p < 0.0001, R2 = 0.53
*E�ect sizes should not be conclusively interpreted in the absence of signi�cant p-values
0 1 2 30
2
4
6
8
10
Group A(3.9 ± 0.2)
Group B(4.0 ± 0.3)
0 2 4 6 8 100
2
4
6
8
10
Predictor variable
Resp
onse
var
iabl
e
p = 0.007 p = 0.047, R2 = 0.04
(b) Signi�cant result, large e�ect size (25 samples)
(c) Signi�cant result, small e�ect size (100 samples)
| 9PERSPECTIVE
whichthesedifferenceswouldbeexpectedgiventhenullhypoth-esis(ormorebroadly,theuncertaintyoftheresult).
Inthisrespect,wenotethatwehavefocusedprimarilyhereonformalNHST,buttherearecertainlyotherstrategiessuchasmodel-basedapproaches,informationtheoreticapproaches,and/orboot-strappingthatachievethesamegeneralgoalofunderstandingtruerelationships (as distinguished from spurious results that are sim-ply due to sampling variability). For instance,maximum likelihoodand Bayesian approaches to assess the impact of random errorare the common practice inmolecular phylogenetics. The debateaboutwhether formalNHST should be retained and improvedordoneawaywithisdecadesold,ragesstill,andisnotsomethingwecanadequatelyaddresshere (Benjaminetal.,2018;Cohen,1994;Falk&Greenbaum, 1995; Fidler etal., 2004;Halsey etal., 2015).AndasCohen(1994)noted,thereisnomagicalternativetoNHST.Certainly,though,allapproachesdiscussedabovearemorelikelytoleadtocorrectinferencethanvisualinspectionofdata.
As statistical tests are increasingly implemented in geobiol-ogy,thenextstepistolearnfromthemistakesofotherfieldsandhelpeachothernotfallfoulofthefallaciesdescribedinthisessay.Specifically,thepointstoavoidare:
1. Not correctly accounting for multiple testing.2. Consideringthep-valuetobetheprobabilitythatthehypothesisiscorrect.
3. Consideringp =0.045tobeadramaticallystrongerrejectionofthe null hypothesis than p =0.055 (“mechanical dichotomousdecisions”).
4. Not explicitly reporting “researcher degrees of freedom”(Simmonsetal.,2011).
5. Consideringevery“significant”resulttobean“important”result.Rather,significanceshouldtypicallybetherequirementforposit-inga scientificeffect,whichmust thenbeput intoappropriatecontext.
Ofthese,multipletestingand“researcherdegreesoffreedom”arelikelythemostproblematicwithrespecttoreproducibility,especiallyastheycanoftenbedoneunconsciously.Explicitplanningatthestartofastudyiscrucialinthisregard.Alsoimportantattheplanningstagesis power analysis. Ioannidis (2005) notes that in terms of achievinglong-lastingscientificinsights,fewerwell-poweredstudiesarevastlypreferabletomanylow-poweredstudies,andcertainlythefieldben-efitsfromnotchasingfalseleads.Further,asdemonstratedbyHalseyetal.(2015),largersamplessizesandwell-poweredstudiesalsomorepreciselyestimatetheeffectsize,whichisafterallwhatweareinter-estedin.Thechallengehereisanobviousconflictbetweenincentivestructuresforthefieldasawholeandindividualresearchers,specifi-callyearly-careerresearchers.Suchconsiderationsshouldideallyplayintoevolvingdiscussionsonhowpost-docs,facultypositions,andten-ureareevaluated.Fortunately,ingeobiologyweareoftenaddressingfirst-orderquestionswith largeeffects,andtherequired increase insample size to achieveawell-powered study isoftennot that large(Sterne&Smith,2001).
Asonefinalnoteregardingreproducibility,correctlydocument-ing original data and metadata in accessible supplementary doc-uments or data repositories is key to allowing researchers to testresults and build on previous studies inmeta- andmega-analyses(seeforinstanceIoannidisetal.,2009).Originalcodeusedforanaly-sesmustalsobeadequatelycuratedinapublicrepository;thisstepiscommoninbiologicalfieldssuchasecology(Cooperetal.,2017;Ram,2013),butisnotacrossgeobiology.
Toavoid feedingapublicationbiasmonster, and toencouragenewdevelopmentsinamannerKarlTurekianwouldbeproudof,wedonotadvocateastrictrequirementofsignificanceforpublication.Inouropinion,theresultsinFigure3,PanelA,iffromanemergingproxyrecord,wouldbequitesuggestiveandshouldbeconsideredforpublication,butthereadershouldbenotifiedhowlikelythedataare(giventhenullhypothesis).Poweranalysisisalsohelpfulinthisregard(Cohen,1992).Asanexample,aninfluentialgeneticspaperwaspublishedinNature,despitehavingnullresultsfortheprimaryhypothesis (The International Schizophrenia Consortium, 2009).Thispaperprovidedstrongevidencethatsignificantresultswouldbedetectablewithlargersamplesizesinthenearfuture,anditwastheapplicationofarecentlydevelopedstatistical technique (poly-genicriskscoring)thatmadeitworthyofpublicationinNature.
Moving to the longer-term view, the “best practices” describedabovearehopefullyuseful,but thegoal for thenextgenerationofgeobiologistsshouldnotbebestpracticesliststapedtothesideofcu-bicles.Putsimply,eventhemostwell-intentionedof“bestpractices”listscan lead toa “cookbook”viewofdataanalysis,where there isarightandawrongwaytodothings,andstatisticsareacomputerbutton to push after data acquisition. Rather the goal should be asituation where, for many, computational reasoning and data sci-enceareanatural,integratedpartofoursciencealongsidefieldandlaboratory skills.Themainneed for this is simply thatmanyofourscientificquestionsdonotreadilyconformtoclassicstatisticaltests.Asanexample,KellerandSchoene(2012)investigatedhowigneousgeochemistryhaschangedthroughEarthhistory,andrecognizedthattheserocksarenotevenlysampledinspaceandtime—someplutonsareheavilysampled,whileothersweresampledrarelyornotatall.Inotherwords, samplesarenot independent.Toaddress this issueofsamplingheterogeneity,theyutilizedare-weightedbootstrappingapproach,withbootstrapweightsforagivensamplerelatedtothespatialandtemporalproximitytoothersamples.Paleontologistshavealsoaddressedthesame issueofsamplingheterogeneity,butusingdifferentmethodsappropriatetothedataarchiveofthatfield(e.g.,Alroy,2010).Neitherofthesesolutionscamefromastatistics“cook-book,” and achieving the flexibility to design themost appropriatetest(beitfrequentist,likelihood,orBayesian),ortoperformnumeri-calexperimentstestingdifferentscenarios,willrequireafoundationinstatisticsbutalso, importantly,computationalthinking (Weintropetal., 2016). Geobiology has had considerable success in breakingfieldboundariesandeducatingstudentswhoareascomfortablewitharockhammerasapipette;integratingacomputationalandstatisticalperspectiveintothistrainingwillbethenextsteptodrawingrobustinsightsfromever-largergeobiologicaldatasets.
10 | PERSPECTIVE
Inclosing,wedonotexpectastatistical revolution ingeobiol-ogy overnight. The common implementation of statistical analy-ses in fields likeecology tookdecades. In fact,Gosset (“Student”)wrotetoFisherregardingthettestthat“IamsendingyouacopyofStudent'sTablesasyouaretheonly[person]that'severlikelytousethem!”(citedinBox,1981).WehopethisPerspectivehelpsstartadialogueregardingstatisticalpracticeingeobiology,whilealsorec-ognizingitisheavilycoloredbyourperspective—welookforwardtoseeingcommentaryfromotherperspectives(differentsubfieldsofgeobiology,Bayesianists,etc.).Ultimately,thiswillhelpusavoidtheproblemswithreproducibilitypresentinotherfields,bemoreconfi-dentinourresults,andasafieldmovemorequicklytowarddeepergeobiologicalunderstanding.
ACKNOWLEDG MENTS
WethankDavidJohnston,JonPayne,MattClapham,andananony-mous reviewer for commentsonapreviousversionof thismanu-script,andJamesFarquhar,AnneDekas,JoeRyan,DavidEvans,andAlex Bradley for helpful discussion.We thank RandallMunroe ofXKCD.comforpermissiontoreproduceFigure2.EASandSATwerefundedbyaSloanResearchFellowship.
CONFLIC T OF INTERE S T
Theauthorsdeclarenoconflictsofinterest.
ORCID
Erik A. Sperling https://orcid.org/0000-0001-9590-371X
ErikA.Sperling1
SabrinaTecklenburg1
LaramieE.Duncan2
1Department of Geological Sciences, Stanford University, Stanford, California
2Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, California
CorrespondenceErik A. Sperling, Department of Geological Sciences, Stanford
University, Stanford, CA.Email: [email protected]
R E FE R E N C E S
Alroy, J. (2010). The shiftingbalanceof diversity amongmajormarineanimal groups. Science, 329, 1191–1194. https://doi.org/10.1126/science.1189910
Baker,M. (2016).1,500scientists lift the lidon reproducibility.Nature News,533,452.https://doi.org/10.1038/533452a
Begley,C.G.,&Ellis,L.M. (2012).Drugdevelopment:Raisestandardsfor preclinical cancer research.Nature, 483, 531–533. https://doi.org/10.1038/483531a
Begley, C. G., & Ioannidis, J. P. A. (2015). Reproducibility in sci-ence: Improving the standard for basic and preclinical research.
Circulation Research, 116, 116–126. https://doi.org/10.1161/CIRCRESAHA.114.303819
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A.,Wagenmakers,E.-J.,Berk,R.,…Johnson,V.E.(2018).Redefinesta-tistical significance.Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562-017-0189-z
Bouter, L.M., Tijdink, J., Axelsen,N.,Martinson, B. C., & ten Riet,G.(2016). Ranking major and minor research misbehaviors: Resultsfrom a survey among participants of fourWorld Conferences onResearchIntegrity.Research Integrity and Peer Review,1,17.https://doi.org/10.1186/s41073-016-0024-5
Box, J. F. (1981). Gosset, Fisher, and the t distribution. The American Statistician,35,61–66.
Button, K. (2018). Reboot undergraduate courses for reproducibility.Nature,561,287.https://doi.org/10.1038/d41586-018-06692-8
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J.,Robinson,E.S.J.,&Munafò,M.R.(2013).Powerfailure:Whysmallsamplesizeunderminesthereliabilityofneuroscience.Nature Reviews Neuroscience,14,365–376.https://doi.org/10.1038/nrn3475
Cohen,J.(1990).ThingsIhavelearned(sofar).American Psychologist,45,1304–1312.https://doi.org/10.1037/0003-066X.45.12.1304
Cohen, J. (1992).Apowerprimer.Psychological Bulletin,112, 155–159.https://doi.org/10.1037/0033-2909.112.1.155
Cohen, J. (1994).Theearth is round (p<.05).American Psychologist,49,997–1003.https://doi.org/10.1037/0003-066X.49.12.997
Cooper,N.,Hsing,P.-Y.,Croucher,M.,Graham,L.,James,T.,Krystalli,A.,&Primeau,F.(2017).A guide to reproducible code in ecology and evolution. BESGuidestoBetterScience.London,UK:BritishEcologicalSociety.
DeBiasse,M.B.,&Ryan,J.F.(2018).Phylotocol:Promotingtransparencyand overcoming bias in phylogenetics (No. e26585v1). Systematic Biology, syy090.https://doi.org/10.1093/sysbio/syy090
Duncan,L.E.,&Keller,M.C.(2011).Acriticalreviewofthefirst10yearsofcandidategene-by-environment interactionresearch inpsychia-try.The American Journal of Psychiatry,168,1041–1049.https://doi.org/10.1176/appi.ajp.2011.11020191
Duncan,L.,Yilmaz,Z.,Gaspar,H.,Walters,R.,Goldstein,J.,Anttila,V.,…Bulik,C.M. (2017). Significant locus andmetabolic genetic cor-relations revealed in genome-wide association study of anorexianervosa.American Journal of Psychiatry,174, 850–858. https://doi.org/10.1176/appi.ajp.2017.16121402
Falk, R., & Greenbaum, C. W. (1995). Significance tests diehard: The amazing persistence of a probabilistic misconcep-tion. Theory & Psychology, 5, 75–98. https://doi.org/10.1177/ 0959354395051004
Fanelli,D. (2009).Howmany scientists fabricate and falsify research?Asystematicreviewandmeta-analysisofsurveydata.PLoS ONE,4,e5738.https://doi.org/10.1371/journal.pone.0005738
Fidler, F., Burgman, M. A., Cumming, G., Buttrose, R., & Thomason,N. (2006). Impact of criticism of null-hypothesis signifi-cance testing on statistical reporting practices in conserva-tion biology. Conservation Biology, 20, 1539–1544. https://doi.org/10.1111/j.1523-1739.2006.00525.x
Fidler,F.,Cumming,G.,Burgman,M.,&Thomason,N.(2004).Statisticalreform in medicine, psychology and ecology. The Journal of Socio- Economics,33,615–630.https://doi.org/10.1016/j.socec.2004.09.035
García, L. V. (2004). Escaping the Bonferroni iron claw in ecologicalstudies. Oikos, 105, 657–663. https://doi.org/10.1111/j.0030- 1299.2004.13046.x
Gelman,A.,Hill,J.,&Yajima,M.(2012).Whywe(usually)don'thavetoworryaboutmultiplecomparisons.Journal of Research on Educational Effectiveness,5,189–211.https://doi.org/10.1080/19345747.2011.618213
Gilbert,D.T.,King,G.,Pettigrew,S.,&Wilson,T.D. (2016).Commenton“Estimatingthereproducibilityofpsychologicalscience”.Science,351,1037.https://doi.org/10.1126/science.aad7243
| 11PERSPECTIVE
Halsey,L.G.,Curran-Everett,D.,Vowler,S.L.,&Drummond,G.B.(2015).TheficklePvaluegeneratesirreproducibleresults.Nature Methods,12(3),179–185.https://doi.org/10.1038/nmeth.3288
Heim,N.A.,Knope,M.L.,Schaal,E.K.,Wang,S.C.,&Payne,J.L.(2015).Cope'sruleintheevolutionofmarineanimals.Science,347,867–870.https://doi.org/10.1126/science.1260065
Hines,W.C.,Su,Y.,Kuhn,I.,Polyak,K.,&Bissell,M.J.(2014).SortingouttheFACS:Adevilinthedetails.Cell Reports,6,779–781.https://doi.org/10.1016/j.celrep.2014.02.021
Ioannidis, J. P. A. (2005). Why most published research findings arefalse. PLOS Medicine, 2, e124. https://doi.org/10.1371/journal.pmed.0020124
Ioannidis,J.P.A.(2012).Whyscienceisnotnecessarilyself-correcting.Perspectives on Psychological Science, 7, 645–654. https://doi.org/10.1177/1745691612464056
Ioannidis,J.P.A.,Allison,D.B.,Ball,C.A.,Coulibaly,I.,Cui,X.,Culhane,A.C.,…vanNoort,V.(2009).Repeatabilityofpublishedmicroarraygeneexpressionanalyses.Nature Genetics,41,149–155.https://doi.org/10.1038/ng.295
Keller,C.B.,&Schoene,B.(2012).Statisticalgeochemistryrevealsdis-ruptioninsecularlithosphericevolutionabout2.5Gyrago.Nature,485,490–493.https://doi.org/10.1038/nature11024
Lithgow,G.J.,Driscoll,M.,&Phillips,P.(2017).Alongjourneytoreproduc-ibleresults.Nature News,548,387.https://doi.org/10.1038/548387a
Moran, M. D. (2003). Arguments for rejecting the sequentialBonferroni in ecological studies.Oikos,100, 403–405. https://doi.org/10.1034/j.1600-0706.2003.12010.x
Nuzzo,R.(2014).Scientificmethod:Statisticalerrors.Nature News,506,150.https://doi.org/10.1038/506150a
Oakes,M.W.(1986).Statistical inference: A commentary for the social and behavioural sciences.Chichester,UK:Wiley.
Open Science Collaboration (2015). Estimating the reproducibility ofpsychologicalscience.Science 349,aac4716.
Prinz,F.,Schlange,T.,&Asadullah,K.(2011).Believeitornot:Howmuchcan we rely on published data on potential drug targets? Nature Reviews Drug Discovery,10,712.https://doi.org/10.1038/nrd3439-c1
Ram,K. (2013).Git can facilitategreater reproducibilityand increasedtransparency in science.Source Code for Biology and Medicine,8,7.https://doi.org/10.1186/1751-0473-8-7
Ripke,S.,Corvin,A.,Walters,J.T.R.,Farh,K.-H.,Holmans,P.A.,Lee,P.,…O'Donovan,M.C.(2014).Biologicalinsightsfrom108schizophrenia-associatedgeneticloci.Nature,511,421–427.
Rosnow,R.L.,&Rosenthal,R.(1989).Statisticalproceduresandthejusti-ficationofknowledgeinpsychologicalscience.American Psychologist,44,1276–1284.https://doi.org/10.1037/0003-066X.44.10.1276
Rothman,K.J.(1990).Noadjustmentsareneededformultiplecompar-isons. Epidemiology, 1, 43–46. https://doi.org/10.1097/00001648- 199001000-00010
Simmons,J.P.,Nelson,L.D.,&Simonsohn,U.(2011).False-positivepsy-chology:Undisclosedflexibilityindatacollectionandanalysisallowspresentinganythingas significant.Psychological Science,22, 1359–1366.https://doi.org/10.1177/0956797611417632
Sterne,J.A.C.,&Smith,G.D.(2001).Siftingtheevidence—what'swrongwithsignificancetests?British Medical Journal,322,226–231.https://doi.org/10.1136/bmj.322.7280.226
Streiner, D. L. (2015). Best (but oft-forgotten) practices: The multipleproblemsofmultiplicity-whetherandhowtocorrectformanysta-tisticaltests.The American Journal of Clinical Nutrition,102,721–728.https://doi.org/10.3945/ajcn.115.113548
Student(1908).Theprobableerrorofamean.Biometrika 6,1–25.TheInternationalSchizophreniaConsortium(2009).Commonpolygenic
variation contributes to risk of schizophrenia and bipolar disorder.Nature,460,748–752.
Thiemens,M. H., Davis, A.M., Grossman, L., & Colman, A. S. (2013).Turekianreflections.Proceedings of the National Academy of Sciences of the United States of America, 110, 16289–16290. https://doi.org/10.1073/pnas.1315804110
Weintrop,D.,Beheshti,E.,Horn,M.,Orton,K.,Jona,K.,Trouille,L.,&Wilensky,U.(2016).Definingcomputationalthinkingformathematicsandscienceclassrooms.Journal of Science Education and Technology,25,127–147.https://doi.org/10.1007/s10956-015-9581-5
Wray, N. R., Ripke, S.,Mattheisen,M., Trzaskowski,M., Byrne, E.M.,Abdellaoui, A., … Sullivan, P. F. (2018). Genome-wide associationanalyses identify 44 risk variants and refine the genetic architec-tureofmajordepression.Nature Genetics,50,668–681.https://doi.org/10.1038/s41588-018-0090-3