statistical inference and reproducibility in geobiology · su, kuhn, polyak, & bissell, 2014;...

Geobiology. 2019;1–11. wileyonlinelibrary.com/journal/gbi | 1© 2019 John Wiley & Sons Ltd

Received:4June2018 | Revised:30October2018 | Accepted:30December2018DOI:10.1111/gbi.12333

P E R S P E C T I V E

Statistical inference and reproducibility in geobiology

1 | INTRODUC TION

Thelate,greatKarlTurekianwouldoftenjokeaboutthenumberofnewdatapointsrequiredforageochemicalpaper.Theanswerwasone:Whencombinedwithapreviouslypublisheddatapoint,a“best-fit” linecouldbedrawnbetweenthetwopoints,andtheslopeofthe linecalculated, therebygivingrate.The jokehad its roots inaseminarDr.Turekiangavefarearlierinhiscareer,wheretheentiretalkfocusedonthefirstdatapointofanovelmeasurement(ahard-wondatapoint,butonlyonenonetheless).Whenanapparentlyex-asperated audiencememberpointed this out, Turekian reportedlyreplied“wellit'sonemorethananyoneelsehas!”(Thiemens,Davis,Grossman, & Colman, 2013). This anecdote raises two importantpointsaboutourfield:First,establishingnewanalyticaltechniquesis laborious, expensive, and requires considerable vision and skill,andsecond,atthenascentstagesofanynewrecord,thenumberofobservationswillbesmall.

Geobiologyisnowatthepointwheresomematuredatarecords(forinstanceδ13Corδ34S)havehadthousandsortensofthousandsofdatapointsgenerated.Therealsoexistmillionsofindividualgenese-quencesfromenvironmentalgenomicssurveys.Incontrast,emerg-ingproxies (for instance, selenium isotopes, I/Ca ratios, carbonate“clumped” isotopes),orbiomarker studiesand technicalgeomicro-biological experiments still face the issue Turekian described andhave only a handful ofmeasurements or are transitioning towardlargerdatasets. Thequestion remainshowbest to interpretdata-richrecordsontheonehand,andscattered,hard-wondatapointsontheotherhand.Here,usingexamplesfromwithingeobiologyaswellasthedevelopmentofotherfieldssuchasecology,psychology,andmedicine,wearguethatincreasedclarityregardingsignificancetesting,multiple comparisons, and effect sizeswill help research-ers avoid false-positive inferences, better estimate themagnitudeofchanges inproxyrecordsorexperiments,andultimatelyyieldaricherunderstandingofgeobiologicalphenomena.

2 | STATISTIC S AND REPRODUCIBILIT Y IN OTHER FIELDS AND IN GEOBIOLOGY

We start by examining statistical practice and reproducibilityoutside of our field, before relating these broader themes backto geobiology. Science as awhole is currently described as fac-ing a “crisis of reproducibility,” with diverse studies in multiplefields failing toreplicatepublishedfindings (seeBaker,2016fora Nature survey of reproducibility across fields). The issue here

isnot technical reproducibilityat thesample level (e.g., “If Iputthissamesample inamassspectrometer,will Igetthesamere-sulttwice?”)butratherattheleveloftheeffectsobservedinthemanuscript(e.g.,agiventreatmentcausesaspecificoutcome,ora predictor variable is correlatedwith a response variable). Forexample,independenteffortstoreproduce“landmarkpapers”incancerbiomedicalresearchbypharmaceuticalcompaniesAmgen(Begley & Ellis, 2012) and Bayer (Prinz, Schlange, & Asadullah,2011) have only been able to replicate 11% and 20%–25% ofpublishedresults,respectively.Similarly,acriticalstudyofgene-by-environment (GxE) interactions inpsychiatry found thatonly27%of published replication attemptswerepositive (Duncan&Keller, 2011). Further, this small proportion of apparently posi-tivereplicationswaslikelyinflated;therewereclearsignaturesofpublicationbias(thetendencytopublishsignificantresultsmorereadilythannon-significantresults)amongreplicationattemptsintheGxEliterature. Inpsychology,a large-scalereplicationeffortfoundthatonly39%ofstudiescouldbereplicated(OpenScienceCollaboration, 2015), and similar problemsplaguemuchofneu-roscience research (Button etal., 2013). The studiesmentionedherearelikelyjustthetipoftheiceberg(Begley&Ioannidis,2015;Ioannidis,2005).

Thecausesoftheselowratesofreplicationarevaried.Thereare likely different underlying causes for poor reproducibilityacrossfields,andtheseverityoftheproblemundoubtedlyvariesaswell.Our personal experience trying to implement publishedprotocolswithout thehighlydetailed, laboratory-specificknowl-edge that is often omitted (generally for space reasons) fromMaterialsandMethodssectionssuggeststousthatatleastsomepercentageoffailedreplicationsarecausedbyinadvertentmeth-odological differences. Indeed, the Open Science Collaborationreplicationofpsychologystudieswasattackedforpooradherencetooriginalprotocols(Gilbert,King,Pettigrew,&Wilson,2016).Inordertoachieveprecisereplicationofprotocols,inextremecasesresearchershavehadtotraveltootherlaboratoriesandworkside-by-sidetoidentifyseeminglytrivialmethodologicaldifferences—vigorousstirringversusprolongedgentleshakingtoisolatecells,forinstance—thathadanoutsizedeffectonreproducibility(Hines,Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips,2017).Methodologicaldifferences,though,canlikelyonlyexplainaportionofthefailedreplicationattempts.Truescientificfraudisalsosomethingthatmakesheadlinesanderodespublictrust,butagainlikelyonlyaccountsforasmallproportionofreplicationfail-ures(Bouter,Tijdink,Axelsen,Martinson,&tenRiet,2016;Fanelli,2009).

www.wileyonlinelibrary.com/journal/gbi

2 | PERSPECTIVE

So, what causes such poor reproducibility?While recognizingagainthatthecausescanvaryacrossfields,clearlysomeofthemostimportant factors are under-powered studies, a reliance on NullHypothesisSignificanceTesting(NHST,e.g.,p <0.05)todetermine“truth”orwhetherapapershouldbepublished,anda lackofcor-rection for implicitorexplicitmultiplecomparisons (morebroadly,“researcherdegreesoffreedom”).Theseissueswereknownwithinsomefields,butwerebroughttomorewidespreadattentionthroughthepublicationofaprovocative2005essay,“WhyMostPublishedResearchFindingsAreFalse” (Ioannidis,2005). Ioannidis identifiedthefollowingcausalfactorsas leadingtotheoverallvery lowper-centagesofpositive replication findings: (a) small sample sizes, (b)smalleffectsizes,(c)highnumbersoftestedrelationships,(d)flexi-bilityindesignsanddefinitions,andwhatrepresentsanunequivo-caloutcome,(e)conflictsofinterestorprejudicewithinafield,thusincreasingbias,and(f)“hot”fieldsofscience,wheremoreteamsaresimultaneouslytestingmanydifferentrelationships.Thistendencyisexacerbatedbytheincentivestructureforjournalsandauthorstopublishpositiveresultsandpublishoften.

Our field of geobiology is likely less besot by some of theseissues related to, for instance,publicationbias.The reasonwhy isthatdataanalysisisoftenaccomplishedthroughvisualratherthanstatistical comparison;explicitp-values thatwouldbeusedas thearbitrator of accept/reject decisions in other fields are not gener-ated.Figure1documentstheuseofstatisticaltestinginthejournalGeobiologycomparedtoasisterpublicationbyWiley,Marine Ecology. ThiscomparisonisnotmeanttosingleoutGeobiologyasajournal;wearecertaintheresultswouldbesimilarinourotherdisciplinaryjournals.Inthiscomparison,weconsideredthe100mostrecentpa-persatthetimeofwritingthatreportednewdata(soreviewpapers,commentaries, or papers that were descriptive in nature, such asdescribingstromatolitemorphologies—basicallyanypaperwherenonewnumericaldataweregenerated—wereexcluded).Paperswereconsideredtohavea“statisticalanalysis”simplyiftherewassomeefforttounderstandthepossibleinfluenceoferrorandsamplingontheprecisionoftheinference,recognizingthiscantakemanyforms(e.g.,formalNHST,bootstrapping,andBayesianposteriorprobabil-ities). In themarineecology journal, essentially everypaper (97%)reportingnewdatauseda statistical analysis (similar resultswerereported by Fidler, Burgman, Cumming, Buttrose, & Thomason,2006,forotherecologyjournals).InGeobiology,thepercentagewithanysortofstatisticalanalysisisfarlower,at38%(chi-squaredtest;p = 2.0 × 10−18).Werecognizethatsomeofthisdifferencemaybedue todifferentdata types andapproaches, butourpersonalob-servation is thatmanystudies inGeobiologyarecomparinggroupsofdatavisuallyratherthanstatistically.Further,thestatisticalanal-ysesthatarepublishedinGeobiologyareconcentratedinmolecularphylogeneticstudies,wherebootstrapsorBayesianposteriorprob-abilities are commonly used to assess precision. Statistical testingis far less common innon-phylogenetic studies (e.g.,manygeomi-crobiology,geochemistry,biomarker,andbiomineralizationstudies).ComparingthemostrecentpapersinGeobiologyagainstthefirst100data-driven papers in the journal (first published in 2003) reveals

thatthepercentagewithastatisticaltesthasincreasedslightly(from26%to38%)butthedifferencebetweenthetwotimeintervalsex-aminedisnotstatisticallysignificant(usingp<0.05todeclarestatis-ticalsignificance;chi-sq=2.8,df=1,p=0.095).

We propose that increased emphasis on statistical testing asacriticalstep indataanalysis isneeded inthefieldofgeobiology.Putsimply,whybuildascientificworldviewbasedonstudieswhereit has not been demonstrated that the observed differences aremore thanwhatwouldbeexpected fromsamplingvariability?Ontheotherhand,blindrelianceonstatisticaltestingwillnotbehelp-ful either. As cogently argued by Fidler, Cumming, Burgman, andThomason(2004),ecologyasafieldismiredinstatisticalfallacies:specifically,theerroneousbeliefsthatp-valuesareadirectindexofeffectsize,andthatthep-valuerepresentstheprobabilitythatthenullhypothesisistrue(orfalse).Consequently,simplyrunningmorestatisticalanalysesisnotasufficientavenuetoaccurate,reproduc-ible, and correctly interpreted findings.Here,wehope to use theopportunity provided by broader discussions about statistics andreproducibility in science to review three important concepts—(a)significancetesting,(b)multiplecomparisons,and(c)effectsize—andtranslatethemtoourfieldofgeobiology.Thismanuscriptisintended

F IGURE 1 Useofstatisticaltestinginthe100mostrecentdata-drivenpapersinGeobiology(present)versusMarine Ecology. Thefirst100papersinGeobiology(early)werealsocompared.“Statisticaltests”werebroadlydefined(forinstance,bootstrapping,Bayesianapproaches,etc.,andnotonlyformalNullHypothesisSignificanceTesting).Paperswithouta“MaterialsandMethods”sectionorthatdidnotpresentnewnumericaldata(e.g.,review/synthesispapers,modelingpapers)werenotincludedinthecomparison

Use of statistics in 100 papers from each journal

Geobiology (early) Geobiology (present) Marine ecology

Sta

tistic

al te

stN

o st

atis

tical

test

| 3PERSPECTIVE

tobeeducationalandtostartadiscussionregardingproperstatisti-calanalysesingeobiology.OurfocusisonwhatwebelievearethemorefamiliartermsofformalNHST(i.e.,afrequentistapproach),butwediscussthegoalofamoreflexibleapproachtostatisticalthink-ingat theconclusionof thepaper. Inthisspirit,werecognizethatthesetopicswillbefamiliartomanyreaders,andindeedasidefromageobiologicalspin,thereislittleintellectualterritoryherethathasnot been extensively covered in medical, psychological, and eco-logicaljournals.However,forreaderslessfamiliarwiththesetopicswehopethisessaymayprovideusefulguidanceforavoidingsomeof the problems of reproducibility that have plagued other fields.Geobiologyultimatelyhastheopportunitytobeamongtheselectgroupoffieldsthatdonothavemajorreproducibilityproblems.

Take home message: There are acknowledged issues with NullHypothesisSignificanceTesting(includingpublicationbiasandrep-licability in science at large), but studies shouldnot rely solely onvisualanalysisalone.Aformalexaminationofthesizeoftheeffectrelativetosamplingvariationisakeytoolinevaluatingnewscientificclaims.

3 | NULL HYPOTHESIS SIGNIFIC ANCE TESTING (NHST)

Despite itswidespread use in science, the development ofNHSThas its roots in industrial applications.For instance,RonaldFisherdevelopedtheanalysisofvarianceworkingwithexperimentalcropdata,andWilliamGosset(nominatedhereasthepatronsaintofgeo-biologicaldataanalysis)developedtheStudent'sttestworkingtoin-creaseyieldsattheGuinnessBrewery(Student,1908).Whenfacedwithtwo(ormore)groupsofsamples,eachwithscatter,theperti-nentquestioniswhetherthesetsofsamplesmayactuallyrepresentthesameunderlyingdatadistribution,withobservedvariationbe-tweengroupsarisingfromsamplingofthisdistribution(Thenullhy-pothesis,H0,isthatthereisnodifferencebetweengroups;inotherwords,thereisonlyoneunderlyingdistribution).InNHST,thisques-tionisaddressedbycomparingthemeansormediansofthegroups(anddatavariability)andcalculatingtheprobabilitythataresultatleastthatextremewouldbefoundifthegroupsweredrawnfromthesamedistributionorpopulation.Thisprobabilityisexpressedasthep-value.Incorrectlyrejectingthenullhypothesis(afalsepositive)isreferredtoasTypeIerror,andincorrectlyfailingtorejectthenullhypothesis (afalsenegative) isTypeIIerror.Avarietyofparamet-ric (those thatdependona specifiedprobabilitydistribution fromwhich the data are drawn) and nonparametric (those that do not)statistical analysesexist tomake thesecomparisonsbetween twoormoregroups.A full reviewofparticularanalyses isbeyond thescopeof this articleand isbest found in statistical textbooksandwebresources.

Choosingthecorrectstatisticalanalysis foragivensetofdataposessomeissues,butthemoreimportantissue—thefocusofthispiece—is understanding the fundamental logic of statistical tests:What theydoanddonot tellyou. It iserrors in logic, rather than

someone using a t test when they should have used aWilcoxon,whicharemostproblematic.OnecommonfallacyregardingNHSTisthatthelevelofsignificancerejectingthenullhypothesis(thep-value)istheprobabilitythatthenullhypothesisiscorrect.Thereisawideliteraturediscussingthisfallacy,withoneofthebestbeingCohen's1994essay “TheEarth is round (p<0.05).”Henotes thatwhatwewanttoknow,asresearchers,is“giventhesedata,whatistheprobabilitythatH0istrue?”ButwhatNHSTtellsusis“giventhatH0istrue,whatistheprobabilityofthese(ormoreextreme)data?”Inotherwords, rejectinga specificnullhypothesisdoesnotactu-allyconfirmanyunderlyingtruthortheory.Addressingthisrequiresknowing the likelihood that a real effect exists in the first place,whichmaybedifficulttocalculate.Eveniftheseoddscanbecalcu-lated, thecombineduncertaintyresults inmoresubstantialmurki-nessabouttheresultsthanthep-valuealoneindicate(Nuzzo,2014,providesmore informationonwhatap-value really tellsyou—in a formmorepalatabletotheaveragereaderthanCohen—aswellasahistoricaldiscussionofhowthep=0.05thresholdcameabout).

Another common error in interpreting formal statistical testsregards “mechanical dichotomous decisions around a sacred 0.05criterion” (Cohen, 1994). Obvious to most readers is that a 5.5%probabilityofgeneratingresultsatleastthatextreme(e.g.,p=0.055)isbasicallynodifferent,inaninterpretivesense,thana4.5%prob-ability(p=0.045).Inotherwords,p <0.05,p <0.01,andp <0.005(Benjamin etal., 2018, recently advocated for the more stringentp <0.005 statistical criteria) are all arbitrary criteria, although stillusefulconventions.AmemorablequotebyRosnowandRosenthal(1989)states:

Wewant to underscore that, surely, God loves the0.06 nearly asmuch as the 0.05. Can there be anydoubtthatGodviewsthestrengthofevidencefororagainstthenullasafairlycontinuousfunctionofthemagnitudeofp?

Yetwhileresearchers inherently“know”p=0.045and0.055arereallynodifferent,oneresultisdeemed“true”andpublishable(posi-tivepublicationbias)andoneisdeemedinconsequentialandignored.OrinRosnowandRosenthal'smoredramaticprose:

Itmaynot be an exaggeration to say that formanyPhDstudents,forwhomthe0.05alphahasacquiredanalmostontologicalmystique,itcanmeanjoy,adoc-toral degree, and a tenure-track position at amajoruniversity if their dissertationp is <0.05.However,ifthep is>0.05,itcanmeanruin,despair,andtheiradvisor'ssuddenlythinkingofanewcontrolconditionthatshouldberun.

Take home message: Null Hypothesis Significance Testing inves-tigates the probability (expressed as p-values) of finding results asextreme as the observed data, given the null hypothesis. Of note,p-values cannot address the likelihood that a hypothesis is correct.

4 | PERSPECTIVE

p-value thresholds representarbitrarycutoffs rather than true/falsecriteriaforaccept/rejectdecisions.

4 | MULTIPLE TESTING AND RESE ARCHER DEGREES OF FREEDOM

A colleague once described a conference—in the pre-PowerPointdays—in which geologists were instructed to bring an overheadtransparency summarizing the record of their subfield throughEarthhistory(tectonicevents,fossiltrends,geochemicalproxyre-cords,etc.)alongthesamex-axis temporalscale.Theparticipantsthen took turns swappingdifferent transparenciesonandoff theprojector, looking for correlations. This strategy was innovativefrom a data exploration perspective—and indeed sounds intellec-tuallystimulating—butultimatelyisanightmareregardingmultiplecomparisons.Themultiple comparisonproblemessentially resultsfrom“moreshotsongoal.” Itcanbe illustratedwiththefollowingexample—the probability of flipping any given coin as “heads” 10timesoutof10isverylow;however,ifthisisdoneoverandover,theprobability thatonecoinwillbe “heads”every timeobviouslyincreases.Itwouldbeincorrect,though,toconcludethatthatcoinisdifferentfromtherest.SpecificallyforNHST,theprobabilityofafalsepositiveinabatteryoftestswillbe1–(1–α)k,wherealpha(α)isthesignificancelevelrequired(e.g.,p<0.05)andkisthenum-beroftestsperformed(discussedindetailinStreiner,2015).For10separatetestsatthestandardlevelofsignificance,theprobabilityofafalsepositiveis1–(1–0.05)10orabout40%.Thisissueisper-hapsevenbetterillustratedbyFigure2below(analyzingtheeffectofjellybeancolors).

Uncorrectedmultiplecomparisonsareoneoftheprimarycausesof replicability issues across many scientific fields. In some cases,thesecomparisonsmayhavebeendoneexplicitly,suchaswiththegreenjellybeans.Manyreportsinthe1990sand2000saboutwhatgenes“cause”specificeffectsinhumanswerethespuriousresultoftrawlinglimitedgeneticdataacrosscomprehensiveepidemiologicaldatasetsand lookingfor“significant”correlations.Ananalog inourfieldmaybeinstanceswherebiologicaldataofinterest(e.g.,micro-bialabundance/ecologicaltraits/geneexpression)arecollectedfromamodernenvironment,suchasahotspring,alongsideenvironmen-tal data.Which environmental parameters are correlatedwith thebiologicalmetricof interest?(leavingasidethebroaderquestionofcorrelationandcausality). Itwouldbe inappropriate tosimplycon-ductapairwisecomparisonofalltheenvironmentalparameterswiththebiologicalmetricwithoutcorrectionformultipletests.Bychancealone, some parameters might be significantly correlated: In nulldatasets,p<0.05occurs,bydefinition,5%ofthetime.

Perhapsmoreinsidiously,multiplecomparisonscanbedoneim-plicitlyorunconsciously.Forinstance,aresearchermaycompleteacompilationof fossil data from shelly invertebrates and thenhalf-asleepintheshowermentallywanderthroughallthedifferentgeo-logicaldatarecords(sealevel,temperature,redoxproxies,strontiumisotopes,etc.)beforesnappingawakeafternotingthattheidentified

fossil trend looks very similar to a previously published record ofcalciumisotopes.Asinglestatisticalcomparisonismade,andvoila:p<0.05. Perhaps something in the calcium cycle is affecting thelivelihoodoftheseshell-formingorganisms?Maybe.Inthiscase,theresearcher did not explicitly test each comparisonwith ap-value,as inthejellybeanorhotspringsexample,buttheresearcherstillmentallyconductedtheequivalentofswappingoutoverheadtrans-parencies:multiplecomparisonsuntilamatchwasfound.Arelatedproblemisexploratorydataanalysisasdataarebeinggenerated(theroleofexploratorydataanalysisisdiscussedfurtherbelow).Arangeofanalysesmightbeconducted,withonepredictorvariableoutofmanyyieldingasignificantcorrelation.Whenthefulldatasetisgen-erated,“onlyone”explicittestisconductedonthefinaldatasetandincludedinthepublication,withtheresearcherhonestlyforgettingjusthowmanyanalyseshadbeenconducted.Or,aspuriouslysignif-icantp-valueisfoundearlyon,sayforallavailablemarinesamples,whichthendisappearsasmoredataareadded.Asecondanalysis,looking at individual ocean basins, and a third, looking by depthclass, are conducted, perhaps excluding some extreme outliers,until anothersignificantp-value reappears [so-calledp-hacking,or“researcherdegreesoffreedom”;Simmons,Nelson,andSimonsohn(2011)]. It isthenthissub-groupanalysisthat isemphasizedinthemanuscript.Moreoftenthannot,itisanunwittingerrorbyascien-tistexcitedlyanalyzingtheirdata.RememberFeynman'squotethat“thefirstprincipleisthatyoumustnotfoolyourself—andyouaretheeasiestperson to fool.”This isnot todiscouragedataexploration,butcorrectlyaccountingforthesecomparisons—oratleastremain-ingcognizantoftheissue—willultimatelybekeytoproducinglastinginsights.

4.1 | Avoiding multiple comparison pitfalls

Howthenshouldoneaccountformultiplecomparisons?Thebestapproachisearly,explicitplanning.Thegoldstandard,aspracticedin clinical trials, is pre-registration (for instance, on ClinicalTrials.govrunbytheU.S.NationalLibraryofMedicine). Inthisstrategy,theresearcherpublishesawhitepaperpriortostartingtheexperi-ments,explicitlydetailinghowthedatawillbecollected(includingthestoppingpoint),andthenumberandtypesofstatisticalanalysestobeconducted.Theyarethenheldtothisplan,orthetrialisnotconsidered valid. Such a strategy is often considered anunrealis-ticidealoutsideofclinicaltrials,butnotablyanefforttopublicallypostmethodologiesa priorihasrecentlybeeninitiatedformolecularphylogenetics(Phylotocol;DeBiasse&Ryan,2018),afieldparticu-larly susceptible to “researcher degrees of freedom.”Whether ornotsuchregistriesareappropriateforgeobiologyinthelongtermisasubjectfordebate,andcertainlytheapproachignoresthefactthatmuchofourscience (especially fieldscience) is truly “discov-erydriven”ratherthan“hypothesisdriven.”Nonetheless,increasedpre-experimenteffortputintoplanningstatisticalanalyseswillbeeffectiveinreducingmultiplecomparison“creep.”Thismaybepar-ticularlyusefultobringupduringprojectplanningwithearly-careerresearchers, as essentially all aspects of experimental design are

| 5PERSPECTIVE

beingtaughtatthatpoint.Discussionsofstudy-levelreproducibilityshould also startmaking it intoundergraduate andgraduategeo-biology curricula, as for instance is occurring in somepsychologyprograms(Button,2018).

Moving from pre-experiment awareness and planning to post-experimentdataanalysis,thereareseveraltechniquestoaccountformultiplecomparisons(“multipletestingcorrection”or“alphainflationcorrection”).CertainBayesianapproachesmaynotrequireexplicitcor-rection(seeGelman,Hill,&Yajima,2012),butinafrequentistcontextacommonapproachistoapplyadirectcorrectionthataccountsforthe increased family-wise error rate. The Bonferroni correction, forinstance, divides the level of significance required (e.g.,α=0.05) bythenumberofindependentanalysesconducted.So,aresearcherin-vestigatinghowbrachiopodbodysizechangedbetweenfourdifferentstratigraphicformationswouldrequirep<0.008(anoverallα=0.05/6independentpairwisetests=0.008)inordertoachievesignificance.Anothercommonpracticeforsuchananalysis(multiplepairwisecom-parisons)istoconductanomnibustestsuchasANOVA,followedbyaposthoctestsuchastheTukeyHSDtestthatdirectlyaccountsforincreasedfamily-wiseerror.

This general practice of alpha inflation correction has receivedsomecriticism, as the thresholds for significancecanbeoverly con-servative, leading tomore false negatives and potentially restrictingthepathoffuturescientificcuriosity(Moran,2003;Rothman,1990).Suchcriticismscommonlynotethatstudiesintheirfieldsoftenhave“asmallnumberofreplicates,highvariability,and(subsequently) lowstatisticalpower” (Moran,2003).Dr.Turekianmightwell relate!Theargumentputforthbysuchpapersisthattheincreasedincidenceinfalsepositivesisnotreallyanissue,becauseotherresearcherswillre-peattheexperiments,beunabletoreplicatethem,andthefalseclaimswilleventuallydisappearfromtheliterature.However,carefulstudyofreplicationattempts inpsychiatryandpsychologyhasdemonstratedthat(a)uncorrectedmultiplecomparisontestingdoesempiricallyleadtoamorassofpublishedfalsepositives(Duncan&Keller,2011),and(b)theoriginalhypothesesdonotsimplydisappearfromtheliteraturebuthaveincrediblepersistence(Ioannidis,2012).Thecausesofsuchper-sistencearevariedbutbasicallyboildowntolowincentivesforjournalsorauthorstopublishnegativereplications(Ioannidis,2012;Simmonsetal.,2011).Whiletheseargumentsonthestringencyofmultipletest-ingcorrectionsarenotmeritless (seenextparagraph), inouropinionwidespreadTypeIerrors(falsepositives)areafargreaterhindrancetotheadvancementofsciencethanTypeIIerrors(falsenegatives).

Thatsaid,determiningthecorrectbalancebetweenthelikelihoodoffalsenegativesandfalsepositives,aswellasencouragingcutting-edge methods development that must start with small datasets

F IGURE 2 FiguremodifiedfromthewebcomicXKCD.com.Multiplestatisticalcomparisonsincreasetheprobabilityofafalse-positiveresult(“moreshotsongoal”).Specifically,withtwentyindependentcomparisonsasshowninthecomic,theprobabilityofafalsepositiveis~65%(Streiner,2015)

6 | PERSPECTIVE

(Turekian'spoint),isobviouslyacomplicatedendeavor.Severalothermethods, such as the Holm and Hochberg methods (reviewed byStreiner,2015), offerprotectionagainst alpha inflationbut arenotas conservative as a strict Bonferroni correction. Such correctionsshouldalsofollowcommonsense,astheycandegenerateintotheab-surd(García,2004;Streiner,2015).Forinstance,howmanytrulyin-dependentcomparisonsarereallybeingconducted?Oceanographicfactorssuchastemperature,oxygen,pH,light,andpressurecanallbebroadlycorrelatedwithdepth.Comparingalloftheseagainstmicro-bialabundancesmightresultinmultiplesignificantcorrelationsthatbecome(inappropriately)non-significantaftercorrectionformultiplecomparisons.Anotherissuetoconsideriswhethertheremaybepre-existinghypothesesthatmotivatedthestudy.Forthebrachiopodol-ogiststudyingbodysizeacrossfourformations,theymaybetestingprevioushypothesesofbodysizeevolutionduringaspecifictimepe-riod(e.g.,Heim,Knope,Schaal,Wang,&Payne,2015).Therealtestmaybebetweenthestratigraphicallylowestandhighestformations,and itmaybeundulystringent to requireavery lowvaluep-valueresultingfromthemultiplepairwisecomparisons.

ThisdifferencehasbeendiscussedbyStreiner(2015)asthediffer-encebetweenhypothesistesting(whichmaynotrequirecorrection)andhypothesis generating (which shouldbe reportedas tentativeand/orexploratoryresults).Orinplainerlanguage,exploratorydataanalysiscanbegood,andexplicithypothesis testingcanbegood,buttheapproachbeingusedshouldbeclear.Indeed,therearemanypowerfultechniques(includingnewmachinelearningtechniques)tounderstandwhichofmultiplepredictorvariablesmightbestexplainvariance in the response variableof interest (a situationweoftenfaceingeobiology).Theresultsofsuchanalyses,though,cannotbeturnedaroundandpresentedasana priorihypothesiscompletewithasignificantp-value.Dr.BrianMcGill'sblogpostprovidesacogent“defense” of exploratory data analysis: (https://dynamicecology.wordpress.com/2013/10/16/in-praise-of-exploratory-statistics).Hecloses,“IuseexploratorystatisticsandI'mproudofit!AndifIclaimsomethingwasahypothesisitreallywasana priorihypothesis.Youcantrustmebecause Iamoutandproudaboutusingexploratorystatistics.”

Thekeythreadrunningthroughthesestatisticalcommentariesregarding multiple comparisons is conscientiousness—personallywithrespecttohowmanyanalyseshavebeenconducted,butalsowith respect tohow the full scopeof statisticalprocedures isde-scribedinthepaper.Simmonsetal.(2011)provideexcellentguide-lines in this regard forbothauthorsandreviewers.Notably,whilepromotingstrict reportingguidelines, theseauthorsalsoadvocateforincreasedtoleranceofstatistical“imperfection”byreviewersandeditorsinwell-documentedpapers:“onereasonresearchersexploitresearch degrees of freedom is the unreasonable expectationweoftenimposeasreviewersforeverydatapatterntobe(significantly)as predicted. Under-powered studieswith perfect results are theonesthatshouldinviteextrascrutiny.”

Finally,attheendofthisdiscussion,itisworthwhiletoaskyour-selfthebasicquestionofwhethermultipletestingisanissueinyourparticularresearcharea.Ifyouareworkingsolelywithasingleproxy

record,oranisolatedgeneticsystem,theanswermightbeno.Theissue arises if youwant to understandwhat correlates with yourdata—ifthereisalargeconstellationofpossiblecorrelates,thereisan equally large likelihoodof spurious correlations. Learning fromthe abysmal record of replication in other fields, caution againstalphainflationinthesecaseswillbeacornerstonetoarobustandhealthyfieldofgeobiology.

Take home message:Vigilanceagainstalphainflationstartswiththeindividualresearcher,ideallyduringthepre-experimentalde-signphase.Dataexplorationisencouraged,buttrawlingthroughdata for significant correlations that are then presented as an a priorihypothesismustbeavoided:Papersshouldexplicitlystateiftheyaretobeviewedasexploratory.Thefullsweepofdatacol-lected,statisticaltestsconducted,andchoicesofdatainclusion/exclusion(orother“degreesoffreedom”)byaresearchershouldbe made clear in publication in order to avoid cherry-picking(Simmonsetal.,2011).

5 | EFFEC T SIZE

So letus sayyouhave founda significantdifferencebetween twogroupsofdata,andthroughattentivepracticeandanalysis,youhavedetermined it is not a chance result based on numerous “shots ongoal.”Thequestionnowis—doestheresultmatter?Thislastquestionseems silly, but amistaken focuson significantp-values as thebe-all/end-all,insteadofoneffectsize,hashamperedprogressinfieldssuchasecology(Fidleretal.,2004).Simply,effectsizeisaquantita-tivemeasureofthemagnitudeofaphenomenon.Statisticalpoweristhelikelihoodthatananalysiswilldetectarealeffect(asdeterminedbyasignificantp-value),andisgovernedbythesizeoftheeffect,thevariationpresentinthegroups,thenumberofsamplesintheanalysis,andthethresholdforsignificance(α).Thus,twogroupswithwidelyseparatedmeans (largeeffect size) and littlewithin-groupvariationwillrequirerelativelyfewsamplesforawell-poweredstudy.

The flip side of this—andwhy p-valuesmust be regarded asthe statistical likelihood a result that extreme would be foundby chance, rather than how “important” a result is—is that givenenoughsamples,literallyanyeffectsize,nomatterhowsmall,canbedetected.Thishasreceivedconsiderableattentioninthesta-tisticalliterature:

[Thenullhypothesis]canonlybetrueinthebowelsofacomputerprocessorrunningaMonteCarlostudy(andeventhenastrayelectronmaymakeitfalse).Ifitisfalse,eventoatinydegree,itmustbethecasethatalargeenoughsamplewillproduceasignificantresultand lead to its rejection.So if thenullhypothesis isalwaysfalse,what'sthebigdealinrejectingit?

(Cohen,1990)

At this point, you may be asking, “since what we are really in-terested in is effect size, andgivenCohen's statement that thenull

https://dynamicecology.wordpress.com/2013/10/16/in-praise-of-exploratory-statistics

https://dynamicecology.wordpress.com/2013/10/16/in-praise-of-exploratory-statistics

| 7PERSPECTIVE

hypothesisisalwaysfalse,doweevenneedtorunstatisticaltests?”Yes!Withoutasignificantresult,thereisnoreasontobelievetheob-servedresultsmaynotbeduetosamplingvariation.Inotherwords,significanceisthejumpingoffpoint,thelicensetostarttalkingabouteffectsizefromapositionofconfidence.Forfurtherreading,there-lationshipbetweensamplesize,p-values,andaccuracyininferringef-fect size is intelligentlydissectedbyHalsey,Curran-Everett,Vowler,andDrummond(2015).

Whatconstitutesanimportanteffectwillvarybyfield.Returningtothecointossexample,givenmillionsorbillionsortrillionsofflips—whatevertherequiredsamplesizemaybe—thedifferingweightsofthecoinsideswillultimatelycauseonesidetolandupsignificantlymore times than the other. Yet, this does not matter when twofriendsarechoosingarestaurant:Itisstill~50:50,andtheminisculeeffect size is irrelevant in this instance. This is the difference be-tweenasignificanteffectandanimportanteffect.Nonetheless,thisfallacyisoftencommittedintheliterature:

Allpsychologistsknowthatstatistically significantdoesnotmeanplain-Englishsignificant,butifonereadstheliterature,oneoftendiscoversthatafindingreportedintheResultssectionsstuddedwithasterisksimplic-itlybecomes intheDiscussionsectionhighlysignifi-cantorveryhighlysignificant,important,big!

(Cohen,1994)

Sometimes, though, small effect sizes really domatter. As anexample,humangeneticistshavelearnedthatthemajorityofphe-notypesarenotcontrolledbyasinglegene/locus(suchasthecaseforHuntington'sdisease).Rather,characteristicslikeheight,andtheriskofdiseasessuchasheartdisease,TypeIIdiabetes,anddepres-sionarecausedbythe(largely)additiveeffectsofthousandsofge-neticloci.Inthecaseofpsychiatricdisorders,eachindividuallocusexplains far<1%ofvariance in risk forapsychiatricdisorder (i.e.,smallindividualeffects)andyettotalgeneticcontributionsexplain40%–80%ofthevarianceinriskforthesedisorders(largesummedeffect) (Duncanetal.,2017;Ripkeetal.,2014;Wrayetal.,2018).Thus, as our questions move toward datasets with hundreds orthousandsofmeasurements,thefocusmustbeongeobiologicallyimportanteffectsizes(whichwillvarybyquestion)andconfidenceintervalsratherthansimplystatisticalsignificance.

6 | INTERPRETATIVE E X AMPLES

Therelationshipbetweenp-values (significance),effectsize,andsamplesizeisillustratedinFigure3.Notethereisnorelationshipimplied between the left and right panels, they are simply illus-trativeexamplesof theseconcepts. InPanelA,a researcherhasidentified relationships that appear interesting to pursue, withtheGroupBmean~50%higherthanGroupAinthettestexam-ple (left side).On the right side, thepredictor variable accountsfor27%ofthevariation(R2value)intheregressionexample.But

the results are not statistically significant. Especially in light ofrecentclaims thatp <0.05 is too laxastandard (Benjaminetal.,2018),thereisastronglikelihoodthatthisresult—specificallytheobservedsizeanddirectionof theeffect—isdue to samplingef-fects (Halsey etal., 2015). If thiswere your hard-wondata, it isimportant to avoid the temptationof knowinglyor unknowinglymanipulatingthedata(p-hacking)toachievea“significantresult.”For instance, simply removing the lowest data point inGroupBresultsinp = 0.03,significant!Maybe,thereisanimminentlylogi-calreasontoexcludethatdatapoint—perhapsit isfromadiffer-entandinappropriatelithology,orthesamplingmethodologywasactuallydifferent.Thegoalthoughistoavoidinventingposthocjustificationsfordatamanipulation,asitissoeasytofoolyourself,particularlyifremovingthatpointprovidessupportforlong-heldideas (confirmation bias). If such data exclusions aremade, theyshouldbenotedclearlyinthemanuscript,andbothsetsofanaly-ses, with a reasoned explanation, should be included (Simmonsetal.,2011).Movingtowardsharedtransparencywithinscientificsubfieldsfordatacollection,reportingandpresentationmethodswillbeinstrumentalinhelpingresearcherspresentreasonablere-sultswithlessthreatofconscious/unconsciousp-hacking.PanelBdepictsessentiallythesameanalysisasinPanelA,butwithmoresamples. Inthiscase,theresult issignificant,andtheresearchercanfeelmoreconfidentdescribingthesizeanddirectionoftheef-fect.NotethatifPanelBwereanextensionofthestudyinPanelA,thebestpractice(sometimesdifficulttoachievebutnonethelessthebestpractice) isactually toestablishapre-determinedstop-pingpoint for data collection (Simmons etal., 2011).Continuingtoaddbitsofdata,withtheanalysisreruneachtime,effectivelyrepresentsmultipletests.

PanelCdepictshowwithlargersamplesizesthepowertoiden-tify small but statistically significant effects also increases. In thecomparisonofgroupsAandB,themeansonlydifferby~3%,buttheresultishighlysignificant(p =0.007).Intheregressionanalysis,thepredictorvariableonlydescribes4%ofthevariationintheresponsevariable,butnonetheless,theresultissignificantbytraditionalmea-sures(p =0.047).Lookingatthescatterinthisplotisinstructive,asitappearsasacloudofpointswithnocorrelation,butstatistically,asmallcorrelationdoesexist. Inotherwords, itvisually illustratesthe quotes fromCohen: Effect size, not significance alone, is theendgoal.Asalsopreviouslydiscussed, theultimate interpretationofimportanceisquestion-andfield-specific.Toreiteratethepoint,robust increases in crop yields of 4%may feedmillions, whereasachangeof4% inmodernmarine sulfate levels (e.g.,~1mM)mayhavelittlerelevancetocurrentresearchquestionsregardingsulfurbiogeochemistry.

7 | BEST PR AC TICES MOVING FORWARD

Westartthis“bestpractices”sectionfromahumbleposition,aswearebynomeanstrainedstatisticiansbutratherenthusiasticadvocatesforincreasedstatisticalrigoringeobiology.Wehavemade(orwillmake)

8 | PERSPECTIVE

technical and logical errors inourpublishedpapers and almost cer-tainlyhavemadeunconscious“researcherdegreesoffreedom”deci-sionsthatimpactedresults(Simmonsetal.,2011).EvenJacobCohen,whosethoughtfulpaperswehavecitednumeroustimes,wascalledoutforincorrectstatisticallogic(Oakes,1986).Learningcorrectstatisticalpracticeandlogicisthusacareer-longendeavorformostscientists.

Asafirststep,wesuggestthatbothreviewersandauthorsin-sistonsomeformofstatisticalanalysisasabestpracticeapproach.Thecriticalconcepthereisthatyourparticularsetofmeasurements

isnotastatictruthaboutagroup.Rather, theyarevaluesdrawnfromadistribution, and repeateddrawsmayyield verydifferentresults—especiallyatlowsamplesizes(Halseyetal.,2015).Wedorecognizethatmanygeobiologicalstudieswillnotlendthemselvesto formal statistical tests and that there is a healthy traditionofdiscovery-based geobiological science—especially in field stud-ies—that should not be stifled.Nonetheless, if a paper is report-ingobserveddifferencesbetweengroupsorcorrelationsbetweenvariables,itisscientificdue-diligencetoinvestigatethedegreeto

F IGURE 3 Therelationshipbetweensignificance,effectsize,andsamplesizeshownwithhypotheticaldatapoints.Thisisdepictedformeasurementssampledfromtwodifferentgroups(leftside)andcorrelationbetweenapredictorvariableandresponsevariable(rightside);notethereisnostrictrelationshipbetweenthedatainthetwopanels.(a)Smallsamplesizesmayrevealalargeeffect,buttheresultisnotsignificantandshouldbeconsideredanintriguingfindingratherthanaconfidentconclusion.(b)Withincreasedsampling,aresearchermay(ormaynot)revealasignificanteffect.Thesign(direction)ofthiseffectislikelycorrect,whileincreasedsamplingwillleadtobetteraccuracyoftheeffect'smagnitude.(c)Givenenoughsampling,eventhesmallesteffectsizewillbecomestatisticallysignificant.Therightsideplotincisvisuallyinformativeinthisregard—whatappearstobeacloudofpointsare,statisticallyspeaking,correlated.Forillustrativepurposes,wehavedescribedeffectshereas“large”and“small,”butasdiscussedinthetext,thisdistinctionwillbefield-andquestion-specific

0 1 2 30

2

4

6

8

10

Group A(4.3 ± 2.3*)

Group B(6.4 ± 2.8*)

0 2 4 6 8 100

2

4

6

8

10

Predictor variable

Resp

onse

var

iabl

ep = 0.11

(a) Non-signi�cant result, possibly(?) large e�ect size* (10 samples)

p = 0.12, R2 = 0.27*

0 1 2 30

2

4

6

8

10

Group A(4.2 ± 2.1)

Group B(6.6 ± 2.0)

p = 0.0001

0 2 4 6 8 100

2

4

6

8

10

Predictor variable

Resp

onse

var

iabl

e

p < 0.0001, R2 = 0.53

*E�ect sizes should not be conclusively interpreted in the absence of signi�cant p-values

0 1 2 30

2

4

6

8

10

Group A(3.9 ± 0.2)

Group B(4.0 ± 0.3)

0 2 4 6 8 100

2

4

6

8

10

Predictor variable

Resp

onse

var

iabl

e

p = 0.007 p = 0.047, R2 = 0.04

(b) Signi�cant result, large e�ect size (25 samples)

(c) Signi�cant result, small e�ect size (100 samples)

| 9PERSPECTIVE

whichthesedifferenceswouldbeexpectedgiventhenullhypoth-esis(ormorebroadly,theuncertaintyoftheresult).

Inthisrespect,wenotethatwehavefocusedprimarilyhereonformalNHST,buttherearecertainlyotherstrategiessuchasmodel-basedapproaches,informationtheoreticapproaches,and/orboot-strappingthatachievethesamegeneralgoalofunderstandingtruerelationships (as distinguished from spurious results that are sim-ply due to sampling variability). For instance,maximum likelihoodand Bayesian approaches to assess the impact of random errorare the common practice inmolecular phylogenetics. The debateaboutwhether formalNHST should be retained and improvedordoneawaywithisdecadesold,ragesstill,andisnotsomethingwecanadequatelyaddresshere (Benjaminetal.,2018;Cohen,1994;Falk&Greenbaum, 1995; Fidler etal., 2004;Halsey etal., 2015).AndasCohen(1994)noted,thereisnomagicalternativetoNHST.Certainly,though,allapproachesdiscussedabovearemorelikelytoleadtocorrectinferencethanvisualinspectionofdata.

As statistical tests are increasingly implemented in geobiol-ogy,thenextstepistolearnfromthemistakesofotherfieldsandhelpeachothernotfallfoulofthefallaciesdescribedinthisessay.Specifically,thepointstoavoidare:

1. Not correctly accounting for multiple testing.2. Consideringthep-valuetobetheprobabilitythatthehypothesisiscorrect.

3. Consideringp =0.045tobeadramaticallystrongerrejectionofthe null hypothesis than p =0.055 (“mechanical dichotomousdecisions”).

4. Not explicitly reporting “researcher degrees of freedom”(Simmonsetal.,2011).

5. Consideringevery“significant”resulttobean“important”result.Rather,significanceshouldtypicallybetherequirementforposit-inga scientificeffect,whichmust thenbeput intoappropriatecontext.

Ofthese,multipletestingand“researcherdegreesoffreedom”arelikelythemostproblematicwithrespecttoreproducibility,especiallyastheycanoftenbedoneunconsciously.Explicitplanningatthestartofastudyiscrucialinthisregard.Alsoimportantattheplanningstagesis power analysis. Ioannidis (2005) notes that in terms of achievinglong-lastingscientificinsights,fewerwell-poweredstudiesarevastlypreferabletomanylow-poweredstudies,andcertainlythefieldben-efitsfromnotchasingfalseleads.Further,asdemonstratedbyHalseyetal.(2015),largersamplessizesandwell-poweredstudiesalsomorepreciselyestimatetheeffectsize,whichisafterallwhatweareinter-estedin.Thechallengehereisanobviousconflictbetweenincentivestructuresforthefieldasawholeandindividualresearchers,specifi-callyearly-careerresearchers.Suchconsiderationsshouldideallyplayintoevolvingdiscussionsonhowpost-docs,facultypositions,andten-ureareevaluated.Fortunately,ingeobiologyweareoftenaddressingfirst-orderquestionswith largeeffects,andtherequired increase insample size to achieveawell-powered study isoftennot that large(Sterne&Smith,2001).

Asonefinalnoteregardingreproducibility,correctlydocument-ing original data and metadata in accessible supplementary doc-uments or data repositories is key to allowing researchers to testresults and build on previous studies inmeta- andmega-analyses(seeforinstanceIoannidisetal.,2009).Originalcodeusedforanaly-sesmustalsobeadequatelycuratedinapublicrepository;thisstepiscommoninbiologicalfieldssuchasecology(Cooperetal.,2017;Ram,2013),butisnotacrossgeobiology.

Toavoid feedingapublicationbiasmonster, and toencouragenewdevelopmentsinamannerKarlTurekianwouldbeproudof,wedonotadvocateastrictrequirementofsignificanceforpublication.Inouropinion,theresultsinFigure3,PanelA,iffromanemergingproxyrecord,wouldbequitesuggestiveandshouldbeconsideredforpublication,butthereadershouldbenotifiedhowlikelythedataare(giventhenullhypothesis).Poweranalysisisalsohelpfulinthisregard(Cohen,1992).Asanexample,aninfluentialgeneticspaperwaspublishedinNature,despitehavingnullresultsfortheprimaryhypothesis (The International Schizophrenia Consortium, 2009).Thispaperprovidedstrongevidencethatsignificantresultswouldbedetectablewithlargersamplesizesinthenearfuture,anditwastheapplicationofarecentlydevelopedstatistical technique (poly-genicriskscoring)thatmadeitworthyofpublicationinNature.

Moving to the longer-term view, the “best practices” describedabovearehopefullyuseful,but thegoal for thenextgenerationofgeobiologistsshouldnotbebestpracticesliststapedtothesideofcu-bicles.Putsimply,eventhemostwell-intentionedof“bestpractices”listscan lead toa “cookbook”viewofdataanalysis,where there isarightandawrongwaytodothings,andstatisticsareacomputerbutton to push after data acquisition. Rather the goal should be asituation where, for many, computational reasoning and data sci-enceareanatural,integratedpartofoursciencealongsidefieldandlaboratory skills.Themainneed for this is simply thatmanyofourscientificquestionsdonotreadilyconformtoclassicstatisticaltests.Asanexample,KellerandSchoene(2012)investigatedhowigneousgeochemistryhaschangedthroughEarthhistory,andrecognizedthattheserocksarenotevenlysampledinspaceandtime—someplutonsareheavilysampled,whileothersweresampledrarelyornotatall.Inotherwords, samplesarenot independent.Toaddress this issueofsamplingheterogeneity,theyutilizedare-weightedbootstrappingapproach,withbootstrapweightsforagivensamplerelatedtothespatialandtemporalproximitytoothersamples.Paleontologistshavealsoaddressedthesame issueofsamplingheterogeneity,butusingdifferentmethodsappropriatetothedataarchiveofthatfield(e.g.,Alroy,2010).Neitherofthesesolutionscamefromastatistics“cook-book,” and achieving the flexibility to design themost appropriatetest(beitfrequentist,likelihood,orBayesian),ortoperformnumeri-calexperimentstestingdifferentscenarios,willrequireafoundationinstatisticsbutalso, importantly,computationalthinking (Weintropetal., 2016). Geobiology has had considerable success in breakingfieldboundariesandeducatingstudentswhoareascomfortablewitharockhammerasapipette;integratingacomputationalandstatisticalperspectiveintothistrainingwillbethenextsteptodrawingrobustinsightsfromever-largergeobiologicaldatasets.

10 | PERSPECTIVE

Inclosing,wedonotexpectastatistical revolution ingeobiol-ogy overnight. The common implementation of statistical analy-ses in fields likeecology tookdecades. In fact,Gosset (“Student”)wrotetoFisherregardingthettestthat“IamsendingyouacopyofStudent'sTablesasyouaretheonly[person]that'severlikelytousethem!”(citedinBox,1981).WehopethisPerspectivehelpsstartadialogueregardingstatisticalpracticeingeobiology,whilealsorec-ognizingitisheavilycoloredbyourperspective—welookforwardtoseeingcommentaryfromotherperspectives(differentsubfieldsofgeobiology,Bayesianists,etc.).Ultimately,thiswillhelpusavoidtheproblemswithreproducibilitypresentinotherfields,bemoreconfi-dentinourresults,andasafieldmovemorequicklytowarddeepergeobiologicalunderstanding.

ACKNOWLEDG MENTS

WethankDavidJohnston,JonPayne,MattClapham,andananony-mous reviewer for commentsonapreviousversionof thismanu-script,andJamesFarquhar,AnneDekas,JoeRyan,DavidEvans,andAlex Bradley for helpful discussion.We thank RandallMunroe ofXKCD.comforpermissiontoreproduceFigure2.EASandSATwerefundedbyaSloanResearchFellowship.

CONFLIC T OF INTERE S T

Theauthorsdeclarenoconflictsofinterest.

ORCID

Erik A. Sperling https://orcid.org/0000-0001-9590-371X

ErikA.Sperling1

SabrinaTecklenburg1

LaramieE.Duncan2

1Department of Geological Sciences, Stanford University, Stanford, California

2Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, California

CorrespondenceErik A. Sperling, Department of Geological Sciences, Stanford

University, Stanford, CA.Email: [email protected]

R E FE R E N C E S

Alroy, J. (2010). The shiftingbalanceof diversity amongmajormarineanimal groups. Science, 329, 1191–1194. https://doi.org/10.1126/science.1189910

Baker,M. (2016).1,500scientists lift the lidon reproducibility.Nature News,533,452.https://doi.org/10.1038/533452a

Begley,C.G.,&Ellis,L.M. (2012).Drugdevelopment:Raisestandardsfor preclinical cancer research.Nature, 483, 531–533. https://doi.org/10.1038/483531a

Begley, C. G., & Ioannidis, J. P. A. (2015). Reproducibility in sci-ence: Improving the standard for basic and preclinical research.

Circulation Research, 116, 116–126. https://doi.org/10.1161/CIRCRESAHA.114.303819

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A.,Wagenmakers,E.-J.,Berk,R.,…Johnson,V.E.(2018).Redefinesta-tistical significance.Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562-017-0189-z

Bouter, L.M., Tijdink, J., Axelsen,N.,Martinson, B. C., & ten Riet,G.(2016). Ranking major and minor research misbehaviors: Resultsfrom a survey among participants of fourWorld Conferences onResearchIntegrity.Research Integrity and Peer Review,1,17.https://doi.org/10.1186/s41073-016-0024-5

Box, J. F. (1981). Gosset, Fisher, and the t distribution. The American Statistician,35,61–66.

Button, K. (2018). Reboot undergraduate courses for reproducibility.Nature,561,287.https://doi.org/10.1038/d41586-018-06692-8

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J.,Robinson,E.S.J.,&Munafò,M.R.(2013).Powerfailure:Whysmallsamplesizeunderminesthereliabilityofneuroscience.Nature Reviews Neuroscience,14,365–376.https://doi.org/10.1038/nrn3475

Cohen,J.(1990).ThingsIhavelearned(sofar).American Psychologist,45,1304–1312.https://doi.org/10.1037/0003-066X.45.12.1304

Cohen, J. (1992).Apowerprimer.Psychological Bulletin,112, 155–159.https://doi.org/10.1037/0033-2909.112.1.155

Cohen, J. (1994).Theearth is round (p<.05).American Psychologist,49,997–1003.https://doi.org/10.1037/0003-066X.49.12.997

Cooper,N.,Hsing,P.-Y.,Croucher,M.,Graham,L.,James,T.,Krystalli,A.,&Primeau,F.(2017).A guide to reproducible code in ecology and evolution. BESGuidestoBetterScience.London,UK:BritishEcologicalSociety.

DeBiasse,M.B.,&Ryan,J.F.(2018).Phylotocol:Promotingtransparencyand overcoming bias in phylogenetics (No. e26585v1). Systematic Biology, syy090.https://doi.org/10.1093/sysbio/syy090

Duncan,L.E.,&Keller,M.C.(2011).Acriticalreviewofthefirst10yearsofcandidategene-by-environment interactionresearch inpsychia-try.The American Journal of Psychiatry,168,1041–1049.https://doi.org/10.1176/appi.ajp.2011.11020191

Duncan,L.,Yilmaz,Z.,Gaspar,H.,Walters,R.,Goldstein,J.,Anttila,V.,…Bulik,C.M. (2017). Significant locus andmetabolic genetic cor-relations revealed in genome-wide association study of anorexianervosa.American Journal of Psychiatry,174, 850–858. https://doi.org/10.1176/appi.ajp.2017.16121402

Falk, R., & Greenbaum, C. W. (1995). Significance tests diehard: The amazing persistence of a probabilistic misconcep-tion. Theory & Psychology, 5, 75–98. https://doi.org/10.1177/ 0959354395051004

Fanelli,D. (2009).Howmany scientists fabricate and falsify research?Asystematicreviewandmeta-analysisofsurveydata.PLoS ONE,4,e5738.https://doi.org/10.1371/journal.pone.0005738

Fidler, F., Burgman, M. A., Cumming, G., Buttrose, R., & Thomason,N. (2006). Impact of criticism of null-hypothesis signifi-cance testing on statistical reporting practices in conserva-tion biology. Conservation Biology, 20, 1539–1544. https://doi.org/10.1111/j.1523-1739.2006.00525.x

Fidler,F.,Cumming,G.,Burgman,M.,&Thomason,N.(2004).Statisticalreform in medicine, psychology and ecology. The Journal of Socio- Economics,33,615–630.https://doi.org/10.1016/j.socec.2004.09.035

García, L. V. (2004). Escaping the Bonferroni iron claw in ecologicalstudies. Oikos, 105, 657–663. https://doi.org/10.1111/j.0030- 1299.2004.13046.x

Gelman,A.,Hill,J.,&Yajima,M.(2012).Whywe(usually)don'thavetoworryaboutmultiplecomparisons.Journal of Research on Educational Effectiveness,5,189–211.https://doi.org/10.1080/19345747.2011.618213

Gilbert,D.T.,King,G.,Pettigrew,S.,&Wilson,T.D. (2016).Commenton“Estimatingthereproducibilityofpsychologicalscience”.Science,351,1037.https://doi.org/10.1126/science.aad7243

https://orcid.org/0000-0001-9590-371X

https://orcid.org/0000-0001-9590-371X

https://orcid.org/0000-0001-9590-371X

mailto:[email protected]

https://doi.org/10.1126/science.1189910


https://doi.org/10.1038/533452a

https://doi.org/10.1038/483531a

https://doi.org/10.1038/483531a

https://doi.org/10.1161/CIRCRESAHA.114.303819

https://doi.org/10.1161/CIRCRESAHA.114.303819

https://doi.org/10.1038/s41562-017-0189-z

https://doi.org/10.1038/s41562-017-0189-z

https://doi.org/10.1186/s41073-016-0024-5

https://doi.org/10.1186/s41073-016-0024-5

https://doi.org/10.1038/d41586-018-06692-8

https://doi.org/10.1038/nrn3475

https://doi.org/10.1037/0003-066X.45.12.1304

https://doi.org/10.1037/0033-2909.112.1.155

https://doi.org/10.1037/0003-066X.49.12.997

https://doi.org/10.1093/sysbio/syy090

https://doi.org/10.1176/appi.ajp.2011.11020191




https://doi.org/10.1177/0959354395051004

https://doi.org/10.1177/0959354395051004

https://doi.org/10.1371/journal.pone.0005738

https://doi.org/10.1111/j.1523-1739.2006.00525.x

https://doi.org/10.1111/j.1523-1739.2006.00525.x

https://doi.org/10.1016/j.socec.2004.09.035

https://doi.org/10.1111/j.0030-1299.2004.13046.x

https://doi.org/10.1111/j.0030-1299.2004.13046.x

https://doi.org/10.1080/19345747.2011.618213

https://doi.org/10.1080/19345747.2011.618213

https://doi.org/10.1126/science.aad7243

| 11PERSPECTIVE

Halsey,L.G.,Curran-Everett,D.,Vowler,S.L.,&Drummond,G.B.(2015).TheficklePvaluegeneratesirreproducibleresults.Nature Methods,12(3),179–185.https://doi.org/10.1038/nmeth.3288

Heim,N.A.,Knope,M.L.,Schaal,E.K.,Wang,S.C.,&Payne,J.L.(2015).Cope'sruleintheevolutionofmarineanimals.Science,347,867–870.https://doi.org/10.1126/science.1260065

Hines,W.C.,Su,Y.,Kuhn,I.,Polyak,K.,&Bissell,M.J.(2014).SortingouttheFACS:Adevilinthedetails.Cell Reports,6,779–781.https://doi.org/10.1016/j.celrep.2014.02.021

Ioannidis, J. P. A. (2005). Why most published research findings arefalse. PLOS Medicine, 2, e124. https://doi.org/10.1371/journal.pmed.0020124

Ioannidis,J.P.A.(2012).Whyscienceisnotnecessarilyself-correcting.Perspectives on Psychological Science, 7, 645–654. https://doi.org/10.1177/1745691612464056

Ioannidis,J.P.A.,Allison,D.B.,Ball,C.A.,Coulibaly,I.,Cui,X.,Culhane,A.C.,…vanNoort,V.(2009).Repeatabilityofpublishedmicroarraygeneexpressionanalyses.Nature Genetics,41,149–155.https://doi.org/10.1038/ng.295

Keller,C.B.,&Schoene,B.(2012).Statisticalgeochemistryrevealsdis-ruptioninsecularlithosphericevolutionabout2.5Gyrago.Nature,485,490–493.https://doi.org/10.1038/nature11024

Lithgow,G.J.,Driscoll,M.,&Phillips,P.(2017).Alongjourneytoreproduc-ibleresults.Nature News,548,387.https://doi.org/10.1038/548387a

Moran, M. D. (2003). Arguments for rejecting the sequentialBonferroni in ecological studies.Oikos,100, 403–405. https://doi.org/10.1034/j.1600-0706.2003.12010.x

Nuzzo,R.(2014).Scientificmethod:Statisticalerrors.Nature News,506,150.https://doi.org/10.1038/506150a

Oakes,M.W.(1986).Statistical inference: A commentary for the social and behavioural sciences.Chichester,UK:Wiley.

Open Science Collaboration (2015). Estimating the reproducibility ofpsychologicalscience.Science 349,aac4716.

Prinz,F.,Schlange,T.,&Asadullah,K.(2011).Believeitornot:Howmuchcan we rely on published data on potential drug targets? Nature Reviews Drug Discovery,10,712.https://doi.org/10.1038/nrd3439-c1

Ram,K. (2013).Git can facilitategreater reproducibilityand increasedtransparency in science.Source Code for Biology and Medicine,8,7.https://doi.org/10.1186/1751-0473-8-7

Ripke,S.,Corvin,A.,Walters,J.T.R.,Farh,K.-H.,Holmans,P.A.,Lee,P.,…O'Donovan,M.C.(2014).Biologicalinsightsfrom108schizophrenia-associatedgeneticloci.Nature,511,421–427.

Rosnow,R.L.,&Rosenthal,R.(1989).Statisticalproceduresandthejusti-ficationofknowledgeinpsychologicalscience.American Psychologist,44,1276–1284.https://doi.org/10.1037/0003-066X.44.10.1276

Rothman,K.J.(1990).Noadjustmentsareneededformultiplecompar-isons. Epidemiology, 1, 43–46. https://doi.org/10.1097/00001648- 199001000-00010

Simmons,J.P.,Nelson,L.D.,&Simonsohn,U.(2011).False-positivepsy-chology:Undisclosedflexibilityindatacollectionandanalysisallowspresentinganythingas significant.Psychological Science,22, 1359–1366.https://doi.org/10.1177/0956797611417632

Sterne,J.A.C.,&Smith,G.D.(2001).Siftingtheevidence—what'swrongwithsignificancetests?British Medical Journal,322,226–231.https://doi.org/10.1136/bmj.322.7280.226

Streiner, D. L. (2015). Best (but oft-forgotten) practices: The multipleproblemsofmultiplicity-whetherandhowtocorrectformanysta-tisticaltests.The American Journal of Clinical Nutrition,102,721–728.https://doi.org/10.3945/ajcn.115.113548

Student(1908).Theprobableerrorofamean.Biometrika 6,1–25.TheInternationalSchizophreniaConsortium(2009).Commonpolygenic

variation contributes to risk of schizophrenia and bipolar disorder.Nature,460,748–752.

Thiemens,M. H., Davis, A.M., Grossman, L., & Colman, A. S. (2013).Turekianreflections.Proceedings of the National Academy of Sciences of the United States of America, 110, 16289–16290. https://doi.org/10.1073/pnas.1315804110

Weintrop,D.,Beheshti,E.,Horn,M.,Orton,K.,Jona,K.,Trouille,L.,&Wilensky,U.(2016).Definingcomputationalthinkingformathematicsandscienceclassrooms.Journal of Science Education and Technology,25,127–147.https://doi.org/10.1007/s10956-015-9581-5

Wray, N. R., Ripke, S.,Mattheisen,M., Trzaskowski,M., Byrne, E.M.,Abdellaoui, A., … Sullivan, P. F. (2018). Genome-wide associationanalyses identify 44 risk variants and refine the genetic architec-tureofmajordepression.Nature Genetics,50,668–681.https://doi.org/10.1038/s41588-018-0090-3

https://doi.org/10.1038/nmeth.3288


https://doi.org/10.1016/j.celrep.2014.02.021

https://doi.org/10.1016/j.celrep.2014.02.021

https://doi.org/10.1371/journal.pmed.0020124

https://doi.org/10.1371/journal.pmed.0020124

https://doi.org/10.1177/1745691612464056

https://doi.org/10.1177/1745691612464056

https://doi.org/10.1038/ng.295

https://doi.org/10.1038/ng.295

https://doi.org/10.1038/nature11024

https://doi.org/10.1038/548387a

https://doi.org/10.1034/j.1600-0706.2003.12010.x

https://doi.org/10.1034/j.1600-0706.2003.12010.x

https://doi.org/10.1038/506150a

https://doi.org/10.1038/nrd3439-c1

https://doi.org/10.1186/1751-0473-8-7

https://doi.org/10.1037/0003-066X.44.10.1276

https://doi.org/10.1097/00001648-199001000-00010

https://doi.org/10.1097/00001648-199001000-00010

https://doi.org/10.1177/0956797611417632

https://doi.org/10.1136/bmj.322.7280.226

https://doi.org/10.1136/bmj.322.7280.226

https://doi.org/10.3945/ajcn.115.113548

https://doi.org/10.1073/pnas.1315804110

https://doi.org/10.1073/pnas.1315804110

https://doi.org/10.1007/s10956-015-9581-5

https://doi.org/10.1038/s41588-018-0090-3

https://doi.org/10.1038/s41588-018-0090-3

statistical inference and reproducibility in geobiology · su, kuhn, polyak, & bissell, 2014;...

Documents