a guide to robust statistical methods in neuroscience ... · a guide to robust statistical methods...

30
UNIT 8.42 A Guide to Robust Statistical Methods in Neuroscience Rand R. Wilcox 1 and Guillaume A. Rousselet 2 1 Deptartment of Psychology, University of Southern California, Los Angeles, California 2 Institute of Neuroscience and Psychology, College of Medical, Veterinary, and Life Sciences, University of Glasgow, Glasgow, United Kingdom There is a vast array of new and improved methods for comparing groups and studying associations that offer the potential for substantially increasing power, providing improved control over the probability of a Type I error, and yielding a deeper and more nuanced understanding of data. These new techniques effectively deal with four insights into when and why conventional methods can be unsatisfactory. But for the non-statistician, the vast array of new and improved techniques for comparing groups and studying associations can seem daunting, simply because there are so many new methods that are now available. This unit briefly reviews when and why conventional methods can have relatively low power and yield misleading results. The main goal is to suggest some general guidelines regarding when, how, and why certain modern techniques might be used. C 2018 by John Wiley & Sons, Inc. Keywords: non-normality heteroscedasticity skewed distributions outliers curvature How to cite this article: Wilcox, R. R., & Rousselet, G. A. (2018). A guide to robust statistical methods in neuroscience. Current Protocols in Neuroscience, 82, 8.42.1–8.42.30. doi: 10.1002/cpns.41 1. INTRODUCTION The typical introductory statistics course covers classic methods for comparing groups (e.g., Student’s t-test, ANOVA F test, and Wilcoxon-Mann-Whitney test) and studying associations (e.g., Pearson’s correlation and least squares regression). The two-sample Student’s t-test and the ANOVA F test as- sume that sampling is from normal distribu- tions and that the population variances are identical, which is generally known as the homoscedasticity assumption. When testing hypotheses based on the least squares regres- sion estimator or Pearson’s correlation, similar assumptions are made. (Section 2.2 elaborates on the details.) An issue of fundamental impor- tance is whether violating these assumptions can have a serious detrimental impact on two key properties of a statistical test: the proba- bility of a false positive, also known as a Type I error, and power, the probability of detect- ing true differences among groups and a true association among two or more variables. There is the related issue of whether conven- tional methods provide enough detail regard- ing how groups differ as well as the nature of true association. There are a variety of relatively well-known techniques for dealing with non-normality and unequal variances (e.g., a rank-based method). However, by modern standards, these methods are relatively ineffective for reasons reviewed in Section 3. More effective techniques are indicated in Section 4. The good news is that when comparing groups that have non-normal but identical dis- tributions, control over the Type I error proba- bility is, in general, reasonably good when us- ing conventional techniques. But if the groups differ, there is now a vast literature indicating that under general conditions power can be rel- atively poor. In practical terms, important dif- ferences among groups might be missed (e.g., Wilcox, 2017a, 2017b, 2017c). Even when the Current Protocols in Neuroscience 8.42.1–8.42.30, January 2018 Published online January 2018 in Wiley Online Library (wileyonlinelibrary.com). doi: 10.1002/cpns.41 Copyright C 2018 John Wiley & Sons, Inc. Behavioral Neuroscience 8.42.1 Supplement 82

Upload: others

Post on 28-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

UNIT 842A Guide to Robust Statistical Methods inNeuroscienceRand R Wilcox1 and Guillaume A Rousselet2

1Deptartment of Psychology University of Southern California Los Angeles California2Institute of Neuroscience and Psychology College of Medical Veterinary and LifeSciences University of Glasgow Glasgow United Kingdom

There is a vast array of new and improved methods for comparing groupsand studying associations that offer the potential for substantially increasingpower providing improved control over the probability of a Type I errorand yielding a deeper and more nuanced understanding of data These newtechniques effectively deal with four insights into when and why conventionalmethods can be unsatisfactory But for the non-statistician the vast array ofnew and improved techniques for comparing groups and studying associationscan seem daunting simply because there are so many new methods that arenow available This unit briefly reviews when and why conventional methodscan have relatively low power and yield misleading results The main goal is tosuggest some general guidelines regarding when how and why certain moderntechniques might be used Ccopy 2018 by John Wiley amp Sons Inc

Keywords non-normality heteroscedasticity skewed distributions outliers curvature

How to cite this articleWilcox R R amp Rousselet G A (2018) A guide to robust statistical

methods in neuroscience Current Protocols in Neuroscience 828421ndash84230 doi 101002cpns41

1 INTRODUCTIONThe typical introductory statistics course

covers classic methods for comparing groups(eg Studentrsquos t-test ANOVA F test andWilcoxon-Mann-Whitney test) and studyingassociations (eg Pearsonrsquos correlation andleast squares regression) The two-sampleStudentrsquos t-test and the ANOVA F test as-sume that sampling is from normal distribu-tions and that the population variances areidentical which is generally known as thehomoscedasticity assumption When testinghypotheses based on the least squares regres-sion estimator or Pearsonrsquos correlation similarassumptions are made (Section 22 elaborateson the details) An issue of fundamental impor-tance is whether violating these assumptionscan have a serious detrimental impact on twokey properties of a statistical test the proba-bility of a false positive also known as a TypeI error and power the probability of detect-ing true differences among groups and a true

association among two or more variablesThere is the related issue of whether conven-tional methods provide enough detail regard-ing how groups differ as well as the nature oftrue association

There are a variety of relatively well-knowntechniques for dealing with non-normality andunequal variances (eg a rank-based method)However by modern standards these methodsare relatively ineffective for reasons reviewedin Section 3 More effective techniques areindicated in Section 4

The good news is that when comparinggroups that have non-normal but identical dis-tributions control over the Type I error proba-bility is in general reasonably good when us-ing conventional techniques But if the groupsdiffer there is now a vast literature indicatingthat under general conditions power can be rel-atively poor In practical terms important dif-ferences among groups might be missed (egWilcox 2017a 2017b 2017c) Even when the

Current Protocols in Neuroscience 8421ndash84230 January 2018Published online January 2018 in Wiley Online Library (wileyonlinelibrarycom)doi 101002cpns41Copyright Ccopy 2018 John Wiley amp Sons Inc

BehavioralNeuroscience

8421

Supplement 82

normality assumption is true but the popula-tion variances differ (called heteroscedastic-ity) power can be adversely impacted whenusing the ANOVA F

Similar concerns arise when dealing withregression Conventional methods includingrank-based techniques perform well in termsof controlling the probability of a Type I errorwhen there is no association But when there isan association conventional methods includ-ing rank-based techniques (eg Spearmanrsquosrho and Kendallrsquos tau) can have a relativelylow probability of detecting an association rel-ative to modern methods developed during thelast 30 years

Practical concerns regarding conventionalmethods stem from four major insights (egWilcox 2017a 2017c) These insights can bebriefly summarized as follows

The central limit and skeweddistributions Much larger samplesizes might be needed to assumenormality than is generallyrecognized

There is now a deeper understanding ofthe role of outliers and how to dealwith them Some seemingly obviousstrategies for dealing with outliersbased on standard training are knownto be highly unsatisfactory for reasonsoutlined later in the paper

There is a substantial literature indicatingthat methods that assumehomoscedasticity (equal variances)can yield inaccurate results when infact there is heteroscedasticity evenwhen the sample sizes are quite large

When dealing with regression curvaturerefers to situations where theregression line is not straight There isnow considerable evidence thatcurvature is a much more seriousconcern than is generally recognized

Robust methods are typically thought of asmethods that provide good control over theprobability of a Type I error But today theydeal with much broader issues In particularthey are designed to deal with the problemsassociated with skewed distributions outliersheteroscedasticity and curvature that wereoutlined above

One of the more fundamental goals amongrobust methods is to develop techniques thatare not overly sensitive to very small changesin a distribution For instance a slight depar-ture from normality should not destroy power

This rules out any method based on the meanand variance (eg Staudte amp Sheather 1990Wilcox 2017a 2017b) Section 23 illustratesthis point

Many modern robust methods are designedto have nearly the same amount of poweras conventional methods under normality butthey continue to have relatively high powerunder slight departures from normality whereconventional techniques based on means per-form poorly There are other fundamentalgoals some of which are relevant regardlessof how large the sample sizes might be Butan effective description of these goals goesbeyond the scope of this paper For presentpurposes the focus is on achieving relativelyhigh power

Another point that should be stressed has todo with standard power analyses A commongoal is to justify some choice for the samplesizes prior to obtaining any data Note that ineffect the goal is to address a statistical is-sue without any data Typically this is doneby assuming normality and homoscedasticitywhich in turn can suggest that relatively smallsample sizes provide adequate power when us-ing means A practical concern is that violat-ing either of these two assumptions can havea tremendous impact on power when attentionis focused exclusively on comparing meansSection 21 illustrates this concern when deal-ing with measures of central tendency Sim-ilar concerns arise when dealing with leastsquares regression and Pearsonrsquos correlationThese concerns can be mitigated by using re-cently developed robust methods summarizedhere as well as in Wilcox (2017a 2017c)

There is now a vast array of new andimproved methods that effectively deal withknown concerns associated with classic tech-niques (eg Heritier Cantoni Copt ampVictoria-Feser 2009 Maronna Martin ampYohai 2006 Wilcox 2017a 2017b 2017c)They include substantially improved methodsfor dealing with all four of the major insightspreviously listed Perhaps more importantlythey can provide a deeper more accurate andmore nuanced understanding of data as will beillustrated in Section 5

For books focused on the mathematicalfoundation of modern robust methods seeHampel Ronchetti Rousseeuw and Stahel(1986) Huber and Ronchetti (2009) Maronnaet al (2006) and Staudte and Sheather (1990)For books focused on applying robust meth-ods see Heritier et al (2009) and Wilcox(2017a 2017c) From an applied point of view

A Guide to RobustStatisticalMethods

8422

Supplement 82 Current Protocols in Neuroscience

the difficulty is not finding a method thateffectively deals with violations of standardassumptions Rather for the non-statisticianthere is the difficulty in navigating throughthe many alternative techniques that might beused This paper is an attempt to deal with thisissue by providing a general guide regardingwhen and how modern robust methods mightbe used when comparing two or more groupsWhen dealing with regression all of the con-cerns associated with conventional methodsfor comparing groups remain and new con-cerns are introduced A few issues related toregression and correlations are covered herebut it is stressed that there are many other mod-ern advances that have practical value Readersinterested in regression are referred to Wilcox(2017a 2017c)

A few general points should be stressedFirst if robust methods such as modern meth-ods based on the median described later inthis paper give very similar results to conven-tional methods based on means this is reas-suring that conventional methods based on themean are performing relatively well in termsof Type I errors and power But when theydiffer there is doubt about the validity of con-ventional techniques In a given situation con-ventional methods might perform well in termsof controlling the Type I error probability andproviding reasonably high power But the bestthat can be said is that there are general condi-tions where conventional methods do indeedyield inaccurate inferences A particular con-cern is that they can suffer from relatively lowpower in situations where more modern meth-ods have relatively high power More detailsare provided in Sections 3 and 4

Second the choice of method can make asubstantial difference in our understanding ofdata One reason is that modern methods pro-vide alternative and interesting perspectivesthat more conventional methods do not ad-dress A complication is that there is no sin-gle method that dominates in terms of poweror providing a deep understanding of howgroups compare The same is true when deal-ing with regression and measures of associa-tion The reality is that several methods mightbe needed to address even what appears as asimple problem for instance comparing twogroups

There is of course the issue of control-ling the probability of one or more Type I er-rors when multiple tests are performed Thereare many modern improvements for dealingwith this issue (eg Wilcox 2017a 2017c)

Another strategy is to put more emphasis onexploratory studies One could then deal withthe risk of false positive results by conduct-ing a confirmatory study aimed at determiningwhether significant results in an exploratorystudy can be replicated (Wagenmakers Wet-zels Borsboom van der Maas amp Kievit2012) Otherwise there is the danger of miss-ing important details regarding how groupscompare One of the main messages here isthat despite the lack of a single method thatdominates certain guidelines can be offeredregarding how to analyze data

Modern methods for plotting data can be in-valuable as well (eg Rousselet Foxe amp Bo-lam 2016 Rousselet Pernet amp Wilcox 2017Weissgerber Milic Winham amp Garovic2015) In particular they can provide impor-tant perspectives beyond the common strategyof using error bars Complete details go be-yond the scope of this paper but Section 5 il-lustrates some of the more effective plots thatmight be used

This unit is organized as follows Sec-tion 2 briefly reviews when and why con-ventional methods can be highly unsatisfac-tory This is necessary in order to appreci-ate modern technology and because standardtraining typically ignores these issues Effortsto modernize basic training have been made(eg Field Miles amp Field 2012 Wilcox2017b 2017c) and a 2016 special issue of TheAmerican Statistician (volume 69 number4) was aimed at encouraging instructors tomodernize their courses (This special issuetouches on a broader range of topics thanthose discussed here) Some neuroscientistsare trained in a manner that takes into accountmodern insights relevant to basic principlesbut it is evident that most are not Section 3reviews the seemingly more obvious strate-gies aimed at salvaging standard techniquesthe point being that by modern standards theyare relatively ineffective and cannot be rec-ommended Moreover certain strategies arenot technically sound Section 3 also providesan indication of how concerns regarding con-ventional methods are addressed using moremodern techniques Section 4 describes strate-gies for comparing two independent or depen-dent groups that take modern advances intoaccount Included are some methods aimedat comparing correlations as well as meth-ods designed to determine which independentvariables are most important Section 5 illus-trates modern methods using data from severalstudies Behavioral

Neuroscience

8423

Current Protocols in Neuroscience Supplement 82

2 INSIGHTS REGARDINGCONVENTIONAL METHODS

This section elaborates on the concernswith conventional methods for comparinggroups and studying associations stemmingfrom the four insights previously indicated

21 Skewed DistributionsA skewed distribution simply refers to

a distribution that is not symmetric aboutsome central value An example is shown inFigure 8421A Such distributions occur natu-rally An example relevant to the neurosciencesis given in Section 52 Skewed distributionsare a much more serious problem for statisticalinferences than once thought due to insightsregarding the central limit theorem

Consider the one-sample case Conven-tional wisdom is that with a relatively smallsample size normality can be assumed underrandom sampling An implicit assumption wasthat if the sample mean has approximately anormal distribution then Studentrsquos t-test willperform reasonably well It is now known thatthis is not necessarily the case as illustratedfor example in Wilcox (2017a 2017b)

This point is illustrated here using a simula-tion that is performed in the following manner

Imagine that data are randomly sampled fromthe distribution shown in Figure 8421A (alognormal distribution) and that the mean iscomputed based on a sample size of n = 30Repeating this process 5000 times the thickblack line in Figure 8421B shows a plot ofthe resulting sample means the thin gray lineis the plot of the means when sampling froma normal distribution instead The distributionof T values for samples of n = 30 is indicatedby the thick black line in Figure 8421C thethin gray line is the distribution of T valueswhen sampling from a normal distribution Ascan be seen the actual T distribution extendsout much further to the left compared to thedistribution of T under normality That is inthe current example sampling from a skeweddistribution leads to much more extreme val-ues than expected by chance under normalitywhich in turn results in more false positive re-sults than expected when the null hypothesisis true

Suppose the goal is to test some hypothesisat the 005 level Bradley (1978) suggests thatas a general guide control over the probabilityof a Type I error is minimally satisfactory ifthe actual level is between 0025 and 0075When we test at the 005 level we expect 5

00

02

04

06

00 05 10 15 20 25 30 35X values

Den

sity

A

00

03

06

09

12

10 15 20 25 30 35Means

Den

sity

DistributionSample MeanTheoretical

B

00

01

02

03

04

minus8 minus7 minus6 minus5 minus4 minus3 minus2 minus1 0 1 2 3T values

Den

sity

Distributionn = 30Theoretical

C

00

01

02

03

04

minus8 minus7 minus6 minus5 minus4 minus3 minus2 minus1 0 1 2 3T values

Den

sity

Distributionn = 100Theoretical

D

Figure 8421 (A) Example of skewed distribution (B) Distribution of the sample mean undernormality (thin line n = 30) and the actual distribution based on a simulation Each sample meanwas computed based on 30 observations randomly sampled from the distribution shown in A (C)and (D) Comparison of the theoretical T distribution with 29 degrees of freedom to distributions of5000 T values Again the T values were computed from observations sampled from the distributionin A

A Guide to RobustStatisticalMethods

8424

Supplement 82 Current Protocols in Neuroscience

0000

0025

0050

0075

0100

0125

0150

0 100 200 300 400 500Sample size

Type

I er

ror

prob

abili

ty

DistributionLognormalNormal

EstimatorMeanMedianTrimmed mean

Figure 8422 Type I error probability as a function of sample size The Type I error probabilitywas computed by running a simulation with 10000 iterations In each iteration sample sizes from10 to 500 in steps of 10 were drawn from a normal distribution and a lognormal distributionFor each combination of sample size and distribution we applied a t-test on the mean a testof the median and a t-test on the 20 trimmed mean all with alpha = 005 Depending on thetest applied the mean median or 20 trimmed mean of the population sampled from was zeroThe black horizontal line marks the expected 005 Type I error probability The gray area marksBradleyrsquos satisfactory range When sampling from a normal distribution all methods are close tothe nominal 005 level except the trimmed mean for very small sample sizes When sampling isfrom a lognormal distribution the mean and the trimmed mean give rise to too many false alarmsfor small sample sizes The mean continues to give higher false positive rates than the othertechniques even with n = 500

of the t-tests to be significant However whensampling from the skewed distribution consid-ered here this is not the case the actual TypeI error probability is 0111

Figure 8421D shows the distribution of Twhen n = 100 Now the Type I error probabil-ity is 0082 again when testing at the 005level Based on Bradleyrsquos criterion a samplesize of about 130 or larger is required Bradely(1978) goes on to suggest that ideally the ac-tual Type I error probability should be between0045 and 0055 Now n = 600 is unsatisfac-tory the actual level is 0057 With n = 700the level is 0055

Before continuing it is noted that the me-dian belongs to the class of trimmed meanswhich refers to the strategy of trimminga specified proportion of the smallest andlargest values and averaging the values thatremain For example if n = 10 10 trim-ming means that the lowest and highest valuesare removed and that the remaining data areaveraged Similarly 20 trimming would re-move the two smallest and two largest val-ues Based on conventional training trimmingmight seem counterintuitive In some situa-tions however it can substantially increaseour ability to control the Type I error prob-ability as illustrated next and trimming can

substantially increase power as well for rea-sons to be explained

First focus on controlling the probability ofa Type I error (Section 23 illustrates one ofthe reasons that methods based on means canhave relatively low power) Figure 8422 il-lustrates the Type I error probability as a func-tion of the sample size when using the meanwhen using the median and when samplingfrom the asymmetric (lognormal) distributionin Figure 8421A Inferences based on the20 trimmed mean were made via the methodderived by Tukey and McLaughlin (1963) In-ferences based on the median were made viathe method derived by Hettmansperger andSheather (1986) (The software used to ap-ply these latter two methods is contained inthe R package described at the beginning ofSection 4) Also shown is the Type I errorprobability when sampling from a normal dis-tribution The gray area indicates Bradleyrsquoscriterion

Figure 8423 illustrates the association be-tween power and the sample size for the distri-butions used in Figure 8422 As can be seenunder normality the sample mean is best fol-lowed closely by the 20 trimmed mean Themedian is least satisfactory when dealing witha normal distribution as expected However

BehavioralNeuroscience

8425

Current Protocols in Neuroscience Supplement 82

00

02

04

06

08

10

0 50 100 150 200 250 300Sample size

Pow

er (

true

pos

itive

pro

babi

lity)

DistributionLognormalNormal

EstimatorMeanMedianTrimmed mean

Figure 8423 Power as a function of sample size The probability of a true positive was computedby running a simulation with 10000 iterations In each iteration sample sizes from 10 to 500 insteps of 10 were drawn from a normal distribution and the (lognormal) distribution shown in Figure8421A For each combination of sample size and distribution we applied a t-test on the mean atest of the median and a t-test on the 20 trimmed mean all with alpha = 005 Depending onthe test applied the mean median or 20 trimmed mean of the population sampled from was05 The black horizontal line marks the conventional 80 power threshold When sampling froma normal distribution all methods require lt50 observations to achieve 80 power and the meanappears to have higher power at lower sample size than the other methods When sampling froma lognormal distribution power drops dramatically for the mean but not for the median and thetrimmed mean Power for the median actually improves The exact pattern of results depends onthe effect size and the asymmetry of the distribution we sample from so we strongly encouragereaders to perform their own detailed power analyses

for the asymmetric (lognormal) distributionthe median performs best and the mean per-forms very poorly

A feature of random samples taken from thedistribution in Figure 8421A is that the ex-pected proportion of points declared an outlieris relatively small For skewed distributionsas we move toward situations where outliersare more common a sample size gt300 canbe required to achieve reasonably good con-trol over the Type I error probability Thatis control over the Type I error probabilityis a function of both the degree a distributionis skewed and the likelihood of encounteringoutliers However there are methods that per-form reasonably well with small sample sizesas indicated in Section 41

For symmetric distributions where outlierstend to occur the reverse can happen the ac-tual Type I error probability can be substan-tially less than the nominal level This happensbecause outliers inflate the standard deviationwhich in turn lowers the value of T which inturn can negatively impact power Section 23elaborates on this issue

In an important sense outliers have a largerimpact on the sample variance than the sam-ple mean which impacts the t-test To illus-

trate this point imagine the goal is to test H0μ = 1 based on the following values 1 1516 18 2 22 24 27 Then T = 469 the p-value is p = 0002 and the 095 confidenceinterval = [145 235] Now including thevalue 8 the mean increases from 19 to 258suggesting at some level there is stronger evi-dence for rejecting the null hypothesis How-ever this outlier increases the standard devia-tion from 054 to 21 and now T = 226 andp = 0054 The 095 confidence interval =[minus0033 319]

Now consider the goal of comparing twoindependent or dependent groups If thegroups have identical distributions then dif-ference scores have a symmetric distribu-tion and the probability of a Type I erroris in general less than the nominal levelwhen using conventional methods based onmeans Now in addition to outliers differ-ences in skewness create practical concernswhen using Studentrsquos t-test Indeed undergeneral conditions the two-sample Studentrsquost-test for independent groups is not evenasymptotically correct roughly because thestandard error of the difference betweenthe sample means is not estimated correctly(eg Cressie amp Whitford 1986) Moreover

A Guide to RobustStatisticalMethods

8426

Supplement 82 Current Protocols in Neuroscience

Studentrsquos t-test can be biased This means thatthe probability of rejecting the null hypothesisof equal means can be higher when the popu-lation means are equal compared to situationswhere the population means differ Roughlythis concern arises because the distribution ofT can be skewed and in fact the mean of T candiffer from zero even though the null hypothe-sis is true For a more detailed explanation seeWilcox (2017c section 55) Problems persistwhen Studentrsquos t-test is replaced by Welchrsquos(1938) method which is designed to comparemeans in a manner that allows unequal vari-ances Put another way if the goal is to testthe hypothesis that two groups have identicaldistributions conventional methods based onmeans perform well in terms of controlling theType I error probability But if the goal is tocompare the population means and if distribu-tions differ conventional methods can performpoorly

There are many techniques that performwell when dealing with skewed distributionsin terms of controlling the Type I error prob-ability some of which are based on the usualsample median (Wilcox 2017a 2017c) Boththeory and simulations indicate that as theamount of trimming increases the ability tocontrol over the probability of a Type I errorincreases as well Moreover for reasons to beexplained trimming can substantially increasepower a result that is not obvious based onconventional training The optimal amount oftrimming depends on the characteristics of thepopulation distributions which are unknownCurrently the best that can be said is that thechoice can make a substantial difference The20 trimmed mean has been studied exten-sively and often provides a good compromisebetween the two extremes no trimming (themean) and the maximum amount of trimming(the median)

In various situations particularly impor-tant are inferential methods based on whatare called bootstrap techniques Two basicversions are the bootstrap-t and percentilebootstrap Roughly rather than assume nor-mality bootstrap-t methods perform a simu-lation using the observed data that yields anestimate of an appropriate critical value anda p-value So values of T are generated asdone in Figure 8421 only data are sampledwith replacement from the observed data Inessence bootstrap-t methods generate data-driven T distributions expected by chance ifthere were no effect The percentile bootstrapproceeds in a similar manner only when deal-ing with a trimmed mean for example the

goal is to determine the distribution of thesample trimmed mean which can then be usedto compute a p-value and a confidence inter-val When comparing two independent groupsbased on the usual sample median if thereare tied (duplicated) values currently the onlymethod that performs well in simulations interms of controlling the Type I error probabil-ity is based on a percentile bootstrap method(compare to Table 53 in Wilcox 2017c) Sec-tion 41 elaborates on how this method isperformed

If the amount of trimming is close to zerothe bootstrap-t method is preferable to the per-centile bootstrap method But as the amount oftrimming increases at some point a percentilebootstrap method is preferable This is the casewith 20 trimming (eg Wilcox 2017a) Itseems to be the case with 10 trimming aswell but a definitive study has not been made

Also if the goal is to reflect the typical re-sponse it is evident that the median or evena 20 trimmed mean might be more satisfac-tory Using quantiles (percentiles) other thanthe median can be important as well for rea-sons summarized in Section 42

When comparing independent groupsmodern improvements on the Wilcoxon-Mann-Whitney (WMW) test are another pos-sibility which are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a ran-dom observation from the second (More de-tails are provided in Section 3) Additionalpossibilities are described in Wilcox (2017a)some of which are illustrated in Section 4below

In some situations robust methods can havesubstantially higher power than any methodbased on means It is not being suggested thatrobust methods always have more power Thisis not the case Rather the point is that powercan depend crucially on the conjunction ofwhich estimator is used (for instance the meanversus the median) and how a confidence inter-val is built (for instance a parametric methodor the percentile bootstrap) These choices arenot trivial and must be taken into account whenanalyzing data

22 HeteroscedasticityIt has been clear for some time that when

using classic methods for comparing meansheteroscedasticity (unequal population vari-ances) is a serious concern (eg Brown ampForsythe 1974) Heteroscedasticity can im-pact both power and the Type I error proba-bility The basic reason is that under general

BehavioralNeuroscience

8427

Current Protocols in Neuroscience Supplement 82

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

A

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

B

Figure 8424 Homoscedasticity (A) and heteroscedasticity (B) (A) The variance of the depen-dent variable is the same at any age (B) The variance of the dependent variable can depend onthe age of the participants

conditions methods that assume homoscedas-ticity are using an incorrect estimate ofthe standard error when in fact there isheteroscedasticity Indeed there are concernsregardless of how large the sample size mightbe As we consider more and more compli-cated designs heteroscedasticity becomes anincreasing concern

When dealing with regression ho-moscedasticity means that the variance of thedependent variable does not depend on thevalue of the independent variable When deal-ing with age and depressive symptoms for ex-ample homoscedasticity means that the vari-ation in measures of depressive symptoms atage 23 is the same at age 80 or any age inbetween as illustrated in Figure 8424

Independence implies homoscedasticitySo for this particular situation classic methodsassociated with least squares regression Pear-sonrsquos correlation Kendallrsquos tau and Spear-manrsquos rho are using a correct estimate of thestandard error which helps explain why theyperform well in terms of Type I errors whenthere is no association That is when a ho-moscedastic method rejects it is reasonableto conclude that there is an association but

in terms of inferring the nature of the asso-ciation these methods can perform poorlyAgain a practical concern is that when thereis heteroscedasticity homoscedastic methodsuse an incorrect estimate of the standard errorwhich can result in poor power and erroneousconclusions

A seemingly natural way of salvaging ho-moscedastic methods is to test the assump-tion that there is homoscedasticity But sixstudies summarized in Wilcox (2017c) foundthat this strategy is unsatisfactory Presum-ably situations are encountered where this isnot the case but it is difficult and unclearhow to determine when such situations areencountered

Methods that are designed to deal with het-eroscedasticity have been developed and areeasily applied using extant software Thesetechniques use a correct estimate of the stan-dard error regardless of whether the ho-moscedasticity assumption is true A generalrecommendation is to always use a het-eroscedastic method given the goal of com-paring measures of central tendency or mak-ing inferences about regression parameters aswell as measures of association

A Guide to RobustStatisticalMethods

8428

Supplement 82 Current Protocols in Neuroscience

00

01

02

03

04

minus3 minus2 minus1 0 1 2 3

Den

sity

Distribution mixed standard

A

minus10

minus05

00

05

10

Means Medians Trimmed means

Val

ue

B

minus3minus2minus1

01234

Means Medians Trimmed means

Val

ue

C

Figure 8425 (A) Density functions for the standard normal distribution (solid line) and the mixednormal distribution (dashed line) These distributions have an obvious similarity yet the variancesare 1 and 109 (B) Box plots of means medians and 20 trimmed means when sampling froma normal distribution (C) Box plots of means medians and 20 trimmed means when samplingfrom a mixed normal distribution In panels B and C each distribution has 10000 values and eachof these values was obtained by computing the mean median or trimmed mean of 30 randomlygenerated observations

23 OutliersEven small departures from normality can

devastate power The modern illustration ofthis fact stems from Tukey (1960) and is basedon what is generally known as a mixed normaldistribution The mixed normal considered byTukey means that with probability 09 an ob-servation is sampled from a standard normaldistribution otherwise an observation is sam-pled from a normal distribution having meanzero and standard deviation 10 Figure 8425Ashows a standard normal distribution and themixed normal discussed by Tukey Note thatin the center of the distributions the mixednormal is below the normal distribution Butfor the two ends of the mixed normal distribu-tion (the tails) the mixed normal lies above thenormal distribution For this reason the mixednormal is often described as having heavy tailsIn general heavy-tailed distributions roughlyrefer to distributions where outliers are likelyto occur

Here is an important point The standardnormal has variance 1 but the mixed normalhas variance 109 That is the population vari-ance can be overly sensitive to slight changesin the tails of a distribution Consequentlyeven a slight departure from normality can re-sult in relatively poor power when using anymethod based on the mean

Put another way samples from the mixednormal are more likely to result in outlierscompared to samples from a standard normaldistribution As previously indicated outliersinflate the sample variance which can nega-tively impact power when using means An-other concern is that they can give a distortedand misleading summary regarding the bulk ofthe participants

The first indication that heavy-tailed dis-tributions are a concern stems from a resultderived by Laplace about two centuries agoIn modern terminology he established thatas we move from a normal distribution to a

BehavioralNeuroscience

8429

Current Protocols in Neuroscience Supplement 82

distribution more likely to generate outliersthe standard error of the usual sample mediancan be smaller than the standard error of themean (Hand 1998) The first empirical evi-dence implying that outliers might be morecommon than what is expected under normal-ity was reported by Bessel (1818)

To add perspective we computed the meanmedian and a 20 trimmed mean based on 30observations generated from a standard nor-mal distribution (Again a 20 trimmed meanremoves the 20 lowest and highest valuesand averages the remaining data) Then we re-peated this process 10000 times Box plots ofthe results are shown in Figure 8425B The-ory tells us that under normality the variationof the sample means is smaller than the varia-tion among the 20 trimmed means and medi-ans and Figure 8421B provides perspectiveon the extent this is the case

Now we repeat this process only dataare sampled from the mixed normal inFigure 8425A Figure 8425C reports theresults As is evident there is substantiallyless variation among the medians and 20trimmed means That is despite trimmingdata the standard errors of the median and20 trimmed mean are substantially smallercontrary to what might be expected based onstandard training

Of course a more important issue iswhether the median or 20 trimmed meanever have substantially smaller standard er-rors based on the data encountered in re-search There are numerous illustrations thatthis is the case (eg Wilcox 2017a 2017b2017c)

There is the additional complication thatthe amount of trimming can substantially im-pact power and the ideal amount of trimmingin terms of maximizing power can dependcrucially on the nature of the unknown distri-butions under investigation The median per-forms best for the situation in Figure 8425Cbut situations are encountered where it trimstoo much given the goal of minimizing thestandard error Roughly a 20 trimmed meancompetes reasonably well with the mean un-der normality But as we move toward distribu-tions that are more likely to generate outliersat some point the median will have a smallerstandard error than a 20 trimmed mean Il-lustrations in Section 5 demonstrate that thisis a practical concern

It is not being suggested that the mere pres-ence of outliers will necessarily result in higherpower when using a 20 trimmed mean ormedian But it is being argued that simply

ignoring the potential impact of outliers canbe a serious practical concern

In terms of controlling the Type I errorprobability effective techniques are availablefor both the 20 trimmed mean and medianBecause the choice between a 20 trimmedmean and median is not straightforward interms of maximizing power it is suggestedthat in exploratory studies both of these esti-mators be considered

When dealing with least squares regressionor Pearsonrsquos correlation again outliers are aserious concern Indeed even a single outliermight give a highly distorted sense about theassociation among the bulk of the participantsunder study In particular important associa-tions might be missed One of the more ob-vious ways of dealing with this issue is toswitch to Kendallrsquos tau or Spearmanrsquos rhoHowever these measures of associations donot deal with all possible concerns related tooutliers For instance two outliers properlyplaced can give a distorted sense about theassociation among the bulk of the data (egWilcox 2017b p 239)

A measure of association that deals withthis issue is the skipped correlation where out-liers are detected using a projection methodthese points are removed and Pearsonrsquos cor-relation is computed using the remaining dataComplete details are summarized in Wilcox(2017a) This particular skipped correlationcan be computed with the R function scor anda confidence interval that allows heteroscedas-ticity can be computed with scorci This func-tion also reports a p-value when testing the hy-pothesis that the correlation is equal to zero(See Section 32 for a description of commonmistakes when testing hypotheses and outliersare removed) MATLAB code is available aswell (Pernet Wilcox amp Rousselet 2012)

24 CurvatureTypically a regression line is assumed

to be straight In some situations this ap-proach seems to suffice However it cannot bestressed too strongly that there is substantialliterature indicating that this is not always thecase A vast array of new and improved non-parametric methods for dealing with curvatureis now available but complete details go be-yond the scope of this paper Here it is merelyremarked that among the many nonparametricregression estimators that have been proposedgenerally known as smoothers two that seemto be particularly useful are Clevelandrsquos (1979)estimator which can be applied via the R func-tion lplot and the running-interval smoother

A Guide to RobustStatisticalMethods

84210

Supplement 82 Current Protocols in Neuroscience

(Wilcox 2017a) which can be applied withthe R function rplot Arguments can be madethat other smoothers should be given seriousconsideration Readers interested in these de-tails are referred to Wilcox (2017a 2017c)

Clevelandrsquos smoother was initially de-signed to estimate the mean of the dependentvariable given some value of the independentvariable The R function contains an optionfor dealing with outliers among the depen-dent variable but it currently seems that therunning-interval smoother is generally betterfor dealing with this issue By default therunning-interval smoother estimates the 20trimmed mean of the dependent variable butany other measure of central tendency can beused via the argument est

A simple strategy for dealing with curva-ture is to include a quadratic term Let X de-note the independent variable An even moregeneral strategy is to include Xa in the modelfor some appropriate choice for the exponenta But this strategy can be unsatisfactory as il-lustrated in Section 54 using data from a studydealing with fractional anisotropy and readingability In general smoothers provide a moresatisfactory approach

There are methods for testing the hypoth-esis that a regression line is straight (egWilcox 2017a) However failing to rejectdoes not provide compelling evidence that it issafe to assume that indeed the regression lineis straight It is unclear when this approach hasenough power to detect situations where cur-vature is an important practical concern Thebest advice is to plot an estimate of the re-gression line using a smoother If there is anyindication that curvature might be an issueuse modern methods for dealing with curva-ture many of which are summarized in Wilcox(2017a 2017c)

3 DEALING WITH VIOLATIONOF ASSUMPTIONS

Based on conventional training there aresome seemingly obvious strategies for deal-ing with the concerns reviewed in the previoussection But by modern standards generallythese strategies are relatively ineffective Thissection summarizes strategies that performpoorly followed by a brief description of mod-ern methods that give improved results

31 Testing AssumptionsA seemingly natural strategy is to test

assumptions In particular test the hypothe-sis that distributions have a normal distribu-

tion and test the hypothesis that there is ho-moscedasticity This approach generally failsroughly because such tests do not have enoughpower to detect situations where violatingthese assumptions is a practical concern Thatis these tests can fail to detect situations thathave an inordinately detrimental influence onstatistical power and parameter estimation

For example Wilcox (2017a 2017c) listssix studies aimed at testing the homoscedas-ticity assumption with the goal of salvaging amethod that assumes homoscedasticity (com-pare to Keselman Othman amp Wilcox 2016)Briefly these simulation studies generate datafrom a situation where it is known that ho-moscedastic methods perform poorly in termsof controlling the Type I error when there isheteroscedasticity Then various methods fortesting the homoscedasticity are performedand if they reject a heteroscedastic method isused instead All six studies came to the sameconclusion this strategy is unsatisfactory Pre-sumably there are situations where this strat-egy is satisfactory but it is unknown how toaccurately determine whether this is the casebased on the available data In practical termsall indications are that it is best to always use aheteroscedastic method when comparing mea-sures of central tendency and when dealingwith regression as well as measures of associ-ation such as Pearsonrsquos correlation

As for testing the normality assumptionfor instance using the Kolmogorov-Smirnovtest currently a better strategy is to use amore modern method that performs about aswell as conventional methods under normal-ity but which continues to perform relativelywell in situations where standard techniquesperform poorly There are many such methods(eg Wilcox 2017a 2017c) some of whichare outlined in Section 4 These methods canbe applied using extant software as will beillustrated

32 Outliers Two Common MistakesThere are two common mistakes regarding

how to deal with outliers The first is to searchfor outliers using the mean and standard de-viation For example declare the value X anoutlier if

|X minus X |s

gt 2

Equation 8421

where X and s are the usual sample meanand standard deviation respectively A prob-lem with this strategy is that it suffers from

BehavioralNeuroscience

84211

Current Protocols in Neuroscience Supplement 82

masking simply meaning that the very pres-ence of outliers causes them to be missed (egRousseeuw amp Leroy 1987 Wilcox 2017a)

Consider for example the values 1 2 2 34 6 100 and 100 The two last observationsappear to be clear outliers yet the rule givenby Equation 8421 fails to flag them as suchThe reason is simple the standard deviation ofthe sample is very large at almost 45 becauseit is not robust to outliers

This is not to suggest that all outliers willbe missed this is not necessarily the case Thepoint is that multiple outliers might be missedthat adversely affect any conventional methodthat might be used to compare means Muchmore effective are the box plot rule and theso-called MAD (median absolute deviation tothe median)-median rule

The box plot rule is applied as follows Letq1 and q2 be estimates of the lower and upperquartiles respectively Then the value X is de-clared an outlier if X lt q1 ndash 15(q2 ndash q1) or ifX gt q2 + 15(q2 ndash q1) As for the MAD-medianrule let X1 Xn denote a random sampleand let M be the usual sample median MADis the median of the values |X1 ndash M| |Xn ndashM| The MAD-median rule declares the valueX an outlier if

|X minus M |MAD06745

gt 224

Equation 8422

Under normality it can be shown thatMAD06745 estimates the standard devia-tion and of course M estimates the popula-tion mean So the MAD-median rule is simi-lar to using Equation 8421 only rather thanuse a two-standard deviation rule 224 is usedinstead

As an illustration consider the values 185111 111 037 037 185 7153 and 7153The MAD-median rule detects the outliers7153 But the rule given by Equation 8421does not The MAD-median rule is better thanthe box plot rule in terms of avoiding maskingIf the proportion of values that are outliers is25 masking can occur when using the boxplot rule Here for example the box plot ruledoes not flag the value 7153 as an outlier incontrast to the MAD-median rule

The second mistake is discarding outliersand applying some standard method for com-paring means using the remaining data Thisresults in an incorrect estimate of the stan-dard error regardless how large the samplesize might be That is an invalid test statisticis being used Roughly it can be shown that

if the remaining data are dependent then theyare correlated which invalidates the deriva-tion of the standard error Of course if an ar-gument can be made that a value is invaliddiscarding it is reasonable and does not leadto technical issues For instance a straightfor-ward case can be made if a measurement isoutside physiological bounds or if it followsa biologically non-plausible pattern over timesuch as during an electrophysiological record-ing But otherwise the estimate of the standarderror can be off by a factor of 2 (eg Wilcox2017b) which is a serious practical issue Asimple way of dealing with this issue whenusing a 20 trimmed mean or median is touse a percentile bootstrap method (With rea-sonably large sample sizes alternatives to thepercentile bootstrap method can be used Theyuse correct estimates of the standard error andare described in Wilcox 2017a 2017c) Themain point here is that these methods are read-ily applied with the free software R which isplaying an increasing role in basic trainingSome illustrations are given in Section 5

It is noted that when dealing with regres-sion outliers among the independent variablescan be removed when testing hypotheses Butif outliers among the dependent variable are re-moved conventional hypothesis testing tech-niques based on the least squares estimatorare no longer valid even when there is ho-moscedasticity Again the issue is that anincorrect estimate of the standard error is be-ing used When using robust regression es-timators that deal with outliers among thedependent variable again a percentile boot-strap method can be used to test hypothesesComplete details are summarized in Wilcox(2017a 2017c) There are numerous regres-sion estimators that effectively deal with out-liers among the dependent variable but a briefsummary of the many details is impossibleThe Theil and Sen estimator as well as theMM-estimator are relatively good choices butarguments can be made that alternative esti-mators deserve serious consideration

33 Transform the DataA common strategy for dealing with non-

normality or heteroscedasticity is to transformthe data There are exceptions but generallythis approach is unsatisfactory for several rea-sons First the transformed data can again beskewed to the point that classic techniquesperform poorly (eg Wilcox 2017b) Sec-ond this simple approach does not deal withoutliers in a satisfactory manner (eg Dok-sum amp Wong 1983 Rasmussen 1989) The

A Guide to RobustStatisticalMethods

84212

Supplement 82 Current Protocols in Neuroscience

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 2: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

normality assumption is true but the popula-tion variances differ (called heteroscedastic-ity) power can be adversely impacted whenusing the ANOVA F

Similar concerns arise when dealing withregression Conventional methods includingrank-based techniques perform well in termsof controlling the probability of a Type I errorwhen there is no association But when there isan association conventional methods includ-ing rank-based techniques (eg Spearmanrsquosrho and Kendallrsquos tau) can have a relativelylow probability of detecting an association rel-ative to modern methods developed during thelast 30 years

Practical concerns regarding conventionalmethods stem from four major insights (egWilcox 2017a 2017c) These insights can bebriefly summarized as follows

The central limit and skeweddistributions Much larger samplesizes might be needed to assumenormality than is generallyrecognized

There is now a deeper understanding ofthe role of outliers and how to dealwith them Some seemingly obviousstrategies for dealing with outliersbased on standard training are knownto be highly unsatisfactory for reasonsoutlined later in the paper

There is a substantial literature indicatingthat methods that assumehomoscedasticity (equal variances)can yield inaccurate results when infact there is heteroscedasticity evenwhen the sample sizes are quite large

When dealing with regression curvaturerefers to situations where theregression line is not straight There isnow considerable evidence thatcurvature is a much more seriousconcern than is generally recognized

Robust methods are typically thought of asmethods that provide good control over theprobability of a Type I error But today theydeal with much broader issues In particularthey are designed to deal with the problemsassociated with skewed distributions outliersheteroscedasticity and curvature that wereoutlined above

One of the more fundamental goals amongrobust methods is to develop techniques thatare not overly sensitive to very small changesin a distribution For instance a slight depar-ture from normality should not destroy power

This rules out any method based on the meanand variance (eg Staudte amp Sheather 1990Wilcox 2017a 2017b) Section 23 illustratesthis point

Many modern robust methods are designedto have nearly the same amount of poweras conventional methods under normality butthey continue to have relatively high powerunder slight departures from normality whereconventional techniques based on means per-form poorly There are other fundamentalgoals some of which are relevant regardlessof how large the sample sizes might be Butan effective description of these goals goesbeyond the scope of this paper For presentpurposes the focus is on achieving relativelyhigh power

Another point that should be stressed has todo with standard power analyses A commongoal is to justify some choice for the samplesizes prior to obtaining any data Note that ineffect the goal is to address a statistical is-sue without any data Typically this is doneby assuming normality and homoscedasticitywhich in turn can suggest that relatively smallsample sizes provide adequate power when us-ing means A practical concern is that violat-ing either of these two assumptions can havea tremendous impact on power when attentionis focused exclusively on comparing meansSection 21 illustrates this concern when deal-ing with measures of central tendency Sim-ilar concerns arise when dealing with leastsquares regression and Pearsonrsquos correlationThese concerns can be mitigated by using re-cently developed robust methods summarizedhere as well as in Wilcox (2017a 2017c)

There is now a vast array of new andimproved methods that effectively deal withknown concerns associated with classic tech-niques (eg Heritier Cantoni Copt ampVictoria-Feser 2009 Maronna Martin ampYohai 2006 Wilcox 2017a 2017b 2017c)They include substantially improved methodsfor dealing with all four of the major insightspreviously listed Perhaps more importantlythey can provide a deeper more accurate andmore nuanced understanding of data as will beillustrated in Section 5

For books focused on the mathematicalfoundation of modern robust methods seeHampel Ronchetti Rousseeuw and Stahel(1986) Huber and Ronchetti (2009) Maronnaet al (2006) and Staudte and Sheather (1990)For books focused on applying robust meth-ods see Heritier et al (2009) and Wilcox(2017a 2017c) From an applied point of view

A Guide to RobustStatisticalMethods

8422

Supplement 82 Current Protocols in Neuroscience

the difficulty is not finding a method thateffectively deals with violations of standardassumptions Rather for the non-statisticianthere is the difficulty in navigating throughthe many alternative techniques that might beused This paper is an attempt to deal with thisissue by providing a general guide regardingwhen and how modern robust methods mightbe used when comparing two or more groupsWhen dealing with regression all of the con-cerns associated with conventional methodsfor comparing groups remain and new con-cerns are introduced A few issues related toregression and correlations are covered herebut it is stressed that there are many other mod-ern advances that have practical value Readersinterested in regression are referred to Wilcox(2017a 2017c)

A few general points should be stressedFirst if robust methods such as modern meth-ods based on the median described later inthis paper give very similar results to conven-tional methods based on means this is reas-suring that conventional methods based on themean are performing relatively well in termsof Type I errors and power But when theydiffer there is doubt about the validity of con-ventional techniques In a given situation con-ventional methods might perform well in termsof controlling the Type I error probability andproviding reasonably high power But the bestthat can be said is that there are general condi-tions where conventional methods do indeedyield inaccurate inferences A particular con-cern is that they can suffer from relatively lowpower in situations where more modern meth-ods have relatively high power More detailsare provided in Sections 3 and 4

Second the choice of method can make asubstantial difference in our understanding ofdata One reason is that modern methods pro-vide alternative and interesting perspectivesthat more conventional methods do not ad-dress A complication is that there is no sin-gle method that dominates in terms of poweror providing a deep understanding of howgroups compare The same is true when deal-ing with regression and measures of associa-tion The reality is that several methods mightbe needed to address even what appears as asimple problem for instance comparing twogroups

There is of course the issue of control-ling the probability of one or more Type I er-rors when multiple tests are performed Thereare many modern improvements for dealingwith this issue (eg Wilcox 2017a 2017c)

Another strategy is to put more emphasis onexploratory studies One could then deal withthe risk of false positive results by conduct-ing a confirmatory study aimed at determiningwhether significant results in an exploratorystudy can be replicated (Wagenmakers Wet-zels Borsboom van der Maas amp Kievit2012) Otherwise there is the danger of miss-ing important details regarding how groupscompare One of the main messages here isthat despite the lack of a single method thatdominates certain guidelines can be offeredregarding how to analyze data

Modern methods for plotting data can be in-valuable as well (eg Rousselet Foxe amp Bo-lam 2016 Rousselet Pernet amp Wilcox 2017Weissgerber Milic Winham amp Garovic2015) In particular they can provide impor-tant perspectives beyond the common strategyof using error bars Complete details go be-yond the scope of this paper but Section 5 il-lustrates some of the more effective plots thatmight be used

This unit is organized as follows Sec-tion 2 briefly reviews when and why con-ventional methods can be highly unsatisfac-tory This is necessary in order to appreci-ate modern technology and because standardtraining typically ignores these issues Effortsto modernize basic training have been made(eg Field Miles amp Field 2012 Wilcox2017b 2017c) and a 2016 special issue of TheAmerican Statistician (volume 69 number4) was aimed at encouraging instructors tomodernize their courses (This special issuetouches on a broader range of topics thanthose discussed here) Some neuroscientistsare trained in a manner that takes into accountmodern insights relevant to basic principlesbut it is evident that most are not Section 3reviews the seemingly more obvious strate-gies aimed at salvaging standard techniquesthe point being that by modern standards theyare relatively ineffective and cannot be rec-ommended Moreover certain strategies arenot technically sound Section 3 also providesan indication of how concerns regarding con-ventional methods are addressed using moremodern techniques Section 4 describes strate-gies for comparing two independent or depen-dent groups that take modern advances intoaccount Included are some methods aimedat comparing correlations as well as meth-ods designed to determine which independentvariables are most important Section 5 illus-trates modern methods using data from severalstudies Behavioral

Neuroscience

8423

Current Protocols in Neuroscience Supplement 82

2 INSIGHTS REGARDINGCONVENTIONAL METHODS

This section elaborates on the concernswith conventional methods for comparinggroups and studying associations stemmingfrom the four insights previously indicated

21 Skewed DistributionsA skewed distribution simply refers to

a distribution that is not symmetric aboutsome central value An example is shown inFigure 8421A Such distributions occur natu-rally An example relevant to the neurosciencesis given in Section 52 Skewed distributionsare a much more serious problem for statisticalinferences than once thought due to insightsregarding the central limit theorem

Consider the one-sample case Conven-tional wisdom is that with a relatively smallsample size normality can be assumed underrandom sampling An implicit assumption wasthat if the sample mean has approximately anormal distribution then Studentrsquos t-test willperform reasonably well It is now known thatthis is not necessarily the case as illustratedfor example in Wilcox (2017a 2017b)

This point is illustrated here using a simula-tion that is performed in the following manner

Imagine that data are randomly sampled fromthe distribution shown in Figure 8421A (alognormal distribution) and that the mean iscomputed based on a sample size of n = 30Repeating this process 5000 times the thickblack line in Figure 8421B shows a plot ofthe resulting sample means the thin gray lineis the plot of the means when sampling froma normal distribution instead The distributionof T values for samples of n = 30 is indicatedby the thick black line in Figure 8421C thethin gray line is the distribution of T valueswhen sampling from a normal distribution Ascan be seen the actual T distribution extendsout much further to the left compared to thedistribution of T under normality That is inthe current example sampling from a skeweddistribution leads to much more extreme val-ues than expected by chance under normalitywhich in turn results in more false positive re-sults than expected when the null hypothesisis true

Suppose the goal is to test some hypothesisat the 005 level Bradley (1978) suggests thatas a general guide control over the probabilityof a Type I error is minimally satisfactory ifthe actual level is between 0025 and 0075When we test at the 005 level we expect 5

00

02

04

06

00 05 10 15 20 25 30 35X values

Den

sity

A

00

03

06

09

12

10 15 20 25 30 35Means

Den

sity

DistributionSample MeanTheoretical

B

00

01

02

03

04

minus8 minus7 minus6 minus5 minus4 minus3 minus2 minus1 0 1 2 3T values

Den

sity

Distributionn = 30Theoretical

C

00

01

02

03

04

minus8 minus7 minus6 minus5 minus4 minus3 minus2 minus1 0 1 2 3T values

Den

sity

Distributionn = 100Theoretical

D

Figure 8421 (A) Example of skewed distribution (B) Distribution of the sample mean undernormality (thin line n = 30) and the actual distribution based on a simulation Each sample meanwas computed based on 30 observations randomly sampled from the distribution shown in A (C)and (D) Comparison of the theoretical T distribution with 29 degrees of freedom to distributions of5000 T values Again the T values were computed from observations sampled from the distributionin A

A Guide to RobustStatisticalMethods

8424

Supplement 82 Current Protocols in Neuroscience

0000

0025

0050

0075

0100

0125

0150

0 100 200 300 400 500Sample size

Type

I er

ror

prob

abili

ty

DistributionLognormalNormal

EstimatorMeanMedianTrimmed mean

Figure 8422 Type I error probability as a function of sample size The Type I error probabilitywas computed by running a simulation with 10000 iterations In each iteration sample sizes from10 to 500 in steps of 10 were drawn from a normal distribution and a lognormal distributionFor each combination of sample size and distribution we applied a t-test on the mean a testof the median and a t-test on the 20 trimmed mean all with alpha = 005 Depending on thetest applied the mean median or 20 trimmed mean of the population sampled from was zeroThe black horizontal line marks the expected 005 Type I error probability The gray area marksBradleyrsquos satisfactory range When sampling from a normal distribution all methods are close tothe nominal 005 level except the trimmed mean for very small sample sizes When sampling isfrom a lognormal distribution the mean and the trimmed mean give rise to too many false alarmsfor small sample sizes The mean continues to give higher false positive rates than the othertechniques even with n = 500

of the t-tests to be significant However whensampling from the skewed distribution consid-ered here this is not the case the actual TypeI error probability is 0111

Figure 8421D shows the distribution of Twhen n = 100 Now the Type I error probabil-ity is 0082 again when testing at the 005level Based on Bradleyrsquos criterion a samplesize of about 130 or larger is required Bradely(1978) goes on to suggest that ideally the ac-tual Type I error probability should be between0045 and 0055 Now n = 600 is unsatisfac-tory the actual level is 0057 With n = 700the level is 0055

Before continuing it is noted that the me-dian belongs to the class of trimmed meanswhich refers to the strategy of trimminga specified proportion of the smallest andlargest values and averaging the values thatremain For example if n = 10 10 trim-ming means that the lowest and highest valuesare removed and that the remaining data areaveraged Similarly 20 trimming would re-move the two smallest and two largest val-ues Based on conventional training trimmingmight seem counterintuitive In some situa-tions however it can substantially increaseour ability to control the Type I error prob-ability as illustrated next and trimming can

substantially increase power as well for rea-sons to be explained

First focus on controlling the probability ofa Type I error (Section 23 illustrates one ofthe reasons that methods based on means canhave relatively low power) Figure 8422 il-lustrates the Type I error probability as a func-tion of the sample size when using the meanwhen using the median and when samplingfrom the asymmetric (lognormal) distributionin Figure 8421A Inferences based on the20 trimmed mean were made via the methodderived by Tukey and McLaughlin (1963) In-ferences based on the median were made viathe method derived by Hettmansperger andSheather (1986) (The software used to ap-ply these latter two methods is contained inthe R package described at the beginning ofSection 4) Also shown is the Type I errorprobability when sampling from a normal dis-tribution The gray area indicates Bradleyrsquoscriterion

Figure 8423 illustrates the association be-tween power and the sample size for the distri-butions used in Figure 8422 As can be seenunder normality the sample mean is best fol-lowed closely by the 20 trimmed mean Themedian is least satisfactory when dealing witha normal distribution as expected However

BehavioralNeuroscience

8425

Current Protocols in Neuroscience Supplement 82

00

02

04

06

08

10

0 50 100 150 200 250 300Sample size

Pow

er (

true

pos

itive

pro

babi

lity)

DistributionLognormalNormal

EstimatorMeanMedianTrimmed mean

Figure 8423 Power as a function of sample size The probability of a true positive was computedby running a simulation with 10000 iterations In each iteration sample sizes from 10 to 500 insteps of 10 were drawn from a normal distribution and the (lognormal) distribution shown in Figure8421A For each combination of sample size and distribution we applied a t-test on the mean atest of the median and a t-test on the 20 trimmed mean all with alpha = 005 Depending onthe test applied the mean median or 20 trimmed mean of the population sampled from was05 The black horizontal line marks the conventional 80 power threshold When sampling froma normal distribution all methods require lt50 observations to achieve 80 power and the meanappears to have higher power at lower sample size than the other methods When sampling froma lognormal distribution power drops dramatically for the mean but not for the median and thetrimmed mean Power for the median actually improves The exact pattern of results depends onthe effect size and the asymmetry of the distribution we sample from so we strongly encouragereaders to perform their own detailed power analyses

for the asymmetric (lognormal) distributionthe median performs best and the mean per-forms very poorly

A feature of random samples taken from thedistribution in Figure 8421A is that the ex-pected proportion of points declared an outlieris relatively small For skewed distributionsas we move toward situations where outliersare more common a sample size gt300 canbe required to achieve reasonably good con-trol over the Type I error probability Thatis control over the Type I error probabilityis a function of both the degree a distributionis skewed and the likelihood of encounteringoutliers However there are methods that per-form reasonably well with small sample sizesas indicated in Section 41

For symmetric distributions where outlierstend to occur the reverse can happen the ac-tual Type I error probability can be substan-tially less than the nominal level This happensbecause outliers inflate the standard deviationwhich in turn lowers the value of T which inturn can negatively impact power Section 23elaborates on this issue

In an important sense outliers have a largerimpact on the sample variance than the sam-ple mean which impacts the t-test To illus-

trate this point imagine the goal is to test H0μ = 1 based on the following values 1 1516 18 2 22 24 27 Then T = 469 the p-value is p = 0002 and the 095 confidenceinterval = [145 235] Now including thevalue 8 the mean increases from 19 to 258suggesting at some level there is stronger evi-dence for rejecting the null hypothesis How-ever this outlier increases the standard devia-tion from 054 to 21 and now T = 226 andp = 0054 The 095 confidence interval =[minus0033 319]

Now consider the goal of comparing twoindependent or dependent groups If thegroups have identical distributions then dif-ference scores have a symmetric distribu-tion and the probability of a Type I erroris in general less than the nominal levelwhen using conventional methods based onmeans Now in addition to outliers differ-ences in skewness create practical concernswhen using Studentrsquos t-test Indeed undergeneral conditions the two-sample Studentrsquost-test for independent groups is not evenasymptotically correct roughly because thestandard error of the difference betweenthe sample means is not estimated correctly(eg Cressie amp Whitford 1986) Moreover

A Guide to RobustStatisticalMethods

8426

Supplement 82 Current Protocols in Neuroscience

Studentrsquos t-test can be biased This means thatthe probability of rejecting the null hypothesisof equal means can be higher when the popu-lation means are equal compared to situationswhere the population means differ Roughlythis concern arises because the distribution ofT can be skewed and in fact the mean of T candiffer from zero even though the null hypothe-sis is true For a more detailed explanation seeWilcox (2017c section 55) Problems persistwhen Studentrsquos t-test is replaced by Welchrsquos(1938) method which is designed to comparemeans in a manner that allows unequal vari-ances Put another way if the goal is to testthe hypothesis that two groups have identicaldistributions conventional methods based onmeans perform well in terms of controlling theType I error probability But if the goal is tocompare the population means and if distribu-tions differ conventional methods can performpoorly

There are many techniques that performwell when dealing with skewed distributionsin terms of controlling the Type I error prob-ability some of which are based on the usualsample median (Wilcox 2017a 2017c) Boththeory and simulations indicate that as theamount of trimming increases the ability tocontrol over the probability of a Type I errorincreases as well Moreover for reasons to beexplained trimming can substantially increasepower a result that is not obvious based onconventional training The optimal amount oftrimming depends on the characteristics of thepopulation distributions which are unknownCurrently the best that can be said is that thechoice can make a substantial difference The20 trimmed mean has been studied exten-sively and often provides a good compromisebetween the two extremes no trimming (themean) and the maximum amount of trimming(the median)

In various situations particularly impor-tant are inferential methods based on whatare called bootstrap techniques Two basicversions are the bootstrap-t and percentilebootstrap Roughly rather than assume nor-mality bootstrap-t methods perform a simu-lation using the observed data that yields anestimate of an appropriate critical value anda p-value So values of T are generated asdone in Figure 8421 only data are sampledwith replacement from the observed data Inessence bootstrap-t methods generate data-driven T distributions expected by chance ifthere were no effect The percentile bootstrapproceeds in a similar manner only when deal-ing with a trimmed mean for example the

goal is to determine the distribution of thesample trimmed mean which can then be usedto compute a p-value and a confidence inter-val When comparing two independent groupsbased on the usual sample median if thereare tied (duplicated) values currently the onlymethod that performs well in simulations interms of controlling the Type I error probabil-ity is based on a percentile bootstrap method(compare to Table 53 in Wilcox 2017c) Sec-tion 41 elaborates on how this method isperformed

If the amount of trimming is close to zerothe bootstrap-t method is preferable to the per-centile bootstrap method But as the amount oftrimming increases at some point a percentilebootstrap method is preferable This is the casewith 20 trimming (eg Wilcox 2017a) Itseems to be the case with 10 trimming aswell but a definitive study has not been made

Also if the goal is to reflect the typical re-sponse it is evident that the median or evena 20 trimmed mean might be more satisfac-tory Using quantiles (percentiles) other thanthe median can be important as well for rea-sons summarized in Section 42

When comparing independent groupsmodern improvements on the Wilcoxon-Mann-Whitney (WMW) test are another pos-sibility which are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a ran-dom observation from the second (More de-tails are provided in Section 3) Additionalpossibilities are described in Wilcox (2017a)some of which are illustrated in Section 4below

In some situations robust methods can havesubstantially higher power than any methodbased on means It is not being suggested thatrobust methods always have more power Thisis not the case Rather the point is that powercan depend crucially on the conjunction ofwhich estimator is used (for instance the meanversus the median) and how a confidence inter-val is built (for instance a parametric methodor the percentile bootstrap) These choices arenot trivial and must be taken into account whenanalyzing data

22 HeteroscedasticityIt has been clear for some time that when

using classic methods for comparing meansheteroscedasticity (unequal population vari-ances) is a serious concern (eg Brown ampForsythe 1974) Heteroscedasticity can im-pact both power and the Type I error proba-bility The basic reason is that under general

BehavioralNeuroscience

8427

Current Protocols in Neuroscience Supplement 82

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

A

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

B

Figure 8424 Homoscedasticity (A) and heteroscedasticity (B) (A) The variance of the depen-dent variable is the same at any age (B) The variance of the dependent variable can depend onthe age of the participants

conditions methods that assume homoscedas-ticity are using an incorrect estimate ofthe standard error when in fact there isheteroscedasticity Indeed there are concernsregardless of how large the sample size mightbe As we consider more and more compli-cated designs heteroscedasticity becomes anincreasing concern

When dealing with regression ho-moscedasticity means that the variance of thedependent variable does not depend on thevalue of the independent variable When deal-ing with age and depressive symptoms for ex-ample homoscedasticity means that the vari-ation in measures of depressive symptoms atage 23 is the same at age 80 or any age inbetween as illustrated in Figure 8424

Independence implies homoscedasticitySo for this particular situation classic methodsassociated with least squares regression Pear-sonrsquos correlation Kendallrsquos tau and Spear-manrsquos rho are using a correct estimate of thestandard error which helps explain why theyperform well in terms of Type I errors whenthere is no association That is when a ho-moscedastic method rejects it is reasonableto conclude that there is an association but

in terms of inferring the nature of the asso-ciation these methods can perform poorlyAgain a practical concern is that when thereis heteroscedasticity homoscedastic methodsuse an incorrect estimate of the standard errorwhich can result in poor power and erroneousconclusions

A seemingly natural way of salvaging ho-moscedastic methods is to test the assump-tion that there is homoscedasticity But sixstudies summarized in Wilcox (2017c) foundthat this strategy is unsatisfactory Presum-ably situations are encountered where this isnot the case but it is difficult and unclearhow to determine when such situations areencountered

Methods that are designed to deal with het-eroscedasticity have been developed and areeasily applied using extant software Thesetechniques use a correct estimate of the stan-dard error regardless of whether the ho-moscedasticity assumption is true A generalrecommendation is to always use a het-eroscedastic method given the goal of com-paring measures of central tendency or mak-ing inferences about regression parameters aswell as measures of association

A Guide to RobustStatisticalMethods

8428

Supplement 82 Current Protocols in Neuroscience

00

01

02

03

04

minus3 minus2 minus1 0 1 2 3

Den

sity

Distribution mixed standard

A

minus10

minus05

00

05

10

Means Medians Trimmed means

Val

ue

B

minus3minus2minus1

01234

Means Medians Trimmed means

Val

ue

C

Figure 8425 (A) Density functions for the standard normal distribution (solid line) and the mixednormal distribution (dashed line) These distributions have an obvious similarity yet the variancesare 1 and 109 (B) Box plots of means medians and 20 trimmed means when sampling froma normal distribution (C) Box plots of means medians and 20 trimmed means when samplingfrom a mixed normal distribution In panels B and C each distribution has 10000 values and eachof these values was obtained by computing the mean median or trimmed mean of 30 randomlygenerated observations

23 OutliersEven small departures from normality can

devastate power The modern illustration ofthis fact stems from Tukey (1960) and is basedon what is generally known as a mixed normaldistribution The mixed normal considered byTukey means that with probability 09 an ob-servation is sampled from a standard normaldistribution otherwise an observation is sam-pled from a normal distribution having meanzero and standard deviation 10 Figure 8425Ashows a standard normal distribution and themixed normal discussed by Tukey Note thatin the center of the distributions the mixednormal is below the normal distribution Butfor the two ends of the mixed normal distribu-tion (the tails) the mixed normal lies above thenormal distribution For this reason the mixednormal is often described as having heavy tailsIn general heavy-tailed distributions roughlyrefer to distributions where outliers are likelyto occur

Here is an important point The standardnormal has variance 1 but the mixed normalhas variance 109 That is the population vari-ance can be overly sensitive to slight changesin the tails of a distribution Consequentlyeven a slight departure from normality can re-sult in relatively poor power when using anymethod based on the mean

Put another way samples from the mixednormal are more likely to result in outlierscompared to samples from a standard normaldistribution As previously indicated outliersinflate the sample variance which can nega-tively impact power when using means An-other concern is that they can give a distortedand misleading summary regarding the bulk ofthe participants

The first indication that heavy-tailed dis-tributions are a concern stems from a resultderived by Laplace about two centuries agoIn modern terminology he established thatas we move from a normal distribution to a

BehavioralNeuroscience

8429

Current Protocols in Neuroscience Supplement 82

distribution more likely to generate outliersthe standard error of the usual sample mediancan be smaller than the standard error of themean (Hand 1998) The first empirical evi-dence implying that outliers might be morecommon than what is expected under normal-ity was reported by Bessel (1818)

To add perspective we computed the meanmedian and a 20 trimmed mean based on 30observations generated from a standard nor-mal distribution (Again a 20 trimmed meanremoves the 20 lowest and highest valuesand averages the remaining data) Then we re-peated this process 10000 times Box plots ofthe results are shown in Figure 8425B The-ory tells us that under normality the variationof the sample means is smaller than the varia-tion among the 20 trimmed means and medi-ans and Figure 8421B provides perspectiveon the extent this is the case

Now we repeat this process only dataare sampled from the mixed normal inFigure 8425A Figure 8425C reports theresults As is evident there is substantiallyless variation among the medians and 20trimmed means That is despite trimmingdata the standard errors of the median and20 trimmed mean are substantially smallercontrary to what might be expected based onstandard training

Of course a more important issue iswhether the median or 20 trimmed meanever have substantially smaller standard er-rors based on the data encountered in re-search There are numerous illustrations thatthis is the case (eg Wilcox 2017a 2017b2017c)

There is the additional complication thatthe amount of trimming can substantially im-pact power and the ideal amount of trimmingin terms of maximizing power can dependcrucially on the nature of the unknown distri-butions under investigation The median per-forms best for the situation in Figure 8425Cbut situations are encountered where it trimstoo much given the goal of minimizing thestandard error Roughly a 20 trimmed meancompetes reasonably well with the mean un-der normality But as we move toward distribu-tions that are more likely to generate outliersat some point the median will have a smallerstandard error than a 20 trimmed mean Il-lustrations in Section 5 demonstrate that thisis a practical concern

It is not being suggested that the mere pres-ence of outliers will necessarily result in higherpower when using a 20 trimmed mean ormedian But it is being argued that simply

ignoring the potential impact of outliers canbe a serious practical concern

In terms of controlling the Type I errorprobability effective techniques are availablefor both the 20 trimmed mean and medianBecause the choice between a 20 trimmedmean and median is not straightforward interms of maximizing power it is suggestedthat in exploratory studies both of these esti-mators be considered

When dealing with least squares regressionor Pearsonrsquos correlation again outliers are aserious concern Indeed even a single outliermight give a highly distorted sense about theassociation among the bulk of the participantsunder study In particular important associa-tions might be missed One of the more ob-vious ways of dealing with this issue is toswitch to Kendallrsquos tau or Spearmanrsquos rhoHowever these measures of associations donot deal with all possible concerns related tooutliers For instance two outliers properlyplaced can give a distorted sense about theassociation among the bulk of the data (egWilcox 2017b p 239)

A measure of association that deals withthis issue is the skipped correlation where out-liers are detected using a projection methodthese points are removed and Pearsonrsquos cor-relation is computed using the remaining dataComplete details are summarized in Wilcox(2017a) This particular skipped correlationcan be computed with the R function scor anda confidence interval that allows heteroscedas-ticity can be computed with scorci This func-tion also reports a p-value when testing the hy-pothesis that the correlation is equal to zero(See Section 32 for a description of commonmistakes when testing hypotheses and outliersare removed) MATLAB code is available aswell (Pernet Wilcox amp Rousselet 2012)

24 CurvatureTypically a regression line is assumed

to be straight In some situations this ap-proach seems to suffice However it cannot bestressed too strongly that there is substantialliterature indicating that this is not always thecase A vast array of new and improved non-parametric methods for dealing with curvatureis now available but complete details go be-yond the scope of this paper Here it is merelyremarked that among the many nonparametricregression estimators that have been proposedgenerally known as smoothers two that seemto be particularly useful are Clevelandrsquos (1979)estimator which can be applied via the R func-tion lplot and the running-interval smoother

A Guide to RobustStatisticalMethods

84210

Supplement 82 Current Protocols in Neuroscience

(Wilcox 2017a) which can be applied withthe R function rplot Arguments can be madethat other smoothers should be given seriousconsideration Readers interested in these de-tails are referred to Wilcox (2017a 2017c)

Clevelandrsquos smoother was initially de-signed to estimate the mean of the dependentvariable given some value of the independentvariable The R function contains an optionfor dealing with outliers among the depen-dent variable but it currently seems that therunning-interval smoother is generally betterfor dealing with this issue By default therunning-interval smoother estimates the 20trimmed mean of the dependent variable butany other measure of central tendency can beused via the argument est

A simple strategy for dealing with curva-ture is to include a quadratic term Let X de-note the independent variable An even moregeneral strategy is to include Xa in the modelfor some appropriate choice for the exponenta But this strategy can be unsatisfactory as il-lustrated in Section 54 using data from a studydealing with fractional anisotropy and readingability In general smoothers provide a moresatisfactory approach

There are methods for testing the hypoth-esis that a regression line is straight (egWilcox 2017a) However failing to rejectdoes not provide compelling evidence that it issafe to assume that indeed the regression lineis straight It is unclear when this approach hasenough power to detect situations where cur-vature is an important practical concern Thebest advice is to plot an estimate of the re-gression line using a smoother If there is anyindication that curvature might be an issueuse modern methods for dealing with curva-ture many of which are summarized in Wilcox(2017a 2017c)

3 DEALING WITH VIOLATIONOF ASSUMPTIONS

Based on conventional training there aresome seemingly obvious strategies for deal-ing with the concerns reviewed in the previoussection But by modern standards generallythese strategies are relatively ineffective Thissection summarizes strategies that performpoorly followed by a brief description of mod-ern methods that give improved results

31 Testing AssumptionsA seemingly natural strategy is to test

assumptions In particular test the hypothe-sis that distributions have a normal distribu-

tion and test the hypothesis that there is ho-moscedasticity This approach generally failsroughly because such tests do not have enoughpower to detect situations where violatingthese assumptions is a practical concern Thatis these tests can fail to detect situations thathave an inordinately detrimental influence onstatistical power and parameter estimation

For example Wilcox (2017a 2017c) listssix studies aimed at testing the homoscedas-ticity assumption with the goal of salvaging amethod that assumes homoscedasticity (com-pare to Keselman Othman amp Wilcox 2016)Briefly these simulation studies generate datafrom a situation where it is known that ho-moscedastic methods perform poorly in termsof controlling the Type I error when there isheteroscedasticity Then various methods fortesting the homoscedasticity are performedand if they reject a heteroscedastic method isused instead All six studies came to the sameconclusion this strategy is unsatisfactory Pre-sumably there are situations where this strat-egy is satisfactory but it is unknown how toaccurately determine whether this is the casebased on the available data In practical termsall indications are that it is best to always use aheteroscedastic method when comparing mea-sures of central tendency and when dealingwith regression as well as measures of associ-ation such as Pearsonrsquos correlation

As for testing the normality assumptionfor instance using the Kolmogorov-Smirnovtest currently a better strategy is to use amore modern method that performs about aswell as conventional methods under normal-ity but which continues to perform relativelywell in situations where standard techniquesperform poorly There are many such methods(eg Wilcox 2017a 2017c) some of whichare outlined in Section 4 These methods canbe applied using extant software as will beillustrated

32 Outliers Two Common MistakesThere are two common mistakes regarding

how to deal with outliers The first is to searchfor outliers using the mean and standard de-viation For example declare the value X anoutlier if

|X minus X |s

gt 2

Equation 8421

where X and s are the usual sample meanand standard deviation respectively A prob-lem with this strategy is that it suffers from

BehavioralNeuroscience

84211

Current Protocols in Neuroscience Supplement 82

masking simply meaning that the very pres-ence of outliers causes them to be missed (egRousseeuw amp Leroy 1987 Wilcox 2017a)

Consider for example the values 1 2 2 34 6 100 and 100 The two last observationsappear to be clear outliers yet the rule givenby Equation 8421 fails to flag them as suchThe reason is simple the standard deviation ofthe sample is very large at almost 45 becauseit is not robust to outliers

This is not to suggest that all outliers willbe missed this is not necessarily the case Thepoint is that multiple outliers might be missedthat adversely affect any conventional methodthat might be used to compare means Muchmore effective are the box plot rule and theso-called MAD (median absolute deviation tothe median)-median rule

The box plot rule is applied as follows Letq1 and q2 be estimates of the lower and upperquartiles respectively Then the value X is de-clared an outlier if X lt q1 ndash 15(q2 ndash q1) or ifX gt q2 + 15(q2 ndash q1) As for the MAD-medianrule let X1 Xn denote a random sampleand let M be the usual sample median MADis the median of the values |X1 ndash M| |Xn ndashM| The MAD-median rule declares the valueX an outlier if

|X minus M |MAD06745

gt 224

Equation 8422

Under normality it can be shown thatMAD06745 estimates the standard devia-tion and of course M estimates the popula-tion mean So the MAD-median rule is simi-lar to using Equation 8421 only rather thanuse a two-standard deviation rule 224 is usedinstead

As an illustration consider the values 185111 111 037 037 185 7153 and 7153The MAD-median rule detects the outliers7153 But the rule given by Equation 8421does not The MAD-median rule is better thanthe box plot rule in terms of avoiding maskingIf the proportion of values that are outliers is25 masking can occur when using the boxplot rule Here for example the box plot ruledoes not flag the value 7153 as an outlier incontrast to the MAD-median rule

The second mistake is discarding outliersand applying some standard method for com-paring means using the remaining data Thisresults in an incorrect estimate of the stan-dard error regardless how large the samplesize might be That is an invalid test statisticis being used Roughly it can be shown that

if the remaining data are dependent then theyare correlated which invalidates the deriva-tion of the standard error Of course if an ar-gument can be made that a value is invaliddiscarding it is reasonable and does not leadto technical issues For instance a straightfor-ward case can be made if a measurement isoutside physiological bounds or if it followsa biologically non-plausible pattern over timesuch as during an electrophysiological record-ing But otherwise the estimate of the standarderror can be off by a factor of 2 (eg Wilcox2017b) which is a serious practical issue Asimple way of dealing with this issue whenusing a 20 trimmed mean or median is touse a percentile bootstrap method (With rea-sonably large sample sizes alternatives to thepercentile bootstrap method can be used Theyuse correct estimates of the standard error andare described in Wilcox 2017a 2017c) Themain point here is that these methods are read-ily applied with the free software R which isplaying an increasing role in basic trainingSome illustrations are given in Section 5

It is noted that when dealing with regres-sion outliers among the independent variablescan be removed when testing hypotheses Butif outliers among the dependent variable are re-moved conventional hypothesis testing tech-niques based on the least squares estimatorare no longer valid even when there is ho-moscedasticity Again the issue is that anincorrect estimate of the standard error is be-ing used When using robust regression es-timators that deal with outliers among thedependent variable again a percentile boot-strap method can be used to test hypothesesComplete details are summarized in Wilcox(2017a 2017c) There are numerous regres-sion estimators that effectively deal with out-liers among the dependent variable but a briefsummary of the many details is impossibleThe Theil and Sen estimator as well as theMM-estimator are relatively good choices butarguments can be made that alternative esti-mators deserve serious consideration

33 Transform the DataA common strategy for dealing with non-

normality or heteroscedasticity is to transformthe data There are exceptions but generallythis approach is unsatisfactory for several rea-sons First the transformed data can again beskewed to the point that classic techniquesperform poorly (eg Wilcox 2017b) Sec-ond this simple approach does not deal withoutliers in a satisfactory manner (eg Dok-sum amp Wong 1983 Rasmussen 1989) The

A Guide to RobustStatisticalMethods

84212

Supplement 82 Current Protocols in Neuroscience

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 3: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

the difficulty is not finding a method thateffectively deals with violations of standardassumptions Rather for the non-statisticianthere is the difficulty in navigating throughthe many alternative techniques that might beused This paper is an attempt to deal with thisissue by providing a general guide regardingwhen and how modern robust methods mightbe used when comparing two or more groupsWhen dealing with regression all of the con-cerns associated with conventional methodsfor comparing groups remain and new con-cerns are introduced A few issues related toregression and correlations are covered herebut it is stressed that there are many other mod-ern advances that have practical value Readersinterested in regression are referred to Wilcox(2017a 2017c)

A few general points should be stressedFirst if robust methods such as modern meth-ods based on the median described later inthis paper give very similar results to conven-tional methods based on means this is reas-suring that conventional methods based on themean are performing relatively well in termsof Type I errors and power But when theydiffer there is doubt about the validity of con-ventional techniques In a given situation con-ventional methods might perform well in termsof controlling the Type I error probability andproviding reasonably high power But the bestthat can be said is that there are general condi-tions where conventional methods do indeedyield inaccurate inferences A particular con-cern is that they can suffer from relatively lowpower in situations where more modern meth-ods have relatively high power More detailsare provided in Sections 3 and 4

Second the choice of method can make asubstantial difference in our understanding ofdata One reason is that modern methods pro-vide alternative and interesting perspectivesthat more conventional methods do not ad-dress A complication is that there is no sin-gle method that dominates in terms of poweror providing a deep understanding of howgroups compare The same is true when deal-ing with regression and measures of associa-tion The reality is that several methods mightbe needed to address even what appears as asimple problem for instance comparing twogroups

There is of course the issue of control-ling the probability of one or more Type I er-rors when multiple tests are performed Thereare many modern improvements for dealingwith this issue (eg Wilcox 2017a 2017c)

Another strategy is to put more emphasis onexploratory studies One could then deal withthe risk of false positive results by conduct-ing a confirmatory study aimed at determiningwhether significant results in an exploratorystudy can be replicated (Wagenmakers Wet-zels Borsboom van der Maas amp Kievit2012) Otherwise there is the danger of miss-ing important details regarding how groupscompare One of the main messages here isthat despite the lack of a single method thatdominates certain guidelines can be offeredregarding how to analyze data

Modern methods for plotting data can be in-valuable as well (eg Rousselet Foxe amp Bo-lam 2016 Rousselet Pernet amp Wilcox 2017Weissgerber Milic Winham amp Garovic2015) In particular they can provide impor-tant perspectives beyond the common strategyof using error bars Complete details go be-yond the scope of this paper but Section 5 il-lustrates some of the more effective plots thatmight be used

This unit is organized as follows Sec-tion 2 briefly reviews when and why con-ventional methods can be highly unsatisfac-tory This is necessary in order to appreci-ate modern technology and because standardtraining typically ignores these issues Effortsto modernize basic training have been made(eg Field Miles amp Field 2012 Wilcox2017b 2017c) and a 2016 special issue of TheAmerican Statistician (volume 69 number4) was aimed at encouraging instructors tomodernize their courses (This special issuetouches on a broader range of topics thanthose discussed here) Some neuroscientistsare trained in a manner that takes into accountmodern insights relevant to basic principlesbut it is evident that most are not Section 3reviews the seemingly more obvious strate-gies aimed at salvaging standard techniquesthe point being that by modern standards theyare relatively ineffective and cannot be rec-ommended Moreover certain strategies arenot technically sound Section 3 also providesan indication of how concerns regarding con-ventional methods are addressed using moremodern techniques Section 4 describes strate-gies for comparing two independent or depen-dent groups that take modern advances intoaccount Included are some methods aimedat comparing correlations as well as meth-ods designed to determine which independentvariables are most important Section 5 illus-trates modern methods using data from severalstudies Behavioral

Neuroscience

8423

Current Protocols in Neuroscience Supplement 82

2 INSIGHTS REGARDINGCONVENTIONAL METHODS

This section elaborates on the concernswith conventional methods for comparinggroups and studying associations stemmingfrom the four insights previously indicated

21 Skewed DistributionsA skewed distribution simply refers to

a distribution that is not symmetric aboutsome central value An example is shown inFigure 8421A Such distributions occur natu-rally An example relevant to the neurosciencesis given in Section 52 Skewed distributionsare a much more serious problem for statisticalinferences than once thought due to insightsregarding the central limit theorem

Consider the one-sample case Conven-tional wisdom is that with a relatively smallsample size normality can be assumed underrandom sampling An implicit assumption wasthat if the sample mean has approximately anormal distribution then Studentrsquos t-test willperform reasonably well It is now known thatthis is not necessarily the case as illustratedfor example in Wilcox (2017a 2017b)

This point is illustrated here using a simula-tion that is performed in the following manner

Imagine that data are randomly sampled fromthe distribution shown in Figure 8421A (alognormal distribution) and that the mean iscomputed based on a sample size of n = 30Repeating this process 5000 times the thickblack line in Figure 8421B shows a plot ofthe resulting sample means the thin gray lineis the plot of the means when sampling froma normal distribution instead The distributionof T values for samples of n = 30 is indicatedby the thick black line in Figure 8421C thethin gray line is the distribution of T valueswhen sampling from a normal distribution Ascan be seen the actual T distribution extendsout much further to the left compared to thedistribution of T under normality That is inthe current example sampling from a skeweddistribution leads to much more extreme val-ues than expected by chance under normalitywhich in turn results in more false positive re-sults than expected when the null hypothesisis true

Suppose the goal is to test some hypothesisat the 005 level Bradley (1978) suggests thatas a general guide control over the probabilityof a Type I error is minimally satisfactory ifthe actual level is between 0025 and 0075When we test at the 005 level we expect 5

00

02

04

06

00 05 10 15 20 25 30 35X values

Den

sity

A

00

03

06

09

12

10 15 20 25 30 35Means

Den

sity

DistributionSample MeanTheoretical

B

00

01

02

03

04

minus8 minus7 minus6 minus5 minus4 minus3 minus2 minus1 0 1 2 3T values

Den

sity

Distributionn = 30Theoretical

C

00

01

02

03

04

minus8 minus7 minus6 minus5 minus4 minus3 minus2 minus1 0 1 2 3T values

Den

sity

Distributionn = 100Theoretical

D

Figure 8421 (A) Example of skewed distribution (B) Distribution of the sample mean undernormality (thin line n = 30) and the actual distribution based on a simulation Each sample meanwas computed based on 30 observations randomly sampled from the distribution shown in A (C)and (D) Comparison of the theoretical T distribution with 29 degrees of freedom to distributions of5000 T values Again the T values were computed from observations sampled from the distributionin A

A Guide to RobustStatisticalMethods

8424

Supplement 82 Current Protocols in Neuroscience

0000

0025

0050

0075

0100

0125

0150

0 100 200 300 400 500Sample size

Type

I er

ror

prob

abili

ty

DistributionLognormalNormal

EstimatorMeanMedianTrimmed mean

Figure 8422 Type I error probability as a function of sample size The Type I error probabilitywas computed by running a simulation with 10000 iterations In each iteration sample sizes from10 to 500 in steps of 10 were drawn from a normal distribution and a lognormal distributionFor each combination of sample size and distribution we applied a t-test on the mean a testof the median and a t-test on the 20 trimmed mean all with alpha = 005 Depending on thetest applied the mean median or 20 trimmed mean of the population sampled from was zeroThe black horizontal line marks the expected 005 Type I error probability The gray area marksBradleyrsquos satisfactory range When sampling from a normal distribution all methods are close tothe nominal 005 level except the trimmed mean for very small sample sizes When sampling isfrom a lognormal distribution the mean and the trimmed mean give rise to too many false alarmsfor small sample sizes The mean continues to give higher false positive rates than the othertechniques even with n = 500

of the t-tests to be significant However whensampling from the skewed distribution consid-ered here this is not the case the actual TypeI error probability is 0111

Figure 8421D shows the distribution of Twhen n = 100 Now the Type I error probabil-ity is 0082 again when testing at the 005level Based on Bradleyrsquos criterion a samplesize of about 130 or larger is required Bradely(1978) goes on to suggest that ideally the ac-tual Type I error probability should be between0045 and 0055 Now n = 600 is unsatisfac-tory the actual level is 0057 With n = 700the level is 0055

Before continuing it is noted that the me-dian belongs to the class of trimmed meanswhich refers to the strategy of trimminga specified proportion of the smallest andlargest values and averaging the values thatremain For example if n = 10 10 trim-ming means that the lowest and highest valuesare removed and that the remaining data areaveraged Similarly 20 trimming would re-move the two smallest and two largest val-ues Based on conventional training trimmingmight seem counterintuitive In some situa-tions however it can substantially increaseour ability to control the Type I error prob-ability as illustrated next and trimming can

substantially increase power as well for rea-sons to be explained

First focus on controlling the probability ofa Type I error (Section 23 illustrates one ofthe reasons that methods based on means canhave relatively low power) Figure 8422 il-lustrates the Type I error probability as a func-tion of the sample size when using the meanwhen using the median and when samplingfrom the asymmetric (lognormal) distributionin Figure 8421A Inferences based on the20 trimmed mean were made via the methodderived by Tukey and McLaughlin (1963) In-ferences based on the median were made viathe method derived by Hettmansperger andSheather (1986) (The software used to ap-ply these latter two methods is contained inthe R package described at the beginning ofSection 4) Also shown is the Type I errorprobability when sampling from a normal dis-tribution The gray area indicates Bradleyrsquoscriterion

Figure 8423 illustrates the association be-tween power and the sample size for the distri-butions used in Figure 8422 As can be seenunder normality the sample mean is best fol-lowed closely by the 20 trimmed mean Themedian is least satisfactory when dealing witha normal distribution as expected However

BehavioralNeuroscience

8425

Current Protocols in Neuroscience Supplement 82

00

02

04

06

08

10

0 50 100 150 200 250 300Sample size

Pow

er (

true

pos

itive

pro

babi

lity)

DistributionLognormalNormal

EstimatorMeanMedianTrimmed mean

Figure 8423 Power as a function of sample size The probability of a true positive was computedby running a simulation with 10000 iterations In each iteration sample sizes from 10 to 500 insteps of 10 were drawn from a normal distribution and the (lognormal) distribution shown in Figure8421A For each combination of sample size and distribution we applied a t-test on the mean atest of the median and a t-test on the 20 trimmed mean all with alpha = 005 Depending onthe test applied the mean median or 20 trimmed mean of the population sampled from was05 The black horizontal line marks the conventional 80 power threshold When sampling froma normal distribution all methods require lt50 observations to achieve 80 power and the meanappears to have higher power at lower sample size than the other methods When sampling froma lognormal distribution power drops dramatically for the mean but not for the median and thetrimmed mean Power for the median actually improves The exact pattern of results depends onthe effect size and the asymmetry of the distribution we sample from so we strongly encouragereaders to perform their own detailed power analyses

for the asymmetric (lognormal) distributionthe median performs best and the mean per-forms very poorly

A feature of random samples taken from thedistribution in Figure 8421A is that the ex-pected proportion of points declared an outlieris relatively small For skewed distributionsas we move toward situations where outliersare more common a sample size gt300 canbe required to achieve reasonably good con-trol over the Type I error probability Thatis control over the Type I error probabilityis a function of both the degree a distributionis skewed and the likelihood of encounteringoutliers However there are methods that per-form reasonably well with small sample sizesas indicated in Section 41

For symmetric distributions where outlierstend to occur the reverse can happen the ac-tual Type I error probability can be substan-tially less than the nominal level This happensbecause outliers inflate the standard deviationwhich in turn lowers the value of T which inturn can negatively impact power Section 23elaborates on this issue

In an important sense outliers have a largerimpact on the sample variance than the sam-ple mean which impacts the t-test To illus-

trate this point imagine the goal is to test H0μ = 1 based on the following values 1 1516 18 2 22 24 27 Then T = 469 the p-value is p = 0002 and the 095 confidenceinterval = [145 235] Now including thevalue 8 the mean increases from 19 to 258suggesting at some level there is stronger evi-dence for rejecting the null hypothesis How-ever this outlier increases the standard devia-tion from 054 to 21 and now T = 226 andp = 0054 The 095 confidence interval =[minus0033 319]

Now consider the goal of comparing twoindependent or dependent groups If thegroups have identical distributions then dif-ference scores have a symmetric distribu-tion and the probability of a Type I erroris in general less than the nominal levelwhen using conventional methods based onmeans Now in addition to outliers differ-ences in skewness create practical concernswhen using Studentrsquos t-test Indeed undergeneral conditions the two-sample Studentrsquost-test for independent groups is not evenasymptotically correct roughly because thestandard error of the difference betweenthe sample means is not estimated correctly(eg Cressie amp Whitford 1986) Moreover

A Guide to RobustStatisticalMethods

8426

Supplement 82 Current Protocols in Neuroscience

Studentrsquos t-test can be biased This means thatthe probability of rejecting the null hypothesisof equal means can be higher when the popu-lation means are equal compared to situationswhere the population means differ Roughlythis concern arises because the distribution ofT can be skewed and in fact the mean of T candiffer from zero even though the null hypothe-sis is true For a more detailed explanation seeWilcox (2017c section 55) Problems persistwhen Studentrsquos t-test is replaced by Welchrsquos(1938) method which is designed to comparemeans in a manner that allows unequal vari-ances Put another way if the goal is to testthe hypothesis that two groups have identicaldistributions conventional methods based onmeans perform well in terms of controlling theType I error probability But if the goal is tocompare the population means and if distribu-tions differ conventional methods can performpoorly

There are many techniques that performwell when dealing with skewed distributionsin terms of controlling the Type I error prob-ability some of which are based on the usualsample median (Wilcox 2017a 2017c) Boththeory and simulations indicate that as theamount of trimming increases the ability tocontrol over the probability of a Type I errorincreases as well Moreover for reasons to beexplained trimming can substantially increasepower a result that is not obvious based onconventional training The optimal amount oftrimming depends on the characteristics of thepopulation distributions which are unknownCurrently the best that can be said is that thechoice can make a substantial difference The20 trimmed mean has been studied exten-sively and often provides a good compromisebetween the two extremes no trimming (themean) and the maximum amount of trimming(the median)

In various situations particularly impor-tant are inferential methods based on whatare called bootstrap techniques Two basicversions are the bootstrap-t and percentilebootstrap Roughly rather than assume nor-mality bootstrap-t methods perform a simu-lation using the observed data that yields anestimate of an appropriate critical value anda p-value So values of T are generated asdone in Figure 8421 only data are sampledwith replacement from the observed data Inessence bootstrap-t methods generate data-driven T distributions expected by chance ifthere were no effect The percentile bootstrapproceeds in a similar manner only when deal-ing with a trimmed mean for example the

goal is to determine the distribution of thesample trimmed mean which can then be usedto compute a p-value and a confidence inter-val When comparing two independent groupsbased on the usual sample median if thereare tied (duplicated) values currently the onlymethod that performs well in simulations interms of controlling the Type I error probabil-ity is based on a percentile bootstrap method(compare to Table 53 in Wilcox 2017c) Sec-tion 41 elaborates on how this method isperformed

If the amount of trimming is close to zerothe bootstrap-t method is preferable to the per-centile bootstrap method But as the amount oftrimming increases at some point a percentilebootstrap method is preferable This is the casewith 20 trimming (eg Wilcox 2017a) Itseems to be the case with 10 trimming aswell but a definitive study has not been made

Also if the goal is to reflect the typical re-sponse it is evident that the median or evena 20 trimmed mean might be more satisfac-tory Using quantiles (percentiles) other thanthe median can be important as well for rea-sons summarized in Section 42

When comparing independent groupsmodern improvements on the Wilcoxon-Mann-Whitney (WMW) test are another pos-sibility which are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a ran-dom observation from the second (More de-tails are provided in Section 3) Additionalpossibilities are described in Wilcox (2017a)some of which are illustrated in Section 4below

In some situations robust methods can havesubstantially higher power than any methodbased on means It is not being suggested thatrobust methods always have more power Thisis not the case Rather the point is that powercan depend crucially on the conjunction ofwhich estimator is used (for instance the meanversus the median) and how a confidence inter-val is built (for instance a parametric methodor the percentile bootstrap) These choices arenot trivial and must be taken into account whenanalyzing data

22 HeteroscedasticityIt has been clear for some time that when

using classic methods for comparing meansheteroscedasticity (unequal population vari-ances) is a serious concern (eg Brown ampForsythe 1974) Heteroscedasticity can im-pact both power and the Type I error proba-bility The basic reason is that under general

BehavioralNeuroscience

8427

Current Protocols in Neuroscience Supplement 82

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

A

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

B

Figure 8424 Homoscedasticity (A) and heteroscedasticity (B) (A) The variance of the depen-dent variable is the same at any age (B) The variance of the dependent variable can depend onthe age of the participants

conditions methods that assume homoscedas-ticity are using an incorrect estimate ofthe standard error when in fact there isheteroscedasticity Indeed there are concernsregardless of how large the sample size mightbe As we consider more and more compli-cated designs heteroscedasticity becomes anincreasing concern

When dealing with regression ho-moscedasticity means that the variance of thedependent variable does not depend on thevalue of the independent variable When deal-ing with age and depressive symptoms for ex-ample homoscedasticity means that the vari-ation in measures of depressive symptoms atage 23 is the same at age 80 or any age inbetween as illustrated in Figure 8424

Independence implies homoscedasticitySo for this particular situation classic methodsassociated with least squares regression Pear-sonrsquos correlation Kendallrsquos tau and Spear-manrsquos rho are using a correct estimate of thestandard error which helps explain why theyperform well in terms of Type I errors whenthere is no association That is when a ho-moscedastic method rejects it is reasonableto conclude that there is an association but

in terms of inferring the nature of the asso-ciation these methods can perform poorlyAgain a practical concern is that when thereis heteroscedasticity homoscedastic methodsuse an incorrect estimate of the standard errorwhich can result in poor power and erroneousconclusions

A seemingly natural way of salvaging ho-moscedastic methods is to test the assump-tion that there is homoscedasticity But sixstudies summarized in Wilcox (2017c) foundthat this strategy is unsatisfactory Presum-ably situations are encountered where this isnot the case but it is difficult and unclearhow to determine when such situations areencountered

Methods that are designed to deal with het-eroscedasticity have been developed and areeasily applied using extant software Thesetechniques use a correct estimate of the stan-dard error regardless of whether the ho-moscedasticity assumption is true A generalrecommendation is to always use a het-eroscedastic method given the goal of com-paring measures of central tendency or mak-ing inferences about regression parameters aswell as measures of association

A Guide to RobustStatisticalMethods

8428

Supplement 82 Current Protocols in Neuroscience

00

01

02

03

04

minus3 minus2 minus1 0 1 2 3

Den

sity

Distribution mixed standard

A

minus10

minus05

00

05

10

Means Medians Trimmed means

Val

ue

B

minus3minus2minus1

01234

Means Medians Trimmed means

Val

ue

C

Figure 8425 (A) Density functions for the standard normal distribution (solid line) and the mixednormal distribution (dashed line) These distributions have an obvious similarity yet the variancesare 1 and 109 (B) Box plots of means medians and 20 trimmed means when sampling froma normal distribution (C) Box plots of means medians and 20 trimmed means when samplingfrom a mixed normal distribution In panels B and C each distribution has 10000 values and eachof these values was obtained by computing the mean median or trimmed mean of 30 randomlygenerated observations

23 OutliersEven small departures from normality can

devastate power The modern illustration ofthis fact stems from Tukey (1960) and is basedon what is generally known as a mixed normaldistribution The mixed normal considered byTukey means that with probability 09 an ob-servation is sampled from a standard normaldistribution otherwise an observation is sam-pled from a normal distribution having meanzero and standard deviation 10 Figure 8425Ashows a standard normal distribution and themixed normal discussed by Tukey Note thatin the center of the distributions the mixednormal is below the normal distribution Butfor the two ends of the mixed normal distribu-tion (the tails) the mixed normal lies above thenormal distribution For this reason the mixednormal is often described as having heavy tailsIn general heavy-tailed distributions roughlyrefer to distributions where outliers are likelyto occur

Here is an important point The standardnormal has variance 1 but the mixed normalhas variance 109 That is the population vari-ance can be overly sensitive to slight changesin the tails of a distribution Consequentlyeven a slight departure from normality can re-sult in relatively poor power when using anymethod based on the mean

Put another way samples from the mixednormal are more likely to result in outlierscompared to samples from a standard normaldistribution As previously indicated outliersinflate the sample variance which can nega-tively impact power when using means An-other concern is that they can give a distortedand misleading summary regarding the bulk ofthe participants

The first indication that heavy-tailed dis-tributions are a concern stems from a resultderived by Laplace about two centuries agoIn modern terminology he established thatas we move from a normal distribution to a

BehavioralNeuroscience

8429

Current Protocols in Neuroscience Supplement 82

distribution more likely to generate outliersthe standard error of the usual sample mediancan be smaller than the standard error of themean (Hand 1998) The first empirical evi-dence implying that outliers might be morecommon than what is expected under normal-ity was reported by Bessel (1818)

To add perspective we computed the meanmedian and a 20 trimmed mean based on 30observations generated from a standard nor-mal distribution (Again a 20 trimmed meanremoves the 20 lowest and highest valuesand averages the remaining data) Then we re-peated this process 10000 times Box plots ofthe results are shown in Figure 8425B The-ory tells us that under normality the variationof the sample means is smaller than the varia-tion among the 20 trimmed means and medi-ans and Figure 8421B provides perspectiveon the extent this is the case

Now we repeat this process only dataare sampled from the mixed normal inFigure 8425A Figure 8425C reports theresults As is evident there is substantiallyless variation among the medians and 20trimmed means That is despite trimmingdata the standard errors of the median and20 trimmed mean are substantially smallercontrary to what might be expected based onstandard training

Of course a more important issue iswhether the median or 20 trimmed meanever have substantially smaller standard er-rors based on the data encountered in re-search There are numerous illustrations thatthis is the case (eg Wilcox 2017a 2017b2017c)

There is the additional complication thatthe amount of trimming can substantially im-pact power and the ideal amount of trimmingin terms of maximizing power can dependcrucially on the nature of the unknown distri-butions under investigation The median per-forms best for the situation in Figure 8425Cbut situations are encountered where it trimstoo much given the goal of minimizing thestandard error Roughly a 20 trimmed meancompetes reasonably well with the mean un-der normality But as we move toward distribu-tions that are more likely to generate outliersat some point the median will have a smallerstandard error than a 20 trimmed mean Il-lustrations in Section 5 demonstrate that thisis a practical concern

It is not being suggested that the mere pres-ence of outliers will necessarily result in higherpower when using a 20 trimmed mean ormedian But it is being argued that simply

ignoring the potential impact of outliers canbe a serious practical concern

In terms of controlling the Type I errorprobability effective techniques are availablefor both the 20 trimmed mean and medianBecause the choice between a 20 trimmedmean and median is not straightforward interms of maximizing power it is suggestedthat in exploratory studies both of these esti-mators be considered

When dealing with least squares regressionor Pearsonrsquos correlation again outliers are aserious concern Indeed even a single outliermight give a highly distorted sense about theassociation among the bulk of the participantsunder study In particular important associa-tions might be missed One of the more ob-vious ways of dealing with this issue is toswitch to Kendallrsquos tau or Spearmanrsquos rhoHowever these measures of associations donot deal with all possible concerns related tooutliers For instance two outliers properlyplaced can give a distorted sense about theassociation among the bulk of the data (egWilcox 2017b p 239)

A measure of association that deals withthis issue is the skipped correlation where out-liers are detected using a projection methodthese points are removed and Pearsonrsquos cor-relation is computed using the remaining dataComplete details are summarized in Wilcox(2017a) This particular skipped correlationcan be computed with the R function scor anda confidence interval that allows heteroscedas-ticity can be computed with scorci This func-tion also reports a p-value when testing the hy-pothesis that the correlation is equal to zero(See Section 32 for a description of commonmistakes when testing hypotheses and outliersare removed) MATLAB code is available aswell (Pernet Wilcox amp Rousselet 2012)

24 CurvatureTypically a regression line is assumed

to be straight In some situations this ap-proach seems to suffice However it cannot bestressed too strongly that there is substantialliterature indicating that this is not always thecase A vast array of new and improved non-parametric methods for dealing with curvatureis now available but complete details go be-yond the scope of this paper Here it is merelyremarked that among the many nonparametricregression estimators that have been proposedgenerally known as smoothers two that seemto be particularly useful are Clevelandrsquos (1979)estimator which can be applied via the R func-tion lplot and the running-interval smoother

A Guide to RobustStatisticalMethods

84210

Supplement 82 Current Protocols in Neuroscience

(Wilcox 2017a) which can be applied withthe R function rplot Arguments can be madethat other smoothers should be given seriousconsideration Readers interested in these de-tails are referred to Wilcox (2017a 2017c)

Clevelandrsquos smoother was initially de-signed to estimate the mean of the dependentvariable given some value of the independentvariable The R function contains an optionfor dealing with outliers among the depen-dent variable but it currently seems that therunning-interval smoother is generally betterfor dealing with this issue By default therunning-interval smoother estimates the 20trimmed mean of the dependent variable butany other measure of central tendency can beused via the argument est

A simple strategy for dealing with curva-ture is to include a quadratic term Let X de-note the independent variable An even moregeneral strategy is to include Xa in the modelfor some appropriate choice for the exponenta But this strategy can be unsatisfactory as il-lustrated in Section 54 using data from a studydealing with fractional anisotropy and readingability In general smoothers provide a moresatisfactory approach

There are methods for testing the hypoth-esis that a regression line is straight (egWilcox 2017a) However failing to rejectdoes not provide compelling evidence that it issafe to assume that indeed the regression lineis straight It is unclear when this approach hasenough power to detect situations where cur-vature is an important practical concern Thebest advice is to plot an estimate of the re-gression line using a smoother If there is anyindication that curvature might be an issueuse modern methods for dealing with curva-ture many of which are summarized in Wilcox(2017a 2017c)

3 DEALING WITH VIOLATIONOF ASSUMPTIONS

Based on conventional training there aresome seemingly obvious strategies for deal-ing with the concerns reviewed in the previoussection But by modern standards generallythese strategies are relatively ineffective Thissection summarizes strategies that performpoorly followed by a brief description of mod-ern methods that give improved results

31 Testing AssumptionsA seemingly natural strategy is to test

assumptions In particular test the hypothe-sis that distributions have a normal distribu-

tion and test the hypothesis that there is ho-moscedasticity This approach generally failsroughly because such tests do not have enoughpower to detect situations where violatingthese assumptions is a practical concern Thatis these tests can fail to detect situations thathave an inordinately detrimental influence onstatistical power and parameter estimation

For example Wilcox (2017a 2017c) listssix studies aimed at testing the homoscedas-ticity assumption with the goal of salvaging amethod that assumes homoscedasticity (com-pare to Keselman Othman amp Wilcox 2016)Briefly these simulation studies generate datafrom a situation where it is known that ho-moscedastic methods perform poorly in termsof controlling the Type I error when there isheteroscedasticity Then various methods fortesting the homoscedasticity are performedand if they reject a heteroscedastic method isused instead All six studies came to the sameconclusion this strategy is unsatisfactory Pre-sumably there are situations where this strat-egy is satisfactory but it is unknown how toaccurately determine whether this is the casebased on the available data In practical termsall indications are that it is best to always use aheteroscedastic method when comparing mea-sures of central tendency and when dealingwith regression as well as measures of associ-ation such as Pearsonrsquos correlation

As for testing the normality assumptionfor instance using the Kolmogorov-Smirnovtest currently a better strategy is to use amore modern method that performs about aswell as conventional methods under normal-ity but which continues to perform relativelywell in situations where standard techniquesperform poorly There are many such methods(eg Wilcox 2017a 2017c) some of whichare outlined in Section 4 These methods canbe applied using extant software as will beillustrated

32 Outliers Two Common MistakesThere are two common mistakes regarding

how to deal with outliers The first is to searchfor outliers using the mean and standard de-viation For example declare the value X anoutlier if

|X minus X |s

gt 2

Equation 8421

where X and s are the usual sample meanand standard deviation respectively A prob-lem with this strategy is that it suffers from

BehavioralNeuroscience

84211

Current Protocols in Neuroscience Supplement 82

masking simply meaning that the very pres-ence of outliers causes them to be missed (egRousseeuw amp Leroy 1987 Wilcox 2017a)

Consider for example the values 1 2 2 34 6 100 and 100 The two last observationsappear to be clear outliers yet the rule givenby Equation 8421 fails to flag them as suchThe reason is simple the standard deviation ofthe sample is very large at almost 45 becauseit is not robust to outliers

This is not to suggest that all outliers willbe missed this is not necessarily the case Thepoint is that multiple outliers might be missedthat adversely affect any conventional methodthat might be used to compare means Muchmore effective are the box plot rule and theso-called MAD (median absolute deviation tothe median)-median rule

The box plot rule is applied as follows Letq1 and q2 be estimates of the lower and upperquartiles respectively Then the value X is de-clared an outlier if X lt q1 ndash 15(q2 ndash q1) or ifX gt q2 + 15(q2 ndash q1) As for the MAD-medianrule let X1 Xn denote a random sampleand let M be the usual sample median MADis the median of the values |X1 ndash M| |Xn ndashM| The MAD-median rule declares the valueX an outlier if

|X minus M |MAD06745

gt 224

Equation 8422

Under normality it can be shown thatMAD06745 estimates the standard devia-tion and of course M estimates the popula-tion mean So the MAD-median rule is simi-lar to using Equation 8421 only rather thanuse a two-standard deviation rule 224 is usedinstead

As an illustration consider the values 185111 111 037 037 185 7153 and 7153The MAD-median rule detects the outliers7153 But the rule given by Equation 8421does not The MAD-median rule is better thanthe box plot rule in terms of avoiding maskingIf the proportion of values that are outliers is25 masking can occur when using the boxplot rule Here for example the box plot ruledoes not flag the value 7153 as an outlier incontrast to the MAD-median rule

The second mistake is discarding outliersand applying some standard method for com-paring means using the remaining data Thisresults in an incorrect estimate of the stan-dard error regardless how large the samplesize might be That is an invalid test statisticis being used Roughly it can be shown that

if the remaining data are dependent then theyare correlated which invalidates the deriva-tion of the standard error Of course if an ar-gument can be made that a value is invaliddiscarding it is reasonable and does not leadto technical issues For instance a straightfor-ward case can be made if a measurement isoutside physiological bounds or if it followsa biologically non-plausible pattern over timesuch as during an electrophysiological record-ing But otherwise the estimate of the standarderror can be off by a factor of 2 (eg Wilcox2017b) which is a serious practical issue Asimple way of dealing with this issue whenusing a 20 trimmed mean or median is touse a percentile bootstrap method (With rea-sonably large sample sizes alternatives to thepercentile bootstrap method can be used Theyuse correct estimates of the standard error andare described in Wilcox 2017a 2017c) Themain point here is that these methods are read-ily applied with the free software R which isplaying an increasing role in basic trainingSome illustrations are given in Section 5

It is noted that when dealing with regres-sion outliers among the independent variablescan be removed when testing hypotheses Butif outliers among the dependent variable are re-moved conventional hypothesis testing tech-niques based on the least squares estimatorare no longer valid even when there is ho-moscedasticity Again the issue is that anincorrect estimate of the standard error is be-ing used When using robust regression es-timators that deal with outliers among thedependent variable again a percentile boot-strap method can be used to test hypothesesComplete details are summarized in Wilcox(2017a 2017c) There are numerous regres-sion estimators that effectively deal with out-liers among the dependent variable but a briefsummary of the many details is impossibleThe Theil and Sen estimator as well as theMM-estimator are relatively good choices butarguments can be made that alternative esti-mators deserve serious consideration

33 Transform the DataA common strategy for dealing with non-

normality or heteroscedasticity is to transformthe data There are exceptions but generallythis approach is unsatisfactory for several rea-sons First the transformed data can again beskewed to the point that classic techniquesperform poorly (eg Wilcox 2017b) Sec-ond this simple approach does not deal withoutliers in a satisfactory manner (eg Dok-sum amp Wong 1983 Rasmussen 1989) The

A Guide to RobustStatisticalMethods

84212

Supplement 82 Current Protocols in Neuroscience

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 4: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

2 INSIGHTS REGARDINGCONVENTIONAL METHODS

This section elaborates on the concernswith conventional methods for comparinggroups and studying associations stemmingfrom the four insights previously indicated

21 Skewed DistributionsA skewed distribution simply refers to

a distribution that is not symmetric aboutsome central value An example is shown inFigure 8421A Such distributions occur natu-rally An example relevant to the neurosciencesis given in Section 52 Skewed distributionsare a much more serious problem for statisticalinferences than once thought due to insightsregarding the central limit theorem

Consider the one-sample case Conven-tional wisdom is that with a relatively smallsample size normality can be assumed underrandom sampling An implicit assumption wasthat if the sample mean has approximately anormal distribution then Studentrsquos t-test willperform reasonably well It is now known thatthis is not necessarily the case as illustratedfor example in Wilcox (2017a 2017b)

This point is illustrated here using a simula-tion that is performed in the following manner

Imagine that data are randomly sampled fromthe distribution shown in Figure 8421A (alognormal distribution) and that the mean iscomputed based on a sample size of n = 30Repeating this process 5000 times the thickblack line in Figure 8421B shows a plot ofthe resulting sample means the thin gray lineis the plot of the means when sampling froma normal distribution instead The distributionof T values for samples of n = 30 is indicatedby the thick black line in Figure 8421C thethin gray line is the distribution of T valueswhen sampling from a normal distribution Ascan be seen the actual T distribution extendsout much further to the left compared to thedistribution of T under normality That is inthe current example sampling from a skeweddistribution leads to much more extreme val-ues than expected by chance under normalitywhich in turn results in more false positive re-sults than expected when the null hypothesisis true

Suppose the goal is to test some hypothesisat the 005 level Bradley (1978) suggests thatas a general guide control over the probabilityof a Type I error is minimally satisfactory ifthe actual level is between 0025 and 0075When we test at the 005 level we expect 5

00

02

04

06

00 05 10 15 20 25 30 35X values

Den

sity

A

00

03

06

09

12

10 15 20 25 30 35Means

Den

sity

DistributionSample MeanTheoretical

B

00

01

02

03

04

minus8 minus7 minus6 minus5 minus4 minus3 minus2 minus1 0 1 2 3T values

Den

sity

Distributionn = 30Theoretical

C

00

01

02

03

04

minus8 minus7 minus6 minus5 minus4 minus3 minus2 minus1 0 1 2 3T values

Den

sity

Distributionn = 100Theoretical

D

Figure 8421 (A) Example of skewed distribution (B) Distribution of the sample mean undernormality (thin line n = 30) and the actual distribution based on a simulation Each sample meanwas computed based on 30 observations randomly sampled from the distribution shown in A (C)and (D) Comparison of the theoretical T distribution with 29 degrees of freedom to distributions of5000 T values Again the T values were computed from observations sampled from the distributionin A

A Guide to RobustStatisticalMethods

8424

Supplement 82 Current Protocols in Neuroscience

0000

0025

0050

0075

0100

0125

0150

0 100 200 300 400 500Sample size

Type

I er

ror

prob

abili

ty

DistributionLognormalNormal

EstimatorMeanMedianTrimmed mean

Figure 8422 Type I error probability as a function of sample size The Type I error probabilitywas computed by running a simulation with 10000 iterations In each iteration sample sizes from10 to 500 in steps of 10 were drawn from a normal distribution and a lognormal distributionFor each combination of sample size and distribution we applied a t-test on the mean a testof the median and a t-test on the 20 trimmed mean all with alpha = 005 Depending on thetest applied the mean median or 20 trimmed mean of the population sampled from was zeroThe black horizontal line marks the expected 005 Type I error probability The gray area marksBradleyrsquos satisfactory range When sampling from a normal distribution all methods are close tothe nominal 005 level except the trimmed mean for very small sample sizes When sampling isfrom a lognormal distribution the mean and the trimmed mean give rise to too many false alarmsfor small sample sizes The mean continues to give higher false positive rates than the othertechniques even with n = 500

of the t-tests to be significant However whensampling from the skewed distribution consid-ered here this is not the case the actual TypeI error probability is 0111

Figure 8421D shows the distribution of Twhen n = 100 Now the Type I error probabil-ity is 0082 again when testing at the 005level Based on Bradleyrsquos criterion a samplesize of about 130 or larger is required Bradely(1978) goes on to suggest that ideally the ac-tual Type I error probability should be between0045 and 0055 Now n = 600 is unsatisfac-tory the actual level is 0057 With n = 700the level is 0055

Before continuing it is noted that the me-dian belongs to the class of trimmed meanswhich refers to the strategy of trimminga specified proportion of the smallest andlargest values and averaging the values thatremain For example if n = 10 10 trim-ming means that the lowest and highest valuesare removed and that the remaining data areaveraged Similarly 20 trimming would re-move the two smallest and two largest val-ues Based on conventional training trimmingmight seem counterintuitive In some situa-tions however it can substantially increaseour ability to control the Type I error prob-ability as illustrated next and trimming can

substantially increase power as well for rea-sons to be explained

First focus on controlling the probability ofa Type I error (Section 23 illustrates one ofthe reasons that methods based on means canhave relatively low power) Figure 8422 il-lustrates the Type I error probability as a func-tion of the sample size when using the meanwhen using the median and when samplingfrom the asymmetric (lognormal) distributionin Figure 8421A Inferences based on the20 trimmed mean were made via the methodderived by Tukey and McLaughlin (1963) In-ferences based on the median were made viathe method derived by Hettmansperger andSheather (1986) (The software used to ap-ply these latter two methods is contained inthe R package described at the beginning ofSection 4) Also shown is the Type I errorprobability when sampling from a normal dis-tribution The gray area indicates Bradleyrsquoscriterion

Figure 8423 illustrates the association be-tween power and the sample size for the distri-butions used in Figure 8422 As can be seenunder normality the sample mean is best fol-lowed closely by the 20 trimmed mean Themedian is least satisfactory when dealing witha normal distribution as expected However

BehavioralNeuroscience

8425

Current Protocols in Neuroscience Supplement 82

00

02

04

06

08

10

0 50 100 150 200 250 300Sample size

Pow

er (

true

pos

itive

pro

babi

lity)

DistributionLognormalNormal

EstimatorMeanMedianTrimmed mean

Figure 8423 Power as a function of sample size The probability of a true positive was computedby running a simulation with 10000 iterations In each iteration sample sizes from 10 to 500 insteps of 10 were drawn from a normal distribution and the (lognormal) distribution shown in Figure8421A For each combination of sample size and distribution we applied a t-test on the mean atest of the median and a t-test on the 20 trimmed mean all with alpha = 005 Depending onthe test applied the mean median or 20 trimmed mean of the population sampled from was05 The black horizontal line marks the conventional 80 power threshold When sampling froma normal distribution all methods require lt50 observations to achieve 80 power and the meanappears to have higher power at lower sample size than the other methods When sampling froma lognormal distribution power drops dramatically for the mean but not for the median and thetrimmed mean Power for the median actually improves The exact pattern of results depends onthe effect size and the asymmetry of the distribution we sample from so we strongly encouragereaders to perform their own detailed power analyses

for the asymmetric (lognormal) distributionthe median performs best and the mean per-forms very poorly

A feature of random samples taken from thedistribution in Figure 8421A is that the ex-pected proportion of points declared an outlieris relatively small For skewed distributionsas we move toward situations where outliersare more common a sample size gt300 canbe required to achieve reasonably good con-trol over the Type I error probability Thatis control over the Type I error probabilityis a function of both the degree a distributionis skewed and the likelihood of encounteringoutliers However there are methods that per-form reasonably well with small sample sizesas indicated in Section 41

For symmetric distributions where outlierstend to occur the reverse can happen the ac-tual Type I error probability can be substan-tially less than the nominal level This happensbecause outliers inflate the standard deviationwhich in turn lowers the value of T which inturn can negatively impact power Section 23elaborates on this issue

In an important sense outliers have a largerimpact on the sample variance than the sam-ple mean which impacts the t-test To illus-

trate this point imagine the goal is to test H0μ = 1 based on the following values 1 1516 18 2 22 24 27 Then T = 469 the p-value is p = 0002 and the 095 confidenceinterval = [145 235] Now including thevalue 8 the mean increases from 19 to 258suggesting at some level there is stronger evi-dence for rejecting the null hypothesis How-ever this outlier increases the standard devia-tion from 054 to 21 and now T = 226 andp = 0054 The 095 confidence interval =[minus0033 319]

Now consider the goal of comparing twoindependent or dependent groups If thegroups have identical distributions then dif-ference scores have a symmetric distribu-tion and the probability of a Type I erroris in general less than the nominal levelwhen using conventional methods based onmeans Now in addition to outliers differ-ences in skewness create practical concernswhen using Studentrsquos t-test Indeed undergeneral conditions the two-sample Studentrsquost-test for independent groups is not evenasymptotically correct roughly because thestandard error of the difference betweenthe sample means is not estimated correctly(eg Cressie amp Whitford 1986) Moreover

A Guide to RobustStatisticalMethods

8426

Supplement 82 Current Protocols in Neuroscience

Studentrsquos t-test can be biased This means thatthe probability of rejecting the null hypothesisof equal means can be higher when the popu-lation means are equal compared to situationswhere the population means differ Roughlythis concern arises because the distribution ofT can be skewed and in fact the mean of T candiffer from zero even though the null hypothe-sis is true For a more detailed explanation seeWilcox (2017c section 55) Problems persistwhen Studentrsquos t-test is replaced by Welchrsquos(1938) method which is designed to comparemeans in a manner that allows unequal vari-ances Put another way if the goal is to testthe hypothesis that two groups have identicaldistributions conventional methods based onmeans perform well in terms of controlling theType I error probability But if the goal is tocompare the population means and if distribu-tions differ conventional methods can performpoorly

There are many techniques that performwell when dealing with skewed distributionsin terms of controlling the Type I error prob-ability some of which are based on the usualsample median (Wilcox 2017a 2017c) Boththeory and simulations indicate that as theamount of trimming increases the ability tocontrol over the probability of a Type I errorincreases as well Moreover for reasons to beexplained trimming can substantially increasepower a result that is not obvious based onconventional training The optimal amount oftrimming depends on the characteristics of thepopulation distributions which are unknownCurrently the best that can be said is that thechoice can make a substantial difference The20 trimmed mean has been studied exten-sively and often provides a good compromisebetween the two extremes no trimming (themean) and the maximum amount of trimming(the median)

In various situations particularly impor-tant are inferential methods based on whatare called bootstrap techniques Two basicversions are the bootstrap-t and percentilebootstrap Roughly rather than assume nor-mality bootstrap-t methods perform a simu-lation using the observed data that yields anestimate of an appropriate critical value anda p-value So values of T are generated asdone in Figure 8421 only data are sampledwith replacement from the observed data Inessence bootstrap-t methods generate data-driven T distributions expected by chance ifthere were no effect The percentile bootstrapproceeds in a similar manner only when deal-ing with a trimmed mean for example the

goal is to determine the distribution of thesample trimmed mean which can then be usedto compute a p-value and a confidence inter-val When comparing two independent groupsbased on the usual sample median if thereare tied (duplicated) values currently the onlymethod that performs well in simulations interms of controlling the Type I error probabil-ity is based on a percentile bootstrap method(compare to Table 53 in Wilcox 2017c) Sec-tion 41 elaborates on how this method isperformed

If the amount of trimming is close to zerothe bootstrap-t method is preferable to the per-centile bootstrap method But as the amount oftrimming increases at some point a percentilebootstrap method is preferable This is the casewith 20 trimming (eg Wilcox 2017a) Itseems to be the case with 10 trimming aswell but a definitive study has not been made

Also if the goal is to reflect the typical re-sponse it is evident that the median or evena 20 trimmed mean might be more satisfac-tory Using quantiles (percentiles) other thanthe median can be important as well for rea-sons summarized in Section 42

When comparing independent groupsmodern improvements on the Wilcoxon-Mann-Whitney (WMW) test are another pos-sibility which are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a ran-dom observation from the second (More de-tails are provided in Section 3) Additionalpossibilities are described in Wilcox (2017a)some of which are illustrated in Section 4below

In some situations robust methods can havesubstantially higher power than any methodbased on means It is not being suggested thatrobust methods always have more power Thisis not the case Rather the point is that powercan depend crucially on the conjunction ofwhich estimator is used (for instance the meanversus the median) and how a confidence inter-val is built (for instance a parametric methodor the percentile bootstrap) These choices arenot trivial and must be taken into account whenanalyzing data

22 HeteroscedasticityIt has been clear for some time that when

using classic methods for comparing meansheteroscedasticity (unequal population vari-ances) is a serious concern (eg Brown ampForsythe 1974) Heteroscedasticity can im-pact both power and the Type I error proba-bility The basic reason is that under general

BehavioralNeuroscience

8427

Current Protocols in Neuroscience Supplement 82

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

A

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

B

Figure 8424 Homoscedasticity (A) and heteroscedasticity (B) (A) The variance of the depen-dent variable is the same at any age (B) The variance of the dependent variable can depend onthe age of the participants

conditions methods that assume homoscedas-ticity are using an incorrect estimate ofthe standard error when in fact there isheteroscedasticity Indeed there are concernsregardless of how large the sample size mightbe As we consider more and more compli-cated designs heteroscedasticity becomes anincreasing concern

When dealing with regression ho-moscedasticity means that the variance of thedependent variable does not depend on thevalue of the independent variable When deal-ing with age and depressive symptoms for ex-ample homoscedasticity means that the vari-ation in measures of depressive symptoms atage 23 is the same at age 80 or any age inbetween as illustrated in Figure 8424

Independence implies homoscedasticitySo for this particular situation classic methodsassociated with least squares regression Pear-sonrsquos correlation Kendallrsquos tau and Spear-manrsquos rho are using a correct estimate of thestandard error which helps explain why theyperform well in terms of Type I errors whenthere is no association That is when a ho-moscedastic method rejects it is reasonableto conclude that there is an association but

in terms of inferring the nature of the asso-ciation these methods can perform poorlyAgain a practical concern is that when thereis heteroscedasticity homoscedastic methodsuse an incorrect estimate of the standard errorwhich can result in poor power and erroneousconclusions

A seemingly natural way of salvaging ho-moscedastic methods is to test the assump-tion that there is homoscedasticity But sixstudies summarized in Wilcox (2017c) foundthat this strategy is unsatisfactory Presum-ably situations are encountered where this isnot the case but it is difficult and unclearhow to determine when such situations areencountered

Methods that are designed to deal with het-eroscedasticity have been developed and areeasily applied using extant software Thesetechniques use a correct estimate of the stan-dard error regardless of whether the ho-moscedasticity assumption is true A generalrecommendation is to always use a het-eroscedastic method given the goal of com-paring measures of central tendency or mak-ing inferences about regression parameters aswell as measures of association

A Guide to RobustStatisticalMethods

8428

Supplement 82 Current Protocols in Neuroscience

00

01

02

03

04

minus3 minus2 minus1 0 1 2 3

Den

sity

Distribution mixed standard

A

minus10

minus05

00

05

10

Means Medians Trimmed means

Val

ue

B

minus3minus2minus1

01234

Means Medians Trimmed means

Val

ue

C

Figure 8425 (A) Density functions for the standard normal distribution (solid line) and the mixednormal distribution (dashed line) These distributions have an obvious similarity yet the variancesare 1 and 109 (B) Box plots of means medians and 20 trimmed means when sampling froma normal distribution (C) Box plots of means medians and 20 trimmed means when samplingfrom a mixed normal distribution In panels B and C each distribution has 10000 values and eachof these values was obtained by computing the mean median or trimmed mean of 30 randomlygenerated observations

23 OutliersEven small departures from normality can

devastate power The modern illustration ofthis fact stems from Tukey (1960) and is basedon what is generally known as a mixed normaldistribution The mixed normal considered byTukey means that with probability 09 an ob-servation is sampled from a standard normaldistribution otherwise an observation is sam-pled from a normal distribution having meanzero and standard deviation 10 Figure 8425Ashows a standard normal distribution and themixed normal discussed by Tukey Note thatin the center of the distributions the mixednormal is below the normal distribution Butfor the two ends of the mixed normal distribu-tion (the tails) the mixed normal lies above thenormal distribution For this reason the mixednormal is often described as having heavy tailsIn general heavy-tailed distributions roughlyrefer to distributions where outliers are likelyto occur

Here is an important point The standardnormal has variance 1 but the mixed normalhas variance 109 That is the population vari-ance can be overly sensitive to slight changesin the tails of a distribution Consequentlyeven a slight departure from normality can re-sult in relatively poor power when using anymethod based on the mean

Put another way samples from the mixednormal are more likely to result in outlierscompared to samples from a standard normaldistribution As previously indicated outliersinflate the sample variance which can nega-tively impact power when using means An-other concern is that they can give a distortedand misleading summary regarding the bulk ofthe participants

The first indication that heavy-tailed dis-tributions are a concern stems from a resultderived by Laplace about two centuries agoIn modern terminology he established thatas we move from a normal distribution to a

BehavioralNeuroscience

8429

Current Protocols in Neuroscience Supplement 82

distribution more likely to generate outliersthe standard error of the usual sample mediancan be smaller than the standard error of themean (Hand 1998) The first empirical evi-dence implying that outliers might be morecommon than what is expected under normal-ity was reported by Bessel (1818)

To add perspective we computed the meanmedian and a 20 trimmed mean based on 30observations generated from a standard nor-mal distribution (Again a 20 trimmed meanremoves the 20 lowest and highest valuesand averages the remaining data) Then we re-peated this process 10000 times Box plots ofthe results are shown in Figure 8425B The-ory tells us that under normality the variationof the sample means is smaller than the varia-tion among the 20 trimmed means and medi-ans and Figure 8421B provides perspectiveon the extent this is the case

Now we repeat this process only dataare sampled from the mixed normal inFigure 8425A Figure 8425C reports theresults As is evident there is substantiallyless variation among the medians and 20trimmed means That is despite trimmingdata the standard errors of the median and20 trimmed mean are substantially smallercontrary to what might be expected based onstandard training

Of course a more important issue iswhether the median or 20 trimmed meanever have substantially smaller standard er-rors based on the data encountered in re-search There are numerous illustrations thatthis is the case (eg Wilcox 2017a 2017b2017c)

There is the additional complication thatthe amount of trimming can substantially im-pact power and the ideal amount of trimmingin terms of maximizing power can dependcrucially on the nature of the unknown distri-butions under investigation The median per-forms best for the situation in Figure 8425Cbut situations are encountered where it trimstoo much given the goal of minimizing thestandard error Roughly a 20 trimmed meancompetes reasonably well with the mean un-der normality But as we move toward distribu-tions that are more likely to generate outliersat some point the median will have a smallerstandard error than a 20 trimmed mean Il-lustrations in Section 5 demonstrate that thisis a practical concern

It is not being suggested that the mere pres-ence of outliers will necessarily result in higherpower when using a 20 trimmed mean ormedian But it is being argued that simply

ignoring the potential impact of outliers canbe a serious practical concern

In terms of controlling the Type I errorprobability effective techniques are availablefor both the 20 trimmed mean and medianBecause the choice between a 20 trimmedmean and median is not straightforward interms of maximizing power it is suggestedthat in exploratory studies both of these esti-mators be considered

When dealing with least squares regressionor Pearsonrsquos correlation again outliers are aserious concern Indeed even a single outliermight give a highly distorted sense about theassociation among the bulk of the participantsunder study In particular important associa-tions might be missed One of the more ob-vious ways of dealing with this issue is toswitch to Kendallrsquos tau or Spearmanrsquos rhoHowever these measures of associations donot deal with all possible concerns related tooutliers For instance two outliers properlyplaced can give a distorted sense about theassociation among the bulk of the data (egWilcox 2017b p 239)

A measure of association that deals withthis issue is the skipped correlation where out-liers are detected using a projection methodthese points are removed and Pearsonrsquos cor-relation is computed using the remaining dataComplete details are summarized in Wilcox(2017a) This particular skipped correlationcan be computed with the R function scor anda confidence interval that allows heteroscedas-ticity can be computed with scorci This func-tion also reports a p-value when testing the hy-pothesis that the correlation is equal to zero(See Section 32 for a description of commonmistakes when testing hypotheses and outliersare removed) MATLAB code is available aswell (Pernet Wilcox amp Rousselet 2012)

24 CurvatureTypically a regression line is assumed

to be straight In some situations this ap-proach seems to suffice However it cannot bestressed too strongly that there is substantialliterature indicating that this is not always thecase A vast array of new and improved non-parametric methods for dealing with curvatureis now available but complete details go be-yond the scope of this paper Here it is merelyremarked that among the many nonparametricregression estimators that have been proposedgenerally known as smoothers two that seemto be particularly useful are Clevelandrsquos (1979)estimator which can be applied via the R func-tion lplot and the running-interval smoother

A Guide to RobustStatisticalMethods

84210

Supplement 82 Current Protocols in Neuroscience

(Wilcox 2017a) which can be applied withthe R function rplot Arguments can be madethat other smoothers should be given seriousconsideration Readers interested in these de-tails are referred to Wilcox (2017a 2017c)

Clevelandrsquos smoother was initially de-signed to estimate the mean of the dependentvariable given some value of the independentvariable The R function contains an optionfor dealing with outliers among the depen-dent variable but it currently seems that therunning-interval smoother is generally betterfor dealing with this issue By default therunning-interval smoother estimates the 20trimmed mean of the dependent variable butany other measure of central tendency can beused via the argument est

A simple strategy for dealing with curva-ture is to include a quadratic term Let X de-note the independent variable An even moregeneral strategy is to include Xa in the modelfor some appropriate choice for the exponenta But this strategy can be unsatisfactory as il-lustrated in Section 54 using data from a studydealing with fractional anisotropy and readingability In general smoothers provide a moresatisfactory approach

There are methods for testing the hypoth-esis that a regression line is straight (egWilcox 2017a) However failing to rejectdoes not provide compelling evidence that it issafe to assume that indeed the regression lineis straight It is unclear when this approach hasenough power to detect situations where cur-vature is an important practical concern Thebest advice is to plot an estimate of the re-gression line using a smoother If there is anyindication that curvature might be an issueuse modern methods for dealing with curva-ture many of which are summarized in Wilcox(2017a 2017c)

3 DEALING WITH VIOLATIONOF ASSUMPTIONS

Based on conventional training there aresome seemingly obvious strategies for deal-ing with the concerns reviewed in the previoussection But by modern standards generallythese strategies are relatively ineffective Thissection summarizes strategies that performpoorly followed by a brief description of mod-ern methods that give improved results

31 Testing AssumptionsA seemingly natural strategy is to test

assumptions In particular test the hypothe-sis that distributions have a normal distribu-

tion and test the hypothesis that there is ho-moscedasticity This approach generally failsroughly because such tests do not have enoughpower to detect situations where violatingthese assumptions is a practical concern Thatis these tests can fail to detect situations thathave an inordinately detrimental influence onstatistical power and parameter estimation

For example Wilcox (2017a 2017c) listssix studies aimed at testing the homoscedas-ticity assumption with the goal of salvaging amethod that assumes homoscedasticity (com-pare to Keselman Othman amp Wilcox 2016)Briefly these simulation studies generate datafrom a situation where it is known that ho-moscedastic methods perform poorly in termsof controlling the Type I error when there isheteroscedasticity Then various methods fortesting the homoscedasticity are performedand if they reject a heteroscedastic method isused instead All six studies came to the sameconclusion this strategy is unsatisfactory Pre-sumably there are situations where this strat-egy is satisfactory but it is unknown how toaccurately determine whether this is the casebased on the available data In practical termsall indications are that it is best to always use aheteroscedastic method when comparing mea-sures of central tendency and when dealingwith regression as well as measures of associ-ation such as Pearsonrsquos correlation

As for testing the normality assumptionfor instance using the Kolmogorov-Smirnovtest currently a better strategy is to use amore modern method that performs about aswell as conventional methods under normal-ity but which continues to perform relativelywell in situations where standard techniquesperform poorly There are many such methods(eg Wilcox 2017a 2017c) some of whichare outlined in Section 4 These methods canbe applied using extant software as will beillustrated

32 Outliers Two Common MistakesThere are two common mistakes regarding

how to deal with outliers The first is to searchfor outliers using the mean and standard de-viation For example declare the value X anoutlier if

|X minus X |s

gt 2

Equation 8421

where X and s are the usual sample meanand standard deviation respectively A prob-lem with this strategy is that it suffers from

BehavioralNeuroscience

84211

Current Protocols in Neuroscience Supplement 82

masking simply meaning that the very pres-ence of outliers causes them to be missed (egRousseeuw amp Leroy 1987 Wilcox 2017a)

Consider for example the values 1 2 2 34 6 100 and 100 The two last observationsappear to be clear outliers yet the rule givenby Equation 8421 fails to flag them as suchThe reason is simple the standard deviation ofthe sample is very large at almost 45 becauseit is not robust to outliers

This is not to suggest that all outliers willbe missed this is not necessarily the case Thepoint is that multiple outliers might be missedthat adversely affect any conventional methodthat might be used to compare means Muchmore effective are the box plot rule and theso-called MAD (median absolute deviation tothe median)-median rule

The box plot rule is applied as follows Letq1 and q2 be estimates of the lower and upperquartiles respectively Then the value X is de-clared an outlier if X lt q1 ndash 15(q2 ndash q1) or ifX gt q2 + 15(q2 ndash q1) As for the MAD-medianrule let X1 Xn denote a random sampleand let M be the usual sample median MADis the median of the values |X1 ndash M| |Xn ndashM| The MAD-median rule declares the valueX an outlier if

|X minus M |MAD06745

gt 224

Equation 8422

Under normality it can be shown thatMAD06745 estimates the standard devia-tion and of course M estimates the popula-tion mean So the MAD-median rule is simi-lar to using Equation 8421 only rather thanuse a two-standard deviation rule 224 is usedinstead

As an illustration consider the values 185111 111 037 037 185 7153 and 7153The MAD-median rule detects the outliers7153 But the rule given by Equation 8421does not The MAD-median rule is better thanthe box plot rule in terms of avoiding maskingIf the proportion of values that are outliers is25 masking can occur when using the boxplot rule Here for example the box plot ruledoes not flag the value 7153 as an outlier incontrast to the MAD-median rule

The second mistake is discarding outliersand applying some standard method for com-paring means using the remaining data Thisresults in an incorrect estimate of the stan-dard error regardless how large the samplesize might be That is an invalid test statisticis being used Roughly it can be shown that

if the remaining data are dependent then theyare correlated which invalidates the deriva-tion of the standard error Of course if an ar-gument can be made that a value is invaliddiscarding it is reasonable and does not leadto technical issues For instance a straightfor-ward case can be made if a measurement isoutside physiological bounds or if it followsa biologically non-plausible pattern over timesuch as during an electrophysiological record-ing But otherwise the estimate of the standarderror can be off by a factor of 2 (eg Wilcox2017b) which is a serious practical issue Asimple way of dealing with this issue whenusing a 20 trimmed mean or median is touse a percentile bootstrap method (With rea-sonably large sample sizes alternatives to thepercentile bootstrap method can be used Theyuse correct estimates of the standard error andare described in Wilcox 2017a 2017c) Themain point here is that these methods are read-ily applied with the free software R which isplaying an increasing role in basic trainingSome illustrations are given in Section 5

It is noted that when dealing with regres-sion outliers among the independent variablescan be removed when testing hypotheses Butif outliers among the dependent variable are re-moved conventional hypothesis testing tech-niques based on the least squares estimatorare no longer valid even when there is ho-moscedasticity Again the issue is that anincorrect estimate of the standard error is be-ing used When using robust regression es-timators that deal with outliers among thedependent variable again a percentile boot-strap method can be used to test hypothesesComplete details are summarized in Wilcox(2017a 2017c) There are numerous regres-sion estimators that effectively deal with out-liers among the dependent variable but a briefsummary of the many details is impossibleThe Theil and Sen estimator as well as theMM-estimator are relatively good choices butarguments can be made that alternative esti-mators deserve serious consideration

33 Transform the DataA common strategy for dealing with non-

normality or heteroscedasticity is to transformthe data There are exceptions but generallythis approach is unsatisfactory for several rea-sons First the transformed data can again beskewed to the point that classic techniquesperform poorly (eg Wilcox 2017b) Sec-ond this simple approach does not deal withoutliers in a satisfactory manner (eg Dok-sum amp Wong 1983 Rasmussen 1989) The

A Guide to RobustStatisticalMethods

84212

Supplement 82 Current Protocols in Neuroscience

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 5: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

0000

0025

0050

0075

0100

0125

0150

0 100 200 300 400 500Sample size

Type

I er

ror

prob

abili

ty

DistributionLognormalNormal

EstimatorMeanMedianTrimmed mean

Figure 8422 Type I error probability as a function of sample size The Type I error probabilitywas computed by running a simulation with 10000 iterations In each iteration sample sizes from10 to 500 in steps of 10 were drawn from a normal distribution and a lognormal distributionFor each combination of sample size and distribution we applied a t-test on the mean a testof the median and a t-test on the 20 trimmed mean all with alpha = 005 Depending on thetest applied the mean median or 20 trimmed mean of the population sampled from was zeroThe black horizontal line marks the expected 005 Type I error probability The gray area marksBradleyrsquos satisfactory range When sampling from a normal distribution all methods are close tothe nominal 005 level except the trimmed mean for very small sample sizes When sampling isfrom a lognormal distribution the mean and the trimmed mean give rise to too many false alarmsfor small sample sizes The mean continues to give higher false positive rates than the othertechniques even with n = 500

of the t-tests to be significant However whensampling from the skewed distribution consid-ered here this is not the case the actual TypeI error probability is 0111

Figure 8421D shows the distribution of Twhen n = 100 Now the Type I error probabil-ity is 0082 again when testing at the 005level Based on Bradleyrsquos criterion a samplesize of about 130 or larger is required Bradely(1978) goes on to suggest that ideally the ac-tual Type I error probability should be between0045 and 0055 Now n = 600 is unsatisfac-tory the actual level is 0057 With n = 700the level is 0055

Before continuing it is noted that the me-dian belongs to the class of trimmed meanswhich refers to the strategy of trimminga specified proportion of the smallest andlargest values and averaging the values thatremain For example if n = 10 10 trim-ming means that the lowest and highest valuesare removed and that the remaining data areaveraged Similarly 20 trimming would re-move the two smallest and two largest val-ues Based on conventional training trimmingmight seem counterintuitive In some situa-tions however it can substantially increaseour ability to control the Type I error prob-ability as illustrated next and trimming can

substantially increase power as well for rea-sons to be explained

First focus on controlling the probability ofa Type I error (Section 23 illustrates one ofthe reasons that methods based on means canhave relatively low power) Figure 8422 il-lustrates the Type I error probability as a func-tion of the sample size when using the meanwhen using the median and when samplingfrom the asymmetric (lognormal) distributionin Figure 8421A Inferences based on the20 trimmed mean were made via the methodderived by Tukey and McLaughlin (1963) In-ferences based on the median were made viathe method derived by Hettmansperger andSheather (1986) (The software used to ap-ply these latter two methods is contained inthe R package described at the beginning ofSection 4) Also shown is the Type I errorprobability when sampling from a normal dis-tribution The gray area indicates Bradleyrsquoscriterion

Figure 8423 illustrates the association be-tween power and the sample size for the distri-butions used in Figure 8422 As can be seenunder normality the sample mean is best fol-lowed closely by the 20 trimmed mean Themedian is least satisfactory when dealing witha normal distribution as expected However

BehavioralNeuroscience

8425

Current Protocols in Neuroscience Supplement 82

00

02

04

06

08

10

0 50 100 150 200 250 300Sample size

Pow

er (

true

pos

itive

pro

babi

lity)

DistributionLognormalNormal

EstimatorMeanMedianTrimmed mean

Figure 8423 Power as a function of sample size The probability of a true positive was computedby running a simulation with 10000 iterations In each iteration sample sizes from 10 to 500 insteps of 10 were drawn from a normal distribution and the (lognormal) distribution shown in Figure8421A For each combination of sample size and distribution we applied a t-test on the mean atest of the median and a t-test on the 20 trimmed mean all with alpha = 005 Depending onthe test applied the mean median or 20 trimmed mean of the population sampled from was05 The black horizontal line marks the conventional 80 power threshold When sampling froma normal distribution all methods require lt50 observations to achieve 80 power and the meanappears to have higher power at lower sample size than the other methods When sampling froma lognormal distribution power drops dramatically for the mean but not for the median and thetrimmed mean Power for the median actually improves The exact pattern of results depends onthe effect size and the asymmetry of the distribution we sample from so we strongly encouragereaders to perform their own detailed power analyses

for the asymmetric (lognormal) distributionthe median performs best and the mean per-forms very poorly

A feature of random samples taken from thedistribution in Figure 8421A is that the ex-pected proportion of points declared an outlieris relatively small For skewed distributionsas we move toward situations where outliersare more common a sample size gt300 canbe required to achieve reasonably good con-trol over the Type I error probability Thatis control over the Type I error probabilityis a function of both the degree a distributionis skewed and the likelihood of encounteringoutliers However there are methods that per-form reasonably well with small sample sizesas indicated in Section 41

For symmetric distributions where outlierstend to occur the reverse can happen the ac-tual Type I error probability can be substan-tially less than the nominal level This happensbecause outliers inflate the standard deviationwhich in turn lowers the value of T which inturn can negatively impact power Section 23elaborates on this issue

In an important sense outliers have a largerimpact on the sample variance than the sam-ple mean which impacts the t-test To illus-

trate this point imagine the goal is to test H0μ = 1 based on the following values 1 1516 18 2 22 24 27 Then T = 469 the p-value is p = 0002 and the 095 confidenceinterval = [145 235] Now including thevalue 8 the mean increases from 19 to 258suggesting at some level there is stronger evi-dence for rejecting the null hypothesis How-ever this outlier increases the standard devia-tion from 054 to 21 and now T = 226 andp = 0054 The 095 confidence interval =[minus0033 319]

Now consider the goal of comparing twoindependent or dependent groups If thegroups have identical distributions then dif-ference scores have a symmetric distribu-tion and the probability of a Type I erroris in general less than the nominal levelwhen using conventional methods based onmeans Now in addition to outliers differ-ences in skewness create practical concernswhen using Studentrsquos t-test Indeed undergeneral conditions the two-sample Studentrsquost-test for independent groups is not evenasymptotically correct roughly because thestandard error of the difference betweenthe sample means is not estimated correctly(eg Cressie amp Whitford 1986) Moreover

A Guide to RobustStatisticalMethods

8426

Supplement 82 Current Protocols in Neuroscience

Studentrsquos t-test can be biased This means thatthe probability of rejecting the null hypothesisof equal means can be higher when the popu-lation means are equal compared to situationswhere the population means differ Roughlythis concern arises because the distribution ofT can be skewed and in fact the mean of T candiffer from zero even though the null hypothe-sis is true For a more detailed explanation seeWilcox (2017c section 55) Problems persistwhen Studentrsquos t-test is replaced by Welchrsquos(1938) method which is designed to comparemeans in a manner that allows unequal vari-ances Put another way if the goal is to testthe hypothesis that two groups have identicaldistributions conventional methods based onmeans perform well in terms of controlling theType I error probability But if the goal is tocompare the population means and if distribu-tions differ conventional methods can performpoorly

There are many techniques that performwell when dealing with skewed distributionsin terms of controlling the Type I error prob-ability some of which are based on the usualsample median (Wilcox 2017a 2017c) Boththeory and simulations indicate that as theamount of trimming increases the ability tocontrol over the probability of a Type I errorincreases as well Moreover for reasons to beexplained trimming can substantially increasepower a result that is not obvious based onconventional training The optimal amount oftrimming depends on the characteristics of thepopulation distributions which are unknownCurrently the best that can be said is that thechoice can make a substantial difference The20 trimmed mean has been studied exten-sively and often provides a good compromisebetween the two extremes no trimming (themean) and the maximum amount of trimming(the median)

In various situations particularly impor-tant are inferential methods based on whatare called bootstrap techniques Two basicversions are the bootstrap-t and percentilebootstrap Roughly rather than assume nor-mality bootstrap-t methods perform a simu-lation using the observed data that yields anestimate of an appropriate critical value anda p-value So values of T are generated asdone in Figure 8421 only data are sampledwith replacement from the observed data Inessence bootstrap-t methods generate data-driven T distributions expected by chance ifthere were no effect The percentile bootstrapproceeds in a similar manner only when deal-ing with a trimmed mean for example the

goal is to determine the distribution of thesample trimmed mean which can then be usedto compute a p-value and a confidence inter-val When comparing two independent groupsbased on the usual sample median if thereare tied (duplicated) values currently the onlymethod that performs well in simulations interms of controlling the Type I error probabil-ity is based on a percentile bootstrap method(compare to Table 53 in Wilcox 2017c) Sec-tion 41 elaborates on how this method isperformed

If the amount of trimming is close to zerothe bootstrap-t method is preferable to the per-centile bootstrap method But as the amount oftrimming increases at some point a percentilebootstrap method is preferable This is the casewith 20 trimming (eg Wilcox 2017a) Itseems to be the case with 10 trimming aswell but a definitive study has not been made

Also if the goal is to reflect the typical re-sponse it is evident that the median or evena 20 trimmed mean might be more satisfac-tory Using quantiles (percentiles) other thanthe median can be important as well for rea-sons summarized in Section 42

When comparing independent groupsmodern improvements on the Wilcoxon-Mann-Whitney (WMW) test are another pos-sibility which are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a ran-dom observation from the second (More de-tails are provided in Section 3) Additionalpossibilities are described in Wilcox (2017a)some of which are illustrated in Section 4below

In some situations robust methods can havesubstantially higher power than any methodbased on means It is not being suggested thatrobust methods always have more power Thisis not the case Rather the point is that powercan depend crucially on the conjunction ofwhich estimator is used (for instance the meanversus the median) and how a confidence inter-val is built (for instance a parametric methodor the percentile bootstrap) These choices arenot trivial and must be taken into account whenanalyzing data

22 HeteroscedasticityIt has been clear for some time that when

using classic methods for comparing meansheteroscedasticity (unequal population vari-ances) is a serious concern (eg Brown ampForsythe 1974) Heteroscedasticity can im-pact both power and the Type I error proba-bility The basic reason is that under general

BehavioralNeuroscience

8427

Current Protocols in Neuroscience Supplement 82

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

A

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

B

Figure 8424 Homoscedasticity (A) and heteroscedasticity (B) (A) The variance of the depen-dent variable is the same at any age (B) The variance of the dependent variable can depend onthe age of the participants

conditions methods that assume homoscedas-ticity are using an incorrect estimate ofthe standard error when in fact there isheteroscedasticity Indeed there are concernsregardless of how large the sample size mightbe As we consider more and more compli-cated designs heteroscedasticity becomes anincreasing concern

When dealing with regression ho-moscedasticity means that the variance of thedependent variable does not depend on thevalue of the independent variable When deal-ing with age and depressive symptoms for ex-ample homoscedasticity means that the vari-ation in measures of depressive symptoms atage 23 is the same at age 80 or any age inbetween as illustrated in Figure 8424

Independence implies homoscedasticitySo for this particular situation classic methodsassociated with least squares regression Pear-sonrsquos correlation Kendallrsquos tau and Spear-manrsquos rho are using a correct estimate of thestandard error which helps explain why theyperform well in terms of Type I errors whenthere is no association That is when a ho-moscedastic method rejects it is reasonableto conclude that there is an association but

in terms of inferring the nature of the asso-ciation these methods can perform poorlyAgain a practical concern is that when thereis heteroscedasticity homoscedastic methodsuse an incorrect estimate of the standard errorwhich can result in poor power and erroneousconclusions

A seemingly natural way of salvaging ho-moscedastic methods is to test the assump-tion that there is homoscedasticity But sixstudies summarized in Wilcox (2017c) foundthat this strategy is unsatisfactory Presum-ably situations are encountered where this isnot the case but it is difficult and unclearhow to determine when such situations areencountered

Methods that are designed to deal with het-eroscedasticity have been developed and areeasily applied using extant software Thesetechniques use a correct estimate of the stan-dard error regardless of whether the ho-moscedasticity assumption is true A generalrecommendation is to always use a het-eroscedastic method given the goal of com-paring measures of central tendency or mak-ing inferences about regression parameters aswell as measures of association

A Guide to RobustStatisticalMethods

8428

Supplement 82 Current Protocols in Neuroscience

00

01

02

03

04

minus3 minus2 minus1 0 1 2 3

Den

sity

Distribution mixed standard

A

minus10

minus05

00

05

10

Means Medians Trimmed means

Val

ue

B

minus3minus2minus1

01234

Means Medians Trimmed means

Val

ue

C

Figure 8425 (A) Density functions for the standard normal distribution (solid line) and the mixednormal distribution (dashed line) These distributions have an obvious similarity yet the variancesare 1 and 109 (B) Box plots of means medians and 20 trimmed means when sampling froma normal distribution (C) Box plots of means medians and 20 trimmed means when samplingfrom a mixed normal distribution In panels B and C each distribution has 10000 values and eachof these values was obtained by computing the mean median or trimmed mean of 30 randomlygenerated observations

23 OutliersEven small departures from normality can

devastate power The modern illustration ofthis fact stems from Tukey (1960) and is basedon what is generally known as a mixed normaldistribution The mixed normal considered byTukey means that with probability 09 an ob-servation is sampled from a standard normaldistribution otherwise an observation is sam-pled from a normal distribution having meanzero and standard deviation 10 Figure 8425Ashows a standard normal distribution and themixed normal discussed by Tukey Note thatin the center of the distributions the mixednormal is below the normal distribution Butfor the two ends of the mixed normal distribu-tion (the tails) the mixed normal lies above thenormal distribution For this reason the mixednormal is often described as having heavy tailsIn general heavy-tailed distributions roughlyrefer to distributions where outliers are likelyto occur

Here is an important point The standardnormal has variance 1 but the mixed normalhas variance 109 That is the population vari-ance can be overly sensitive to slight changesin the tails of a distribution Consequentlyeven a slight departure from normality can re-sult in relatively poor power when using anymethod based on the mean

Put another way samples from the mixednormal are more likely to result in outlierscompared to samples from a standard normaldistribution As previously indicated outliersinflate the sample variance which can nega-tively impact power when using means An-other concern is that they can give a distortedand misleading summary regarding the bulk ofthe participants

The first indication that heavy-tailed dis-tributions are a concern stems from a resultderived by Laplace about two centuries agoIn modern terminology he established thatas we move from a normal distribution to a

BehavioralNeuroscience

8429

Current Protocols in Neuroscience Supplement 82

distribution more likely to generate outliersthe standard error of the usual sample mediancan be smaller than the standard error of themean (Hand 1998) The first empirical evi-dence implying that outliers might be morecommon than what is expected under normal-ity was reported by Bessel (1818)

To add perspective we computed the meanmedian and a 20 trimmed mean based on 30observations generated from a standard nor-mal distribution (Again a 20 trimmed meanremoves the 20 lowest and highest valuesand averages the remaining data) Then we re-peated this process 10000 times Box plots ofthe results are shown in Figure 8425B The-ory tells us that under normality the variationof the sample means is smaller than the varia-tion among the 20 trimmed means and medi-ans and Figure 8421B provides perspectiveon the extent this is the case

Now we repeat this process only dataare sampled from the mixed normal inFigure 8425A Figure 8425C reports theresults As is evident there is substantiallyless variation among the medians and 20trimmed means That is despite trimmingdata the standard errors of the median and20 trimmed mean are substantially smallercontrary to what might be expected based onstandard training

Of course a more important issue iswhether the median or 20 trimmed meanever have substantially smaller standard er-rors based on the data encountered in re-search There are numerous illustrations thatthis is the case (eg Wilcox 2017a 2017b2017c)

There is the additional complication thatthe amount of trimming can substantially im-pact power and the ideal amount of trimmingin terms of maximizing power can dependcrucially on the nature of the unknown distri-butions under investigation The median per-forms best for the situation in Figure 8425Cbut situations are encountered where it trimstoo much given the goal of minimizing thestandard error Roughly a 20 trimmed meancompetes reasonably well with the mean un-der normality But as we move toward distribu-tions that are more likely to generate outliersat some point the median will have a smallerstandard error than a 20 trimmed mean Il-lustrations in Section 5 demonstrate that thisis a practical concern

It is not being suggested that the mere pres-ence of outliers will necessarily result in higherpower when using a 20 trimmed mean ormedian But it is being argued that simply

ignoring the potential impact of outliers canbe a serious practical concern

In terms of controlling the Type I errorprobability effective techniques are availablefor both the 20 trimmed mean and medianBecause the choice between a 20 trimmedmean and median is not straightforward interms of maximizing power it is suggestedthat in exploratory studies both of these esti-mators be considered

When dealing with least squares regressionor Pearsonrsquos correlation again outliers are aserious concern Indeed even a single outliermight give a highly distorted sense about theassociation among the bulk of the participantsunder study In particular important associa-tions might be missed One of the more ob-vious ways of dealing with this issue is toswitch to Kendallrsquos tau or Spearmanrsquos rhoHowever these measures of associations donot deal with all possible concerns related tooutliers For instance two outliers properlyplaced can give a distorted sense about theassociation among the bulk of the data (egWilcox 2017b p 239)

A measure of association that deals withthis issue is the skipped correlation where out-liers are detected using a projection methodthese points are removed and Pearsonrsquos cor-relation is computed using the remaining dataComplete details are summarized in Wilcox(2017a) This particular skipped correlationcan be computed with the R function scor anda confidence interval that allows heteroscedas-ticity can be computed with scorci This func-tion also reports a p-value when testing the hy-pothesis that the correlation is equal to zero(See Section 32 for a description of commonmistakes when testing hypotheses and outliersare removed) MATLAB code is available aswell (Pernet Wilcox amp Rousselet 2012)

24 CurvatureTypically a regression line is assumed

to be straight In some situations this ap-proach seems to suffice However it cannot bestressed too strongly that there is substantialliterature indicating that this is not always thecase A vast array of new and improved non-parametric methods for dealing with curvatureis now available but complete details go be-yond the scope of this paper Here it is merelyremarked that among the many nonparametricregression estimators that have been proposedgenerally known as smoothers two that seemto be particularly useful are Clevelandrsquos (1979)estimator which can be applied via the R func-tion lplot and the running-interval smoother

A Guide to RobustStatisticalMethods

84210

Supplement 82 Current Protocols in Neuroscience

(Wilcox 2017a) which can be applied withthe R function rplot Arguments can be madethat other smoothers should be given seriousconsideration Readers interested in these de-tails are referred to Wilcox (2017a 2017c)

Clevelandrsquos smoother was initially de-signed to estimate the mean of the dependentvariable given some value of the independentvariable The R function contains an optionfor dealing with outliers among the depen-dent variable but it currently seems that therunning-interval smoother is generally betterfor dealing with this issue By default therunning-interval smoother estimates the 20trimmed mean of the dependent variable butany other measure of central tendency can beused via the argument est

A simple strategy for dealing with curva-ture is to include a quadratic term Let X de-note the independent variable An even moregeneral strategy is to include Xa in the modelfor some appropriate choice for the exponenta But this strategy can be unsatisfactory as il-lustrated in Section 54 using data from a studydealing with fractional anisotropy and readingability In general smoothers provide a moresatisfactory approach

There are methods for testing the hypoth-esis that a regression line is straight (egWilcox 2017a) However failing to rejectdoes not provide compelling evidence that it issafe to assume that indeed the regression lineis straight It is unclear when this approach hasenough power to detect situations where cur-vature is an important practical concern Thebest advice is to plot an estimate of the re-gression line using a smoother If there is anyindication that curvature might be an issueuse modern methods for dealing with curva-ture many of which are summarized in Wilcox(2017a 2017c)

3 DEALING WITH VIOLATIONOF ASSUMPTIONS

Based on conventional training there aresome seemingly obvious strategies for deal-ing with the concerns reviewed in the previoussection But by modern standards generallythese strategies are relatively ineffective Thissection summarizes strategies that performpoorly followed by a brief description of mod-ern methods that give improved results

31 Testing AssumptionsA seemingly natural strategy is to test

assumptions In particular test the hypothe-sis that distributions have a normal distribu-

tion and test the hypothesis that there is ho-moscedasticity This approach generally failsroughly because such tests do not have enoughpower to detect situations where violatingthese assumptions is a practical concern Thatis these tests can fail to detect situations thathave an inordinately detrimental influence onstatistical power and parameter estimation

For example Wilcox (2017a 2017c) listssix studies aimed at testing the homoscedas-ticity assumption with the goal of salvaging amethod that assumes homoscedasticity (com-pare to Keselman Othman amp Wilcox 2016)Briefly these simulation studies generate datafrom a situation where it is known that ho-moscedastic methods perform poorly in termsof controlling the Type I error when there isheteroscedasticity Then various methods fortesting the homoscedasticity are performedand if they reject a heteroscedastic method isused instead All six studies came to the sameconclusion this strategy is unsatisfactory Pre-sumably there are situations where this strat-egy is satisfactory but it is unknown how toaccurately determine whether this is the casebased on the available data In practical termsall indications are that it is best to always use aheteroscedastic method when comparing mea-sures of central tendency and when dealingwith regression as well as measures of associ-ation such as Pearsonrsquos correlation

As for testing the normality assumptionfor instance using the Kolmogorov-Smirnovtest currently a better strategy is to use amore modern method that performs about aswell as conventional methods under normal-ity but which continues to perform relativelywell in situations where standard techniquesperform poorly There are many such methods(eg Wilcox 2017a 2017c) some of whichare outlined in Section 4 These methods canbe applied using extant software as will beillustrated

32 Outliers Two Common MistakesThere are two common mistakes regarding

how to deal with outliers The first is to searchfor outliers using the mean and standard de-viation For example declare the value X anoutlier if

|X minus X |s

gt 2

Equation 8421

where X and s are the usual sample meanand standard deviation respectively A prob-lem with this strategy is that it suffers from

BehavioralNeuroscience

84211

Current Protocols in Neuroscience Supplement 82

masking simply meaning that the very pres-ence of outliers causes them to be missed (egRousseeuw amp Leroy 1987 Wilcox 2017a)

Consider for example the values 1 2 2 34 6 100 and 100 The two last observationsappear to be clear outliers yet the rule givenby Equation 8421 fails to flag them as suchThe reason is simple the standard deviation ofthe sample is very large at almost 45 becauseit is not robust to outliers

This is not to suggest that all outliers willbe missed this is not necessarily the case Thepoint is that multiple outliers might be missedthat adversely affect any conventional methodthat might be used to compare means Muchmore effective are the box plot rule and theso-called MAD (median absolute deviation tothe median)-median rule

The box plot rule is applied as follows Letq1 and q2 be estimates of the lower and upperquartiles respectively Then the value X is de-clared an outlier if X lt q1 ndash 15(q2 ndash q1) or ifX gt q2 + 15(q2 ndash q1) As for the MAD-medianrule let X1 Xn denote a random sampleand let M be the usual sample median MADis the median of the values |X1 ndash M| |Xn ndashM| The MAD-median rule declares the valueX an outlier if

|X minus M |MAD06745

gt 224

Equation 8422

Under normality it can be shown thatMAD06745 estimates the standard devia-tion and of course M estimates the popula-tion mean So the MAD-median rule is simi-lar to using Equation 8421 only rather thanuse a two-standard deviation rule 224 is usedinstead

As an illustration consider the values 185111 111 037 037 185 7153 and 7153The MAD-median rule detects the outliers7153 But the rule given by Equation 8421does not The MAD-median rule is better thanthe box plot rule in terms of avoiding maskingIf the proportion of values that are outliers is25 masking can occur when using the boxplot rule Here for example the box plot ruledoes not flag the value 7153 as an outlier incontrast to the MAD-median rule

The second mistake is discarding outliersand applying some standard method for com-paring means using the remaining data Thisresults in an incorrect estimate of the stan-dard error regardless how large the samplesize might be That is an invalid test statisticis being used Roughly it can be shown that

if the remaining data are dependent then theyare correlated which invalidates the deriva-tion of the standard error Of course if an ar-gument can be made that a value is invaliddiscarding it is reasonable and does not leadto technical issues For instance a straightfor-ward case can be made if a measurement isoutside physiological bounds or if it followsa biologically non-plausible pattern over timesuch as during an electrophysiological record-ing But otherwise the estimate of the standarderror can be off by a factor of 2 (eg Wilcox2017b) which is a serious practical issue Asimple way of dealing with this issue whenusing a 20 trimmed mean or median is touse a percentile bootstrap method (With rea-sonably large sample sizes alternatives to thepercentile bootstrap method can be used Theyuse correct estimates of the standard error andare described in Wilcox 2017a 2017c) Themain point here is that these methods are read-ily applied with the free software R which isplaying an increasing role in basic trainingSome illustrations are given in Section 5

It is noted that when dealing with regres-sion outliers among the independent variablescan be removed when testing hypotheses Butif outliers among the dependent variable are re-moved conventional hypothesis testing tech-niques based on the least squares estimatorare no longer valid even when there is ho-moscedasticity Again the issue is that anincorrect estimate of the standard error is be-ing used When using robust regression es-timators that deal with outliers among thedependent variable again a percentile boot-strap method can be used to test hypothesesComplete details are summarized in Wilcox(2017a 2017c) There are numerous regres-sion estimators that effectively deal with out-liers among the dependent variable but a briefsummary of the many details is impossibleThe Theil and Sen estimator as well as theMM-estimator are relatively good choices butarguments can be made that alternative esti-mators deserve serious consideration

33 Transform the DataA common strategy for dealing with non-

normality or heteroscedasticity is to transformthe data There are exceptions but generallythis approach is unsatisfactory for several rea-sons First the transformed data can again beskewed to the point that classic techniquesperform poorly (eg Wilcox 2017b) Sec-ond this simple approach does not deal withoutliers in a satisfactory manner (eg Dok-sum amp Wong 1983 Rasmussen 1989) The

A Guide to RobustStatisticalMethods

84212

Supplement 82 Current Protocols in Neuroscience

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 6: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

00

02

04

06

08

10

0 50 100 150 200 250 300Sample size

Pow

er (

true

pos

itive

pro

babi

lity)

DistributionLognormalNormal

EstimatorMeanMedianTrimmed mean

Figure 8423 Power as a function of sample size The probability of a true positive was computedby running a simulation with 10000 iterations In each iteration sample sizes from 10 to 500 insteps of 10 were drawn from a normal distribution and the (lognormal) distribution shown in Figure8421A For each combination of sample size and distribution we applied a t-test on the mean atest of the median and a t-test on the 20 trimmed mean all with alpha = 005 Depending onthe test applied the mean median or 20 trimmed mean of the population sampled from was05 The black horizontal line marks the conventional 80 power threshold When sampling froma normal distribution all methods require lt50 observations to achieve 80 power and the meanappears to have higher power at lower sample size than the other methods When sampling froma lognormal distribution power drops dramatically for the mean but not for the median and thetrimmed mean Power for the median actually improves The exact pattern of results depends onthe effect size and the asymmetry of the distribution we sample from so we strongly encouragereaders to perform their own detailed power analyses

for the asymmetric (lognormal) distributionthe median performs best and the mean per-forms very poorly

A feature of random samples taken from thedistribution in Figure 8421A is that the ex-pected proportion of points declared an outlieris relatively small For skewed distributionsas we move toward situations where outliersare more common a sample size gt300 canbe required to achieve reasonably good con-trol over the Type I error probability Thatis control over the Type I error probabilityis a function of both the degree a distributionis skewed and the likelihood of encounteringoutliers However there are methods that per-form reasonably well with small sample sizesas indicated in Section 41

For symmetric distributions where outlierstend to occur the reverse can happen the ac-tual Type I error probability can be substan-tially less than the nominal level This happensbecause outliers inflate the standard deviationwhich in turn lowers the value of T which inturn can negatively impact power Section 23elaborates on this issue

In an important sense outliers have a largerimpact on the sample variance than the sam-ple mean which impacts the t-test To illus-

trate this point imagine the goal is to test H0μ = 1 based on the following values 1 1516 18 2 22 24 27 Then T = 469 the p-value is p = 0002 and the 095 confidenceinterval = [145 235] Now including thevalue 8 the mean increases from 19 to 258suggesting at some level there is stronger evi-dence for rejecting the null hypothesis How-ever this outlier increases the standard devia-tion from 054 to 21 and now T = 226 andp = 0054 The 095 confidence interval =[minus0033 319]

Now consider the goal of comparing twoindependent or dependent groups If thegroups have identical distributions then dif-ference scores have a symmetric distribu-tion and the probability of a Type I erroris in general less than the nominal levelwhen using conventional methods based onmeans Now in addition to outliers differ-ences in skewness create practical concernswhen using Studentrsquos t-test Indeed undergeneral conditions the two-sample Studentrsquost-test for independent groups is not evenasymptotically correct roughly because thestandard error of the difference betweenthe sample means is not estimated correctly(eg Cressie amp Whitford 1986) Moreover

A Guide to RobustStatisticalMethods

8426

Supplement 82 Current Protocols in Neuroscience

Studentrsquos t-test can be biased This means thatthe probability of rejecting the null hypothesisof equal means can be higher when the popu-lation means are equal compared to situationswhere the population means differ Roughlythis concern arises because the distribution ofT can be skewed and in fact the mean of T candiffer from zero even though the null hypothe-sis is true For a more detailed explanation seeWilcox (2017c section 55) Problems persistwhen Studentrsquos t-test is replaced by Welchrsquos(1938) method which is designed to comparemeans in a manner that allows unequal vari-ances Put another way if the goal is to testthe hypothesis that two groups have identicaldistributions conventional methods based onmeans perform well in terms of controlling theType I error probability But if the goal is tocompare the population means and if distribu-tions differ conventional methods can performpoorly

There are many techniques that performwell when dealing with skewed distributionsin terms of controlling the Type I error prob-ability some of which are based on the usualsample median (Wilcox 2017a 2017c) Boththeory and simulations indicate that as theamount of trimming increases the ability tocontrol over the probability of a Type I errorincreases as well Moreover for reasons to beexplained trimming can substantially increasepower a result that is not obvious based onconventional training The optimal amount oftrimming depends on the characteristics of thepopulation distributions which are unknownCurrently the best that can be said is that thechoice can make a substantial difference The20 trimmed mean has been studied exten-sively and often provides a good compromisebetween the two extremes no trimming (themean) and the maximum amount of trimming(the median)

In various situations particularly impor-tant are inferential methods based on whatare called bootstrap techniques Two basicversions are the bootstrap-t and percentilebootstrap Roughly rather than assume nor-mality bootstrap-t methods perform a simu-lation using the observed data that yields anestimate of an appropriate critical value anda p-value So values of T are generated asdone in Figure 8421 only data are sampledwith replacement from the observed data Inessence bootstrap-t methods generate data-driven T distributions expected by chance ifthere were no effect The percentile bootstrapproceeds in a similar manner only when deal-ing with a trimmed mean for example the

goal is to determine the distribution of thesample trimmed mean which can then be usedto compute a p-value and a confidence inter-val When comparing two independent groupsbased on the usual sample median if thereare tied (duplicated) values currently the onlymethod that performs well in simulations interms of controlling the Type I error probabil-ity is based on a percentile bootstrap method(compare to Table 53 in Wilcox 2017c) Sec-tion 41 elaborates on how this method isperformed

If the amount of trimming is close to zerothe bootstrap-t method is preferable to the per-centile bootstrap method But as the amount oftrimming increases at some point a percentilebootstrap method is preferable This is the casewith 20 trimming (eg Wilcox 2017a) Itseems to be the case with 10 trimming aswell but a definitive study has not been made

Also if the goal is to reflect the typical re-sponse it is evident that the median or evena 20 trimmed mean might be more satisfac-tory Using quantiles (percentiles) other thanthe median can be important as well for rea-sons summarized in Section 42

When comparing independent groupsmodern improvements on the Wilcoxon-Mann-Whitney (WMW) test are another pos-sibility which are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a ran-dom observation from the second (More de-tails are provided in Section 3) Additionalpossibilities are described in Wilcox (2017a)some of which are illustrated in Section 4below

In some situations robust methods can havesubstantially higher power than any methodbased on means It is not being suggested thatrobust methods always have more power Thisis not the case Rather the point is that powercan depend crucially on the conjunction ofwhich estimator is used (for instance the meanversus the median) and how a confidence inter-val is built (for instance a parametric methodor the percentile bootstrap) These choices arenot trivial and must be taken into account whenanalyzing data

22 HeteroscedasticityIt has been clear for some time that when

using classic methods for comparing meansheteroscedasticity (unequal population vari-ances) is a serious concern (eg Brown ampForsythe 1974) Heteroscedasticity can im-pact both power and the Type I error proba-bility The basic reason is that under general

BehavioralNeuroscience

8427

Current Protocols in Neuroscience Supplement 82

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

A

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

B

Figure 8424 Homoscedasticity (A) and heteroscedasticity (B) (A) The variance of the depen-dent variable is the same at any age (B) The variance of the dependent variable can depend onthe age of the participants

conditions methods that assume homoscedas-ticity are using an incorrect estimate ofthe standard error when in fact there isheteroscedasticity Indeed there are concernsregardless of how large the sample size mightbe As we consider more and more compli-cated designs heteroscedasticity becomes anincreasing concern

When dealing with regression ho-moscedasticity means that the variance of thedependent variable does not depend on thevalue of the independent variable When deal-ing with age and depressive symptoms for ex-ample homoscedasticity means that the vari-ation in measures of depressive symptoms atage 23 is the same at age 80 or any age inbetween as illustrated in Figure 8424

Independence implies homoscedasticitySo for this particular situation classic methodsassociated with least squares regression Pear-sonrsquos correlation Kendallrsquos tau and Spear-manrsquos rho are using a correct estimate of thestandard error which helps explain why theyperform well in terms of Type I errors whenthere is no association That is when a ho-moscedastic method rejects it is reasonableto conclude that there is an association but

in terms of inferring the nature of the asso-ciation these methods can perform poorlyAgain a practical concern is that when thereis heteroscedasticity homoscedastic methodsuse an incorrect estimate of the standard errorwhich can result in poor power and erroneousconclusions

A seemingly natural way of salvaging ho-moscedastic methods is to test the assump-tion that there is homoscedasticity But sixstudies summarized in Wilcox (2017c) foundthat this strategy is unsatisfactory Presum-ably situations are encountered where this isnot the case but it is difficult and unclearhow to determine when such situations areencountered

Methods that are designed to deal with het-eroscedasticity have been developed and areeasily applied using extant software Thesetechniques use a correct estimate of the stan-dard error regardless of whether the ho-moscedasticity assumption is true A generalrecommendation is to always use a het-eroscedastic method given the goal of com-paring measures of central tendency or mak-ing inferences about regression parameters aswell as measures of association

A Guide to RobustStatisticalMethods

8428

Supplement 82 Current Protocols in Neuroscience

00

01

02

03

04

minus3 minus2 minus1 0 1 2 3

Den

sity

Distribution mixed standard

A

minus10

minus05

00

05

10

Means Medians Trimmed means

Val

ue

B

minus3minus2minus1

01234

Means Medians Trimmed means

Val

ue

C

Figure 8425 (A) Density functions for the standard normal distribution (solid line) and the mixednormal distribution (dashed line) These distributions have an obvious similarity yet the variancesare 1 and 109 (B) Box plots of means medians and 20 trimmed means when sampling froma normal distribution (C) Box plots of means medians and 20 trimmed means when samplingfrom a mixed normal distribution In panels B and C each distribution has 10000 values and eachof these values was obtained by computing the mean median or trimmed mean of 30 randomlygenerated observations

23 OutliersEven small departures from normality can

devastate power The modern illustration ofthis fact stems from Tukey (1960) and is basedon what is generally known as a mixed normaldistribution The mixed normal considered byTukey means that with probability 09 an ob-servation is sampled from a standard normaldistribution otherwise an observation is sam-pled from a normal distribution having meanzero and standard deviation 10 Figure 8425Ashows a standard normal distribution and themixed normal discussed by Tukey Note thatin the center of the distributions the mixednormal is below the normal distribution Butfor the two ends of the mixed normal distribu-tion (the tails) the mixed normal lies above thenormal distribution For this reason the mixednormal is often described as having heavy tailsIn general heavy-tailed distributions roughlyrefer to distributions where outliers are likelyto occur

Here is an important point The standardnormal has variance 1 but the mixed normalhas variance 109 That is the population vari-ance can be overly sensitive to slight changesin the tails of a distribution Consequentlyeven a slight departure from normality can re-sult in relatively poor power when using anymethod based on the mean

Put another way samples from the mixednormal are more likely to result in outlierscompared to samples from a standard normaldistribution As previously indicated outliersinflate the sample variance which can nega-tively impact power when using means An-other concern is that they can give a distortedand misleading summary regarding the bulk ofthe participants

The first indication that heavy-tailed dis-tributions are a concern stems from a resultderived by Laplace about two centuries agoIn modern terminology he established thatas we move from a normal distribution to a

BehavioralNeuroscience

8429

Current Protocols in Neuroscience Supplement 82

distribution more likely to generate outliersthe standard error of the usual sample mediancan be smaller than the standard error of themean (Hand 1998) The first empirical evi-dence implying that outliers might be morecommon than what is expected under normal-ity was reported by Bessel (1818)

To add perspective we computed the meanmedian and a 20 trimmed mean based on 30observations generated from a standard nor-mal distribution (Again a 20 trimmed meanremoves the 20 lowest and highest valuesand averages the remaining data) Then we re-peated this process 10000 times Box plots ofthe results are shown in Figure 8425B The-ory tells us that under normality the variationof the sample means is smaller than the varia-tion among the 20 trimmed means and medi-ans and Figure 8421B provides perspectiveon the extent this is the case

Now we repeat this process only dataare sampled from the mixed normal inFigure 8425A Figure 8425C reports theresults As is evident there is substantiallyless variation among the medians and 20trimmed means That is despite trimmingdata the standard errors of the median and20 trimmed mean are substantially smallercontrary to what might be expected based onstandard training

Of course a more important issue iswhether the median or 20 trimmed meanever have substantially smaller standard er-rors based on the data encountered in re-search There are numerous illustrations thatthis is the case (eg Wilcox 2017a 2017b2017c)

There is the additional complication thatthe amount of trimming can substantially im-pact power and the ideal amount of trimmingin terms of maximizing power can dependcrucially on the nature of the unknown distri-butions under investigation The median per-forms best for the situation in Figure 8425Cbut situations are encountered where it trimstoo much given the goal of minimizing thestandard error Roughly a 20 trimmed meancompetes reasonably well with the mean un-der normality But as we move toward distribu-tions that are more likely to generate outliersat some point the median will have a smallerstandard error than a 20 trimmed mean Il-lustrations in Section 5 demonstrate that thisis a practical concern

It is not being suggested that the mere pres-ence of outliers will necessarily result in higherpower when using a 20 trimmed mean ormedian But it is being argued that simply

ignoring the potential impact of outliers canbe a serious practical concern

In terms of controlling the Type I errorprobability effective techniques are availablefor both the 20 trimmed mean and medianBecause the choice between a 20 trimmedmean and median is not straightforward interms of maximizing power it is suggestedthat in exploratory studies both of these esti-mators be considered

When dealing with least squares regressionor Pearsonrsquos correlation again outliers are aserious concern Indeed even a single outliermight give a highly distorted sense about theassociation among the bulk of the participantsunder study In particular important associa-tions might be missed One of the more ob-vious ways of dealing with this issue is toswitch to Kendallrsquos tau or Spearmanrsquos rhoHowever these measures of associations donot deal with all possible concerns related tooutliers For instance two outliers properlyplaced can give a distorted sense about theassociation among the bulk of the data (egWilcox 2017b p 239)

A measure of association that deals withthis issue is the skipped correlation where out-liers are detected using a projection methodthese points are removed and Pearsonrsquos cor-relation is computed using the remaining dataComplete details are summarized in Wilcox(2017a) This particular skipped correlationcan be computed with the R function scor anda confidence interval that allows heteroscedas-ticity can be computed with scorci This func-tion also reports a p-value when testing the hy-pothesis that the correlation is equal to zero(See Section 32 for a description of commonmistakes when testing hypotheses and outliersare removed) MATLAB code is available aswell (Pernet Wilcox amp Rousselet 2012)

24 CurvatureTypically a regression line is assumed

to be straight In some situations this ap-proach seems to suffice However it cannot bestressed too strongly that there is substantialliterature indicating that this is not always thecase A vast array of new and improved non-parametric methods for dealing with curvatureis now available but complete details go be-yond the scope of this paper Here it is merelyremarked that among the many nonparametricregression estimators that have been proposedgenerally known as smoothers two that seemto be particularly useful are Clevelandrsquos (1979)estimator which can be applied via the R func-tion lplot and the running-interval smoother

A Guide to RobustStatisticalMethods

84210

Supplement 82 Current Protocols in Neuroscience

(Wilcox 2017a) which can be applied withthe R function rplot Arguments can be madethat other smoothers should be given seriousconsideration Readers interested in these de-tails are referred to Wilcox (2017a 2017c)

Clevelandrsquos smoother was initially de-signed to estimate the mean of the dependentvariable given some value of the independentvariable The R function contains an optionfor dealing with outliers among the depen-dent variable but it currently seems that therunning-interval smoother is generally betterfor dealing with this issue By default therunning-interval smoother estimates the 20trimmed mean of the dependent variable butany other measure of central tendency can beused via the argument est

A simple strategy for dealing with curva-ture is to include a quadratic term Let X de-note the independent variable An even moregeneral strategy is to include Xa in the modelfor some appropriate choice for the exponenta But this strategy can be unsatisfactory as il-lustrated in Section 54 using data from a studydealing with fractional anisotropy and readingability In general smoothers provide a moresatisfactory approach

There are methods for testing the hypoth-esis that a regression line is straight (egWilcox 2017a) However failing to rejectdoes not provide compelling evidence that it issafe to assume that indeed the regression lineis straight It is unclear when this approach hasenough power to detect situations where cur-vature is an important practical concern Thebest advice is to plot an estimate of the re-gression line using a smoother If there is anyindication that curvature might be an issueuse modern methods for dealing with curva-ture many of which are summarized in Wilcox(2017a 2017c)

3 DEALING WITH VIOLATIONOF ASSUMPTIONS

Based on conventional training there aresome seemingly obvious strategies for deal-ing with the concerns reviewed in the previoussection But by modern standards generallythese strategies are relatively ineffective Thissection summarizes strategies that performpoorly followed by a brief description of mod-ern methods that give improved results

31 Testing AssumptionsA seemingly natural strategy is to test

assumptions In particular test the hypothe-sis that distributions have a normal distribu-

tion and test the hypothesis that there is ho-moscedasticity This approach generally failsroughly because such tests do not have enoughpower to detect situations where violatingthese assumptions is a practical concern Thatis these tests can fail to detect situations thathave an inordinately detrimental influence onstatistical power and parameter estimation

For example Wilcox (2017a 2017c) listssix studies aimed at testing the homoscedas-ticity assumption with the goal of salvaging amethod that assumes homoscedasticity (com-pare to Keselman Othman amp Wilcox 2016)Briefly these simulation studies generate datafrom a situation where it is known that ho-moscedastic methods perform poorly in termsof controlling the Type I error when there isheteroscedasticity Then various methods fortesting the homoscedasticity are performedand if they reject a heteroscedastic method isused instead All six studies came to the sameconclusion this strategy is unsatisfactory Pre-sumably there are situations where this strat-egy is satisfactory but it is unknown how toaccurately determine whether this is the casebased on the available data In practical termsall indications are that it is best to always use aheteroscedastic method when comparing mea-sures of central tendency and when dealingwith regression as well as measures of associ-ation such as Pearsonrsquos correlation

As for testing the normality assumptionfor instance using the Kolmogorov-Smirnovtest currently a better strategy is to use amore modern method that performs about aswell as conventional methods under normal-ity but which continues to perform relativelywell in situations where standard techniquesperform poorly There are many such methods(eg Wilcox 2017a 2017c) some of whichare outlined in Section 4 These methods canbe applied using extant software as will beillustrated

32 Outliers Two Common MistakesThere are two common mistakes regarding

how to deal with outliers The first is to searchfor outliers using the mean and standard de-viation For example declare the value X anoutlier if

|X minus X |s

gt 2

Equation 8421

where X and s are the usual sample meanand standard deviation respectively A prob-lem with this strategy is that it suffers from

BehavioralNeuroscience

84211

Current Protocols in Neuroscience Supplement 82

masking simply meaning that the very pres-ence of outliers causes them to be missed (egRousseeuw amp Leroy 1987 Wilcox 2017a)

Consider for example the values 1 2 2 34 6 100 and 100 The two last observationsappear to be clear outliers yet the rule givenby Equation 8421 fails to flag them as suchThe reason is simple the standard deviation ofthe sample is very large at almost 45 becauseit is not robust to outliers

This is not to suggest that all outliers willbe missed this is not necessarily the case Thepoint is that multiple outliers might be missedthat adversely affect any conventional methodthat might be used to compare means Muchmore effective are the box plot rule and theso-called MAD (median absolute deviation tothe median)-median rule

The box plot rule is applied as follows Letq1 and q2 be estimates of the lower and upperquartiles respectively Then the value X is de-clared an outlier if X lt q1 ndash 15(q2 ndash q1) or ifX gt q2 + 15(q2 ndash q1) As for the MAD-medianrule let X1 Xn denote a random sampleand let M be the usual sample median MADis the median of the values |X1 ndash M| |Xn ndashM| The MAD-median rule declares the valueX an outlier if

|X minus M |MAD06745

gt 224

Equation 8422

Under normality it can be shown thatMAD06745 estimates the standard devia-tion and of course M estimates the popula-tion mean So the MAD-median rule is simi-lar to using Equation 8421 only rather thanuse a two-standard deviation rule 224 is usedinstead

As an illustration consider the values 185111 111 037 037 185 7153 and 7153The MAD-median rule detects the outliers7153 But the rule given by Equation 8421does not The MAD-median rule is better thanthe box plot rule in terms of avoiding maskingIf the proportion of values that are outliers is25 masking can occur when using the boxplot rule Here for example the box plot ruledoes not flag the value 7153 as an outlier incontrast to the MAD-median rule

The second mistake is discarding outliersand applying some standard method for com-paring means using the remaining data Thisresults in an incorrect estimate of the stan-dard error regardless how large the samplesize might be That is an invalid test statisticis being used Roughly it can be shown that

if the remaining data are dependent then theyare correlated which invalidates the deriva-tion of the standard error Of course if an ar-gument can be made that a value is invaliddiscarding it is reasonable and does not leadto technical issues For instance a straightfor-ward case can be made if a measurement isoutside physiological bounds or if it followsa biologically non-plausible pattern over timesuch as during an electrophysiological record-ing But otherwise the estimate of the standarderror can be off by a factor of 2 (eg Wilcox2017b) which is a serious practical issue Asimple way of dealing with this issue whenusing a 20 trimmed mean or median is touse a percentile bootstrap method (With rea-sonably large sample sizes alternatives to thepercentile bootstrap method can be used Theyuse correct estimates of the standard error andare described in Wilcox 2017a 2017c) Themain point here is that these methods are read-ily applied with the free software R which isplaying an increasing role in basic trainingSome illustrations are given in Section 5

It is noted that when dealing with regres-sion outliers among the independent variablescan be removed when testing hypotheses Butif outliers among the dependent variable are re-moved conventional hypothesis testing tech-niques based on the least squares estimatorare no longer valid even when there is ho-moscedasticity Again the issue is that anincorrect estimate of the standard error is be-ing used When using robust regression es-timators that deal with outliers among thedependent variable again a percentile boot-strap method can be used to test hypothesesComplete details are summarized in Wilcox(2017a 2017c) There are numerous regres-sion estimators that effectively deal with out-liers among the dependent variable but a briefsummary of the many details is impossibleThe Theil and Sen estimator as well as theMM-estimator are relatively good choices butarguments can be made that alternative esti-mators deserve serious consideration

33 Transform the DataA common strategy for dealing with non-

normality or heteroscedasticity is to transformthe data There are exceptions but generallythis approach is unsatisfactory for several rea-sons First the transformed data can again beskewed to the point that classic techniquesperform poorly (eg Wilcox 2017b) Sec-ond this simple approach does not deal withoutliers in a satisfactory manner (eg Dok-sum amp Wong 1983 Rasmussen 1989) The

A Guide to RobustStatisticalMethods

84212

Supplement 82 Current Protocols in Neuroscience

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 7: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

Studentrsquos t-test can be biased This means thatthe probability of rejecting the null hypothesisof equal means can be higher when the popu-lation means are equal compared to situationswhere the population means differ Roughlythis concern arises because the distribution ofT can be skewed and in fact the mean of T candiffer from zero even though the null hypothe-sis is true For a more detailed explanation seeWilcox (2017c section 55) Problems persistwhen Studentrsquos t-test is replaced by Welchrsquos(1938) method which is designed to comparemeans in a manner that allows unequal vari-ances Put another way if the goal is to testthe hypothesis that two groups have identicaldistributions conventional methods based onmeans perform well in terms of controlling theType I error probability But if the goal is tocompare the population means and if distribu-tions differ conventional methods can performpoorly

There are many techniques that performwell when dealing with skewed distributionsin terms of controlling the Type I error prob-ability some of which are based on the usualsample median (Wilcox 2017a 2017c) Boththeory and simulations indicate that as theamount of trimming increases the ability tocontrol over the probability of a Type I errorincreases as well Moreover for reasons to beexplained trimming can substantially increasepower a result that is not obvious based onconventional training The optimal amount oftrimming depends on the characteristics of thepopulation distributions which are unknownCurrently the best that can be said is that thechoice can make a substantial difference The20 trimmed mean has been studied exten-sively and often provides a good compromisebetween the two extremes no trimming (themean) and the maximum amount of trimming(the median)

In various situations particularly impor-tant are inferential methods based on whatare called bootstrap techniques Two basicversions are the bootstrap-t and percentilebootstrap Roughly rather than assume nor-mality bootstrap-t methods perform a simu-lation using the observed data that yields anestimate of an appropriate critical value anda p-value So values of T are generated asdone in Figure 8421 only data are sampledwith replacement from the observed data Inessence bootstrap-t methods generate data-driven T distributions expected by chance ifthere were no effect The percentile bootstrapproceeds in a similar manner only when deal-ing with a trimmed mean for example the

goal is to determine the distribution of thesample trimmed mean which can then be usedto compute a p-value and a confidence inter-val When comparing two independent groupsbased on the usual sample median if thereare tied (duplicated) values currently the onlymethod that performs well in simulations interms of controlling the Type I error probabil-ity is based on a percentile bootstrap method(compare to Table 53 in Wilcox 2017c) Sec-tion 41 elaborates on how this method isperformed

If the amount of trimming is close to zerothe bootstrap-t method is preferable to the per-centile bootstrap method But as the amount oftrimming increases at some point a percentilebootstrap method is preferable This is the casewith 20 trimming (eg Wilcox 2017a) Itseems to be the case with 10 trimming aswell but a definitive study has not been made

Also if the goal is to reflect the typical re-sponse it is evident that the median or evena 20 trimmed mean might be more satisfac-tory Using quantiles (percentiles) other thanthe median can be important as well for rea-sons summarized in Section 42

When comparing independent groupsmodern improvements on the Wilcoxon-Mann-Whitney (WMW) test are another pos-sibility which are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a ran-dom observation from the second (More de-tails are provided in Section 3) Additionalpossibilities are described in Wilcox (2017a)some of which are illustrated in Section 4below

In some situations robust methods can havesubstantially higher power than any methodbased on means It is not being suggested thatrobust methods always have more power Thisis not the case Rather the point is that powercan depend crucially on the conjunction ofwhich estimator is used (for instance the meanversus the median) and how a confidence inter-val is built (for instance a parametric methodor the percentile bootstrap) These choices arenot trivial and must be taken into account whenanalyzing data

22 HeteroscedasticityIt has been clear for some time that when

using classic methods for comparing meansheteroscedasticity (unequal population vari-ances) is a serious concern (eg Brown ampForsythe 1974) Heteroscedasticity can im-pact both power and the Type I error proba-bility The basic reason is that under general

BehavioralNeuroscience

8427

Current Protocols in Neuroscience Supplement 82

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

A

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

B

Figure 8424 Homoscedasticity (A) and heteroscedasticity (B) (A) The variance of the depen-dent variable is the same at any age (B) The variance of the dependent variable can depend onthe age of the participants

conditions methods that assume homoscedas-ticity are using an incorrect estimate ofthe standard error when in fact there isheteroscedasticity Indeed there are concernsregardless of how large the sample size mightbe As we consider more and more compli-cated designs heteroscedasticity becomes anincreasing concern

When dealing with regression ho-moscedasticity means that the variance of thedependent variable does not depend on thevalue of the independent variable When deal-ing with age and depressive symptoms for ex-ample homoscedasticity means that the vari-ation in measures of depressive symptoms atage 23 is the same at age 80 or any age inbetween as illustrated in Figure 8424

Independence implies homoscedasticitySo for this particular situation classic methodsassociated with least squares regression Pear-sonrsquos correlation Kendallrsquos tau and Spear-manrsquos rho are using a correct estimate of thestandard error which helps explain why theyperform well in terms of Type I errors whenthere is no association That is when a ho-moscedastic method rejects it is reasonableto conclude that there is an association but

in terms of inferring the nature of the asso-ciation these methods can perform poorlyAgain a practical concern is that when thereis heteroscedasticity homoscedastic methodsuse an incorrect estimate of the standard errorwhich can result in poor power and erroneousconclusions

A seemingly natural way of salvaging ho-moscedastic methods is to test the assump-tion that there is homoscedasticity But sixstudies summarized in Wilcox (2017c) foundthat this strategy is unsatisfactory Presum-ably situations are encountered where this isnot the case but it is difficult and unclearhow to determine when such situations areencountered

Methods that are designed to deal with het-eroscedasticity have been developed and areeasily applied using extant software Thesetechniques use a correct estimate of the stan-dard error regardless of whether the ho-moscedasticity assumption is true A generalrecommendation is to always use a het-eroscedastic method given the goal of com-paring measures of central tendency or mak-ing inferences about regression parameters aswell as measures of association

A Guide to RobustStatisticalMethods

8428

Supplement 82 Current Protocols in Neuroscience

00

01

02

03

04

minus3 minus2 minus1 0 1 2 3

Den

sity

Distribution mixed standard

A

minus10

minus05

00

05

10

Means Medians Trimmed means

Val

ue

B

minus3minus2minus1

01234

Means Medians Trimmed means

Val

ue

C

Figure 8425 (A) Density functions for the standard normal distribution (solid line) and the mixednormal distribution (dashed line) These distributions have an obvious similarity yet the variancesare 1 and 109 (B) Box plots of means medians and 20 trimmed means when sampling froma normal distribution (C) Box plots of means medians and 20 trimmed means when samplingfrom a mixed normal distribution In panels B and C each distribution has 10000 values and eachof these values was obtained by computing the mean median or trimmed mean of 30 randomlygenerated observations

23 OutliersEven small departures from normality can

devastate power The modern illustration ofthis fact stems from Tukey (1960) and is basedon what is generally known as a mixed normaldistribution The mixed normal considered byTukey means that with probability 09 an ob-servation is sampled from a standard normaldistribution otherwise an observation is sam-pled from a normal distribution having meanzero and standard deviation 10 Figure 8425Ashows a standard normal distribution and themixed normal discussed by Tukey Note thatin the center of the distributions the mixednormal is below the normal distribution Butfor the two ends of the mixed normal distribu-tion (the tails) the mixed normal lies above thenormal distribution For this reason the mixednormal is often described as having heavy tailsIn general heavy-tailed distributions roughlyrefer to distributions where outliers are likelyto occur

Here is an important point The standardnormal has variance 1 but the mixed normalhas variance 109 That is the population vari-ance can be overly sensitive to slight changesin the tails of a distribution Consequentlyeven a slight departure from normality can re-sult in relatively poor power when using anymethod based on the mean

Put another way samples from the mixednormal are more likely to result in outlierscompared to samples from a standard normaldistribution As previously indicated outliersinflate the sample variance which can nega-tively impact power when using means An-other concern is that they can give a distortedand misleading summary regarding the bulk ofthe participants

The first indication that heavy-tailed dis-tributions are a concern stems from a resultderived by Laplace about two centuries agoIn modern terminology he established thatas we move from a normal distribution to a

BehavioralNeuroscience

8429

Current Protocols in Neuroscience Supplement 82

distribution more likely to generate outliersthe standard error of the usual sample mediancan be smaller than the standard error of themean (Hand 1998) The first empirical evi-dence implying that outliers might be morecommon than what is expected under normal-ity was reported by Bessel (1818)

To add perspective we computed the meanmedian and a 20 trimmed mean based on 30observations generated from a standard nor-mal distribution (Again a 20 trimmed meanremoves the 20 lowest and highest valuesand averages the remaining data) Then we re-peated this process 10000 times Box plots ofthe results are shown in Figure 8425B The-ory tells us that under normality the variationof the sample means is smaller than the varia-tion among the 20 trimmed means and medi-ans and Figure 8421B provides perspectiveon the extent this is the case

Now we repeat this process only dataare sampled from the mixed normal inFigure 8425A Figure 8425C reports theresults As is evident there is substantiallyless variation among the medians and 20trimmed means That is despite trimmingdata the standard errors of the median and20 trimmed mean are substantially smallercontrary to what might be expected based onstandard training

Of course a more important issue iswhether the median or 20 trimmed meanever have substantially smaller standard er-rors based on the data encountered in re-search There are numerous illustrations thatthis is the case (eg Wilcox 2017a 2017b2017c)

There is the additional complication thatthe amount of trimming can substantially im-pact power and the ideal amount of trimmingin terms of maximizing power can dependcrucially on the nature of the unknown distri-butions under investigation The median per-forms best for the situation in Figure 8425Cbut situations are encountered where it trimstoo much given the goal of minimizing thestandard error Roughly a 20 trimmed meancompetes reasonably well with the mean un-der normality But as we move toward distribu-tions that are more likely to generate outliersat some point the median will have a smallerstandard error than a 20 trimmed mean Il-lustrations in Section 5 demonstrate that thisis a practical concern

It is not being suggested that the mere pres-ence of outliers will necessarily result in higherpower when using a 20 trimmed mean ormedian But it is being argued that simply

ignoring the potential impact of outliers canbe a serious practical concern

In terms of controlling the Type I errorprobability effective techniques are availablefor both the 20 trimmed mean and medianBecause the choice between a 20 trimmedmean and median is not straightforward interms of maximizing power it is suggestedthat in exploratory studies both of these esti-mators be considered

When dealing with least squares regressionor Pearsonrsquos correlation again outliers are aserious concern Indeed even a single outliermight give a highly distorted sense about theassociation among the bulk of the participantsunder study In particular important associa-tions might be missed One of the more ob-vious ways of dealing with this issue is toswitch to Kendallrsquos tau or Spearmanrsquos rhoHowever these measures of associations donot deal with all possible concerns related tooutliers For instance two outliers properlyplaced can give a distorted sense about theassociation among the bulk of the data (egWilcox 2017b p 239)

A measure of association that deals withthis issue is the skipped correlation where out-liers are detected using a projection methodthese points are removed and Pearsonrsquos cor-relation is computed using the remaining dataComplete details are summarized in Wilcox(2017a) This particular skipped correlationcan be computed with the R function scor anda confidence interval that allows heteroscedas-ticity can be computed with scorci This func-tion also reports a p-value when testing the hy-pothesis that the correlation is equal to zero(See Section 32 for a description of commonmistakes when testing hypotheses and outliersare removed) MATLAB code is available aswell (Pernet Wilcox amp Rousselet 2012)

24 CurvatureTypically a regression line is assumed

to be straight In some situations this ap-proach seems to suffice However it cannot bestressed too strongly that there is substantialliterature indicating that this is not always thecase A vast array of new and improved non-parametric methods for dealing with curvatureis now available but complete details go be-yond the scope of this paper Here it is merelyremarked that among the many nonparametricregression estimators that have been proposedgenerally known as smoothers two that seemto be particularly useful are Clevelandrsquos (1979)estimator which can be applied via the R func-tion lplot and the running-interval smoother

A Guide to RobustStatisticalMethods

84210

Supplement 82 Current Protocols in Neuroscience

(Wilcox 2017a) which can be applied withthe R function rplot Arguments can be madethat other smoothers should be given seriousconsideration Readers interested in these de-tails are referred to Wilcox (2017a 2017c)

Clevelandrsquos smoother was initially de-signed to estimate the mean of the dependentvariable given some value of the independentvariable The R function contains an optionfor dealing with outliers among the depen-dent variable but it currently seems that therunning-interval smoother is generally betterfor dealing with this issue By default therunning-interval smoother estimates the 20trimmed mean of the dependent variable butany other measure of central tendency can beused via the argument est

A simple strategy for dealing with curva-ture is to include a quadratic term Let X de-note the independent variable An even moregeneral strategy is to include Xa in the modelfor some appropriate choice for the exponenta But this strategy can be unsatisfactory as il-lustrated in Section 54 using data from a studydealing with fractional anisotropy and readingability In general smoothers provide a moresatisfactory approach

There are methods for testing the hypoth-esis that a regression line is straight (egWilcox 2017a) However failing to rejectdoes not provide compelling evidence that it issafe to assume that indeed the regression lineis straight It is unclear when this approach hasenough power to detect situations where cur-vature is an important practical concern Thebest advice is to plot an estimate of the re-gression line using a smoother If there is anyindication that curvature might be an issueuse modern methods for dealing with curva-ture many of which are summarized in Wilcox(2017a 2017c)

3 DEALING WITH VIOLATIONOF ASSUMPTIONS

Based on conventional training there aresome seemingly obvious strategies for deal-ing with the concerns reviewed in the previoussection But by modern standards generallythese strategies are relatively ineffective Thissection summarizes strategies that performpoorly followed by a brief description of mod-ern methods that give improved results

31 Testing AssumptionsA seemingly natural strategy is to test

assumptions In particular test the hypothe-sis that distributions have a normal distribu-

tion and test the hypothesis that there is ho-moscedasticity This approach generally failsroughly because such tests do not have enoughpower to detect situations where violatingthese assumptions is a practical concern Thatis these tests can fail to detect situations thathave an inordinately detrimental influence onstatistical power and parameter estimation

For example Wilcox (2017a 2017c) listssix studies aimed at testing the homoscedas-ticity assumption with the goal of salvaging amethod that assumes homoscedasticity (com-pare to Keselman Othman amp Wilcox 2016)Briefly these simulation studies generate datafrom a situation where it is known that ho-moscedastic methods perform poorly in termsof controlling the Type I error when there isheteroscedasticity Then various methods fortesting the homoscedasticity are performedand if they reject a heteroscedastic method isused instead All six studies came to the sameconclusion this strategy is unsatisfactory Pre-sumably there are situations where this strat-egy is satisfactory but it is unknown how toaccurately determine whether this is the casebased on the available data In practical termsall indications are that it is best to always use aheteroscedastic method when comparing mea-sures of central tendency and when dealingwith regression as well as measures of associ-ation such as Pearsonrsquos correlation

As for testing the normality assumptionfor instance using the Kolmogorov-Smirnovtest currently a better strategy is to use amore modern method that performs about aswell as conventional methods under normal-ity but which continues to perform relativelywell in situations where standard techniquesperform poorly There are many such methods(eg Wilcox 2017a 2017c) some of whichare outlined in Section 4 These methods canbe applied using extant software as will beillustrated

32 Outliers Two Common MistakesThere are two common mistakes regarding

how to deal with outliers The first is to searchfor outliers using the mean and standard de-viation For example declare the value X anoutlier if

|X minus X |s

gt 2

Equation 8421

where X and s are the usual sample meanand standard deviation respectively A prob-lem with this strategy is that it suffers from

BehavioralNeuroscience

84211

Current Protocols in Neuroscience Supplement 82

masking simply meaning that the very pres-ence of outliers causes them to be missed (egRousseeuw amp Leroy 1987 Wilcox 2017a)

Consider for example the values 1 2 2 34 6 100 and 100 The two last observationsappear to be clear outliers yet the rule givenby Equation 8421 fails to flag them as suchThe reason is simple the standard deviation ofthe sample is very large at almost 45 becauseit is not robust to outliers

This is not to suggest that all outliers willbe missed this is not necessarily the case Thepoint is that multiple outliers might be missedthat adversely affect any conventional methodthat might be used to compare means Muchmore effective are the box plot rule and theso-called MAD (median absolute deviation tothe median)-median rule

The box plot rule is applied as follows Letq1 and q2 be estimates of the lower and upperquartiles respectively Then the value X is de-clared an outlier if X lt q1 ndash 15(q2 ndash q1) or ifX gt q2 + 15(q2 ndash q1) As for the MAD-medianrule let X1 Xn denote a random sampleand let M be the usual sample median MADis the median of the values |X1 ndash M| |Xn ndashM| The MAD-median rule declares the valueX an outlier if

|X minus M |MAD06745

gt 224

Equation 8422

Under normality it can be shown thatMAD06745 estimates the standard devia-tion and of course M estimates the popula-tion mean So the MAD-median rule is simi-lar to using Equation 8421 only rather thanuse a two-standard deviation rule 224 is usedinstead

As an illustration consider the values 185111 111 037 037 185 7153 and 7153The MAD-median rule detects the outliers7153 But the rule given by Equation 8421does not The MAD-median rule is better thanthe box plot rule in terms of avoiding maskingIf the proportion of values that are outliers is25 masking can occur when using the boxplot rule Here for example the box plot ruledoes not flag the value 7153 as an outlier incontrast to the MAD-median rule

The second mistake is discarding outliersand applying some standard method for com-paring means using the remaining data Thisresults in an incorrect estimate of the stan-dard error regardless how large the samplesize might be That is an invalid test statisticis being used Roughly it can be shown that

if the remaining data are dependent then theyare correlated which invalidates the deriva-tion of the standard error Of course if an ar-gument can be made that a value is invaliddiscarding it is reasonable and does not leadto technical issues For instance a straightfor-ward case can be made if a measurement isoutside physiological bounds or if it followsa biologically non-plausible pattern over timesuch as during an electrophysiological record-ing But otherwise the estimate of the standarderror can be off by a factor of 2 (eg Wilcox2017b) which is a serious practical issue Asimple way of dealing with this issue whenusing a 20 trimmed mean or median is touse a percentile bootstrap method (With rea-sonably large sample sizes alternatives to thepercentile bootstrap method can be used Theyuse correct estimates of the standard error andare described in Wilcox 2017a 2017c) Themain point here is that these methods are read-ily applied with the free software R which isplaying an increasing role in basic trainingSome illustrations are given in Section 5

It is noted that when dealing with regres-sion outliers among the independent variablescan be removed when testing hypotheses Butif outliers among the dependent variable are re-moved conventional hypothesis testing tech-niques based on the least squares estimatorare no longer valid even when there is ho-moscedasticity Again the issue is that anincorrect estimate of the standard error is be-ing used When using robust regression es-timators that deal with outliers among thedependent variable again a percentile boot-strap method can be used to test hypothesesComplete details are summarized in Wilcox(2017a 2017c) There are numerous regres-sion estimators that effectively deal with out-liers among the dependent variable but a briefsummary of the many details is impossibleThe Theil and Sen estimator as well as theMM-estimator are relatively good choices butarguments can be made that alternative esti-mators deserve serious consideration

33 Transform the DataA common strategy for dealing with non-

normality or heteroscedasticity is to transformthe data There are exceptions but generallythis approach is unsatisfactory for several rea-sons First the transformed data can again beskewed to the point that classic techniquesperform poorly (eg Wilcox 2017b) Sec-ond this simple approach does not deal withoutliers in a satisfactory manner (eg Dok-sum amp Wong 1983 Rasmussen 1989) The

A Guide to RobustStatisticalMethods

84212

Supplement 82 Current Protocols in Neuroscience

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 8: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

A

0

5

10

15

5 6 7 8 9 10 11 12Age

Dep

ende

nt v

aria

ble

B

Figure 8424 Homoscedasticity (A) and heteroscedasticity (B) (A) The variance of the depen-dent variable is the same at any age (B) The variance of the dependent variable can depend onthe age of the participants

conditions methods that assume homoscedas-ticity are using an incorrect estimate ofthe standard error when in fact there isheteroscedasticity Indeed there are concernsregardless of how large the sample size mightbe As we consider more and more compli-cated designs heteroscedasticity becomes anincreasing concern

When dealing with regression ho-moscedasticity means that the variance of thedependent variable does not depend on thevalue of the independent variable When deal-ing with age and depressive symptoms for ex-ample homoscedasticity means that the vari-ation in measures of depressive symptoms atage 23 is the same at age 80 or any age inbetween as illustrated in Figure 8424

Independence implies homoscedasticitySo for this particular situation classic methodsassociated with least squares regression Pear-sonrsquos correlation Kendallrsquos tau and Spear-manrsquos rho are using a correct estimate of thestandard error which helps explain why theyperform well in terms of Type I errors whenthere is no association That is when a ho-moscedastic method rejects it is reasonableto conclude that there is an association but

in terms of inferring the nature of the asso-ciation these methods can perform poorlyAgain a practical concern is that when thereis heteroscedasticity homoscedastic methodsuse an incorrect estimate of the standard errorwhich can result in poor power and erroneousconclusions

A seemingly natural way of salvaging ho-moscedastic methods is to test the assump-tion that there is homoscedasticity But sixstudies summarized in Wilcox (2017c) foundthat this strategy is unsatisfactory Presum-ably situations are encountered where this isnot the case but it is difficult and unclearhow to determine when such situations areencountered

Methods that are designed to deal with het-eroscedasticity have been developed and areeasily applied using extant software Thesetechniques use a correct estimate of the stan-dard error regardless of whether the ho-moscedasticity assumption is true A generalrecommendation is to always use a het-eroscedastic method given the goal of com-paring measures of central tendency or mak-ing inferences about regression parameters aswell as measures of association

A Guide to RobustStatisticalMethods

8428

Supplement 82 Current Protocols in Neuroscience

00

01

02

03

04

minus3 minus2 minus1 0 1 2 3

Den

sity

Distribution mixed standard

A

minus10

minus05

00

05

10

Means Medians Trimmed means

Val

ue

B

minus3minus2minus1

01234

Means Medians Trimmed means

Val

ue

C

Figure 8425 (A) Density functions for the standard normal distribution (solid line) and the mixednormal distribution (dashed line) These distributions have an obvious similarity yet the variancesare 1 and 109 (B) Box plots of means medians and 20 trimmed means when sampling froma normal distribution (C) Box plots of means medians and 20 trimmed means when samplingfrom a mixed normal distribution In panels B and C each distribution has 10000 values and eachof these values was obtained by computing the mean median or trimmed mean of 30 randomlygenerated observations

23 OutliersEven small departures from normality can

devastate power The modern illustration ofthis fact stems from Tukey (1960) and is basedon what is generally known as a mixed normaldistribution The mixed normal considered byTukey means that with probability 09 an ob-servation is sampled from a standard normaldistribution otherwise an observation is sam-pled from a normal distribution having meanzero and standard deviation 10 Figure 8425Ashows a standard normal distribution and themixed normal discussed by Tukey Note thatin the center of the distributions the mixednormal is below the normal distribution Butfor the two ends of the mixed normal distribu-tion (the tails) the mixed normal lies above thenormal distribution For this reason the mixednormal is often described as having heavy tailsIn general heavy-tailed distributions roughlyrefer to distributions where outliers are likelyto occur

Here is an important point The standardnormal has variance 1 but the mixed normalhas variance 109 That is the population vari-ance can be overly sensitive to slight changesin the tails of a distribution Consequentlyeven a slight departure from normality can re-sult in relatively poor power when using anymethod based on the mean

Put another way samples from the mixednormal are more likely to result in outlierscompared to samples from a standard normaldistribution As previously indicated outliersinflate the sample variance which can nega-tively impact power when using means An-other concern is that they can give a distortedand misleading summary regarding the bulk ofthe participants

The first indication that heavy-tailed dis-tributions are a concern stems from a resultderived by Laplace about two centuries agoIn modern terminology he established thatas we move from a normal distribution to a

BehavioralNeuroscience

8429

Current Protocols in Neuroscience Supplement 82

distribution more likely to generate outliersthe standard error of the usual sample mediancan be smaller than the standard error of themean (Hand 1998) The first empirical evi-dence implying that outliers might be morecommon than what is expected under normal-ity was reported by Bessel (1818)

To add perspective we computed the meanmedian and a 20 trimmed mean based on 30observations generated from a standard nor-mal distribution (Again a 20 trimmed meanremoves the 20 lowest and highest valuesand averages the remaining data) Then we re-peated this process 10000 times Box plots ofthe results are shown in Figure 8425B The-ory tells us that under normality the variationof the sample means is smaller than the varia-tion among the 20 trimmed means and medi-ans and Figure 8421B provides perspectiveon the extent this is the case

Now we repeat this process only dataare sampled from the mixed normal inFigure 8425A Figure 8425C reports theresults As is evident there is substantiallyless variation among the medians and 20trimmed means That is despite trimmingdata the standard errors of the median and20 trimmed mean are substantially smallercontrary to what might be expected based onstandard training

Of course a more important issue iswhether the median or 20 trimmed meanever have substantially smaller standard er-rors based on the data encountered in re-search There are numerous illustrations thatthis is the case (eg Wilcox 2017a 2017b2017c)

There is the additional complication thatthe amount of trimming can substantially im-pact power and the ideal amount of trimmingin terms of maximizing power can dependcrucially on the nature of the unknown distri-butions under investigation The median per-forms best for the situation in Figure 8425Cbut situations are encountered where it trimstoo much given the goal of minimizing thestandard error Roughly a 20 trimmed meancompetes reasonably well with the mean un-der normality But as we move toward distribu-tions that are more likely to generate outliersat some point the median will have a smallerstandard error than a 20 trimmed mean Il-lustrations in Section 5 demonstrate that thisis a practical concern

It is not being suggested that the mere pres-ence of outliers will necessarily result in higherpower when using a 20 trimmed mean ormedian But it is being argued that simply

ignoring the potential impact of outliers canbe a serious practical concern

In terms of controlling the Type I errorprobability effective techniques are availablefor both the 20 trimmed mean and medianBecause the choice between a 20 trimmedmean and median is not straightforward interms of maximizing power it is suggestedthat in exploratory studies both of these esti-mators be considered

When dealing with least squares regressionor Pearsonrsquos correlation again outliers are aserious concern Indeed even a single outliermight give a highly distorted sense about theassociation among the bulk of the participantsunder study In particular important associa-tions might be missed One of the more ob-vious ways of dealing with this issue is toswitch to Kendallrsquos tau or Spearmanrsquos rhoHowever these measures of associations donot deal with all possible concerns related tooutliers For instance two outliers properlyplaced can give a distorted sense about theassociation among the bulk of the data (egWilcox 2017b p 239)

A measure of association that deals withthis issue is the skipped correlation where out-liers are detected using a projection methodthese points are removed and Pearsonrsquos cor-relation is computed using the remaining dataComplete details are summarized in Wilcox(2017a) This particular skipped correlationcan be computed with the R function scor anda confidence interval that allows heteroscedas-ticity can be computed with scorci This func-tion also reports a p-value when testing the hy-pothesis that the correlation is equal to zero(See Section 32 for a description of commonmistakes when testing hypotheses and outliersare removed) MATLAB code is available aswell (Pernet Wilcox amp Rousselet 2012)

24 CurvatureTypically a regression line is assumed

to be straight In some situations this ap-proach seems to suffice However it cannot bestressed too strongly that there is substantialliterature indicating that this is not always thecase A vast array of new and improved non-parametric methods for dealing with curvatureis now available but complete details go be-yond the scope of this paper Here it is merelyremarked that among the many nonparametricregression estimators that have been proposedgenerally known as smoothers two that seemto be particularly useful are Clevelandrsquos (1979)estimator which can be applied via the R func-tion lplot and the running-interval smoother

A Guide to RobustStatisticalMethods

84210

Supplement 82 Current Protocols in Neuroscience

(Wilcox 2017a) which can be applied withthe R function rplot Arguments can be madethat other smoothers should be given seriousconsideration Readers interested in these de-tails are referred to Wilcox (2017a 2017c)

Clevelandrsquos smoother was initially de-signed to estimate the mean of the dependentvariable given some value of the independentvariable The R function contains an optionfor dealing with outliers among the depen-dent variable but it currently seems that therunning-interval smoother is generally betterfor dealing with this issue By default therunning-interval smoother estimates the 20trimmed mean of the dependent variable butany other measure of central tendency can beused via the argument est

A simple strategy for dealing with curva-ture is to include a quadratic term Let X de-note the independent variable An even moregeneral strategy is to include Xa in the modelfor some appropriate choice for the exponenta But this strategy can be unsatisfactory as il-lustrated in Section 54 using data from a studydealing with fractional anisotropy and readingability In general smoothers provide a moresatisfactory approach

There are methods for testing the hypoth-esis that a regression line is straight (egWilcox 2017a) However failing to rejectdoes not provide compelling evidence that it issafe to assume that indeed the regression lineis straight It is unclear when this approach hasenough power to detect situations where cur-vature is an important practical concern Thebest advice is to plot an estimate of the re-gression line using a smoother If there is anyindication that curvature might be an issueuse modern methods for dealing with curva-ture many of which are summarized in Wilcox(2017a 2017c)

3 DEALING WITH VIOLATIONOF ASSUMPTIONS

Based on conventional training there aresome seemingly obvious strategies for deal-ing with the concerns reviewed in the previoussection But by modern standards generallythese strategies are relatively ineffective Thissection summarizes strategies that performpoorly followed by a brief description of mod-ern methods that give improved results

31 Testing AssumptionsA seemingly natural strategy is to test

assumptions In particular test the hypothe-sis that distributions have a normal distribu-

tion and test the hypothesis that there is ho-moscedasticity This approach generally failsroughly because such tests do not have enoughpower to detect situations where violatingthese assumptions is a practical concern Thatis these tests can fail to detect situations thathave an inordinately detrimental influence onstatistical power and parameter estimation

For example Wilcox (2017a 2017c) listssix studies aimed at testing the homoscedas-ticity assumption with the goal of salvaging amethod that assumes homoscedasticity (com-pare to Keselman Othman amp Wilcox 2016)Briefly these simulation studies generate datafrom a situation where it is known that ho-moscedastic methods perform poorly in termsof controlling the Type I error when there isheteroscedasticity Then various methods fortesting the homoscedasticity are performedand if they reject a heteroscedastic method isused instead All six studies came to the sameconclusion this strategy is unsatisfactory Pre-sumably there are situations where this strat-egy is satisfactory but it is unknown how toaccurately determine whether this is the casebased on the available data In practical termsall indications are that it is best to always use aheteroscedastic method when comparing mea-sures of central tendency and when dealingwith regression as well as measures of associ-ation such as Pearsonrsquos correlation

As for testing the normality assumptionfor instance using the Kolmogorov-Smirnovtest currently a better strategy is to use amore modern method that performs about aswell as conventional methods under normal-ity but which continues to perform relativelywell in situations where standard techniquesperform poorly There are many such methods(eg Wilcox 2017a 2017c) some of whichare outlined in Section 4 These methods canbe applied using extant software as will beillustrated

32 Outliers Two Common MistakesThere are two common mistakes regarding

how to deal with outliers The first is to searchfor outliers using the mean and standard de-viation For example declare the value X anoutlier if

|X minus X |s

gt 2

Equation 8421

where X and s are the usual sample meanand standard deviation respectively A prob-lem with this strategy is that it suffers from

BehavioralNeuroscience

84211

Current Protocols in Neuroscience Supplement 82

masking simply meaning that the very pres-ence of outliers causes them to be missed (egRousseeuw amp Leroy 1987 Wilcox 2017a)

Consider for example the values 1 2 2 34 6 100 and 100 The two last observationsappear to be clear outliers yet the rule givenby Equation 8421 fails to flag them as suchThe reason is simple the standard deviation ofthe sample is very large at almost 45 becauseit is not robust to outliers

This is not to suggest that all outliers willbe missed this is not necessarily the case Thepoint is that multiple outliers might be missedthat adversely affect any conventional methodthat might be used to compare means Muchmore effective are the box plot rule and theso-called MAD (median absolute deviation tothe median)-median rule

The box plot rule is applied as follows Letq1 and q2 be estimates of the lower and upperquartiles respectively Then the value X is de-clared an outlier if X lt q1 ndash 15(q2 ndash q1) or ifX gt q2 + 15(q2 ndash q1) As for the MAD-medianrule let X1 Xn denote a random sampleand let M be the usual sample median MADis the median of the values |X1 ndash M| |Xn ndashM| The MAD-median rule declares the valueX an outlier if

|X minus M |MAD06745

gt 224

Equation 8422

Under normality it can be shown thatMAD06745 estimates the standard devia-tion and of course M estimates the popula-tion mean So the MAD-median rule is simi-lar to using Equation 8421 only rather thanuse a two-standard deviation rule 224 is usedinstead

As an illustration consider the values 185111 111 037 037 185 7153 and 7153The MAD-median rule detects the outliers7153 But the rule given by Equation 8421does not The MAD-median rule is better thanthe box plot rule in terms of avoiding maskingIf the proportion of values that are outliers is25 masking can occur when using the boxplot rule Here for example the box plot ruledoes not flag the value 7153 as an outlier incontrast to the MAD-median rule

The second mistake is discarding outliersand applying some standard method for com-paring means using the remaining data Thisresults in an incorrect estimate of the stan-dard error regardless how large the samplesize might be That is an invalid test statisticis being used Roughly it can be shown that

if the remaining data are dependent then theyare correlated which invalidates the deriva-tion of the standard error Of course if an ar-gument can be made that a value is invaliddiscarding it is reasonable and does not leadto technical issues For instance a straightfor-ward case can be made if a measurement isoutside physiological bounds or if it followsa biologically non-plausible pattern over timesuch as during an electrophysiological record-ing But otherwise the estimate of the standarderror can be off by a factor of 2 (eg Wilcox2017b) which is a serious practical issue Asimple way of dealing with this issue whenusing a 20 trimmed mean or median is touse a percentile bootstrap method (With rea-sonably large sample sizes alternatives to thepercentile bootstrap method can be used Theyuse correct estimates of the standard error andare described in Wilcox 2017a 2017c) Themain point here is that these methods are read-ily applied with the free software R which isplaying an increasing role in basic trainingSome illustrations are given in Section 5

It is noted that when dealing with regres-sion outliers among the independent variablescan be removed when testing hypotheses Butif outliers among the dependent variable are re-moved conventional hypothesis testing tech-niques based on the least squares estimatorare no longer valid even when there is ho-moscedasticity Again the issue is that anincorrect estimate of the standard error is be-ing used When using robust regression es-timators that deal with outliers among thedependent variable again a percentile boot-strap method can be used to test hypothesesComplete details are summarized in Wilcox(2017a 2017c) There are numerous regres-sion estimators that effectively deal with out-liers among the dependent variable but a briefsummary of the many details is impossibleThe Theil and Sen estimator as well as theMM-estimator are relatively good choices butarguments can be made that alternative esti-mators deserve serious consideration

33 Transform the DataA common strategy for dealing with non-

normality or heteroscedasticity is to transformthe data There are exceptions but generallythis approach is unsatisfactory for several rea-sons First the transformed data can again beskewed to the point that classic techniquesperform poorly (eg Wilcox 2017b) Sec-ond this simple approach does not deal withoutliers in a satisfactory manner (eg Dok-sum amp Wong 1983 Rasmussen 1989) The

A Guide to RobustStatisticalMethods

84212

Supplement 82 Current Protocols in Neuroscience

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 9: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

00

01

02

03

04

minus3 minus2 minus1 0 1 2 3

Den

sity

Distribution mixed standard

A

minus10

minus05

00

05

10

Means Medians Trimmed means

Val

ue

B

minus3minus2minus1

01234

Means Medians Trimmed means

Val

ue

C

Figure 8425 (A) Density functions for the standard normal distribution (solid line) and the mixednormal distribution (dashed line) These distributions have an obvious similarity yet the variancesare 1 and 109 (B) Box plots of means medians and 20 trimmed means when sampling froma normal distribution (C) Box plots of means medians and 20 trimmed means when samplingfrom a mixed normal distribution In panels B and C each distribution has 10000 values and eachof these values was obtained by computing the mean median or trimmed mean of 30 randomlygenerated observations

23 OutliersEven small departures from normality can

devastate power The modern illustration ofthis fact stems from Tukey (1960) and is basedon what is generally known as a mixed normaldistribution The mixed normal considered byTukey means that with probability 09 an ob-servation is sampled from a standard normaldistribution otherwise an observation is sam-pled from a normal distribution having meanzero and standard deviation 10 Figure 8425Ashows a standard normal distribution and themixed normal discussed by Tukey Note thatin the center of the distributions the mixednormal is below the normal distribution Butfor the two ends of the mixed normal distribu-tion (the tails) the mixed normal lies above thenormal distribution For this reason the mixednormal is often described as having heavy tailsIn general heavy-tailed distributions roughlyrefer to distributions where outliers are likelyto occur

Here is an important point The standardnormal has variance 1 but the mixed normalhas variance 109 That is the population vari-ance can be overly sensitive to slight changesin the tails of a distribution Consequentlyeven a slight departure from normality can re-sult in relatively poor power when using anymethod based on the mean

Put another way samples from the mixednormal are more likely to result in outlierscompared to samples from a standard normaldistribution As previously indicated outliersinflate the sample variance which can nega-tively impact power when using means An-other concern is that they can give a distortedand misleading summary regarding the bulk ofthe participants

The first indication that heavy-tailed dis-tributions are a concern stems from a resultderived by Laplace about two centuries agoIn modern terminology he established thatas we move from a normal distribution to a

BehavioralNeuroscience

8429

Current Protocols in Neuroscience Supplement 82

distribution more likely to generate outliersthe standard error of the usual sample mediancan be smaller than the standard error of themean (Hand 1998) The first empirical evi-dence implying that outliers might be morecommon than what is expected under normal-ity was reported by Bessel (1818)

To add perspective we computed the meanmedian and a 20 trimmed mean based on 30observations generated from a standard nor-mal distribution (Again a 20 trimmed meanremoves the 20 lowest and highest valuesand averages the remaining data) Then we re-peated this process 10000 times Box plots ofthe results are shown in Figure 8425B The-ory tells us that under normality the variationof the sample means is smaller than the varia-tion among the 20 trimmed means and medi-ans and Figure 8421B provides perspectiveon the extent this is the case

Now we repeat this process only dataare sampled from the mixed normal inFigure 8425A Figure 8425C reports theresults As is evident there is substantiallyless variation among the medians and 20trimmed means That is despite trimmingdata the standard errors of the median and20 trimmed mean are substantially smallercontrary to what might be expected based onstandard training

Of course a more important issue iswhether the median or 20 trimmed meanever have substantially smaller standard er-rors based on the data encountered in re-search There are numerous illustrations thatthis is the case (eg Wilcox 2017a 2017b2017c)

There is the additional complication thatthe amount of trimming can substantially im-pact power and the ideal amount of trimmingin terms of maximizing power can dependcrucially on the nature of the unknown distri-butions under investigation The median per-forms best for the situation in Figure 8425Cbut situations are encountered where it trimstoo much given the goal of minimizing thestandard error Roughly a 20 trimmed meancompetes reasonably well with the mean un-der normality But as we move toward distribu-tions that are more likely to generate outliersat some point the median will have a smallerstandard error than a 20 trimmed mean Il-lustrations in Section 5 demonstrate that thisis a practical concern

It is not being suggested that the mere pres-ence of outliers will necessarily result in higherpower when using a 20 trimmed mean ormedian But it is being argued that simply

ignoring the potential impact of outliers canbe a serious practical concern

In terms of controlling the Type I errorprobability effective techniques are availablefor both the 20 trimmed mean and medianBecause the choice between a 20 trimmedmean and median is not straightforward interms of maximizing power it is suggestedthat in exploratory studies both of these esti-mators be considered

When dealing with least squares regressionor Pearsonrsquos correlation again outliers are aserious concern Indeed even a single outliermight give a highly distorted sense about theassociation among the bulk of the participantsunder study In particular important associa-tions might be missed One of the more ob-vious ways of dealing with this issue is toswitch to Kendallrsquos tau or Spearmanrsquos rhoHowever these measures of associations donot deal with all possible concerns related tooutliers For instance two outliers properlyplaced can give a distorted sense about theassociation among the bulk of the data (egWilcox 2017b p 239)

A measure of association that deals withthis issue is the skipped correlation where out-liers are detected using a projection methodthese points are removed and Pearsonrsquos cor-relation is computed using the remaining dataComplete details are summarized in Wilcox(2017a) This particular skipped correlationcan be computed with the R function scor anda confidence interval that allows heteroscedas-ticity can be computed with scorci This func-tion also reports a p-value when testing the hy-pothesis that the correlation is equal to zero(See Section 32 for a description of commonmistakes when testing hypotheses and outliersare removed) MATLAB code is available aswell (Pernet Wilcox amp Rousselet 2012)

24 CurvatureTypically a regression line is assumed

to be straight In some situations this ap-proach seems to suffice However it cannot bestressed too strongly that there is substantialliterature indicating that this is not always thecase A vast array of new and improved non-parametric methods for dealing with curvatureis now available but complete details go be-yond the scope of this paper Here it is merelyremarked that among the many nonparametricregression estimators that have been proposedgenerally known as smoothers two that seemto be particularly useful are Clevelandrsquos (1979)estimator which can be applied via the R func-tion lplot and the running-interval smoother

A Guide to RobustStatisticalMethods

84210

Supplement 82 Current Protocols in Neuroscience

(Wilcox 2017a) which can be applied withthe R function rplot Arguments can be madethat other smoothers should be given seriousconsideration Readers interested in these de-tails are referred to Wilcox (2017a 2017c)

Clevelandrsquos smoother was initially de-signed to estimate the mean of the dependentvariable given some value of the independentvariable The R function contains an optionfor dealing with outliers among the depen-dent variable but it currently seems that therunning-interval smoother is generally betterfor dealing with this issue By default therunning-interval smoother estimates the 20trimmed mean of the dependent variable butany other measure of central tendency can beused via the argument est

A simple strategy for dealing with curva-ture is to include a quadratic term Let X de-note the independent variable An even moregeneral strategy is to include Xa in the modelfor some appropriate choice for the exponenta But this strategy can be unsatisfactory as il-lustrated in Section 54 using data from a studydealing with fractional anisotropy and readingability In general smoothers provide a moresatisfactory approach

There are methods for testing the hypoth-esis that a regression line is straight (egWilcox 2017a) However failing to rejectdoes not provide compelling evidence that it issafe to assume that indeed the regression lineis straight It is unclear when this approach hasenough power to detect situations where cur-vature is an important practical concern Thebest advice is to plot an estimate of the re-gression line using a smoother If there is anyindication that curvature might be an issueuse modern methods for dealing with curva-ture many of which are summarized in Wilcox(2017a 2017c)

3 DEALING WITH VIOLATIONOF ASSUMPTIONS

Based on conventional training there aresome seemingly obvious strategies for deal-ing with the concerns reviewed in the previoussection But by modern standards generallythese strategies are relatively ineffective Thissection summarizes strategies that performpoorly followed by a brief description of mod-ern methods that give improved results

31 Testing AssumptionsA seemingly natural strategy is to test

assumptions In particular test the hypothe-sis that distributions have a normal distribu-

tion and test the hypothesis that there is ho-moscedasticity This approach generally failsroughly because such tests do not have enoughpower to detect situations where violatingthese assumptions is a practical concern Thatis these tests can fail to detect situations thathave an inordinately detrimental influence onstatistical power and parameter estimation

For example Wilcox (2017a 2017c) listssix studies aimed at testing the homoscedas-ticity assumption with the goal of salvaging amethod that assumes homoscedasticity (com-pare to Keselman Othman amp Wilcox 2016)Briefly these simulation studies generate datafrom a situation where it is known that ho-moscedastic methods perform poorly in termsof controlling the Type I error when there isheteroscedasticity Then various methods fortesting the homoscedasticity are performedand if they reject a heteroscedastic method isused instead All six studies came to the sameconclusion this strategy is unsatisfactory Pre-sumably there are situations where this strat-egy is satisfactory but it is unknown how toaccurately determine whether this is the casebased on the available data In practical termsall indications are that it is best to always use aheteroscedastic method when comparing mea-sures of central tendency and when dealingwith regression as well as measures of associ-ation such as Pearsonrsquos correlation

As for testing the normality assumptionfor instance using the Kolmogorov-Smirnovtest currently a better strategy is to use amore modern method that performs about aswell as conventional methods under normal-ity but which continues to perform relativelywell in situations where standard techniquesperform poorly There are many such methods(eg Wilcox 2017a 2017c) some of whichare outlined in Section 4 These methods canbe applied using extant software as will beillustrated

32 Outliers Two Common MistakesThere are two common mistakes regarding

how to deal with outliers The first is to searchfor outliers using the mean and standard de-viation For example declare the value X anoutlier if

|X minus X |s

gt 2

Equation 8421

where X and s are the usual sample meanand standard deviation respectively A prob-lem with this strategy is that it suffers from

BehavioralNeuroscience

84211

Current Protocols in Neuroscience Supplement 82

masking simply meaning that the very pres-ence of outliers causes them to be missed (egRousseeuw amp Leroy 1987 Wilcox 2017a)

Consider for example the values 1 2 2 34 6 100 and 100 The two last observationsappear to be clear outliers yet the rule givenby Equation 8421 fails to flag them as suchThe reason is simple the standard deviation ofthe sample is very large at almost 45 becauseit is not robust to outliers

This is not to suggest that all outliers willbe missed this is not necessarily the case Thepoint is that multiple outliers might be missedthat adversely affect any conventional methodthat might be used to compare means Muchmore effective are the box plot rule and theso-called MAD (median absolute deviation tothe median)-median rule

The box plot rule is applied as follows Letq1 and q2 be estimates of the lower and upperquartiles respectively Then the value X is de-clared an outlier if X lt q1 ndash 15(q2 ndash q1) or ifX gt q2 + 15(q2 ndash q1) As for the MAD-medianrule let X1 Xn denote a random sampleand let M be the usual sample median MADis the median of the values |X1 ndash M| |Xn ndashM| The MAD-median rule declares the valueX an outlier if

|X minus M |MAD06745

gt 224

Equation 8422

Under normality it can be shown thatMAD06745 estimates the standard devia-tion and of course M estimates the popula-tion mean So the MAD-median rule is simi-lar to using Equation 8421 only rather thanuse a two-standard deviation rule 224 is usedinstead

As an illustration consider the values 185111 111 037 037 185 7153 and 7153The MAD-median rule detects the outliers7153 But the rule given by Equation 8421does not The MAD-median rule is better thanthe box plot rule in terms of avoiding maskingIf the proportion of values that are outliers is25 masking can occur when using the boxplot rule Here for example the box plot ruledoes not flag the value 7153 as an outlier incontrast to the MAD-median rule

The second mistake is discarding outliersand applying some standard method for com-paring means using the remaining data Thisresults in an incorrect estimate of the stan-dard error regardless how large the samplesize might be That is an invalid test statisticis being used Roughly it can be shown that

if the remaining data are dependent then theyare correlated which invalidates the deriva-tion of the standard error Of course if an ar-gument can be made that a value is invaliddiscarding it is reasonable and does not leadto technical issues For instance a straightfor-ward case can be made if a measurement isoutside physiological bounds or if it followsa biologically non-plausible pattern over timesuch as during an electrophysiological record-ing But otherwise the estimate of the standarderror can be off by a factor of 2 (eg Wilcox2017b) which is a serious practical issue Asimple way of dealing with this issue whenusing a 20 trimmed mean or median is touse a percentile bootstrap method (With rea-sonably large sample sizes alternatives to thepercentile bootstrap method can be used Theyuse correct estimates of the standard error andare described in Wilcox 2017a 2017c) Themain point here is that these methods are read-ily applied with the free software R which isplaying an increasing role in basic trainingSome illustrations are given in Section 5

It is noted that when dealing with regres-sion outliers among the independent variablescan be removed when testing hypotheses Butif outliers among the dependent variable are re-moved conventional hypothesis testing tech-niques based on the least squares estimatorare no longer valid even when there is ho-moscedasticity Again the issue is that anincorrect estimate of the standard error is be-ing used When using robust regression es-timators that deal with outliers among thedependent variable again a percentile boot-strap method can be used to test hypothesesComplete details are summarized in Wilcox(2017a 2017c) There are numerous regres-sion estimators that effectively deal with out-liers among the dependent variable but a briefsummary of the many details is impossibleThe Theil and Sen estimator as well as theMM-estimator are relatively good choices butarguments can be made that alternative esti-mators deserve serious consideration

33 Transform the DataA common strategy for dealing with non-

normality or heteroscedasticity is to transformthe data There are exceptions but generallythis approach is unsatisfactory for several rea-sons First the transformed data can again beskewed to the point that classic techniquesperform poorly (eg Wilcox 2017b) Sec-ond this simple approach does not deal withoutliers in a satisfactory manner (eg Dok-sum amp Wong 1983 Rasmussen 1989) The

A Guide to RobustStatisticalMethods

84212

Supplement 82 Current Protocols in Neuroscience

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 10: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

distribution more likely to generate outliersthe standard error of the usual sample mediancan be smaller than the standard error of themean (Hand 1998) The first empirical evi-dence implying that outliers might be morecommon than what is expected under normal-ity was reported by Bessel (1818)

To add perspective we computed the meanmedian and a 20 trimmed mean based on 30observations generated from a standard nor-mal distribution (Again a 20 trimmed meanremoves the 20 lowest and highest valuesand averages the remaining data) Then we re-peated this process 10000 times Box plots ofthe results are shown in Figure 8425B The-ory tells us that under normality the variationof the sample means is smaller than the varia-tion among the 20 trimmed means and medi-ans and Figure 8421B provides perspectiveon the extent this is the case

Now we repeat this process only dataare sampled from the mixed normal inFigure 8425A Figure 8425C reports theresults As is evident there is substantiallyless variation among the medians and 20trimmed means That is despite trimmingdata the standard errors of the median and20 trimmed mean are substantially smallercontrary to what might be expected based onstandard training

Of course a more important issue iswhether the median or 20 trimmed meanever have substantially smaller standard er-rors based on the data encountered in re-search There are numerous illustrations thatthis is the case (eg Wilcox 2017a 2017b2017c)

There is the additional complication thatthe amount of trimming can substantially im-pact power and the ideal amount of trimmingin terms of maximizing power can dependcrucially on the nature of the unknown distri-butions under investigation The median per-forms best for the situation in Figure 8425Cbut situations are encountered where it trimstoo much given the goal of minimizing thestandard error Roughly a 20 trimmed meancompetes reasonably well with the mean un-der normality But as we move toward distribu-tions that are more likely to generate outliersat some point the median will have a smallerstandard error than a 20 trimmed mean Il-lustrations in Section 5 demonstrate that thisis a practical concern

It is not being suggested that the mere pres-ence of outliers will necessarily result in higherpower when using a 20 trimmed mean ormedian But it is being argued that simply

ignoring the potential impact of outliers canbe a serious practical concern

In terms of controlling the Type I errorprobability effective techniques are availablefor both the 20 trimmed mean and medianBecause the choice between a 20 trimmedmean and median is not straightforward interms of maximizing power it is suggestedthat in exploratory studies both of these esti-mators be considered

When dealing with least squares regressionor Pearsonrsquos correlation again outliers are aserious concern Indeed even a single outliermight give a highly distorted sense about theassociation among the bulk of the participantsunder study In particular important associa-tions might be missed One of the more ob-vious ways of dealing with this issue is toswitch to Kendallrsquos tau or Spearmanrsquos rhoHowever these measures of associations donot deal with all possible concerns related tooutliers For instance two outliers properlyplaced can give a distorted sense about theassociation among the bulk of the data (egWilcox 2017b p 239)

A measure of association that deals withthis issue is the skipped correlation where out-liers are detected using a projection methodthese points are removed and Pearsonrsquos cor-relation is computed using the remaining dataComplete details are summarized in Wilcox(2017a) This particular skipped correlationcan be computed with the R function scor anda confidence interval that allows heteroscedas-ticity can be computed with scorci This func-tion also reports a p-value when testing the hy-pothesis that the correlation is equal to zero(See Section 32 for a description of commonmistakes when testing hypotheses and outliersare removed) MATLAB code is available aswell (Pernet Wilcox amp Rousselet 2012)

24 CurvatureTypically a regression line is assumed

to be straight In some situations this ap-proach seems to suffice However it cannot bestressed too strongly that there is substantialliterature indicating that this is not always thecase A vast array of new and improved non-parametric methods for dealing with curvatureis now available but complete details go be-yond the scope of this paper Here it is merelyremarked that among the many nonparametricregression estimators that have been proposedgenerally known as smoothers two that seemto be particularly useful are Clevelandrsquos (1979)estimator which can be applied via the R func-tion lplot and the running-interval smoother

A Guide to RobustStatisticalMethods

84210

Supplement 82 Current Protocols in Neuroscience

(Wilcox 2017a) which can be applied withthe R function rplot Arguments can be madethat other smoothers should be given seriousconsideration Readers interested in these de-tails are referred to Wilcox (2017a 2017c)

Clevelandrsquos smoother was initially de-signed to estimate the mean of the dependentvariable given some value of the independentvariable The R function contains an optionfor dealing with outliers among the depen-dent variable but it currently seems that therunning-interval smoother is generally betterfor dealing with this issue By default therunning-interval smoother estimates the 20trimmed mean of the dependent variable butany other measure of central tendency can beused via the argument est

A simple strategy for dealing with curva-ture is to include a quadratic term Let X de-note the independent variable An even moregeneral strategy is to include Xa in the modelfor some appropriate choice for the exponenta But this strategy can be unsatisfactory as il-lustrated in Section 54 using data from a studydealing with fractional anisotropy and readingability In general smoothers provide a moresatisfactory approach

There are methods for testing the hypoth-esis that a regression line is straight (egWilcox 2017a) However failing to rejectdoes not provide compelling evidence that it issafe to assume that indeed the regression lineis straight It is unclear when this approach hasenough power to detect situations where cur-vature is an important practical concern Thebest advice is to plot an estimate of the re-gression line using a smoother If there is anyindication that curvature might be an issueuse modern methods for dealing with curva-ture many of which are summarized in Wilcox(2017a 2017c)

3 DEALING WITH VIOLATIONOF ASSUMPTIONS

Based on conventional training there aresome seemingly obvious strategies for deal-ing with the concerns reviewed in the previoussection But by modern standards generallythese strategies are relatively ineffective Thissection summarizes strategies that performpoorly followed by a brief description of mod-ern methods that give improved results

31 Testing AssumptionsA seemingly natural strategy is to test

assumptions In particular test the hypothe-sis that distributions have a normal distribu-

tion and test the hypothesis that there is ho-moscedasticity This approach generally failsroughly because such tests do not have enoughpower to detect situations where violatingthese assumptions is a practical concern Thatis these tests can fail to detect situations thathave an inordinately detrimental influence onstatistical power and parameter estimation

For example Wilcox (2017a 2017c) listssix studies aimed at testing the homoscedas-ticity assumption with the goal of salvaging amethod that assumes homoscedasticity (com-pare to Keselman Othman amp Wilcox 2016)Briefly these simulation studies generate datafrom a situation where it is known that ho-moscedastic methods perform poorly in termsof controlling the Type I error when there isheteroscedasticity Then various methods fortesting the homoscedasticity are performedand if they reject a heteroscedastic method isused instead All six studies came to the sameconclusion this strategy is unsatisfactory Pre-sumably there are situations where this strat-egy is satisfactory but it is unknown how toaccurately determine whether this is the casebased on the available data In practical termsall indications are that it is best to always use aheteroscedastic method when comparing mea-sures of central tendency and when dealingwith regression as well as measures of associ-ation such as Pearsonrsquos correlation

As for testing the normality assumptionfor instance using the Kolmogorov-Smirnovtest currently a better strategy is to use amore modern method that performs about aswell as conventional methods under normal-ity but which continues to perform relativelywell in situations where standard techniquesperform poorly There are many such methods(eg Wilcox 2017a 2017c) some of whichare outlined in Section 4 These methods canbe applied using extant software as will beillustrated

32 Outliers Two Common MistakesThere are two common mistakes regarding

how to deal with outliers The first is to searchfor outliers using the mean and standard de-viation For example declare the value X anoutlier if

|X minus X |s

gt 2

Equation 8421

where X and s are the usual sample meanand standard deviation respectively A prob-lem with this strategy is that it suffers from

BehavioralNeuroscience

84211

Current Protocols in Neuroscience Supplement 82

masking simply meaning that the very pres-ence of outliers causes them to be missed (egRousseeuw amp Leroy 1987 Wilcox 2017a)

Consider for example the values 1 2 2 34 6 100 and 100 The two last observationsappear to be clear outliers yet the rule givenby Equation 8421 fails to flag them as suchThe reason is simple the standard deviation ofthe sample is very large at almost 45 becauseit is not robust to outliers

This is not to suggest that all outliers willbe missed this is not necessarily the case Thepoint is that multiple outliers might be missedthat adversely affect any conventional methodthat might be used to compare means Muchmore effective are the box plot rule and theso-called MAD (median absolute deviation tothe median)-median rule

The box plot rule is applied as follows Letq1 and q2 be estimates of the lower and upperquartiles respectively Then the value X is de-clared an outlier if X lt q1 ndash 15(q2 ndash q1) or ifX gt q2 + 15(q2 ndash q1) As for the MAD-medianrule let X1 Xn denote a random sampleand let M be the usual sample median MADis the median of the values |X1 ndash M| |Xn ndashM| The MAD-median rule declares the valueX an outlier if

|X minus M |MAD06745

gt 224

Equation 8422

Under normality it can be shown thatMAD06745 estimates the standard devia-tion and of course M estimates the popula-tion mean So the MAD-median rule is simi-lar to using Equation 8421 only rather thanuse a two-standard deviation rule 224 is usedinstead

As an illustration consider the values 185111 111 037 037 185 7153 and 7153The MAD-median rule detects the outliers7153 But the rule given by Equation 8421does not The MAD-median rule is better thanthe box plot rule in terms of avoiding maskingIf the proportion of values that are outliers is25 masking can occur when using the boxplot rule Here for example the box plot ruledoes not flag the value 7153 as an outlier incontrast to the MAD-median rule

The second mistake is discarding outliersand applying some standard method for com-paring means using the remaining data Thisresults in an incorrect estimate of the stan-dard error regardless how large the samplesize might be That is an invalid test statisticis being used Roughly it can be shown that

if the remaining data are dependent then theyare correlated which invalidates the deriva-tion of the standard error Of course if an ar-gument can be made that a value is invaliddiscarding it is reasonable and does not leadto technical issues For instance a straightfor-ward case can be made if a measurement isoutside physiological bounds or if it followsa biologically non-plausible pattern over timesuch as during an electrophysiological record-ing But otherwise the estimate of the standarderror can be off by a factor of 2 (eg Wilcox2017b) which is a serious practical issue Asimple way of dealing with this issue whenusing a 20 trimmed mean or median is touse a percentile bootstrap method (With rea-sonably large sample sizes alternatives to thepercentile bootstrap method can be used Theyuse correct estimates of the standard error andare described in Wilcox 2017a 2017c) Themain point here is that these methods are read-ily applied with the free software R which isplaying an increasing role in basic trainingSome illustrations are given in Section 5

It is noted that when dealing with regres-sion outliers among the independent variablescan be removed when testing hypotheses Butif outliers among the dependent variable are re-moved conventional hypothesis testing tech-niques based on the least squares estimatorare no longer valid even when there is ho-moscedasticity Again the issue is that anincorrect estimate of the standard error is be-ing used When using robust regression es-timators that deal with outliers among thedependent variable again a percentile boot-strap method can be used to test hypothesesComplete details are summarized in Wilcox(2017a 2017c) There are numerous regres-sion estimators that effectively deal with out-liers among the dependent variable but a briefsummary of the many details is impossibleThe Theil and Sen estimator as well as theMM-estimator are relatively good choices butarguments can be made that alternative esti-mators deserve serious consideration

33 Transform the DataA common strategy for dealing with non-

normality or heteroscedasticity is to transformthe data There are exceptions but generallythis approach is unsatisfactory for several rea-sons First the transformed data can again beskewed to the point that classic techniquesperform poorly (eg Wilcox 2017b) Sec-ond this simple approach does not deal withoutliers in a satisfactory manner (eg Dok-sum amp Wong 1983 Rasmussen 1989) The

A Guide to RobustStatisticalMethods

84212

Supplement 82 Current Protocols in Neuroscience

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 11: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

(Wilcox 2017a) which can be applied withthe R function rplot Arguments can be madethat other smoothers should be given seriousconsideration Readers interested in these de-tails are referred to Wilcox (2017a 2017c)

Clevelandrsquos smoother was initially de-signed to estimate the mean of the dependentvariable given some value of the independentvariable The R function contains an optionfor dealing with outliers among the depen-dent variable but it currently seems that therunning-interval smoother is generally betterfor dealing with this issue By default therunning-interval smoother estimates the 20trimmed mean of the dependent variable butany other measure of central tendency can beused via the argument est

A simple strategy for dealing with curva-ture is to include a quadratic term Let X de-note the independent variable An even moregeneral strategy is to include Xa in the modelfor some appropriate choice for the exponenta But this strategy can be unsatisfactory as il-lustrated in Section 54 using data from a studydealing with fractional anisotropy and readingability In general smoothers provide a moresatisfactory approach

There are methods for testing the hypoth-esis that a regression line is straight (egWilcox 2017a) However failing to rejectdoes not provide compelling evidence that it issafe to assume that indeed the regression lineis straight It is unclear when this approach hasenough power to detect situations where cur-vature is an important practical concern Thebest advice is to plot an estimate of the re-gression line using a smoother If there is anyindication that curvature might be an issueuse modern methods for dealing with curva-ture many of which are summarized in Wilcox(2017a 2017c)

3 DEALING WITH VIOLATIONOF ASSUMPTIONS

Based on conventional training there aresome seemingly obvious strategies for deal-ing with the concerns reviewed in the previoussection But by modern standards generallythese strategies are relatively ineffective Thissection summarizes strategies that performpoorly followed by a brief description of mod-ern methods that give improved results

31 Testing AssumptionsA seemingly natural strategy is to test

assumptions In particular test the hypothe-sis that distributions have a normal distribu-

tion and test the hypothesis that there is ho-moscedasticity This approach generally failsroughly because such tests do not have enoughpower to detect situations where violatingthese assumptions is a practical concern Thatis these tests can fail to detect situations thathave an inordinately detrimental influence onstatistical power and parameter estimation

For example Wilcox (2017a 2017c) listssix studies aimed at testing the homoscedas-ticity assumption with the goal of salvaging amethod that assumes homoscedasticity (com-pare to Keselman Othman amp Wilcox 2016)Briefly these simulation studies generate datafrom a situation where it is known that ho-moscedastic methods perform poorly in termsof controlling the Type I error when there isheteroscedasticity Then various methods fortesting the homoscedasticity are performedand if they reject a heteroscedastic method isused instead All six studies came to the sameconclusion this strategy is unsatisfactory Pre-sumably there are situations where this strat-egy is satisfactory but it is unknown how toaccurately determine whether this is the casebased on the available data In practical termsall indications are that it is best to always use aheteroscedastic method when comparing mea-sures of central tendency and when dealingwith regression as well as measures of associ-ation such as Pearsonrsquos correlation

As for testing the normality assumptionfor instance using the Kolmogorov-Smirnovtest currently a better strategy is to use amore modern method that performs about aswell as conventional methods under normal-ity but which continues to perform relativelywell in situations where standard techniquesperform poorly There are many such methods(eg Wilcox 2017a 2017c) some of whichare outlined in Section 4 These methods canbe applied using extant software as will beillustrated

32 Outliers Two Common MistakesThere are two common mistakes regarding

how to deal with outliers The first is to searchfor outliers using the mean and standard de-viation For example declare the value X anoutlier if

|X minus X |s

gt 2

Equation 8421

where X and s are the usual sample meanand standard deviation respectively A prob-lem with this strategy is that it suffers from

BehavioralNeuroscience

84211

Current Protocols in Neuroscience Supplement 82

masking simply meaning that the very pres-ence of outliers causes them to be missed (egRousseeuw amp Leroy 1987 Wilcox 2017a)

Consider for example the values 1 2 2 34 6 100 and 100 The two last observationsappear to be clear outliers yet the rule givenby Equation 8421 fails to flag them as suchThe reason is simple the standard deviation ofthe sample is very large at almost 45 becauseit is not robust to outliers

This is not to suggest that all outliers willbe missed this is not necessarily the case Thepoint is that multiple outliers might be missedthat adversely affect any conventional methodthat might be used to compare means Muchmore effective are the box plot rule and theso-called MAD (median absolute deviation tothe median)-median rule

The box plot rule is applied as follows Letq1 and q2 be estimates of the lower and upperquartiles respectively Then the value X is de-clared an outlier if X lt q1 ndash 15(q2 ndash q1) or ifX gt q2 + 15(q2 ndash q1) As for the MAD-medianrule let X1 Xn denote a random sampleand let M be the usual sample median MADis the median of the values |X1 ndash M| |Xn ndashM| The MAD-median rule declares the valueX an outlier if

|X minus M |MAD06745

gt 224

Equation 8422

Under normality it can be shown thatMAD06745 estimates the standard devia-tion and of course M estimates the popula-tion mean So the MAD-median rule is simi-lar to using Equation 8421 only rather thanuse a two-standard deviation rule 224 is usedinstead

As an illustration consider the values 185111 111 037 037 185 7153 and 7153The MAD-median rule detects the outliers7153 But the rule given by Equation 8421does not The MAD-median rule is better thanthe box plot rule in terms of avoiding maskingIf the proportion of values that are outliers is25 masking can occur when using the boxplot rule Here for example the box plot ruledoes not flag the value 7153 as an outlier incontrast to the MAD-median rule

The second mistake is discarding outliersand applying some standard method for com-paring means using the remaining data Thisresults in an incorrect estimate of the stan-dard error regardless how large the samplesize might be That is an invalid test statisticis being used Roughly it can be shown that

if the remaining data are dependent then theyare correlated which invalidates the deriva-tion of the standard error Of course if an ar-gument can be made that a value is invaliddiscarding it is reasonable and does not leadto technical issues For instance a straightfor-ward case can be made if a measurement isoutside physiological bounds or if it followsa biologically non-plausible pattern over timesuch as during an electrophysiological record-ing But otherwise the estimate of the standarderror can be off by a factor of 2 (eg Wilcox2017b) which is a serious practical issue Asimple way of dealing with this issue whenusing a 20 trimmed mean or median is touse a percentile bootstrap method (With rea-sonably large sample sizes alternatives to thepercentile bootstrap method can be used Theyuse correct estimates of the standard error andare described in Wilcox 2017a 2017c) Themain point here is that these methods are read-ily applied with the free software R which isplaying an increasing role in basic trainingSome illustrations are given in Section 5

It is noted that when dealing with regres-sion outliers among the independent variablescan be removed when testing hypotheses Butif outliers among the dependent variable are re-moved conventional hypothesis testing tech-niques based on the least squares estimatorare no longer valid even when there is ho-moscedasticity Again the issue is that anincorrect estimate of the standard error is be-ing used When using robust regression es-timators that deal with outliers among thedependent variable again a percentile boot-strap method can be used to test hypothesesComplete details are summarized in Wilcox(2017a 2017c) There are numerous regres-sion estimators that effectively deal with out-liers among the dependent variable but a briefsummary of the many details is impossibleThe Theil and Sen estimator as well as theMM-estimator are relatively good choices butarguments can be made that alternative esti-mators deserve serious consideration

33 Transform the DataA common strategy for dealing with non-

normality or heteroscedasticity is to transformthe data There are exceptions but generallythis approach is unsatisfactory for several rea-sons First the transformed data can again beskewed to the point that classic techniquesperform poorly (eg Wilcox 2017b) Sec-ond this simple approach does not deal withoutliers in a satisfactory manner (eg Dok-sum amp Wong 1983 Rasmussen 1989) The

A Guide to RobustStatisticalMethods

84212

Supplement 82 Current Protocols in Neuroscience

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 12: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

masking simply meaning that the very pres-ence of outliers causes them to be missed (egRousseeuw amp Leroy 1987 Wilcox 2017a)

Consider for example the values 1 2 2 34 6 100 and 100 The two last observationsappear to be clear outliers yet the rule givenby Equation 8421 fails to flag them as suchThe reason is simple the standard deviation ofthe sample is very large at almost 45 becauseit is not robust to outliers

This is not to suggest that all outliers willbe missed this is not necessarily the case Thepoint is that multiple outliers might be missedthat adversely affect any conventional methodthat might be used to compare means Muchmore effective are the box plot rule and theso-called MAD (median absolute deviation tothe median)-median rule

The box plot rule is applied as follows Letq1 and q2 be estimates of the lower and upperquartiles respectively Then the value X is de-clared an outlier if X lt q1 ndash 15(q2 ndash q1) or ifX gt q2 + 15(q2 ndash q1) As for the MAD-medianrule let X1 Xn denote a random sampleand let M be the usual sample median MADis the median of the values |X1 ndash M| |Xn ndashM| The MAD-median rule declares the valueX an outlier if

|X minus M |MAD06745

gt 224

Equation 8422

Under normality it can be shown thatMAD06745 estimates the standard devia-tion and of course M estimates the popula-tion mean So the MAD-median rule is simi-lar to using Equation 8421 only rather thanuse a two-standard deviation rule 224 is usedinstead

As an illustration consider the values 185111 111 037 037 185 7153 and 7153The MAD-median rule detects the outliers7153 But the rule given by Equation 8421does not The MAD-median rule is better thanthe box plot rule in terms of avoiding maskingIf the proportion of values that are outliers is25 masking can occur when using the boxplot rule Here for example the box plot ruledoes not flag the value 7153 as an outlier incontrast to the MAD-median rule

The second mistake is discarding outliersand applying some standard method for com-paring means using the remaining data Thisresults in an incorrect estimate of the stan-dard error regardless how large the samplesize might be That is an invalid test statisticis being used Roughly it can be shown that

if the remaining data are dependent then theyare correlated which invalidates the deriva-tion of the standard error Of course if an ar-gument can be made that a value is invaliddiscarding it is reasonable and does not leadto technical issues For instance a straightfor-ward case can be made if a measurement isoutside physiological bounds or if it followsa biologically non-plausible pattern over timesuch as during an electrophysiological record-ing But otherwise the estimate of the standarderror can be off by a factor of 2 (eg Wilcox2017b) which is a serious practical issue Asimple way of dealing with this issue whenusing a 20 trimmed mean or median is touse a percentile bootstrap method (With rea-sonably large sample sizes alternatives to thepercentile bootstrap method can be used Theyuse correct estimates of the standard error andare described in Wilcox 2017a 2017c) Themain point here is that these methods are read-ily applied with the free software R which isplaying an increasing role in basic trainingSome illustrations are given in Section 5

It is noted that when dealing with regres-sion outliers among the independent variablescan be removed when testing hypotheses Butif outliers among the dependent variable are re-moved conventional hypothesis testing tech-niques based on the least squares estimatorare no longer valid even when there is ho-moscedasticity Again the issue is that anincorrect estimate of the standard error is be-ing used When using robust regression es-timators that deal with outliers among thedependent variable again a percentile boot-strap method can be used to test hypothesesComplete details are summarized in Wilcox(2017a 2017c) There are numerous regres-sion estimators that effectively deal with out-liers among the dependent variable but a briefsummary of the many details is impossibleThe Theil and Sen estimator as well as theMM-estimator are relatively good choices butarguments can be made that alternative esti-mators deserve serious consideration

33 Transform the DataA common strategy for dealing with non-

normality or heteroscedasticity is to transformthe data There are exceptions but generallythis approach is unsatisfactory for several rea-sons First the transformed data can again beskewed to the point that classic techniquesperform poorly (eg Wilcox 2017b) Sec-ond this simple approach does not deal withoutliers in a satisfactory manner (eg Dok-sum amp Wong 1983 Rasmussen 1989) The

A Guide to RobustStatisticalMethods

84212

Supplement 82 Current Protocols in Neuroscience

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 13: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

number of outliers might decline but it canremain the same and even increase Currentlya much more satisfactory strategy is to usea modern robust method such as a bootstrapmethod in conjunction with a 20 trimmedmean or median This approach also deals withheteroscedasticity in a very effective manner(eg Wilcox 2017a 2017c)

Another concern is that a transformationchanges the hypothesis being tested In effecttransformations muddy the interpretation ofany comparison because a transformation ofthe data also transforms the construct that itmeasures (Grayson 2004)

34 Use a Rank-Based MethodStandard training suggests a simple way of

dealing with non-normality use a rank-basedmethod such as the WMW test the Kruskall-Wallis test and Friedmanrsquos method The firstthing to stress is that under general conditionsthese methods are not designed to comparemedians or other measures of central tendency(For an illustration based on the Wilcoxonsigned rank test see Wilcox 2017b p 367Also see Fagerland amp Sandvik 2009) More-over the derivation of these methods is basedon the assumption that the groups have identi-cal distributions So in particular homoscedas-ticity is assumed In practical terms if theyreject conclude that the distributions differ

But to get a more detailed understandingof how groups differ and by how much al-ternative inferential techniques should be usedin conjunction with plots such as those sum-marized by Rousselet et al (2017) For exam-ple use methods based on a trimmed mean ormedian

Many improved rank-based methods havebeen derived (Brunner Domhof amp Langer2002) But again these methods are aimed attesting the hypothesis that groups have identi-cal distributions Important exceptions are theimprovements on the WMW test (eg Cliff1996 Wilcox 2017a 2017b) which as pre-viously noted are aimed at making inferencesabout the probability that a random observa-tion from the first group is less than a randomobservation from the second

35 Permutation MethodsFor completeness it is noted that permuta-

tion methods have received some attention inthe neuroscience literature (eg Pernet Lat-inus Nichols amp Rousselet 2015 WinklerRidgwaym Webster Smith amp Nichols 2014)Briefly this approach is well designed to testthe hypothesis that two groups have identi-

cal distributions But based on results reportedby Boik (1987) this approach cannot be rec-ommended when comparing means The sameis true when comparing medians for reasonssummarized by Romano (1990) Chung andRomano (2013) summarize general theoreti-cal concerns and limitations They go on tosuggest a modification of the standard permu-tation method but at least in some situationsthe method is unsatisfactory (Wilcox 2017csection 77) A deep understanding of whenthis modification performs well is in need offurther study

36 More Comments About theMedian

In terms of power the mean is prefer-able over the median or 20 trimmed meanwhen dealing with symmetric distributions forwhich outliers are rare If the distributions areskewed the median and 20 trimmed meancan better reflect what is typical and improvedcontrol over the Type I error probability canbe achieved When outliers occur there is thepossibility that the mean will have a muchlarger standard error than the median or 20trimmed mean Consequently methods basedon the mean might have relatively poor powerNote however that for skewed distributionsthe difference between two means might belarger than the difference between the corre-sponding medians As a result even when out-liers are common it is possible that a methodbased on the means will have more power Interms of maximizing power a crude rule is touse a 20 trimmed mean but the seeminglymore important point is that no method domi-nates Focusing on a single measure of centraltendency might result in missing an importantdifference So again exploratory studies canbe vitally important

Even when there are tied values it is nowpossible to get excellent control over the prob-ability of a Type I error when using the usualsample median For the one-sample case thiscan be done with using a distribution free tech-nique via the R function sintv2 Distributionfree means that the actual Type I error proba-bility can be determined exactly assuming ran-dom sampling only When comparing two ormore groups currently the only known tech-nique that performs well is the percentile boot-strap Methods based on estimates of the stan-dard error can perform poorly even with largesample sizes Also when there are tied valuesthe distribution of the sample median does notnecessarily converge to a normal distributionas the sample size increases The very presence

BehavioralNeuroscience

84213

Current Protocols in Neuroscience Supplement 82

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 14: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

of tied values is not necessarily disastrousBut it is unclear how many tied values canbe accommodated before disaster strikes Thepercentile bootstrap method eliminates thisconcern

4 COMPARING GROUPS ANDMEASURES OF ASSOCIATION

This section elaborates on methods aimedat comparing groups and measures of associ-ation First attention is focused on two inde-pendent groups Comparing dependent groupsis discussed in Section 44 Section 45 com-ments briefly on more complex designs Thisis followed by a description of how to comparemeasures of association as well as an indica-tion of modern advances related to the analysisof covariance Included are indications of howto apply these methods using the free soft-ware R which at the moment is easily the bestsoftware for applying modern methods R is avast and powerful software package CertainlyMATLAB could be used but this would re-quire writing hundreds of functions in orderto compete with R There are numerous bookson R but only a relatively small subset of thebasic commands is needed to apply the func-tions described here (eg see Wilcox 2017b2017c)

The R functions noted here are stored in theR package WRS which can be installed as in-dicated at httpsgithubcomnicebreadWRSAlternatively and seemingly easier use the Rcommand source on the file Rallfun-v33txtwhich can be downloaded from httpsdornsifeuscedulabsrwilcoxsoftware

All recommended methods deal with het-eroscedasticity When comparing groups anddistributions that differ in shape these meth-ods are generally better than classic meth-ods for comparing means which can performpoorly

41 Dealing with Small Sample SizesThis section focuses on the common situ-

ation in neuroscience where the sample sizesare relatively small When the sample sizesare very small say less than or equal to tenand greater than four conventional methodsbased on means are satisfactory in terms ofType I errors when the null hypothesis is thatthe groups have identical distributions If thegoal is to control the probability of a Type Ierror when the null hypothesis is that groupshave equal means extant methods can be un-satisfactory And as previously noted methodsbased on means can have poor power relativeto alternative techniques

Many of the more effective methods arebased in part on the percentile bootstrapmethod Consider for example the goal ofcomparing the medians of two independentgroups Let Mx and My be the sample medi-ans for the two groups being compared and letD = Mx ndash My The basic strategy is to per-form a simulation based on the observed datawith the goal of approximating the distribu-tion of D which can then be used to computea p-value as well as a confidence interval

Let n and m denote the sample sizes for thefirst and second group respectively The per-centile bootstrap method proceeds as followsFor the first group randomly sample with re-placement n observations This yields what isgenerally called a bootstrap sample For thesecond group randomly sample with replace-ment m observations Next based on thesebootstrap samples compute the sample medi-ans say Mlowast

x and Mlowasty Let Dlowast = Mlowast

x minus Mlowasty Re-

peating this process many times a p-value (andconfidence interval) can be computed based onthe proportion of times D lt 0 as well as theproportion of times D = 0 (More precise de-tails are given in Wilcox 2017a 2017c) Thismethod has been found to provide reasonablygood control over the probability of a TypeI error when both n and m are greater than orequal to five The R function medpb2 performsthis technique

The method just described performs verywell compared to alternative techniques Infact regardless of how large the sample sizesmight be the percentile bootstrap method isthe only known method that continues to per-form reasonably well when comparing medi-ans and when there are duplicated values (alsosee Wilcox 2017a section 53)

For the special case where the goal is tocompare means there is no method that pro-vides reasonably accurate control over theType I error probability for a relatively broadrange of situations In fairness situations canbe created where means perform well and in-deed have higher power than methods basedon a 20 trimmed mean or median Whendealing with perfectly symmetric distributionswhere outliers are unlikely to occur methodsbased on means and that allow heteroscedas-ticity perform relatively well But with smallsample sizes there is no satisfactory diagnos-tic tool indicating whether distributions sat-isfy these two conditions in an adequate man-ner Generally using means comes with therelatively high risk of poor control over theType I error probability and relatively poorpower

A Guide to RobustStatisticalMethods

84214

Supplement 82 Current Protocols in Neuroscience

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 15: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

Switching to a 20 trimmed mean themethod derived by Yuen (1974) performsfairly well even when the smallest samplesize is six (compare to Ozdemir Wilcox ampYildiztepe 2013) It can be applied with theR function yuen (Yuenrsquos method reduces toWelchrsquos method for comparing means whenthere is no trimming) When the smallest sam-ple size is five it can be unsatisfactory in situ-ations where the percentile bootstrap methodused in conjunction with the median contin-ues to perform reasonably well A rough ruleis that the ability to control the Type I errorprobability improves as the amount of trim-ming increases With small sample sizes andwhen the goal is to compare means it is un-known how to control the Type I error proba-bility reasonably well over a reasonably broadrange of situations

Another approach is to focus on P(X lt Y)the probability that a randomly sampled ob-servation from the first group is less than arandomly sampled observation from the sec-ond group This strategy is based in part onan estimate of the distribution of D = X minus Ythe distribution of all pairwise differences be-tween observations in each group

To illustrate this point let X1 Xn andY1 Ym be random samples of size n and mrespectively Let Dik = Xi ndash Yk (i = 1 nk = 1 m) Then the usual estimate of P(X lt

Y) is simply the proportion of Dik values lessthan zero For instance if X = (1 2 3) andY = (1 25 4) then D = (00 10 20 ndash15ndash05 05 ndash30 ndash20 ndash10) and the estimateof P(X lt Y) is 49 the proportion of D valuesless than zero

Let microx microy and microD denote the populationmeans associated with X Y and D respec-tively From basic principles microx ndash microy = microDThat is the difference between two meansis the same as the mean of all pairwise dif-ferences However let θ x θ y and θD de-note the population medians associated withX Y and D respectively For symmetric dis-tributions θ x ndash θ y = θD but otherwise itis generally the case that θ x ndash θ y θD Inother words the difference between medi-ans is typically not the same as the medianof all pairwise differences The same is truewhen using any amount of trimming greaterthan zero Roughly θ x and θ y reflect the typ-ical response for each group while θD re-flects the typical difference between two ran-domly sampled participants one from eachgroup Although less known the second per-spective can be instructive in many situationsfor instance in a clinical setting in which we

want to know what effect to expect whenrandomly sampling a patient and a controlparticipant

If two groups do not differ in any mannerP(X lt Y) = 05 Consequently a basic goal istesting

H0 P (X lt Y ) = 05

Equation 8423

If this hypothesis is rejected this indicatesthat it is reasonable to make a decision aboutwhether P(X lt Y) is less than or greater than05 It is readily verified that this is the sameas testing

H0 θD = 0

Equation 8424

Cetainly P(X lt Y) has practical importancefor reasons summarized among others byCliff (1996) Ruscio (2008) and Newcombe(2006) The WMW test is based on an esti-mate P(X lt Y) However it is unsatisfactoryin terms of making inferences about this prob-ability because the estimate of the standarderror assumes that the distributions are iden-tical If the distributions differ an incorrectestimate of the standard error is being usedMore modern methods deal with this issueThe method derived by Cliff (1996) for testingEquation 8423 performs relatively well withsmall sample sizes and can be applied via the Rfunction cidv2 (compare to Ruscio amp Mullens2012) (We are not aware of any commercialsoftware package that contains this method)

However there is the complication thatfor skewed distributions differences amongthe means for example can be substantiallysmaller as well as substantially larger than dif-ferences among 20 trimmed means or medi-ans That is regardless of how large the samplesizes might be power can be substantially im-pacted by which measure of central tendencyis used

42 Comparing QuantilesRather than compare groups based on a sin-

gle measure of central tendency typically themean another approach is to compare multi-ple quantiles This provides more detail aboutwhere and how the two distributions differ(Rousselet et al 2017) For example the typ-ical participants might not differ very muchbased on the medians but the reverse might betrue among low scoring individuals

BehavioralNeuroscience

84215

Current Protocols in Neuroscience Supplement 82

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 16: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

First consider the goal of comparing allquantiles in a manner that controls the proba-bility of one or more Type I errors among allthe tests that are performed Assuming randomsampling only Doksum and Sievers (1976) de-rived such a method that can be applied viathe R function sband The method is based ona generalization of the Kolmogorov-Smirnovtest A negative feature is that power can beadversely affected when there are tied valuesAnd when the goal is to compare the moreextreme quantiles again power might be rela-tively low A way of reducing these concerns isto compare the deciles using a percentile boot-strap method in conjunction with the quan-tile estimator derived by Harrell and Davis(1982) This is easily done with the R func-tions qcomhd

Note that if the distributions associated withX and Y do not differ then D = X minus Y will havea symmetric distribution about zero Let xq bethe qth quantile of D 0 lt q lt 05 In particularit will be the case that xq + x1ndashq = 0 when Xand Y have identical distributions The median(2nd quartile) will be zero and for instancethe sum of the 3rd quartile (075 quantile) andthe 1st quartile (025 quantile) will be zeroSo this sum provides yet another perspectiveon how distributions differ (see illustrations inRousselet et al 2017)

Imagine for example that an experimentalgroup is compared to a control group basedon some measure of depressive symptoms Ifx025 = ndash4 and x075 = 6 then for a single ran-domly sampled observation from each groupthere is a sense in which the experimental treat-ment outweighs no treatment because positivedifferences (beneficial effect) tend to be largerthan negative differences (detrimental effect)The hypothesis

H0 xq + xqminus1 = 0

Equation 8425

can be tested with the R function cbmhd Aconfidence interval is returned as well Cur-rent results indicate that the method providesreasonably accurate control over the Type I er-ror probability when q = 025 and the samplesizes are 10 For q = 01 sample sizes 20should be used (Wilcox 2012)

43 Eliminate Outliers and Averagethe Remaining Values

Rather than use means trimmed means orthe median another approach is to use an es-timator that down weights or eliminates out-liers For example use the MAD-median to

search for outliers remove any that are foundand average the remaining values This isgenerally known as a modified one-step M-estimator (MOM) There is a method for es-timating the standard error but currently apercentile bootstrap method seems preferablewhen testing hypotheses This approach mightseem preferable to using a trimmed mean ormedian because trimming can eliminate pointsthat are not outliers But this issue is far fromsimple Indeed there are indications that whentesting hypotheses the expectation is that us-ing a trimmed mean or median will performbetter in terms of Type I errors and power(Wilcox 2017a) However there are excep-tions no single estimator dominates

As previously noted an invalid strategy isto eliminate extreme values and apply conven-tional methods for means based on the remain-ing data because the wrong standard error isused Switching to a percentile bootstrap dealswith this issue when using MOM as well asrelated estimators The R function pb2gen ap-plies this method

44 Dependent VariablesNext consider the goal of comparing two

dependent variables That is the variablesmight be correlated Based on the random sam-ple (X1 Y1) (Xn Ym) let Di = Xi ndash Yi

(i = 1 n) Even when X and Y are cor-related microx ndash microy = microD the difference betweenthe population means is equal to the mean ofthe difference scores But under general con-ditions this is not the case when working withtrimmed means When dealing with mediansfor example it is generally the case that θ x ndashθ y θD

If the distribution of D is symmetric andlight-tailed (outliers are relatively rare) thepaired t-test performs reasonably well As wemove toward a skewed distribution at somepoint this is no longer the case for reasonssummarized in Section 21 Moreover powerand control over the probability of a Type Ierror are also a function of the likelihood ofencountering outliers

There is a method for computing a con-fidence interval for θD for which the prob-ability of a Type I error can be deter-mined exactly assuming random samplingonly (eg Hettmansperger amp McKean 2011)In practice a slight modification of thismethod is recommended that was derived byHettmansperger and Sheather (1986) Whensample sizes are very small this method per-forms very well in terms of controlling theprobability of a Type I error And in general it

A Guide to RobustStatisticalMethods

84216

Supplement 82 Current Protocols in Neuroscience

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 17: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

is an excellent method for making inferencesabout θD The method can be applied via theR function sintv2

As for trimmed means with the focus stillon D a percentile bootstrap method can beused via the R function trimpb or wmcppbAgain with 20 trimming reasonably goodcontrol over the Type I error probability canbe achieved With n = 20 the percentile boot-strap method is better than the non-bootstrapmethod derived by Tukey and McLaughlin(1963) With large enough sample sizes theTukey-McLaughlin method can be used in lieuof the percentile bootstrap method via the Rfunction trimci but it is unclear just how largethe sample size must be

In some situations there might be interest incomparing measures of central tendencies as-sociated with the marginal distributions ratherthan the difference scores Imagine for exam-ple participants consist of married couplesOne issue might be the typical difference be-tween a husband and his wife in which casedifference scores would be used Another is-sue might be how the typical male comparesto the typical female So now the goal wouldbe to test H0 θ x = θ y rather than H0 θD =0 The R function dmeppb tests the first ofthese hypotheses and performs relatively welleven when there are tied values If the goalis to compare the marginal trimmed meansrather than make inferences about the trimmedmean of the difference scores use the R func-tion dtrimpb or use wmcppb and set the ar-gument dif = FALSE When dealing witha moderately large sample size the R func-tion yuend can be used instead but there isno clear indication just how large the samplesize must be A collection of quantiles canbe compared with Dqcomhd and all of thequantiles can be compared via the functionlband

Yet another approach is to use the classicsign test which is aimed at making inferencesabout P(X lt Y) As is evident this probabil-ity provides a useful perspective on the natureof the difference between the two dependentvariables under study beyond simply compar-ing measures of central tendency The R func-tion signt performs the sign test which by de-fault uses the method derived by Agresti andCoull (1998) If the goal is to ensure that theconfidence interval has probability coverage atleast 1 minus α rather than approximately equalto 1 minus α the Schilling and Doi (2014) methodcan be used by setting the argument SD =TRUE when using the R function signt In con-trast to the Schilling and Doi method p-values

can be computed when using the Agresti andCoull technique Another negative feature ofthe Schilling and Doi method is that execu-tion time can be extremely high even with amoderately large sample size

A criticism of the sign test is that its powermight be lower than the Wilcoxon signed ranktest However this issue is not straightfor-ward Moreover the sign test can reject insituations where other conventional methodsdo not Again which method has the highestpower depends on the characteristics of the un-known distributions generating the data Alsoin contrast to the sign test the Wilcoxon signedrank test provides no insight into the nature ofany difference that might exist without makingrather restrictive assumptions about the under-lying distributions In particular under gen-eral conditions it does not compare mediansor some other measure of central tendency aspreviously noted

45 More Complex DesignsIt is noted that when dealing with a one-way

or higher ANOVA design violations of thenormality and homoscedasticity assumptionsassociated with classic methods for meansbecome an even more serious issue in termsof both Type I error probabilities and powerRobust methods have been derived (Wilcox2017a 2017c) but the many details go be-yond the scope of this paper However a fewpoints are worth stressing

Momentarily assume normality and ho-moscedasticity Another important insight hasto do with the role of the ANOVA F test ver-sus post hoc multiple comparison proceduressuch as the Tukey-Kramer method In termsof controlling the probability of one or moreType I errors is it necessary to first reject withthe ANOVA F test The answer is an unequiv-ocal no With equal sample sizes the Tukey-Kramer method provides exact control But ifit is used only after the ANOVA F test re-jects this is no longer the case it is lowerthan the nominal level (Bernhardson 1975)For unequal sample sizes the probability ofone or more Type I errors is less than or equalto the nominal level when using the Tukey-Kramer method But if it is used only after theANOVA F test rejects it is even lower whichcan negatively impact power More generallyif an experiment aims to test specific hypothe-ses involving subsets of conditions there isno obligation to first perform an ANOVA theanalyses should focus directly on the compar-isons of interest using for instance the func-tions for linear contrasts listed below

BehavioralNeuroscience

84217

Current Protocols in Neuroscience Supplement 82

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 18: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

Now consider non-normality and het-eroscedasticity When performing all pairwisecomparisons for example most modern meth-ods are designed to control the probability ofone or more Type I errors without first per-forming a robust analog of the ANOVA F testThere are however situations where a robustanalog of the ANOVA F test can help increasepower (eg Wilcox 2017a section 74)

For a one-way design where the goal isto compare all pairs of groups a percentilebootstrap method can be used via the R func-tion linconpb A non-bootstrap method is per-formed by lincon For medians use medpbFor an extension of Cliffrsquos method use cid-mul Methods and corresponding R functionsfor both two-way and three-way designs in-cluding techniques for dependent groups areavailable as well (see Wilcox 2017a 2017c)

46 Comparing IndependentCorrelations and RegressionSlopes

Next consider two independent groupswhere for each group there is interest in thestrength of the association between two vari-ables A common goal is to test the hypothesisthat the strength of association is the same forboth groups

Let ρ j (j = 1 2) be Pearsonrsquos correlation forthe jth group and consider the goal of testing

H0 ρ1 = ρ2

Equation 8426

Various methods for accomplishing this goalare known to be unsatisfactory (Wilcox 2009)For example one might use Fisherrsquos r-to-ztransformation but it follows immediatelyfrom results in Duncan and Layard (1973) thatthis approach performs poorly under generalconditions Methods that assume homoscedas-ticity as depicted in Figure 8424 can be un-satisfactory as well As previously noted whenthere is an association (the variables are depen-dent) and in particular there is heteroscedas-ticity a practical concern is that the wrongstandard error is being used when testing hy-potheses about the slopes That is the deriva-tion of the test statistic is valid when thereis no association independence implies ho-moscedasticity But under general conditionsit is invalid when there is heteroscedasticityThis concern extends to inferences made aboutPearsonrsquos correlation

There are two methods that perform rela-tively well in terms of controlling the Type I er-ror probability The first is based on a modifica-

tion of the basic percentile bootstrap methodImagine that Equation 8426 is rejected if theconfidence interval for ρ1 ndash ρ2 does not containzero So the Type I error probability dependson the width of this confidence interval If itis too short the actual Type I error probabil-ity will exceed 005 With small sample sizesthis is exactly what happens when using thebasic percentile bootstrap method The mod-ification consists of widening the confidenceinterval for ρ1 ndash ρ2 when the sample size issmall The amount it is widened depends onthe sample sizes The method can be appliedvia the R function twopcor A limitation is thatthis method can be used only when the Type Ierror is 005 and it does not yield a p-value

The second approach is to use a methodthat estimates the standard error in a mannerthat deals with heteroscedasticity When deal-ing with the slope of the least squares regres-sion line several methods are now availablefor getting valid estimates of the standard er-ror when there is heteroscedasticity (Wilcox2017a) One of these is called the HC4 estima-tor which can be used to test Equation 8426via the R function twohc4cor

As previously noted Pearsonrsquos correlationis not robust even a single outlier might sub-stantially impact its value giving a distortedsense of the strength of the association amongthe bulk of the points Switching to Kendallrsquostau or Spearmanrsquos rho now a basic percentilebootstrap method can be used to compare twoindependent groups in a manner that allowsheteroscedasticity via the R function twocorAs noted in Section 23 the skipped correla-tion can be used via the R function scorci

The slopes of regression lines can be com-pared as well using methods that allow het-eroscedasticity For least squares regressionuse the R function ols2ci For robust regres-sion estimators use reg2ci

47 Comparing Correlations theOverlapping Case

Consider a single dependent variable Yand two independent variables X1 and X2A common and fundamental goal is under-standing the relative importance of X1 and X2

in terms of their association with Y A typ-ical mistake in neuroscience is to performtwo separate tests of associations one be-tween X1 and Y another between X2 and Ywithout explicitly comparing the associationstrengths between the independent variables(Nieuwenhuis Forstmann amp Wagenmakers2011) For instance reporting that one testis significant and the other is not cannot be

A Guide to RobustStatisticalMethods

84218

Supplement 82 Current Protocols in Neuroscience

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 19: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

used to conclude that the associations them-selves differ A common example would bewhen an association is estimated between eachof two brain measurements and a behavioraloutcome

There are many methods for estimatingwhich independent variable is more importantmany of which are known to be unsatisfac-tory (eg Wilcox 2017c section 613) Step-wise regression is among the unsatisfactorytechniques for reasons summarized by Mont-gomery and Peck (1992) in section 723 aswell as Derksen and Keselman (1992) Re-gardless of their relative merits a practical lim-itation is that they do not reflect the strength ofempirical evidence that the most important in-dependent variable has been chosen One strat-egy is to test H0 ρ1 = ρ2 where now ρ j (j = 12) is the correlation between Y and Xj This canbe done with the R function TWOpov Whendealing with robust correlations use the func-tion twoDcorR

However a criticism of this approach is thatit does not take into account the nature of theassociation when both independent variablesare included in the model This is a concern be-cause the strength of the association betweenY and X1 can depend on whether X2 is in-cluded in the model as illustrated in Section54 There is now a robust method for testingthe hypothesis that there is no difference in theassociation strength when both X1 and X2 areincluded in the model (Wilcox 2017a 2018)Heteroscedasticity is allowed If for examplethere are three independent variables one cantest the hypothesis that the strength of the asso-ciation for the first two independent variablesis equal to the strength of the association forthe third independent variable The method canbe applied with the R function regIVcom Amodification and extension of the method hasbeen derived when there is curvature (Wilcoxin press) but it is limited to two independentvariables

48 ANCOVAThe simplest version of the analysis of co-

variance (ANCOVA) consists of comparingthe regression lines associated with two in-dependent groups when there is a single inde-pendent variable The classic method makesseveral restrictive assumptions (1) the regres-sion lines are parallel (2) for each regressionline there is homoscedasticity (3) the varianceof the dependent variable is the same for bothgroups (4) normality and (5) a straight regres-sion line provides an adequate approximationof the true association Violating any of these

assumptions is a serious practical concernViolating two or more of these assumptionsmakes matters worse There is now vast arrayof more modern methods that deal with vio-lations of all of these assumptions in chapter12 of Wilcox (2017a) These newer techniquescan substantially increase power compared tothe classic ANCOVA technique and perhapsmore importantly they can provide a deeperand more accurate understanding of how thegroups compare But the many details go be-yond the scope of this unit

As noted in the introduction curvature isa more serious concern than is generally rec-ognized One strategy as a partial check onthe presence of curvature is to simply plot theregression lines associated with two groups us-ing the R functions lplot2g and rplot2g Whenusing these functions as well as related func-tions it can be vitally important to check onthe impact of removing outliers among the in-dependent variables This is easily done withfunctions mentioned here by setting the ar-gument xout = TRUE If these plots sug-gest that curvature might be an issue considerthe R functions ancova and ancdet This lat-ter function applies method TAP in Wilcox(2017a section 1224) and can provide moredetailed information about where and howtwo regression lines differ compared to thefunction ancova These functions are basedon methods that allow heteroscedasticity andnon-normality and they eliminate the classicassumption that the regression lines are paral-lel For two independent variables see Wilcox(2017a section 124) If there is evidence thatcurvature is not an issue again there are veryeffective methods that allow heteroscedastic-ity as well as non-normality (Wilcox 2017asection 121)

5 SOME ILLUSTRATIONSUsing data from several studies this section

illustrates modern methods and how they con-trast Extant results suggest that robust meth-ods have a relatively high likelihood of max-imizing power but as previously stressed nosingle method dominates in terms of maxi-mizing power Another goal in this sectionis to underscore the suggestion that multi-ple perspectives can be important More com-plete descriptions of the results as well asthe R code that was used are available onfigshare (Wilcox amp Rousselet 2017) Thefigshare reproducibility package also containsa more systematic assessment of type I errorand power in the one-sample case (see filepower_onesamplepdf)

BehavioralNeuroscience

84219

Current Protocols in Neuroscience Supplement 82

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 20: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

00

05

10

15

20

25

30

35

FH S FA H B T

Positions

Thr

esho

lds

A

minus2

minus1

0

1

2

FHminusS

FHminusFA

FHminusHFHminusB

FHminusTSminusF

ASminusH SminusB SminusT

FAminusH

FAminusB

FAminusT HminusB HminusT BminusT

Contrasts

Diff

eren

ces

B

Figure 8426 Data from Mancini et al (2014) (A) Marginal distributions of thresholds at locationsforehead (FH) shoulder (S) forearm (FA) hand (H) back (B) and thigh (T) Individual participants(n = 10) are shown as colored disks and lines The medians across participants are shown inblack (B) Distributions of all pairwise differences between the conditions shown in A In eachstripchart (1D scatterplot) the black horizontal line marks the median of the differences

51 Spatial Acuity for PainThe first illustration stems from Mancini

et al (2014) who report results aimed at pro-viding a whole-body mapping of spatial acu-ity for pain (also see Mancini 2016) Here thefocus is on their second experiment Brieflyspatial acuity was assessed by measuring 2-point discrimination thresholds for both painand touch in 11 body territories One goal wasto compare touch measures taken at differentbody parts forehead shoulder forearm handback and thigh Plots of the data are shownin Figure 8426A for the six body parts Thesample size is n = 10

Their analyses were based on the ANOVAF test followed by paired t-tests when theANOVA F test was significant Their signif-icant results indicate that the distributions dif-fer but because the ANOVA F test is not arobust method when comparing means there

is some doubt about the nature of the differ-ences So one goal is to determine in whichsituations robust methods give similar resultsAnd for the non-significant results there is theissue of whether an important difference wasmissed due to using the ANOVA F test andStudentrsquos t-test

First we describe a situation where robustmethods based a median and a 20 trimmedmean give reasonably similar results Com-paring foot and thigh pain measures based onStudentrsquos t-test the 095 confidence intervalis [minus0112 0894] and the p-value is 0112For a 20 trimmed mean the 095 confidenceinterval is [minus0032 0778] and the p-value is0096 As for the median the correspondingresults were [minus0085 075] and 0062

Next all pairwise comparisons based ontouch were performed for the following bodyparts forehead shoulder forearm hand back

A Guide to RobustStatisticalMethods

84220

Supplement 82 Current Protocols in Neuroscience

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 21: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

and thigh Figure 8426B shows a plot of thedifference scores for each pair of body partsThe probability of one or more Type I er-rors was controlled using an improvement onthe Bonferroni method derived by Hochberg(1988) The simple strategy of using paired t-tests if the ANOVA F rejects does not controlthe probability of one or more Type I errorsIf paired t-tests are used without controllingthe probability of one or more Type I errorsas done by Mancini et al (2014) 14 of the 15hypotheses are rejected If the probability ofone or more Type I errors is controlled usingHochbergrsquos method the following results wereobtained Comparing means via the R functionwmcp (and the argument tr = 0) 10 of the 15hypotheses were rejected Using medians viathe R function dmedpb 11 were rejected Asfor the 20 trimmed mean using the R func-tion wmcppb now 13 are rejected illustrat-ing the point made earlier that the choice ofmethod can make a practical difference

It is noted that in various situations usingthe sign test the estimate of P(X lt Y) wasequal to one which provides a useful perspec-tive beyond using a mean or median

52 Receptive Fields in Early VisualCortex

The next illustrations are based on data an-alyzed by Talebi and Baker (2016) and pre-sented in their Figure 9 The goal was toestimate visual receptive field (RF) models ofneurons from the cat visual cortex using natu-ral image stimuli The authors provided a richquantification of the neuronsrsquo responses anddemonstrated the existence of three function-ally distinct categories of simple cells Thetotal sample size is 212

There are three measures latency durationand a direction selectivity index (dsi) For eachof these measures there are three independentcategories nonoriented (nonOri) cells (n =101) expansive oriented (expOri) cells (n =48) and compressive oriented (compOri) cells(n = 63)

First focus on latency Talebi and Baker(2016) used Studentrsquos t-test to compare meansComparing the means for nonOri and ex-pOri no significant difference is found Butan issue is whether Studentrsquos t-test might bemissing an important difference The plot ofthe distributions shown in the top row ofFigure 8427A which was created with theR function g5plot provides a partial checkon this possibility As is evident the distri-butions for nonOri and expOri are very similarsuggesting that no method will yield a signif-

icant result Using error bars is less convinc-ing because important differences might existwhen focusing instead on the median 20trimmed mean or some other aspect of thedistributions

As noted in Section 42 comparing thedeciles can provide a more detailed under-standing of how groups differ The second rowof Figure 8427 shows the estimated differ-ence between the deciles for the nonOri groupversus the expOri group The vertical dashedline indicates the median Also shown are con-fidence intervals (computed via the R functionqcomhd) having approximately simultaneousprobability coverage equal to 095 That isthe probability of one or more Type I errorsis 005 In this particular case differencesbetween the deciles are more pronouncedas we move from the lower to the upperdeciles but again no significant differences arefound

The third row shows the difference be-tween the deciles for nonOri versus compOriAgain the magnitude of the differences be-comes more pronounced moving from low tohigh measures Now all of the deciles differsignificantly except the 01 quantiles

Next consider durations in column B ofFigure 8427 Comparing nonOri to expOriusing means 20 trimmed means and me-dians the corresponding p-values are 00010005 and 0009 Even when controlling theprobability of one or more Type I errors usingHochbergrsquos method all three reject at the 001level So a method based on means rejects atthe 001 level but this merely indicates that thedistributions differ in some manner To providean indication that the groups differ in termsof some measure of central tendency using20 trimmed means and medians is more sat-isfactory The plot in row 2 of Figure 8427Bconfirms an overall shift between the two dis-tributions and suggests a more specific patternwith increasing differences in the deciles be-yond the median

Comparing nonOri to compOri significantresults were again obtained using means 20trimmed means and medians the largest p-value is p = 0005 In contrast qcomhd in-dicates a significant difference for all of thedeciles excluding the 01 quantile As can beseen from the last row in Figure 8427B oncemore the magnitude of the differences be-tween the deciles increases as we move fromthe lower to the upper deciles Again thisfunction provides a more detailed understand-ing of where and how the distributions differsignificantly

BehavioralNeuroscience

84221

Current Protocols in Neuroscience Supplement 82

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 22: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

000001002003004005006007

0 20 40 60 80 100 120 140Latency (msec)

Den

sity

GroupnonOriexpOricompOri

A

minus40

minus30

minus20

minus10

0

10

20

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

exp

Ori

minus100

minus80

minus60

minus40

minus20

0

20 30 40 50 60nonOri latency deciles (msec)

Diff

eren

ce

nonO

ri minus

com

pOri

0000

0002

0004

0006

0008

0010

0 50 100 150 200 250 300Duration (msec)

B

minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

minus160minus140minus120minus100minus80minus60minus40minus20

0

25 50 75 100 125nonOri duration deciles (msec)

Figure 8427 Response latencies (A) and durations (B) from cells recorded by Talebi andBaker (2016) The top row shows estimated distributions of latency and duration measures forthree categories nonOri expOri and compOri The middle and bottom rows show shift functionsbased on comparisons between the deciles of two groups The deciles of one group are on thex-axis the difference between the deciles of the two groups is on the y-axis The black dotted lineis the difference between deciles plotted as function of the deciles in one group The grey dottedlines mark the 95 bootstrap confidence interval which is also highlighted by the grey shadedarea Middle row Differences between deciles for nonOri versus expOri plotted as a functionof the deciles for nonOri Bottom row Differences between deciles for nonOri versus compOriplotted as a function of the deciles for nonOri

53 Mild Traumatic Brain InjuryThe illustrations in this section stem from a

study dealing with mild traumatic brain injury(Almeida-Suhett et al 2014) Briefly 5- to 6-week-old male Sprague Dawley rats receiveda mild controlled cortical impact (CCI) injuryThe dependent variable used here is the stere-ologically estimated total number of GAD-67-positive cells in the basolateral amygdala(BLA) Measures 1 and 7 days after surgerywere compared to the sham-treated controlgroup that received a craniotomy but no CCIinjury A portion of their analyses focused onthe ipsilateral sides of the BLA Box plots ofthe data are shown in Figure 8428 The sam-ple sizes are 13 9 and 8 respectively

Almeida-Suhett et al (2014) comparedmeans using an ANOVA F test followed byBonferroni post hoc test Comparing both day

1 and day 7 measures to the sham group basedon Studentrsquos t-test the p-values are 00375and lt0001 respectively So if the Bonferronimethod is used the day 1 group does not differsignificantly from the sham group when testingat the 005 level However using Hochbergrsquosimprovement on the Bonferroni method nowthe reverse decision is made

Here the day 1 group was compared againto the sham group based a percentile boot-strap method for comparing both 20 trimmedmeans and medians as well as Cliffrsquos improve-ment on the WMW test The correspondingp-values are 0024 0086 and 0040 If 20trimmed means are compared instead withYuenrsquos method the p-value is p = 0079 butdue to the relatively small sample sizes a per-centile bootstrap would be expected to providemore accurate control over the Type I error

A Guide to RobustStatisticalMethods

84222

Supplement 82 Current Protocols in Neuroscience

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 23: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

Contra Ipsi

Sham Day 1 Day 7 Sham Day 1 Day 7

3000

4000

5000

6000

7000

8000

Conditions

Num

ber

of G

AD

minus67

minuspo

sitiv

e ce

llsin

the

baso

late

ral a

myg

dala

Figure 8428 Box plots for the contralateral (contra) and ipsilateral (ipsi) sides of the basolateralamygdala

probability The main point here is that thechoice between Yuen and a percentile boot-strap method can make a practical differenceThe box plots suggest that sampling is fromdistributions that are unlikely to generate out-liers in which case a method based on theusual sample median might have relatively lowpower When outliers are rare a way of com-paring medians that might have more poweris to use instead the Harrell-Davis estimatormentioned in Section 42 in conjunction witha percentile bootstrap method Now p = 0049Also testing Equation 8424 with the R func-tion wmwpb p = 0031 So focusing on θD themedian of all pairwise differences rather thanthe individual medians can make a practicaldifference

In summary when comparing the shamgroup to the day 1 group all of the meth-ods that perform relatively well when sam-ple sizes are small described in Section 41reject at the 005 level except the percentilebootstrap method based on the usual samplemedian Taken as a whole the results suggestthat measures for the sham group are typicallyhigher than measures based on day 1 groupAs for the day 7 data now all of the methodsused for the day 1 data have p-values 0002

The same analyses were done using thecontralateral sides of the BLA Now the resultswere consistent with those based on meansnone are significant for day 1 As for the day 7measures both conventional and robust meth-ods indicate significant results

54 Fractional Anisotropy andReading Ability

The next illustrations are based on datadealing with reading skills and structuralbrain development (Houston et al 2014) Thegeneral goal was to investigate maturationalvolume changes in brain reading regions andtheir association with performance on read-ing measures The statistical methods usedwere not made explicit Presumably they wereleast squares regression or Pearsonrsquos correla-tion coupled with the usual Studentrsquos t-testsThe ages of the participants ranged between 6and 16 After eliminating missing values thesample size is n = 53 (It is unclear how Hous-ton and colleagues dealt with missing values)

As previously indicated when dealing withregression it is prudent to begin with asmoother as a partial check on whether as-suming a straight regression line appears tobe reasonable Simultaneously the potentialimpact of outliers needs to be considered Inexploratory studies it is suggested that re-sults based on both Clevelandrsquos smoother andthe running interval smoother be examined(Quantile regression smoothers are another op-tion that can be very useful use the R functionqsm)

Here we begin by using the R function lplot(Clevelandrsquos smoother) to estimate the regres-sion line when the goal is to estimate the meanof the dependent variable for some given valueof an independent variable Figure 8429Ashows the estimated regression line when

BehavioralNeuroscience

84223

Current Protocols in Neuroscience Supplement 82

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 24: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

044

046

048

050

052

054

056

058

6 7 8 9 10 11 12 13 14 15 16Age

CS

TL

A

0

20

40

60

80

100

120

140

6 7 8 9 10 11 12 13 14 15 16Age

GO

RT

FL

B

0

20

40

60

80

100

120

140

044 046 048 050 052 054 056 058CSTL

GO

RT

FL

C

044

046

048

050

052

054

056

058

0 20 40 60 80 100 120 140GORTFL

CS

TL

D

Figure 8429 Nonparametric estimates of the regression line for predicting GORTFL with CSTL

using age to predict left corticospinal measures(CSTL) Figure 8429B shows the estimatewhen a GORT fluency (GORTFL) measure ofreading ability (Wiederhold amp Bryan 2001) istaken to be the dependent variable The shadedareas indicate a 095 confidence region thatcontains the true regression line In these twosituations assuming a straight regression lineseems like a reasonable approximation of thetrue regression line

Figure 8429C shows the estimated regres-sion line for predicting GORTFL with CSTLNote the dip in the regression line One possi-bility is that the dip reflects the true regressionline but another explanation is that it is due tooutliers among the dependent variable (TheR function outmgv indicates that the upperGORTFL values are outliers) Switching tothe running interval smoother (via the R func-tion rplot) which uses a 20 trimmed meanto estimate the typical GORTFL value nowthe regression line appears to be quite straight(For details see the figshare file mentioned atthe beginning of this section)

The presence of curvature can depend onwhich variable is taken to be the independentvariable Figure 8429D illustrates this pointby using CSTL as the dependent variable andGORTFL as the independent variable in con-trast to Figure 8429C Note how the line in-creases and then curves down

Using instead the running-intervalsmoother where the goal is to estimatethe 20 trimmed mean of the dependentvariable now there appears to be a distinctbend approximately at GORTFL = 70 ForGORTFL lt 70 the regression line appearsto be quite straight and the slope was foundto be significant (p = 0007) based on theR function regci which uses the robustTheil-Sen estimator by default (It is designedto estimate the median of the dependentvariable) For GORTFL gt 70 again theregression line is reasonably straight butnow the slope is negative and does not differsignificantly from zero (p = 027) More-over testing the hypothesis that these twoslopes are equal (via the R function reg2ci)

A Guide to RobustStatisticalMethods

84224

Supplement 82 Current Protocols in Neuroscience

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 25: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

p = 0027 which provides additional evi-dence that there is curvature So a more robustsmoother suggests that there is a positiveassociation up to about GORTFL = 70 afterwhich the association is much weaker andpossibly nonexistent

A common strategy for dealing with cur-vature is to include a quadratic term in theregression model More generally one mighttry to straighten a regression by replacing theindependent variable X with Xa for some suit-able choice for a However for the situation athand the half slope ratio method (eg Wilcox2017a section 114) does not support this strat-egy It currently seems that smoothers providea more satisfactory approach to dealing withcurvature

Now consider the goal of determiningwhether age or CSTL is more important whenGORTFL is the dependent variable A plot ofthe regression surface when both independentvariables are used to predict GORTFL sug-gests that the regression surface is well approx-imated by a plane (Details are in the figsharedocument) Testing the hypothesis that the twoindependent variables are equally importantvia the R function regIVcom indicates thatage is more important (This function uses theTheil-Sen estimator by default) Moreover theratio of the strength of the individual associ-ations is 53 Using least squares regressioninstead (via regIVcom but with the argumentregfun = ols) again age is more important andnow the ratio of the strength of the individualassociations is 985

Another measure of reading ability thatwas used is the Woodcock-Johnson (WJ) ba-sic reading skills composite index (WoodcockMcGrew amp Mather 2001) Here we considerthe extent the GORT (raw) rate score is moreimportant than age when predicting the WJword attack (raw) score Pearsonrsquos correlationfor the word attack score and age is 068 (plt 0001) The correlation between the GORTrate score and the word attack score is 079(p lt 0001) Comparing these correlations viathe R function twohc4cor no significant dif-ference is found at the 005 level (p = 011)But when both of these independent variablesare entered into the model and again theTheil-Sen regression estimator is used a sig-nificant result is obtained (p lt 0001) theGORT rate score is estimated to be more im-portant In addition the strength of the asso-ciation between age and the word attack scoreis estimated to be close to zero Using insteadleast squares regression p = 002 and the cor-relation between age and WJ word attack score

drops to 0012 So both methods indicate thatthe GORT rate score is more important withthe result based on a robust regression estima-tor providing more compelling evidence thatthis is the case This illustrates a point madeearlier that the relative importance of the inde-pendent variables can depend on which inde-pendent variables are included in the model

6 A SUGGESTED GUIDEKeeping in mind that no single method is

always best this section suggests some broadguidelines regarding how to compare groupsand study associations This is followed bymore precise details for certain special cases

Plot the data Error bars are popular but theyare limited regarding the informationthey convey regardless of whether theyare based on the standard deviation or anestimate of the standard error Better arescatterplots box plots or violin plots Forsmall sample sizes scatterplots shouldbe the default If the sample sizes are nottoo small plots of the distributions canbe very useful but there is no agreedupon guideline regarding just how largethe sample size should be For themoment we suggest checking both boxplots and plots of the distributions whenthe sample size is n 30 with the goal ofgetting different perspectives on thenature of the distributions It is suggestedthat in general kernel density estimatorsshould be used rather than histogramsfor reasons illustrated in Wilcox (2017b)(The R functions akerd and g5plot use akernel density estimator) For discretedata where the number of possiblevalues for the outcome variable isrelatively small also consider a plot ofthe relative frequencies The R functionsplot is one possibility When comparingtwo groups consider the R functionsplotg2 For more details regarding plotssee Weissgerber et al (2015) andRousselet et al (2017)

For very small sample sizes say 10consider the methods in Sections 42 and43

Use a method that allows heteroscedasticityIf the homoscedasticity assumption istrue in general little is lost when using aheteroscedastic method But as thedegree of heteroscedasticity increases atsome point methods that allowheteroscedasticity can make a practical

BehavioralNeuroscience

84225

Current Protocols in Neuroscience Supplement 82

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 26: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

difference in terms of both Type I errorsand power Put more broadly avoidmethods that use an incorrect estimate ofthe standard error when groups differ orwhen dealing with regression and there isan association These methods includet-tests ANOVA F tests on means and theWMW test as well as conventionalmethods for making inferences aboutmeasures of association and theparameters of the usual linear regressionmodel

Be aware of the limitations of methodsbased on means they have a relativelyhigh risk of poor control over the Type Ierror probability as well as poor powerAnother possible concern is that whendealing with skewed distributions themean might be an unsatisfactorysummary of the typical response Resultsbased on means are not necessarilyinaccurate but relative to other methodsthat might be used there are seriouspractical concerns that are difficult toaddress Importantly when using meanseven a significant result can yield arelatively inaccurate and unrevealingsense of how distributions differ and anon-significant result cannot be used toconclude that distributions do not differ

As a useful alternative to comparisons basedon means consider using a shift functionor some other method for comparingmultiple quantiles Sometimes thesemethods can provide a deeperunderstanding of where and how groupsdiffer that has practical value asillustrated in Figure 8427 There is eventhe possibility that they yield significantresults when methods based on meanstrimmed means and medians do not Fordiscrete data where the variables have alimited number of possible valuesconsider the R function binband (seeWilcox 2017c section 12117 orWilcox 2017a section 585)

When checking for outliers use a methodthat avoids masking This eliminates anymethod based on the mean and varianceWilcox (2017a section 64) summarizesmethods designed for multivariate data(The R functions outpro and outmgv usemethods that perform relatively well)

Be aware that the choice of method canmake a substantial difference Forexample non-significant results canbecome significant when switching froma method based on the mean to a 20

trimmed mean or median The reversecan happen where methods based onmeans are significant but robust methodsare not In this latter situation the reasonmight be that confidence intervals basedon means are highly inaccurateResolving whether this is the case isdifficult at best based on currenttechnology Consequently it is prudentto consider the robust methods outlinedin this paper

When dealing with regression or measuresof association use modern methods forchecking on the impact of outliers Whenusing regression estimators dealing withoutliers among the independent variablesis straightforward via the R functionsmentioned here set the argument xout =TRUE As for outliers among thedependent variable use some robustregression estimator The Theil-Senestimator is relatively good butarguments can be made for using certainextensions and alternative techniquesWhen there are one or two independentvariables and the sample size is not toosmall check the plots returned bysmoothers This can be done with the Rfunctions rplot and lplot Again it can becrucial to check the impact of removingoutlier among the independent variablesby setting the argument xout = TRUEOther possibilities and their relativemerits are summarized in Wilcox(2017a)

If the goal is to compare dependent groupsbased on difference scores and the sample sizesare very small say 10 the following optionsperform relatively well over a broad range ofsituations in terms of controlling the Type Ierror probability and providing accurate con-fidence intervals

Use the sign test via the R function signt Ifthe sample sizes are 10 and the goal isto ensure that the Type I error probabilityis less than the nominal level set theargument SD = TRUE Otherwise thedefault method performs reasonablywell For multiple groups use the Rfunction signmcp

Focus on the median of the differencescores using the R command sintv2(x-y)where x and y are R variables containingthe data For multiple groups use the Rfunction sintv2mcp

Focus on a 20 trimmed mean of thedifference scores using the command

A Guide to RobustStatisticalMethods

84226

Supplement 82 Current Protocols in Neuroscience

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 27: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

trimcipb(x-y) This method worksreasonably well with a sample size of sixor greater When dealing with multiplegroups use the R function dtrimpb

Compare the quantiles of the marginaldistributions via the R function lband

Each of the methods listed above providesgood control over the Type I error probabil-ity Which one has the most power depends onhow the groups differ which of course is notknown With sample sizes gt10 many otheroptions are available (eg Wilcox 2017a2017b) Again different methods providedifferent perspectives regarding how groupscompare

If the goal is to compare two independentgroups and the sample sizes are 10 here isa list of some methods that perform relativelywell in terms of Type I errors

Compare the medians using a percentilebootstrap method via the R functionmedpb2 If there are multiple groupsuse the R function medcmp

Compare the quantiles via the R functionsband

Test the hypothesis that a randomlysampled observation from the firstgroup is less than a randomly sampledobservation from the second using theR function cidv2 For multiplegroups use cidmcp

Comparing 20 trimmed means via theR function trimpb2 performsreasonably well but comparingmedians via the R function medpb2 isa bit safer in terms of Type I errorsHowever trimpb2 might have higherpower For multiple groups uselinconpb

Again larger sample sizes provide manymore options particularly when dealing withmore than two groups For example ratherthan comparing quantiles via the R functionsband comparing the deciles using the R func-tion qcomhd might have a power advantageThere are multiple ways of viewing interac-tions in a two-way ANOVA design There areeven alternative methods for comparing de-pendent groups not covered here

7 CONCLUDING REMARKSIt is not being suggested that methods based

on means should be completely abandoned orhave no practical value It is being suggestedthat complete reliance on conventional meth-

ods can result in a superficial misleading andrelatively uninformative understanding of howgroups compare In addition they might pro-vide poor control over the Type I error prob-ability and have relatively low power Similarconcerns plague least squares regression andPearsonrsquos correlation

When a method fails to reject this leavesopen the issue of whether a significant resultwas missed due to the method used From thisperspective multiple tests can be informativeHowever there are two competing issues thatneed to be considered The first is that whentesting multiple hypotheses this can increasethe probability of one or more Type I errorsFrom basic principles if for example five testsare performed at the 005 level and all five hy-potheses are true the expected number of TypeI errors is 025 But the more common focusis on the probability of one or more Type I er-rors rather than the expected number of TypeI errors The probability of one or more TypeI errors will be gt005 but by how much isdifficult to determine exactly due to the de-pendence among the tests that are performedImprovements on the Bonferroni method dealwith this issue (eg Hochberg 1988 Hom-mel 1988) and are readily implemented viathe R function padjust But the more tests thatare performed such adjustments come at thecost of lower power Simultaneously ignor-ing multiple perspectives runs the risk of notachieving a deep understanding of how groupscompare Also as noted in the previous sec-tion if methods based on means are used itis prudent to check the extent robust methodsgive similar results

An important issue not discussed here isrobust measures of effect size When bothconventional and robust methods reject themethod used to characterize how groups dif-fer can be crucial Cohenrsquos d for exampleis not robust simply because it is based onthe means and variances Robust measures ofeffect size are summarized in Wilcox (2017a2017c) Currently efforts are being made toextend and generalize these measures

Another important issue is the implicationof modern insights in terms of the massivenumber of published papers using conven-tional methods These insights do not neces-sarily imply that these results are incorrectThere are conditions where classic methodsperform reasonably well But there is a clearpossibility that misleading results were ob-tained in some situations One of the mainconcerns is whether important differences orassociations have been missed Some of the

BehavioralNeuroscience

84227

Current Protocols in Neuroscience Supplement 82

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 28: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

illustrations in Wilcox (2017a 2017b 2017c)for example reanalyzed data from studiesdealing with regression where the simple actof removing outliers among the independentvariable resulted in a non-significant resultbecoming significant at the 005 level As illus-trated here non-significant results can becomesignificant when using a more modern methodfor comparing measures of central tendencyThere is also the possibility that a few out-liers can result in a large Pearson correlationwhen in fact there is little or no associationamong the bulk of the data (Rousselet amp Per-net 2012) Wilcox (2017c) mentions one un-published study where this occurred So theissue is not whether modern robust methodscan make a practical difference Rather theissue is how often this occurs

Finally there is much more to modernmethods beyond the techniques and issues de-scribed here (Wilcox 2017a 2017b) Includedare additional methods for studying associa-tions as well as substantially better techniquesfor dealing with ANCOVA As mentioned inthe introduction there are introductory text-books that include the basics of modern ad-vances and insights But the difficult task ofmodernizing basic training for the typical neu-roscientist remains

ACKNOWLEDGMENTSThe authors of this article thank the authors

who shared data used in Section 5

LITERATURE CITEDAgresti A amp Coull B A (1998) Approximate

is better than exact for interval estimation ofbinomial proportions American Statistician 52119ndash126 doi 1023072685469

Almeida-Suhett C P Prager E M PidoplichkoV Figueiredo T H Marini A M Li Z Braga M (2014) Reduced GABAergic inhi-bition in the basolateral amygdala and the de-velopment of anxiety-like behaviors after mildtraumatic brain injury PLoS One 9 e102627doi 101371journalpone0102627

Bessel F W (1818) Fundamenta Astronomiaepro anno MDCCLV deducta ex observationibusviri incomparabilis James Bradley in speculaastronomica Grenovicensi per annos 1750-1762institutis Konigsberg Friedrich Nicolovius

Bernhardson C (1975) Type I error rates whenmultiple comparison procedures follow a sig-nificant F test of ANOVA Biometrics 31 719ndash724 doi 1023072529724

Boik R J (1987) The Fisher-Pitman permuta-tion test A non-robust alternative to the nor-mal theory F test when variances are het-erogeneous British Journal of Mathematical

and Statistical Psychology 40 26ndash42 doi101111j2044-83171987tb00865x

Bradley J V (1978) Robustness British Jour-nal of Mathematical and Statistical Psychol-ogy 31 144ndash152 doi 101111j2044-83171978tb00581x

Brown M B amp Forsythe A (1974) Thesmall sample behavior of some statistics whichtest the equality of several means Techno-metrics 16 129ndash132 doi 10108000401706197410489158

Brunner E Domhof S amp Langer F (2002) Non-parametric analysis of longitudinal data in fac-torial experiments New York Wiley

Chung E amp Romano J P (2013) Exact andasymptotically robust permutation tests An-nals of Statistics 41 484ndash507 doi 10121413-AOS1090

Cleveland W S (1979) Robust locally weightedregression and smoothing scatterplots Journalof the American Statistical Association 74 829ndash836 doi 10108001621459197910481038

Cliff N (1996) Ordinal methods for behavioraldata analysis Mahwah NJ Erlbaum

Cressie N A C amp Whitford H J (1986) How touse the two sample t-test Biometrical Journal28 131ndash148 doi 101002bimj4710280202

Derksen S amp Keselman H J (1992) Back-ward forward and stepwise automated sub-set selection algorithms Frequency of obtain-ing authentic and noise variables British Jour-nal of Mathematical and Statistical Psychology45 265ndash282 doi 101111j2044-83171992tb00992x

Doksum K A amp Sievers G L (1976) Plot-ting with confidence Graphical comparisons oftwo populations Biometrika 63 421ndash434 doi101093biomet633421

Doksum K A amp Wong C-W (1983) Statisticaltests based on transformed data Journal of theAmerican Statistical Association 78 411ndash417doi 10108001621459198310477986

Duncan G T amp Layard M W (1973) AMonte-Carlo study of asymptotically robusttests for correlation Biometrika 60 551ndash558doi 101093biomet603551

Fagerland M W amp Leiv Sandvik L (2009) TheWilcoxon-Mann-Whitney test under scrutinyStatistics in Medicine 28 1487ndash1497 doi101002sim3561

Field A Miles J amp Field Z (2012) Discoveringstatistics using R Thousand Oaks CA SagePublishing

Grayson D (2004) Some myths and leg-ends in quantitative psychology Understand-ing Statistics 3 101ndash134 doi 101207s15328031us0302_3

Hampel F R Ronchetti E M Rousseeuw P Jamp Stahel W A (1986) Robust statistics NewYork Wiley

Hand A (1998) A History of mathematical statis-tics from 1750 to 1930 New York Wiley

A Guide to RobustStatisticalMethods

84228

Supplement 82 Current Protocols in Neuroscience

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 29: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

Harrell F E amp Davis C E (1982) A newdistribution-free quantile estimator Biometrika69 635ndash640 doi 101093biomet693635

Heritier S Cantoni E Copt S amp Victoria-FeserM-P (2009) Robust methods in biostatisticsNew York Wiley

Hettmansperger T P amp McKean J W (2011)Robust nonparametric statistical methods 2nded Boca Raton FL CRC Press

Hettmansperger T P amp Sheather S J (1986)Confidence interval based on interpolated orderstatistics Statistical Probability Letters 4 75ndash79 doi 1010160167-7152(86)90021-0

Hochberg Y (1988) A sharper Bonferroni pro-cedure for multiple tests of significanceBiometrika 75 800ndash802 doi 101093biomet754800

Hommel G (1988) A stagewise rejective multi-ple test procedure based on a modified Bon-ferroni test Biometrika 75 383ndash386 doi101093biomet752383

Houston S M Lebel C Katzir T Manis FR Kan E Rodriguez G R amp Sowell ER (2014) Reading skill and structural braindevelopment Neuroreport 25 347ndash352 doi101097WNR0000000000000121

Huber P J amp Ronchetti E (2009) Robust statis-tics 2nd ed New York Wiley

Keselman H J Othman A amp Wilcox RR (2016) Generalized linear model analy-ses for treatment group equality when dataare non-normal Journal of Modern and Ap-plied Statistical Methods 15 32ndash61 doi1022237jmasm1462075380

Mancini F (2016) ANA_mancini14_datazipfigshare Retreived from httpsfigsharecomarticlesANA_mancini14_data_zip3427766

Mancini F Bauleo A Cole J Lui F PorroC A Haggard P amp Iannetti G D (2014)Whole-body mapping of spatial acuity for painand touch Annals of Neurology 75 917ndash924doi 101002ana24179

Maronna R A Martin D R amp Yohai V J(2006) Robust statistics Theory and methodsNew York Wiley

Montgomery D C amp Peck E A (1992) Intro-duction to linear regression analysis New YorkWiley

Newcombe R G (2006) Confidence intervals foran effect size measure based on the Mann-Whitney statistic Part 1 General issues and tail-area-based methods Statistics in Medicine 25543ndash557 doi 101002sim2323

Nieuwenhuis S Forstmann B U amp Wagenmak-ers E-J (2011) Erroneous analyses of inter-actions in neuroscience a problem of signifi-cance Nature Neuroscience 9 1105ndash1107 doi101038nn2886

Ozdemir A F Wilcox R R amp Yildiztepe E(2013) Comparing measures of location Somesmall-sample results when distributions differin skewness and kurtosis under heterogene-ity of variances Communications in Statistics-

Simulation and Computations 42 407ndash424doi 101080036109182011636163

Pernet C R Wilcox R amp Rousselet G A(2012) Robust correlation analyses False pos-itive and power validation using a new opensource matlab toolbox Frontiers in Psychology3 606 doi 103389fpsyg201200606

Pernet C R Latinus M Nichols T E amp Rous-selet G A (2015) Cluster-based computa-tional methods for mass univariate analyses ofevent-related brain potentialsfields A simula-tion study Journal of Neuroscience Methods250 85ndash93 doi 101016jjneumeth201408003

Rasmussen J L (1989) Data transformation TypeI error rate and power British Journal of Math-ematical and Statistical Psychology 42 203ndash211 doi 101111j2044-83171989tb00910x

Romano J P (1990) On the behavior ofrandomization tests without a group invari-ance assumption Journal of the AmericanStatistical Association 85 686ndash692 doi10108001621459199010474928

Rousseeuw P J amp Leroy A M (1987) Ro-bust regression amp outlier detection New YorkWiley

Rousselet G A Foxe J J amp Bolam J P (2016)A few simple steps to improve the descrip-tion of group results in neuroscience EuropeanJournal of Neuroscience 44 2647ndash2651 doi101111ejn13400

Rousselet G A amp Pernet C R (2012) Improvingstandards in brain-behavior correlation analysesFrontiers in Human Neuroscience 6 119 doi103389fnhum201200119

Rousselet G A Pernet C R amp WilcoxR R (2017) Beyond differences in meansRobust graphical methods to compare twogroups in neuroscience European Jour-nal of Neuroscience 46 1738ndash1748 doi101111ejn1361013610

Ruscio J (2008) A probability-based measure ofeffect size Robustness to base rates and otherfactors Psychological Methods 13 19ndash30 doi1010371082-989X13119

Ruscio J amp Mullen T (2012) Confidenceintervals for the probability of superiority ef-fect size measure and the area under a re-ceiver operating characteristic curve Multivari-ate Behavioral Research 47 201ndash223 doi101080002731712012658329

Schilling M amp Doi J (2014) A cover-age probability approach to finding an op-timal binomial confidence procedure Ameri-can Statistician 68 133ndash145 doi 101080000313052014899274

Staudte R G amp Sheather S J (1990) Robustestimation and testing New York Wiley

Talebi V amp Baker C L Jr (2016) Categoricallydistinct types of receptive fields in early visualcortex Journal of Neurophysiology 115 2556ndash2576 doi 101152jn006592015

Tukey J W (1960) A survey of sampling from con-taminated normal distributions In I Olkin (Ed)

BehavioralNeuroscience

84229

Current Protocols in Neuroscience Supplement 82

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience

Page 30: A Guide to Robust Statistical Methods in Neuroscience ... · A Guide to Robust Statistical Methods in UNIT 8.42 Neuroscience Rand R. Wilcox1 and Guillaume A. Rousselet2 1Deptartment

Contributions to Probability and Statistics Es-says in honor of Harold Hotelling (pp 448ndash485) Stanford CA Stanford University Press

Tukey J W amp McLaughlin D H (1963) Lessvulnerable confidence and significance proce-dures for location based on a single sampleTrimmingWinsorization 1 Sankhya The In-dian Journal of Statistics 25 331ndash352

Wagenmakers E J Wetzels R Borsboom Dvan der Maas H L amp Kievit R A (2012) Anagenda for purely confirmatory research Per-spectives in Psychological Science 7 632ndash638doi 1011771745691612463078

Weissgerber T L Milic N M Winham SJ amp Garovic V D (2015) Beyond bar andline graphs Time for a new data presentationparadigm PLOS Biology 13 e1002128 doi101371journalpbio1002128

Welch B L (1938) The significance of the differ-ence between two means when the populationvariances are unequal Biometrika 29 350ndash362doi 101093biomet293-4350

Wiederhold J L amp Bryan B R (2001) Grayoral reading test 4th ed Austin TX ProEdPublishing

Wilcox R R (2009) Comparing Pearson cor-relations Dealing with heteroscedasticity andnon-normality Communications in Statistics-Simulation and Computation 38 2220ndash2234doi 10108003610910903289151

Wilcox R R (2012) Comparing two inde-pendent groups via a quantile generaliza-tion of the WilcoxonndashMannndashWhitney testJournal of Modern and Applied StatisticalMethods 11 296ndash302 doi 1022237jmasm1351742460

Wilcox R R (2017a) Introduction to robust es-timation and hypothesis testing (4th ed) SanDiego Academic Press

Wilcox R R (2017b) Understanding and applyingbasic statistical methods using R New YorkWiley

Wilcox R R (2017c) Modern statistics for thesocial and behavioral sciences A practical in-troduction (2nd ed) New York CRC Press

Wilcox R R (2018) Robust regression An in-ferential method for determining which inde-pendent variables are most important Jour-nal of Applied Statistics 45 100ndash111 doi1010800266476320161268105

Wilcox R R (in press) An inferential method fordetermining which of two independent variablesis most important when there is curvature Jour-nal of Modern and Applied Statistical Methods

Wilcox R R amp Rousselet G A (2017)A guide to robust statistical methods Il-lustrations using R figshare Retrieved fromhttpsfigsharecomarticlesCurrent_Protocols_pdf5047591

Winkler A M Ridgwaym G R Webster MA Smith S M amp Nichols T M (2014)Permutation inference for the general lin-ear model NeuroImage 92 381ndash397 doi101016jneuroimage201401060

Woodcock R W McGrew K S amp MatherN (2001) Woodcock Johnson-III tests ofachievement Rolling Meadows IL RiversidePublishing

Yuen K K (1974) The two sample trimmed t forunequal population variances Biometrika 61165ndash170 doi 101093biomet611165

A Guide to RobustStatisticalMethods

84230

Supplement 82 Current Protocols in Neuroscience