nonparametric tests of differences in medians: comparison of...

25
P1: GIU Experimental Economics KL2082-04 July 31, 2003 17:23 UNCORRECTED PROOF Experimental Economics, 6:273–297 (2003) c 2003 Economic Science Association Nonparametric Tests of Differences in Medians: Comparison of the Wilcoxon–Mann–Whitney and Robust Rank-Order Tests NICK FELTOVICH Department of Economics, University of Houston, Houston, TX 77204-5019, USA email: [email protected] Received June 28, 2001; Revised September 4, 2002; Accepted December 30, 2002 Abstract The nonparametric Wilcoxon–Mann–Whitney test is commonly used by experimental economists for detecting differences in central tendency between two samples. This test is only theoretically appropriate under certain assumptions concerning the population distributions from which the samples are drawn, and is often used in cases where it is unclear whether these assumptions hold, and even when they clearly do not hold. Fligner and Pollicello’s (1981, Journal of the American Statistical Association. 76, 162–168) robust rank-order test is a modification of the Wilcoxon–Mann–Whitney test, designed to be appropriate in more situations than Wilcoxon–Mann–Whitney. This paper uses simulations to compare the performance of the two tests under a variety of distributional assumptions. The results are mixed. The robust rank-order test tends to yield too many false positive results for medium-sized samples, but this liberalness is relatively invariant across distributional assumptions, and seems to be due to a de.ciency of the normal approximation to its test statistic’s distribution, rather than the test itself. The performance of the Wilcoxon–Mann–Whitney test varies hugely, depending on the distributional assumptions; in some cases, it is conservative, in others, extremely liberal. The tests have roughly similar power. Overall, the robust rank-order test performs better than Wilcoxon–Mann–Whitney, though when critical values for the robust rank-order test are not available, so that the normal approximation must be used, their relative performance depends on the underlying distributions, the sample sizes, and the level of significance used. Keywords: robust rank-order, Wilcoxon–Mann–Whitney, hypothesis testing, power, critical values JEL Classification: C14, C12, C90 1. Introduction Many economics experiments are designed to test for differences in central tendency be- Mismatch file disk followed tween two experimental treatments—that is, whether some variable of interest is, on average, higher or lower in one treatment than in another. Often, sample sizes are small enough that standard parametric techniques, such as t -tests, aren’t applicable. If so, or if the data are measurable on an ordinal scale rather than an interval scale, nonparametric statistical tests are useful for such comparisons. 1 When the two samples are drawn independently of each other, the nonparametric test most commonly used by experimental economists is probably

Upload: others

Post on 13-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

Experimental Economics, 6:273–297 (2003)c© 2003 Economic Science Association

Nonparametric Tests of Differences in Medians:Comparison of the Wilcoxon–Mann–Whitneyand Robust Rank-Order Tests

NICK FELTOVICHDepartment of Economics, University of Houston, Houston, TX 77204-5019, USAemail: [email protected]

Received June 28, 2001; Revised September 4, 2002; Accepted December 30, 2002

Abstract

The nonparametric Wilcoxon–Mann–Whitney test is commonly used by experimental economists for detectingdifferences in central tendency between two samples. This test is only theoretically appropriate under certainassumptions concerning the population distributions from which the samples are drawn, and is often used in caseswhere it is unclear whether these assumptions hold, and even when they clearly do not hold. Fligner and Pollicello’s(1981, Journal of the American Statistical Association. 76, 162–168) robust rank-order test is a modification of theWilcoxon–Mann–Whitney test, designed to be appropriate in more situations than Wilcoxon–Mann–Whitney. Thispaper uses simulations to compare the performance of the two tests under a variety of distributional assumptions.The results are mixed. The robust rank-order test tends to yield too many false positive results for medium-sizedsamples, but this liberalness is relatively invariant across distributional assumptions, and seems to be due to ade.ciency of the normal approximation to its test statistic’s distribution, rather than the test itself. The performanceof the Wilcoxon–Mann–Whitney test varies hugely, depending on the distributional assumptions; in some cases,it is conservative, in others, extremely liberal. The tests have roughly similar power. Overall, the robust rank-ordertest performs better than Wilcoxon–Mann–Whitney, though when critical values for the robust rank-order test arenot available, so that the normal approximation must be used, their relative performance depends on the underlyingdistributions, the sample sizes, and the level of significance used.

Keywords: robust rank-order, Wilcoxon–Mann–Whitney, hypothesis testing, power, critical values

JEL Classification: C14, C12, C90

1. Introduction

Many economics experiments are designed to test for differences in central tendency be-

Mismatch filedisk followed

tween two experimental treatments—that is, whether some variable of interest is, on average,higher or lower in one treatment than in another. Often, sample sizes are small enough thatstandard parametric techniques, such as t-tests, aren’t applicable. If so, or if the data aremeasurable on an ordinal scale rather than an interval scale, nonparametric statistical testsare useful for such comparisons.1 When the two samples are drawn independently of eachother, the nonparametric test most commonly used by experimental economists is probably

Page 2: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

274 FELTOVICH

the Wilcoxon–Mann–Whitney test (Wilcoxon, 1945; Mann and Whitney, 1948), also knownas the Wilcoxon, Mann–Whitney, or U -test.

The Wilcoxon–Mann–Whitney test is quite easy to perform, and possibly for this reason,has been used by many researchers even in situations where it isn’t theoretically appropriate.Part of the motivation for this paper is to highlight the problem with using the Wilcoxon–Mann–Whitney test in these situations. We find, perhaps not surprisingly, that it is often toolikely to give a false positive result; that is, it too frequently leads to rejection of a true nullhypothesis of equal central tendencies. In other situations, however, it is excessively unlikelyto reject the null hypothesis. The implication is that its performance varies tremendously,depending on characteristics of the underlying population distributions that are typicallyunknown to the researcher. A closely related nonparametric test, the robust rank-order test,is much less sensitive to these assumptions and substantially outperforms the Wilcoxon–Mann–Whitney test when sample sizes are small (so that the exact critical values of itstest statistic are known) or very large (so that the normal approximation to its distributionis a good one). The robust rank-order test is too likely to give false positive results formedium-sized samples, but this is more a shortcoming of the normal approximation thanof the test itself. The improved robustness of the robust rank-order test does not come atthe expense of power (the likelihood of correctly rejecting a false null hypothesis of equalcentral tendencies), as we also find that the two tests have roughly equal power.

2. Description of the tests

These descriptions are adapted from Siegel and Castellan (1988). Both tests are used todetermine whether two samples—Sample X , of size m, and Sample Y , of size n—have beendrawn from population distributions with the same central tendency. When (1) the data aremeasurable on an interval scale, and (2) sample sizes are large or come from populationsthat are normally distributed, a parametric test, such as the t-test, can be used. If theseassumptions do not hold, then nonparametric tests, such as the Wilcoxon–Mann–Whitneyand robust rank-order tests, are more appropriate.2

2.1. The Wilcoxon–Mann–Whitney test

The Wilcoxon–Mann–Whitney test is performed as follows. The placement of elementxi in sample X is defined as the number of lower-valued observations in Y —the othersample—and is denoted U (Y Xi ). The mean placement U (Y X ) is the arithmetic mean ofthe U (Y Xi )’s. For given sample sizes, U (Y X ) will be larger when observations in X havehigh placements, which will tend to occur when X is drawn from a stochastically largerpopulation than Y . For small sample sizes, critical values of U (Y X ) have been determinedexactly, and can be found in Appendix A. For large sample sizes, the statistic

Z = U (Y X ) − n/2√n(m + n + 1)/12m

(1)

is approximately N(0,1).3

Page 3: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

NONPARAMETRIC TESTS OF DIFFERENCES IN MEDIANS 275

2.2. Numerical example

As an example, we use a portion of the experimental ultimatum-game data from Duffy andFeltovich (1997). Sample X is {5.025, 6.7, 6.725, 6.75, 7.05, 7.25, 8.375} and Sample Y is{4.875, 5.125, 5.225, 5.425, 5.55, 5.75, 5.925, 6.125}, so the sample sizes are m = 7 andn = 8. These numbers are average demands from individual subjects under differing in-formation treatments, which we will call Treatments X and Y (see Duffy and Feltovichfor details). The research question is whether demands are higher under Treatment X thanunder Treatment Y . Because the sample sizes are small, we use a nonparametric test ratherthan a t-test.

To perform the Wilcoxon–Mann–Whitney test, we find the placements of the elements inX . The first element of X is larger than exactly one of the elements of Y , so its placement is1; each of the other elements of X is larger than all 8 elements of Y , so the placement of eachis 8. So, the placements U (YXi ) of the elements in Xare {1, 8, 8, 8, 8, 8, 8} and the averageplacement is U (Y X ) = 7. According to Appendix A, U (Y X ) = 7 with m = 7 and n = 8 im-plies that the population medians are significantly different at the 1% level (one–tailed test).

2.3. The robust rank-order test

The actual hypotheses considered by the Wilcoxon–Mann–Whitney test are the null, thatthe two samples come from the same population (and thus have not only the same firstmoment, but the same higher-order moments also), versus the alternative, that first momentsare different, but higher-order moments are the same. This test is therefore not theoreticallyappropriate when the samples come from populations with different higher-order moments(different dispersion, for example).4 The danger is that differences in higher-order momentsmay be picked up by the test, causing the researcher to falsely conclude that the first momentsof the two populations are different. Robust rank-order tests were designed to correct forthis shortcoming.

The version of the robust rank-order test used in the current paper is due to Flignerand Pollicello (1981), and is theoretically appropriate when the underlying populationdistributions are roughly symmetric. (In Section 3.4, we will look at what happens whenthe distributions are skewed.) The robust rank-order test is performed by first determiningthe values U (Y Xi ) and U (Y X ) as before, as well as the placement of each element y j of Y(the number U (XY j ) of lower-valued observations than y j in X ), and the mean placementU (XY ). Then, an index of variability of placements is computed. The variabilities of Xand Y placements are denoted by Vx and Vy , and calculated as follows:

Vx =m∑

i=1

[U (Y Xi ) − U (Y X )]2 and Vy =n∑

j=1

[U (XY j ) − U (XY )]2. (2)

The robust rank-order test statistic U is given by

U = m · U (Y X ) − n · U (XY )

2√

Vx + Vy + U (XY ) · U (Y X ). (3)

Page 4: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

276 FELTOVICH

Fligner and Pollicello (1981) present (and Siegel and Castellan (1988) reprint) critical valuesof U for sample sizes up to 12 and for some levels of significance; they also suggest thatfor larger sample sizes, U is distributed approximately N (0, 1). We will see, however, thatconvergence to this distribution can be slow. Feltovich (in preparation) provides criticalvalues for additional sample sizes and levels of significance; some of these can be found inAppendix B.

2.4. Numerical example

We use the data from Section 2.2. We found there that the placements U (Y Xi ) were {1, 8,8, 8, 8, 8, 8}, and U (Y X ) = 7. Similarly, the placements U (XY j ) of the elements in Y are{0, 1, 1, 1, 1, 1, 1, 1}, and the average placement U (XY ) is 0.875. The variabilities of theplacements are calculated from Eq. (2):

Vx = (0 − 0.875)2 + 7(1 − 0.875)2 = 0.875 and Vy = (1 − 7)2 + 6(8 − 7)2 = 42.

Then, from Eq. (3),

U = 7 · 7 − 8 · (0.875)

2√

0.875 + 42 + (0.875) · 7)= +3.000.

According to Appendix B, U = +3.000 is significant at the 2.5% level (when m = 7 andn = 8), but just misses being significant at the 1% level.

3. Simulation results

To illustrate the relative performance of the Wilcoxon–Mann–Whitney and robust rank-order tests, we report the results of simulations, written in the GAUSS programming lan-guage, in which the two tests are performed on artificial data sets.5 Each data set consistsof a pair of samples, with each sample randomly drawn from a specified population. Wewill look at several sets of distributional assumptions. In some examples, the underlyingpopulation distributions are identical, so that both tests are theoretically appropriate. Insome, the distributions have different central tendencies but the same dispersion, so again,both tests are appropriate. In a third set of examples, the distributions have the same centraltendency but different dispersions, so that the robust rank-order test is appropriate, but theWilcoxon–Mann–Whitney test is not. Within each of these cases, we will examine the effectof equal versus unequal sample sizes. We will also look at skewed population distributions,for which neither test may be theoretically appropriate.

Two attributes of statistical tests will be of interest here, corresponding to the two kindsof mistake that a hypothesis test can make. The first type of mistake is rejecting the nullhypothesis when it is actually true, known as Type I error. The probability of Type I erroris called the level or size of the test, and is denoted by α. We draw a distinction betweenthe nominal level of a test—its stated level of significance—and the test’s real level—thelikelihood of a true null hypothesis actually being rejected. When a test is theoretically

Page 5: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

NONPARAMETRIC TESTS OF DIFFERENCES IN MEDIANS 277

appropriate, the nominal level and real level should be equal (the test “maintains its level”);for example, rejections of a true null hypothesis should occur roughly 5% of the timewhen the 5% level of significance is used. When a test is not theoretically appropriate, thecloseness of the real level to the nominal level must be determined empirically. A test isbiased if these are different; it is conservative (resp., liberal) if its real level is less (more)than its nominal level.

The second type of mistake is accepting the null hypothesis when it is actually false,known as Type II error. The probability of Type II error is denoted by β, and the powerof the test is 1 − β (the probability of correctly rejecting a false null hypothesis). In ourexamples, we will know the underlying population distributions from which the samplesare drawn; hence, we will know whether the null hypothesis is true or false, and thereforewhether we need to be concerned with Type I or Type II error. The researcher typicallydoes not have this luxury, and must consider the tradeoff between the two types of error.Other things equal, α and β are negatively related, though the exact tradeoff depends onthe test used, the sample sizes, and the “effect size”. The effect size, discussed by Cohen(1969), Tatsuoka (1993), and others, is the hypothesized difference in the variable of interestbetween the two experimental treatments. There are several ways of defining the effect size;probably the most common is d—the difference in central tendency, measured in units ofdispersion. With interval-scale data, this translates to the number of standard deviationsapart the two means are. (This measure of effect size will be used in Section 3.3.)

3.1. Example 1: Identical populations

Our first example is a baseline, meant to show how the two tests behave when the underlyingpopulations are identical. Eight simulations were run. For each simulation, one million datasets were created by drawing two samples uniformly, with replacement, from the interval[0, 10]; we call this Population 1.6 (In other words, the observations in both samples willbe i.i.d. uniform [0, 10] random variables.) Each pair of samples was compared, usingboth tests, and the resulting empirical distributions of the corresponding test statistics werefound. The simulations differed only in the sample sizes m and n; five of the simulationsused pairs of equal-sized samples of sizes 6, 12, 20, 40, and 60, and three used unequal-sizedsamples with (m, n) = (12, 20), (20, 40), and (40, 60).7 For sample sizes m = n = 6, Siegeland Castellan report exact critical values; we use these to evaluate significance of the teststatistics. (We are sometimes limited by which critical values have been published.) For thesimulations with larger sample sizes, we use the normal approximations from Eqs. (1) and(3). We follow Siegel and Castellan in including a continuity correction in their formula forthe normal approximation to the Wilcoxon–Mann–Whitney test statistic, which makes thetest slightly more conservative. Leaving out the continuity correction does not substantiallychange the results.

The two samples are drawn from populations with the same first, second, and higher-order moments, so both tests are theoretically appropriate and the null hypothesis is correct.We are thus interested in how close a test’s real level is to its nominal level of signifi-cance α. Figure 1 displays the “excess Type I error” of both tests, for some common valuesof α. The excess Type I error was computed by subtracting the nominal level α from the

Page 6: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

278 FELTOVICH

Figure 1. Excess Type I error (real level minus nominal level): identical symmetric distributions. (Circles outsidethe dotted lines represent real levels significantly different from nominal levels at the 0.1% level—binomial test).

empirically-determined real level—the proportion of times the test statistic exceeded thecorresponding critical value. (Two-tailed rejection regions are used for these hypothesistests; later, when the populations have different central tendencies, we will use one-tailedrejection regions.) Also shown, as dotted lines, are the regions corresponding to real levelsnot significantly different from nominal levels (binomial test, p ≥ 0.001). We see fromfigure 1 that in the case where exact critical values are available (m = n = 6), both testsmaintain their level. For larger sample sizes, where the normal approximation must beused, the robust rank-order test is liberal (its excess Type I error is positive), especially forsmaller sample sizes. The Wilcoxon–Mann–Whitney test, on the other hand, maintains itslevel for all sample sizes and nominal levels, except in a few cases where it is conservative(its excess Type I error is negative). The difference between the two tests decreases as thesample sizes increase. These results suggest that while the two tests themselves performwell here, there is a big difference in the quality of their normal approximations. The normalapproximation to the Wilcoxon–Mann–Whitney test statistic works well, while that for therobust rank-order test is a poor approximation unless the sample sizes are quite large.

3.2. Example 2: Same central tendency, different dispersions

To show the effect of different levels of dispersion on the tests’ performance, we examinepairs of samples drawn from high- and low-dispersion populations. The simulation design issimilar to that of the previous example, with the samples X again drawn from Population 1,the interval [0, 10]. The samples Y , however, are now drawn uniformly and with replacementfrom Population 2, the interval [−10, 20]. Notice that while Populations 1 and 2 have thesame median, Population 2 is much more dispersed. As before, each of the simulations usedone million pairs of samples. Five simulations were run with pairs of equal-sized samplesof sizes 6, 12, 20, 40, and 60, and six with pairs of unequal-sized samples: (m, n) = (12,20), (20, 40), (40, 60), (20, 12), (40, 20), and (60, 40).

As before, the samples come from populations with no difference in central tendency,so our interest is in the closeness of a test’s real level to its nominal level. The results areshown in figure 2. (Note the change in scale on the vertical axis.) As in Example 1, therobust rank-order test typically exhibits excessive Type I error. Comparing the top panelof figure 2 with figure 1, we see that when both sample sizes are the same, the robustrank-order test performs slightly worse when the population dispersions are different than

Page 7: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

NONPARAMETRIC TESTS OF DIFFERENCES IN MEDIANS 279

Figure 2. Excess Type I error (real level minus nominal level): symmetric distributions with different dispersions.(Circles outside the dotted lines represent real levels significantly different from nominal levels at the 0.1% level—binomial test).

when they are the same. From the bottom panel, we see that the robust rank-order test’sperformance improves when the larger sample comes from the high-dispersion population(m < n), while it becomes worse when the larger sample comes from the low-dispersionpopulation (m > n). The robust rank-order test’s bias decreases as m and n increase, thoughthe convergence of U to its asymptotic distribution again appears slow.

The performance of the Wilcoxon–Mann–Whitney test also depends on which sample sizecomes from which population, and its sensitivity is much greater than the robust rank-ordertest’s. When sample sizes are equal, it is liberal, and its performance relative to the robustrank-order test depends on the sample sizes and the level of significance. The Wilcoxon–Mann–Whitney test performs better when samples are small and α is large, while the robustrank-order test does better for large samples and low values of α. Indeed, the Wilcoxon–Mann–Whitney test shows no tendency to improve as sample sizes increase. (This is notsurprising; since the Wilcoxon–Mann–Whitney test is not theoretically appropriate here, Zin Eq. (1) needn’t converge to a standard normal random variable.)

Figure 2 also shows that, like the robust rank-order test, the Wilcoxon–Mann–Whitneytest becomes more conservative when the larger sample comes from the high-dispersionpopulation, and more liberal when it comes from the low-dispersion population.8 However,this sensitivity is much more stark than it was for the robust rank-order test.

Since the excess Type I error of the robust rank-order test in this example is only slightlylarger than that in Example 1, it is possible that its poor performance in the current examplemight again be due to the poorness of the normal approximation to U , rather than to the way

Page 8: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

280 FELTOVICH

Table 1. Estimated one-tailed critical values for robust rank-order test statistic.

m = 12 m = 12 m = 20 m = 20 m = 40 m = 40 m = 60 Normalα (%) n = 12 n = 20 n = 20 n = 40 n = 40 n = 60 n = 60 approx.

10 1.283 1.290 1.278 1.289 1.277 1.283 1.277 1.282

5 1.704 1.701 1.675 1.681 1.653 1.659 1.651 1.645

2.5 2.117 2.094 2.039 2.039 1.991 1.993 1.978 1.960

1 2.661 2.598 2.491 2.482 2.399 2.397 2.370 2.326

0.5 3.095 2.982 2.820 2.801 2.684 2.677 2.650 2.576

0.1 4.237 3.942 3.588 3.527 3.302 3.283 3.230 3.090

0.05 4.845 4.391 3.927 3.850 3.557 3.521 3.464 3.291

it handles samples from different-dispersion populations. If so, using exact critical values(if they were available) rather than the normal approximation might improve the robustrank-order test’s performance. (In contrast, eliminating the slight conservative bias of theWilcoxon–Mann–Whitney test seen in Example 1 would not be expected to improve itsperformance here.) In order to examine this hypothesis, we run a further set of simulationswith these distributions, but instead of using the normal approximation to U , we estimatethe true critical values.9

Table 1 shows our estimates of the critical values for each sample size and level of signifi-cance, as well as those for the normal approximation. The results of the new simulations us-ing these critical values, rather than those of the normal approximation, are shown in figure 3.

We can see by comparing figures 2 and 3 that using estimated critical values rather thanthe normal approximation markedly improves the robust rank-order test’s performance.It is no longer liberal across-the-board. Rather, it is conservative when the larger sampleis taken from the high-dispersion population, and liberal otherwise—like the Wilcoxon–Mann–Whitney test—but the magnitude of its bias is smaller than that of the Wilcoxon–Mann–Whitney test, often substantially so.

3.3. Example 3: Different central tendencies, same dispersion

It is not surprising that the robust rank-order test outperforms the Wilcoxon–Mann–Whitneytest in cases where the underlying populations have different dispersions; after all, this wasthe situation for which the robust rank-order test was designed. Based on this result, onemight suppose that the robust rank-order test should be used when the Wilcoxon–Mann–Whitney test is inappropriate, but otherwise (when there is reason to believe that levelsof dispersion are the same across the two populations), the Wilcoxon–Mann–Whitney testshould continue to be used, in order to have the best chance of rejecting the null hypothesiswhen it should be rejected. This turns out not to be the case. In this section, we will seethat there is no loss of power involved in using the robust rank-order test instead of theWilcoxon–Mann–Whitney test.

A comparison of the tests’ powers runs into one di.culty right away. As mentionedearlier, α and β are typically negatively related, so that one way to get higher power (that

Page 9: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

NONPARAMETRIC TESTS OF DIFFERENCES IN MEDIANS 281

Figure 3. Excess Type I error (real level minus nominal level): symmetric distributions with different dispersions.(Circles outside the dotted lines represent real levels significantly different from nominal levels at the 0.1% level—binomial test).

is, lower β) is to increase the real level of significance. We saw in Section 3.1 that whennormal approximations were used, the robust rank-order test was often too liberal, whilethe Wilcoxon–Mann–Whitney test was typically slightly conservative. Thus, finding thatthe robust rank-order test is more likely than the Wilcoxon–Mann–Whitney test to rejecta false null hypothesis for a given nominal level of significance (which does turn out tohappen) doesn’t let us conclude that the robust rank-order test is more powerful—it mayjust be more liberal. The true question of interest is which test has a higher likelihoodof rejecting a false null hypothesis for a given real level, not for a given nominal level.Therefore, instead of using the normal approximation to U as in Example 1, we use theestimated critical values from the previous section (see Note 9 for details). We run newsimulations using these estimated critical values, with samples drawn from populationshaving the same dispersion but different central tendencies. We look at two situations—onewhere the difference in central tendencies is relatively large, and one where it is relativelysmall. Our samples X are again drawn from Population 1. For one set of simulations, thesamples Y are drawn from Population 3, the interval [2, 12]; for the other, they are drawnfrom Population 4, the interval [1, 11]. Populations 1, 3, and 4 are thus identical except fora shift in their distributions; Population 3 is simply Population 1 shifted 2 units to the right,and Population 4 is Population 1 shifted 1 unit to the right. The magnitudes of these shiftscorresponds to effect sizes of approximately d = 0.693 (Cohen, 1969) considers this to bea medium-to-large effect size) and d = 0.346 (a small-to-medium effect size).

Other than using the estimated critical values rather than the normal approximation, thesimulation design is similar to that of Example 1. The results are shown in figure 4. Since

Page 10: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

282 FELTOVICH

Figure 4. Power: Wilcoxon–Mann–Whitney and robust rank-order tests (symmetric distributions with differentmedians).

the null hypothesis is false, this figure reports frequencies of correct rejections of the null(power), in contrast to the earlier figures that reported frequencies of incorrect rejectionsof the null (real level). The figure shows that there is little difference in power between thetests. In a few cases, the robust rank-order test is slightly more powerful, but the differenceis rarely substantial.

3.4. Examples 4 and 5: Asymmetric population distributions

Fligner and Pollicello (1981) note that the robust rank-order test is theoretically appropriatewhen the underlying population distributions are symmetric, and discuss the meaning ofthe test in situations where the distributions are asymmetric. Zumbo and Coulombe (1997)show that when the population distributions are skewed, the robust rank-order test can besubstantially liberal. We now look at examples involving skewed population distributions,so that we can see when the robust rank-order test is likely to show excessive Type I error,and whether the Wilcoxon–Mann–Whitney test is any better.

The particular skewed distributions we use are triangular. Population 5 is characterizedby a random variable with probability density function f (x) = x

50 for x ∈ [0, 10], andPopulation 6 by one with p.d.f. g(y) = y+10

√2

450 for y ∈ [−10√

2, 30 −10√

2] (see figure 5.)As was true for Populations 1 and 2, Populations 5 and 6 have equal medians (approximately7.071), while Population 6 has a variance 9 times as high as that of Population 5 (50 and5 5

9 , respectively). Since the populations are not symmetric, having the same median doesnot imply that they have the same mean, and indeed they don’t (approximately 6.667 and5.858, respectively). One set of simulations in this section (Example 4) parallels Example 1,

Figure 5. Populations 5 and 6: probability density functions.

Page 11: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

NONPARAMETRIC TESTS OF DIFFERENCES IN MEDIANS 283

Figure 6. Excess Type I error (real level minus nominal level): asymmetric distributions with different dispersions.(Circles outside the dotted lines represent real levels significantly different from nominal levels at the 0.1% level—binomial test).

with both samples drawn from Population 5; the other (Example 5) parallels Example 2,with the samples X drawn from Population 5 and the samples Y drawn from Population 6.

As in Examples 1 and 2, the null hypothesis is correct, so we are again interested in thefrequency of incorrect rejections—how close real levels are to nominal levels. When thedistributions are skewed but identical (Example 4), the results are nearly identical to thoseof Example 1, so these results are not shown.

Figure 6 shows the results of simulations with skewed distributions having differentdispersions (Example 5). Comparing this figure to figure 2, we can see that the robust rank-order test is somewhat more liberal when the distributions are skewed than when they aresymmetric. Zumbo and Coulombe (1997) have previously showed that the robust rank-ordertest performs poorly when the population distributions are skewed. However, we can seethat the Wilcoxon–Mann–Whitney test also performs poorly here. The relative performanceof the two tests here depends on sample sizes and the level of significance. When the samplesizes are equal, the robust rank-order test does better than the Wilcoxon–Mann–Whitney testwhen either sample sizes or the significance level is large; the Wilcoxon–Mann–Whitneytest does better when both are small. When the sample sizes are unequal, the robust rank-order test maintains its level better in most cases, though the comparison is less clear cutwhen the larger sample comes from the high-dispersion population.

We saw before that when the population distributions were symmetric (Examples 1 and2), the excessive liberalness of the robust rank-order test was due mainly to the use of thenormal approximation, rather than to the test itself. In the current case, however, the testitself is liberal. Figure 7 shows the excess Type I error of the robust rank-order test when

Page 12: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

284 FELTOVICH

Figure 7. Excess Type I error (real level minus nominal level): asymmetric distributions with different dispersions.(Circles outside the dotted lines represent real levels significantly different from nominal levels at the 0.1% level—binomial test).

the estimated critical values are used (as in figure 3). Comparing figures 6 and 7, we cansee that there is a decrease in the robust rank-order test’s Type I error when the approximatecritical values are used rather than the normal approximation, though this decrease is slightand doesn’t substantially affect its poor absolute performance. However, it is enough tomake it better than the Wilcoxon–Mann–Whitney test in nearly all cases.

The problem both tests have in this situation is that they pick up differences in the shapesof the distributions, which are interpreted as differences in central tendency. Fligner andPollicello (1981) note, and Zumbo and Coulombe (1997) discuss in some detail, that therobust rank-order test is not actually a test of differences in medians, but rather whetherthe probability of a randomly-chosen element from one population being higher than arandomly-chosen element from the other population, is equal to one-half. (This is true ofthe Wilcoxon–Mann–Whitney test also.) When the population distributions are symmetric,as in Examples 1–3, this condition is equivalent to the two populations’ having the samemedian. Identical distributions, even if asymmetric (Example 4) will satisfy that condition,and will also have the same median. However, when the populations have different shapes(Example 5), it is possible for this condition to be violated even if the medians are the same.Indeed, while Populations 5 and 6 have the same median, the probability of a randomly-selected element from Population 6 (the higher-dispersion population) being higher thana randomly-selected element from Population 5 (the lower-dispersion population) is morethan one-half; it is approximately 0.5127. The difference between this number and one-halfisn’t large, but any nonzero difference implies that as sample sizes increase, the tests becomeever more likely to yield a significant result.

Page 13: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

NONPARAMETRIC TESTS OF DIFFERENCES IN MEDIANS 285

3.5. Summary of results

We now have some idea of the strengths and weaknesses of the two tests, and in the caseof the robust rank-order test, those of both the test itself and the normal approximation toits test statistic U . In Examples 1 and 3, both tests were theoretically appropriate, as thedispersions of the underlying populations were the same and the population distributionswere symmetric. In Example 4, the population distributions were not symmetric, but theywere identical to each other, so while not theoretically appropriate, both tests should bevalid. The tests should asymptotically give the same results in each of these cases, but forfinite sample sizes, the results could well be different, and often were. In Examples 1 and4, where the underlying distributions were identical, the Wilcoxon–Mann–Whitney eithermaintained its level or was slightly conservative. The robust rank-order test maintained itslevel when exact critical values were used, but its normal approximation was liberal. Thedifference in Type I error between the tests tended to be larger for small sample sizes andfor low values of α; at the largest sample sizes (m = n = 60), the differences in Type Ierror nearly disappeared. In Example 3, where the underlying distributions had differentmedians, and where we used estimated critical values rather than the normal approximationfor the robust rank-order test, both tests were approximately equally likely to reject the(false) null hypothesis. All of these results were equally true for same- and different-sizedsamples.

In Example 2, the underlying distributions had the same central tendency but differentdispersions, so the Wilcoxon–Mann–Whitney test was not theoretically appropriate, but therobust rank-order test was. When normal approximations were used, the robust rank-ordertest behaved similarly to the way it did in Examples 1 and 4—too much Type I error, but betteras the sample sizes increased. For the Wilcoxon–Mann–Whitney test, and for the robust rank-order test when estimated critical values were used, performance depended on the interactionbetween sample sizes and dispersion. Both tests were conservative when the larger samplewas drawn from the high-dispersion population, and both were liberal when either thesample sizes were equal, or the larger sample was drawn from the low-dispersion population.However, the robust rank-order test (with the estimated critical values) maintained its levelsubstantially better—it was typically less liberal when both tests were liberal, and lessconservative when both were conservative.

In Example 5, the underlying distributions were asymmetric and had different dispersions,so neither test was theoretically appropriate. In this example, both tests yielded far too muchType I error, and the bias didn’t decrease noticably as sample sizes increased. Even using theestimated critical values for the robust rank-order test rather than the normal approximationonly improved its performance slightly. The relative performance of the two tests wasbroadly similar to that in Example 2.

4. Discussion and advice

Our results can thus be classified into two distinct comparisons: the Wilcoxon–Mann–Whitney test versus the robust rank-order test itself (using exact critical values), and versusthe normal approximation to the robust rank-order test (when exact critical values are

Page 14: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

286 FELTOVICH

unavailable). The former comparison is completely one-sided; the robust rank-order testis much less sensitive to changes in sample sizes and distributional assumptions than theWilcoxon–Mann–Whitney test, and this improvement comes with no loss in power. Un-fortunately, there are many sample sizes and levels of significance for which we mustresort to the robust rank-order test’s normal approximation, and here the comparison isless clear cut. The main disadvantage of the robust rank-order test’s normal approximation,relative to the Wilcoxon–Mann–Whitney test, is that it is excessively liberal, particularlyfor medium-sized samples (up to about 40). Set against this disadvantage is the advantagethat its performance is still relatively constant across distributional assumptions, while theWilcoxon–Mann–Whitney test behaves erratically, with far less Type I error in some cases,and far more in others—relative to both the robust rank-order test and the nominal level ofsignificance.

Our results suggest that future research into the exact distribution of U for larger samplesizes might eventually eliminate the need for the Wilcoxon–Mann–Whitney test entirely. Asmentioned previously, Feltovich (in preparation) is gradually compiling critical values forsample sizes larger than 12 (the largest sizes for which critical values have been available);some of these are presented in Appendix B. Until this work is complete, the most we can sayis that in deciding which of these two tests to use, the researcher must consider the sizes ofthe samples and the distributions from which they come. The robust rank-order test shouldbe used instead of the Wilcoxon–Mann–Whitney test if (a) population dispersions mightbe different, or (b) exact (or even approximate) critical values for the robust rank-order testare available, or (c) sample sizes are large (40 or more). The Wilcoxon–Mann–Whitney testshould be used only if none of these hold.

In many economics experiments, there is interaction among subjects within each session,so that only session-level data can be regarded as independent observations. Such experi-ments generally result in very small sample sizes (often single digits), so that (b) holds. Evenwhen sample sizes are larger, there is generally no a priori reason to suppose that second-or higher-order moments are the same across populations, so that (a) may well hold. Forthese reasons, there are very few situations in which the Wilcoxon–Mann–Whitney test isthe appropriate one for experimental data.

We note in closing that the advice here concerns the relative merits of the two tests, andis essentially unaffected by whether the population distributions are symmetric. However,as we saw, the absolute performance of both tests can be poor when the distributions areskewed, in which case it would be better not to use either test. Boos and Brownie (1988)discuss completely distribution-free “bootstrap” versions of the Wilcoxon–Mann–Whitneyand robust rank-order tests, which are theoretically appropriate in more situations than thetwo tests themselves. They look at examples with symmetric and skewed distributions, andfind that in both cases, “bootstrapping” reduces the bias of both tests (substantially, for theWilcoxon–Mann–Whitney test). They also find that bootstrapping lowers the power of thesetests, but the loss is small. They don’t use the tests to detect differences in medians, so theirresults don’t correspond exactly to ours, but they are suggestive.10 Further work is neededin this area, especially concerning the tradeoff between level and power of these bootstraptests in situations like those considered in this paper, and the behavior of bootstrapped testsfor detecting differences in medians between skewed populations.

Page 15: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

NONPARAMETRIC TESTS OF DIFFERENCES IN MEDIANS 287A

ppen

dix

A:

Cri

tica

lval

ues

ofU

(YX

)fo

rth

eW

ilcox

on–M

ann–

Whi

tney

test

(ada

pted

from

Sieg

elan

dC

aste

llan

(198

8))

n

34

56

78

910

1112

m

2.66

7(.

100)

3.33

3(.

114)

4.00

0(.

125)

5.00

0(.

083)

5.66

7(.

092)

6.33

3(.

097)

7.00

0(.

104)

7.66

7(.

108)

8.33

3(.

112)

9.33

3(.

090)

3.00

0(.

050)

3.66

7(.

057)

4.66

7(.

036)

5.33

3(.

048)

6.00

0(.

058)

7.00

0(.

042)

7.66

7(.

050)

8.33

3(.

056)

9.33

3(.

044)

10.0

00(.

050)

4.00

0(.

029)

5.00

0(.

018)

5.66

7(.

024)

6.33

3(.

033)

7.33

3(.

024)

8.00

0(.

032)

9.00

0(.

024)

9.66

7(.

030)

10.6

67(.

024)

5.00

0(.

018)

5.66

7(.

024)

6.66

7(.

017)

7.33

3(.

024)

8.33

3(.

018)

9.33

3(.

014)

10.0

00(.

019)

11.0

00(.

015)

3

6.00

0(.

012)

7.00

0(.

008)

7.66

7(.

012)

8.66

7(.

009)

9.66

7(.

007)

10.3

33(.

011)

11.3

33(.

009)

9.00

0(.

0045

)10

.000

(.00

35)

10.6

67(.

0055

)11

.667

(.00

44)

3.25

0(.

100)

4.00

0(.

095)

4.75

0(.

086)

5.25

0(.

115)

6.00

0(.

107)

6.75

0(.

099)

7.50

0(.

094)

8.25

0(.

089)

8.75

0(.

106)

3.50

0(.

057)

4.25

0(.

056)

5.00

0(.

057)

5.75

0(.

054)

6.50

0(.

054)

7.25

0(.

053)

8.00

0(.

053)

8.75

0(.

052)

9.50

0(.

052)

3.75

0(.

028)

4.50

0(.

032)

5.25

0(.

033)

6.25

0(.

021)

7.00

0(.

024)

7.75

0(.

025)

8.50

0(.

027)

9.25

0(.

028)

10.0

00(.

029)

4.00

0(.

014)

4.75

0(.

016)

5.50

0(.

019)

6.25

0(.

021)

7.25

0(.

014)

8.00

0(.

017)

8.75

0(.

018)

9.50

0(.

020)

10.2

50(.

021)

4

5.00

0(.

008)

5.75

0(.

010)

6.50

0(.

012)

7.50

0(.

008)

8.25

0(.

010)

9.00

0(.

012)

10.0

00(.

009)

10.7

50(.

010)

6.00

0(.

0048

)6.

750

(.00

61)

7.75

0(.

0040

)8.

500

(.00

56)

9.50

0(.

0040

)10

.250

(.00

51)

11.2

50(.

0038

)

9.00

0(.

0014

)10

.000

(.00

10)

10.7

50(.

0015

)11

.750

(.00

11)

11.0

00(.

0007

)12

.000

(.00

05)

3.80

0(.

111)

4.60

0(.

089)

5.20

0(.

101)

5.80

0(.

111)

6.60

0(.

095)

7.20

0(.

103)

4.20

0(.

048)

5.00

0(.

041)

5.60

0(.

053)

6.40

0(.

047)

7.00

0(.

056)

7.80

0(.

050)

4.40

0(.

028)

5.20

0(.

026)

6.00

0(.

024)

6.80

0(.

022)

7.60

0(.

021)

8.20

0(.

028)

4.60

0(.

016)

5.40

0(.

015)

6.20

0(.

015)

7.00

0(.

015)

7.80

0(.

014)

8.40

0(.

020)

5

4.80

0(.

008)

5.60

0(.

009)

6.40

0(.

009)

7.20

0(.

009)

8.00

0(.

010)

8.80

0(.

010)

5.00

0(.

0040

)5.

800

(.00

43)

6.60

0(.

0051

)7.

400

(.00

54)

8.20

0(.

0060

)9.

200

(.00

40)

(Con

tinu

edon

next

page

.)

Page 16: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

288 FELTOVICH

(Con

tinu

ed).

n

34

56

78

910

1112

m

7.00

0(.

0013

)8.

000

(.00

08)

8.80

0(.

0010

)9.

800

(.00

07)

9.00

0(.

0005

)10

.000

(.00

03)

4.50

0(.

090)

5.16

7(.

090)

5.83

3(.

091)

6.50

0(.

090)

7.16

7(.

090)

4.83

3(.

047)

5.50

0(.

051)

6.16

7(.

054)

7.00

0(.

044)

7.66

7(.

047)

5.00

0(.

032)

5.83

3(.

026)

6.50

0(.

030)

7.33

3(.

025)

8.00

0(.

028)

5.16

7(.

021)

6.00

0(.

018)

6.66

7(.

021)

7.50

0(.

018)

8.16

7(.

021)

6

5.50

0(.

008)

6.16

7(.

011)

7.00

0(.

010)

7.83

3(.

009)

8.50

0(.

011)

5.66

7(.

0043

)6.

500

(.00

41)

7.33

3(.

0040

)8.

000

(.00

60)

8.83

3(.

0055

)

6.00

0(.

0011

)6.

833

(.00

12)

7.83

3(.

0007

)8.

667

(.00

08)

9.50

0(.

0009

)

7.00

0(.

0006

)8.

000

(.00

03)

8.83

3(.

0004

)9.

667

(.00

05)

5.00

0(.

104)

5.71

4(.

095)

6.28

6(.

105)

7.00

0(.

097)

5.42

9(.

049)

6.14

3(.

047)

6.85

7(.

045)

7.42

9(.

054)

5.71

4(.

026)

6.71

4(.

027)

7.14

3(.

027)

8.14

3(.

028)

5.85

7(.

019)

6.57

1(.

020)

7.28

6(.

021)

8.00

0(.

022)

7

6.14

3(.

009)

6.85

7(.

010)

7.57

1(.

012)

8.42

9(.

009)

6.28

6(.

0055

)7.

143

(.00

47)

7.85

7(.

0058

)8.

714

(.00

48)

6.71

4(.

0012

)7.

571

(.00

11)

8.42

9(.

0010

)9.

286

(.00

10)

6.85

7(.

0006

)7.

714

(.00

06)

8.57

1(.

0006

)9.

571

(.00

04)

5.62

5(.

097)

6.25

0(.

100)

6.87

5(.

102)

6.00

0(.

052)

6.75

0(.

046)

7.37

5(.

051)

6.37

5(.

025)

7.12

5(.

023)

7.75

0(.

027)

Page 17: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

NONPARAMETRIC TESTS OF DIFFERENCES IN MEDIANS 2896.

500

(.01

9)7.

250

(.01

8)7.

875

(.02

2)8

6.75

0(.

010)

7.50

0(.

010)

8.25

0(.

010)

7.00

0(.

0052

)7.

750

(.00

56)

8.62

5(.

0043

)

7.50

0(.

0009

)8.

375

(.00

08)

9.12

5(.

0010

)

7.62

5(.

0005

)8.

500

(.00

05)

9.37

5(.

0004

)

6.22

2(.

095)

6.77

8(.

106)

6.66

7(.

047)

7.33

3(.

047)

7.00

0(.

025)

7.66

7(.

027)

7.11

1(.

020)

7.77

8(.

021)

9

7.44

4(.

009)

8.11

1(.

011)

7.66

7(.

0053

)8.

444

(.00

5)

8.22

2(.

0009

)9.

000

(.00

11)

8.44

4(.

0004

)9.

222

(.00

05)

6.80

0(.

095)

Cri

tical

valu

e(n

omin

alle

velo

fsi

gnifi

canc

e)7.

200

(.05

3)7.

600

(.02

6)7.

700

(.02

2)10

8.10

0(.

009)

8.40

0(.

0045

)8.

900

(.00

10)

9.10

0(.

0005

)

The

sear

eth

eri

ght-

tail

criti

calv

alue

sU

R,s

oth

atPr

ob(U

(YX

)≥

UR

)=

α;t

heco

rres

pond

ing

left

-tai

lcri

tical

valu

esU

L(P

rob(

U(Y

X)≤

UL

)=

α)

can

befo

und

byU

L=

n−

UR

.

Page 18: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

290 FELTOVICH

Appendix B: Critical values for the robust rank-order test (from Feltovich(in preparation))

Shown are right-tail critical values UR , where Prob(U ≥ UR) = α; the correspondingleft-tail critical values UL (Prob(U ≤ UL ) = α) can be found by UL = −UR . Exact valuesof α are also given. Critical values for (m, n) are identical to those for (n, m).

For m up to 9 and n up to 20, for m = 10 and n up to 18, for m = 11 and n up to 16, form = 12 and n up to 14, and for m = n = 13, critical values are exact. For larger sample

Au: electronicversionfollowed forAppendix Btable

sizes, critical values are estimated from Monte Carlo simulations.

n

3 4 5 6 7 8 m

2.347 (.100) 1.732 (.114) 1.632 (.089) 1.897 (.083) 1.644 (.092) 1.500 (.097)∞ (.050) 3.273 (.057) 2.324 (.071) 2.912 (.048) 2.605 (.042) 2.777 (.042)

∞ (.029) 4.195 (.036) 5.116 (.024) 3.497 (.033) 4.082 (.024) 3∞ (.018) ∞ (.012) 6.037 (.017) 6.957 (.012)

∞ (.0083) ∞ (.0061)

1.586 (.100) 1.500 (.095) 1.434 (.090) 1.428 (.100) 1.371 (.101)2.502 (.057) 2.160 (.056) 2.247 (.048) 2.104 (.052) 2.162 (.051)4.483 (.029) 3.265 (.032) 3.021 (.024) 3.295 (.021) 2.868 (.024) 4

∞ (.014) 5.692 (.016) 6.899 (.010) 4.786 (.012) 4.252 (.010)∞ (.0079) ∞ (.0048) 8.106 (.0061) 9.313 (.0040)

∞ (.0020)

1.447 (.103) 1.362 (.102) 1.308 (.100) 1.378 (.099)2.063 (.048) 2.000 (.043) 1.954 (.051) 1.919 (.048)2.859 (.028) 2.622 (.026) 2.465 (.025) 2.556 (.025) 57.187 (.008) 3.913 (.011) 4.246 (.009) 3.730 (.009)

∞ (.0040) 8.682 (.0043) 6.072 (.0050) 4.938 (.0054)∞ (.0022) ∞ (.0013) ∞ (.0008)

1.335 (.104) 1.326 (.100) 1.327 (.100)1.860 (.051) 1.816 (.050) 1.796 (.050)2.550 (.022) 2.500 (.024) 2.443 (.025)3.712 (.011) 3.519 (.010) 3.357 (.009) 64.803 (.0054) 4.438 (.0058) 4.337 (.0050)

∞ (.0011) 12.247 (.0012) 14.029 (.0007)∞ (.0006) ∞ (.0003)

1.333 (.099) 1.310 (.100)1.804 (.050) 1.807 (.050)2.331 (.025) 2.263 (.025)3.195 (.010) 3.088 (.010) 74.220 (.0050) 3.899 (.0050)8.643 (.0012) 7.091 (.0011)

14.317 (.0006) 16.388 (.0031)

1.295 (.101)1.766 (.050)2.269 (.024)2.954 (.010) 83.659 (.0050)6.300 (.0009)8.166 (.0005)

(Continued on next page.)

Page 19: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

NONPARAMETRIC TESTS OF DIFFERENCES IN MEDIANS 291

(Continued ).

n

9 10 11 12 13 14 m

1.575 (.100) 1.611 (.101) 1.638 (.099) 1.616 (.101) 1.666 (.100) 1.686 (.100)2.353 (.050) 2.553 (.049) 2.369 (.055) 2.449 (.048) 2.468 (.050) 2.474 (.071)4.666 (.018) 3.651 (.025) 3.503 (.028) 3.406 (.024) 3.613 (.025) 3.480 (.036) 37.876 (.009) 8.795 (.007) 5.831 (.011) 5.000 (.011) 5.476 (.009) 5.392 (.012)

∞ (.0046) ∞ (.0035) 8.795 (.0070) 10.634 (.0044) 11.553 (.0036) 7.578 (.0083)∞ (.0028) ∞ (.0022) ∞ (.0018) ∞ (.0015)

1.434 (.094) 1.466 (.099) 1.448 (.100) 1.455 (.100) 1.448 (.100) 1.457 (.100)2.057 (.050) 2.000 (.050) 2.067 (.049) 2.096 (.049) 2.098 (.050) 2.086 (.050)2.683 (.025) 2.951 (.025) 2.776 (.026) 2.847 (.024) 2.875 (.025) 2.917 (.025) 44.423 (.010) 4.276 (.010) 4.017 (.011) 3.904 (.010) 4.122 (.010) 4.089 (.010)6.302 (.0056) 5.479 (.0050) 5.548 (.0051) 5.282 (.0055) 5.120 (.0050) 5.344 (.0049)

∞ (.0014) ∞ (.0010) ∞ (.0007) 14.139 (.0011) 15.346 (.0008) 16.552 (.0006)∞ (.0006) ∞ (.0004) ∞ (.0003)

1.361 (.099) 1.361 (.098) 1.340 (.100) 1.369 (.100) 1.381 (.100) 1.379 (.100)1.893 (.050) 1.900 (.049) 1.891 (.051) 1.923 (.049) 1.900 (.050) 1.936 (.050)2.536 (.024) 2.496 (.025) 2.497 (.025) 2.479 (.025) 2.525 (.025) 2.525 (.025)3.388 (.010) 3.560 (.009) 3.435 (.011) 3.444 (.010) 3.489 (.009) 3.448 (.010) 54.830 (.0050) 4.588 (.0050) 4.310 (.0055) 4.323 (.0052) 4.328 (.0047) 4.225 (.0050)

13.166 (.0010) 14.660 (.0007) 9.805 (.0009) 7.698 (.0011) 8.387 (.0008) 7.033 (.0010)∞ (.0005) ∞ (.0003) 16.154 (.0005) 10.737 (.0006) 11.669 (.0005) 9.909 (.0004)

1.338 (.098) 1.339 (.100) 1.320 (.100) 1.333 (.099) 1.345 (.100) 1.347 (.100)1.845 (.050) 1.829 (.050) 1.833 (.050) 1.835 (.050) 1.821 (.050) 1.848 (.050)2.349 (.024) 2.339 (.025) 2.337 (.025) 2.349 (.026) 2.344 (.025) 2.362 (.025)3.224 (.010) 3.164 (.010) 3.161 (.010) 3.151 (.010) 3.116 (.010) 3.150 (.010) 64.172 (.0052) 3.996 (.0049) 3.847 (.0049) 3.912 (.0046) 3.862 (.0049) 3.852 (.0050)7.466 (.0010) 7.652 (.0009) 6.544 (.0010) 6.060 (.0010) 6.358 (.0010) 5.937 (.0010)

15.812 (.0004) 10.685 (.0005) 9.238 (.0004) 8.056 (.0005) 7.846 (.0004) 7.200 (.0005)

1.320 (.100) 1.313 (.100) 1.302 (.100) 1.318 (.099) 1.317 (.100) 1.318 (.100)1.790 (.051) 1.776 (.050) 1.769 (.050) 1.787 (.050) 1.773 (.050) 1.788 (.050)2.287 (.025) 2.248 (.025) 2.240 (.025) 2.239 (.026) 2.258 (.025) 2.260 (.025)2.967 (.010) 3.002 (.010) 2.979 (.010) 2.929 (.010) 2.934 (.010) 2.944 (.010) 73.776 (.0050) 3.674 (.0051) 3.637 (.0049) 3.573 (.0049) 3.546 (.0050) 3.547 (.0050)6.198 (.0010) 5.612 (.0010) 5.665 (.0010) 5.444 (.0010) 5.354 (.0010) 5.178 (.0010)8.765 (.0004) 7.777 (.0005) 7.339 (.0005) 6.873 (.0005) 6.444 (.0005) 6.354 (.0005)

1.283 (.100) 1.284 (.100) 1.290 (.100) 1.295 (.100) 1.300 (.100) 1.303 (.100)1.765 (.051) 1.756 (.050) 1.746 (.050) 1.759 (.050) 1.755 (.050) 1.751 (.050)2.236 (.026) 2.209 (.025) 2.205 (.025) 2.198 (.025) 2.198 (.025) 2.193 (.025)2.925 (.010) 2.880 (.010) 2.870 (.010) 2.845 (.010) 2.835 (.010) 2.812 (.010) 83.505 (.0051) 3.451 (.0050) 3.408 (.0050) 3.371 (.0050) 3.369 (.0050) 3.344 (.0050)5.773 (.0011) 5.413 (.0009) 5.137 (.0010) 4.923 (.0010) 4.940 (.0010) 4.825 (.0010)7.155 (.0005) 6.492 (.0005) 6.178 (.0005) 5.928 (.0005) 5.890 (.0005) 5.684 (.0005)

(Continued on next page.)

Page 20: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

292 FELTOVICH

(Continued ).

n

15 16 17 18 19 20 m

1.677 (.100) 1.726 (.099) 1.679 (.100) 1.708 (.100) 1.739 (.100) 1.732 (.099)2.524 (.048) 2.541 (.051) 2.556 (.049) 2.587 (.050) 2.575 (.050) 2.619 (.055)3.617 (.025) 3.577 (.026) 3.572 (.025) 3.595 (.025) 3.664 (.024) 3.570 (.028)5.826 (.009) 5.477 (.010) 5.152 (.010) 5.365 (.011) 5.403 (.010) 5.421 (.010) 38.160 (.0049) 6.904 (.0052) 7.380 (.0044) 7.129 (.0053) 7.562 (.0046) 7.050 (.0056)

∞ (.0012) ∞ (.0010) ∞ (.0009) ∞ (.0008) 17.066 (.0013) 17.985 (.0011)∞ (.0007) ∞ (.0006)

1.470 (.100) 1.485 (.100) 1.493 (.100) 1.499 (.100) 1.508 (.100) 1.517 (.100)2.121 (.050) 2.138 (.050) 2.164 (.050) 2.172 (.050) 2.165 (.050) 2.191 (.050)2.864 (.025) 2.864 (.026) 2.903 (.025) 2.904 (.025) 2.922 (.025) 2.929 (.025)4.000 (.009) 4.056 (.010) 4.121 (.010) 4.108 (.010) 4.125 (.010) 4.084 (.010) 45.116 (.0046) 5.286 (.0050) 5.182 (.0050) 5.179 (.0048) 5.060 (.0050) 5.189 (.0050)

10.844 (.0010) 9.149 (.0010) 9.760 (.0008) 9.477 (.0010) 8.794 (.0011) 9.295 (.0011)17.758 (.0005) 18.965 (.0004) 20.171 (.0003) 13.113 (.0006) 13.869 (.0004) 11.592 (.0005)

1.389 (.100) 1.399 (.100) 1.404 (.100) 1.411 (.100) 1.414 (.100) 1.419 (.100)1.937 (.050) 1.942 (.050) 1.960 (.050) 1.963 (.050) 1.970 (.050) 1.984 (.050)2.531 (.025) 2.547 (.026) 2.565 (.025) 2.565 (.025) 2.586 (.025) 2.597 (.025)3.437 (.010) 3.470 (.010) 3.482 (.010) 3.464 (.010) 3.484 (.010) 3.510 (.010) 54.187 (.0050) 4.269 (.0049) 4.217 (.0050) 4.292 (.0050) 4.240 (.0050) 4.271 (.0050)7.267 (.0010) 6.925 (.0009) 7.057 (.0010) 6.945 (.0010) 6.686 (.0010) 6.694 (.0010)9.765 (.0004) 9.114 (.0005) 8.680 (.0005) 8.464 (.0005) 8.379 (.0004) 8.429 (.0005)

1.347 (.100) 1.357 (.100) 1.360 (.100) 1.364 (.100) 1.371 (.100) 1.374 (.100)1.843 (.050) 1.856 (.050) 1.854 (.050) 1.862 (.050) 1.868 (.050) 1.877 (.050)2.359 (.025) 2.374 (.026) 2.382 (.025) 2.385 (.025) 2.392 (.025) 2.401 (.025)3.127 (.010) 3.155 (.010) 3.136 (.010) 3.162 (.010) 3.159 (.010) 3.176 (.010) 63.765 (.0050) 3.808 (.0050) 3.814 (.0050) 3.808 (.0050) 3.822 (.0050) 3.825 (.0050)5.933 (.0009) 5.906 (.0010) 5.771 (.0010) 5.697 (.0010) 5.687 (.0010) 5.674 (.0010)7.442 (.0005) 7.109 (.0005) 6.890 (.0005) 6.843 (.0005) 6.705 (.0005) 6.681 (.0005)

1.325 (.100) 1.332 (.100) 1.333 (.100) 1.337 (.100) 1.342 (.100) 1.345 (.100)1.788 (.050) 1.797 (.050) 1.802 (.050) 1.809 (.050) 1.812 (.050) 1.817 (.050)2.268 (.025) 2.271 (.026) 2.271 (.025) 2.277 (.025) 2.290 (.025) 2.292 (.025)2.945 (.010) 2.946 (.010) 2.961 (.010) 2.957 (.010) 2.962 (.010) 2.972 (.010) 73.513 (.0050) 3.501 (.0050) 3.509 (.0050) 3.517 (.0050) 3.516 (.0050) 3.527 (.0050)5.167 (.0010) 5.184 (.0010) 5.124 (.0010) 5.092 (.0010) 5.096 (.0010) 5.084 (.0010)6.075 (.0005) 6.010 (.0005) 5.989 (.0005) 5.970 (.0005) 5.883 (.0005) 5.876 (.0005)

1.310 (.100) 1.314 (.100) 1.317 (.100) 1.318 (.100) 1.322 (.100) 1.326 (.100)1.752 (.050) 1.761 (.050) 1.766 (.050) 1.769 (.050) 1.773 (.050) 1.777 (.050)2.206 (.025) 2.201 (.025) 2.210 (.025) 2.215 (.025) 2.217 (.025) 2.225 (.025)2.821 (.010) 2.822 (.010) 2.826 (.010) 2.823 (.010) 2.833 (.010) 2.834 (.010) 83.334 (.0050) 3.339 (.0050) 3.331 (.0050) 3.328 (.0050) 3.324 (.0050) 3.341 (.0050)4.767 (.0010) 4.735 (.0010) 4.687 (.0010) 4.702 (.0010) 4.685 (.0010) 4.673 (.0010)5.591 (.0005) 5.485 (.0005) 5.449 (.0005) 5.402 (.0005) 5.377 (.0005) 5.358 (.0005)

(Continued on next page.)

Page 21: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

NONPARAMETRIC TESTS OF DIFFERENCES IN MEDIANS 293

(Continued ).

n

9 10 11 12 13 14 m

1.294 (.101) 1.304 (.099) 1.288 (.100) 1.299 (.100) 1.291 (.100) 1.299 (.100)1.744 (.050) 1.742 (.050) 1.744 (.050) 1.737 (.050) 1.729 (.050) 1.729 (.050)2.206 (.025) 2.181 (.025) 2.172 (.025) 2.172 (.025) 2.166 (.025) 2.167 (.025)2.857 (.010) 2.802 (.010) 2.798 (.010) 2.770 (.010) 2.755 (.010) 2.744 (.010) 93.450 (.0049) 3.373 (.0050) 3.300 (.0050) 3.257 (.0050) 3.240 (.0050) 3.224 (.0050)5.204 (.0010) 5.041 (.0010) 4.783 (.0010) 4.685 (.0010) 4.601 (.0010) 4.555 (.0010)6.572 (.0005) 6.019 (.0005) 5.655 (.0005) 5.535 (.0005) 5.370 (.0005) 5.246 (.0005)

1.295 (.100) 1.284 (.100) 1.284 (.100) 1.287 (.100) 1.292 (.100)1.723 (.050) 1.726 (.050) 1.720 (.050) 1.721 (.050) 1.714 (.050)2.161 (.025) 2.152 (.025) 2.144 (.025) 2.138 (.025) 2.129 (.025)2.770 (.010) 2.733 (.010) 2.718 (.010) 2.703 (.010) 2.693 (.010) 103.266 (.0050) 3.222 (.0050) 3.179 (.0050) 3.150 (.0050) 3.148 (.0050)4.747 (.0010) 4.604 (.0010) 4.476 (.0010) 4.394 (.0010) 4.347 (.0010)5.686 (.0005) 5.383 (.0005) 5.224 (.0005) 5.088 (.0005) 4.965 (.0005)

1.289 (.100) 1.290 (.100) 1.286 (.100) 1.286 (.100)1.716 (.050) 1.708 (.050) 1.706 (.050) 1.704 (.050)2.138 (.025) 2.127 (.025) 2.116 (.025) 2.114 (.025)2.705 (.010) 2.683 (.010) 2.669 (.010) 2.662 (.010) 113.170 (.0050) 3.132 (.0050) 3.103 (.0050) 3.088 (.0050)4.456 (.0010) 4.357 (.0010) 4.261 (.0010) 4.214 (.0010)5.155 (.0005) 4.995 (.0005) 4.876 (.0005) 4.770 (.0005)

1.283 (.100) 1.284 (.100) 1.288 (.100)1.704 (.050) 1.704 (.050) 1.701 (.050)2.117 (.025) 2.105 (.025) 2.098 (.025)2.661 (.010) 2.642 (.010) 2.626 (.010) 123.095 (.0050) 3.065 (.0050) 3.046 (.0050)4.237 (.0010) 4.162 (.0010) 4.107 (.0010)4.845 (.0005) 4.736 (.0005) 4.631 (.0005)

1.280 (.100) 1.279 (.100)1.698 (.050) 1.697 (.050)2.097 (.025) 2.094 (.025)2.621 (.010) 2.614 (.010) 133.029 (.0050) 3.016 (.0050)4.094 (.0010) 4.053 (.0010)4.606 (.0005) 4.538 (.0005)

1.283 (.100)1.690 (.050)2.084 (.025)2.590 (.010) 142.985 (.0050)3.998 (.0010)4.500 (.0005)

(Continued on next page.)

Page 22: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

294 FELTOVICH

(Continued ).

n

15 16 17 18 19 20 m

1.297 (.100) 1.304 (.100) 1.304 (.100) 1.308 (.100) 1.310 (.100) 1.312 (.100)1.733 (.050) 1.739 (.050) 1.740 (.050) 1.744 (.050) 1.746 (.050) 1.750 (.050)2.162 (.025) 2.164 (.025) 2.167 (.025) 2.168 (.025) 2.171 (.025) 2.175 (.025)2.743 (.010) 2.741 (.010) 2.735 (.010) 2.740 (.010) 2.749 (.010) 2.746 (.010) 93.213 (.0050) 3.207 (.0050) 3.200 (.0050) 3.197 (.0050) 3.199 (.0050) 3.204 (.0050)4.508 (.0010) 4.476 (.0010) 4.447 (.0010) 4.441 (.0010) 4.418 (.0010) 4.403 (.0010)5.145 (.0005) 5.154 (.0005) 5.062 (.0005) 5.056 (.0005) 5.015 (.0005) 5.003 (.0005)

1.294 (.100) 1.296 (.100) 1.297 (.100) 1.299 (.100) 1.300 (.100) 1.303 (.100)1.721 (.050) 1.721 (.050) 1.723 (.050) 1.726 (.050) 1.727 (.050) 1.732 (.050)2.134 (.025) 2.133 (.025) 2.133 (.025) 2.136 (.025) 2.138 (.025) 2.141 (.025)2.687 (.010) 2.686 (.010) 2.687 (.010) 2.684 (.010) 2.682 (.010) 2.682 (.010) 103.135 (.0050) 3.118 (.0050) 3.117 (.0050) 3.115 (.0050) 3.110 (.0050) 3.108 (.0050)4.298 (.0010) 4.275 (.0010) 4.244 (.0010) 4.232 (.0010) 4.214 (.0010) 4.218 (.0010)4.898 (.0005) 4.868 (.0005) 4.819 (.0005) 4.784 (.0005) 4.759 (.0005) 4.745 (.0005)

1.288 (.100) 1.287 (.100) 1.292 (.100) 1.295 (.100) 1.295 (.100) 1.295 (.100)1.706 (.050) 1.707 (.050) 1.707 (.050) 1.711 (.050) 1.717 (.050) 1.713 (.050)2.113 (.025) 2.110 (.025) 2.107 (.025) 2.116 (.025) 2.116 (.025) 2.115 (.025)2.649 (.010) 2.641 (.010) 2.642 (.010) 2.645 (.010) 2.641 (.010) 2.635 (.010) 113.075 (.0050) 3.064 (.0050) 3.050 (.0050) 3.052 (.0050) 3.049 (.0050) 3.043 (.0050)4.163 (.0010) 4.132 (.0010) 4.121 (.0010) 4.086 (.0010) 4.078 (.0010) 4.060 (.0010)4.707 (.0005) 4.672 (.0005) 4.650 (.0005) 4.623 (.0005) 4.593 (.0005) 4.543 (.0005)

1.285 (.100) 1.287 (.100) 1.285 (.100) 1.288 (.100) 1.289 (.100) 1.290 (.100)1.703 (.050) 1.705 (.050) 1.698 (.050) 1.701 (.050) 1.701 (.050) 1.701 (.050)2.100 (.025) 2.097 (.025) 2.091 (.025) 2.095 (.025) 2.092 (.025) 2.094 (.025)2.626 (.010) 2.607 (.010) 2.605 (.010) 2.605 (.010) 2.609 (.010) 2.598 (.010) 123.035 (.0050) 3.011 (.0050) 2.999 (.0050) 2.994 (.0050) 2.995 (.0050) 2.982 (.0050)4.074 (.0010) 4.017 (.0010) 3.980 (.0010) 3.989 (.0010) 3.985 (.0010) 3.942 (.0010)4.585 (.0005) 4.501 (.0005) 4.448 (.0005) 4.453 (.0005) 4.437 (.0005) 4.391 (.0005)

1.281 (.100) 1.284 (.100) 1.286 (.100) 1.284 (.100) 1.288 (.100) 1.286 (.100)1.695 (.050) 1.693 (.050) 1.696 (.050) 1.693 (.050) 1.695 (.050) 1.693 (.050)2.086 (.025) 2.085 (.025) 2.084 (.025) 2.078 (.025) 2.080 (.025) 2.079 (.025)2.596 (.010) 2.592 (.010) 2.586 (.010) 2.581 (.010) 2.574 (.010) 2.573 (.010) 132.993 (.0050) 2.986 (.0050) 2.971 (.0050) 2.971 (.0050) 2.951 (.0050) 2.949 (.0050)3.973 (.0010) 3.972 (.0010) 3.919 (.0010) 3.905 (.0010) 3.891 (.0010) 3.878 (.0010)4.488 (.0005) 4.420 (.0005) 4.384 (.0005) 4.371 (.0005) 4.298 (.0005) 4.314 (.0005)

1.280 (.100) 1.281 (.100) 1.281 (.100) 1.283 (.100) 1.284 (.100) 1.287 (.100)1.693 (.050) 1.689 (.050) 1.687 (.050) 1.687 (.050) 1.689 (.050) 1.691 (.050)2.080 (.025) 2.076 (.025) 2.067 (.025) 2.070 (.025) 2.069 (.025) 2.068 (.025)2.581 (.010) 2.577 (.010) 2.556 (.010) 2.561 (.010) 2.557 (.010) 2.553 (.010) 142.962 (.0050) 2.956 (.0050) 2.930 (.0050) 2.933 (.0050) 2.925 (.0050) 2.924 (.0050)3.901 (.0010) 3.924 (.0010) 3.850 (.0010) 3.846 (.0010) 3.799 (.0010) 3.802 (.0010)4.377 (.0005) 4.397 (.0005) 4.261 (.0005) 4.277 (.0005) 4.217 (.0005) 4.204 (.0005)

(Continued on next page.)

Page 23: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

NONPARAMETRIC TESTS OF DIFFERENCES IN MEDIANS 295

(Continued ).

n

15 16 17 18 19 20 m

1.277 (.100) 1.278 (.100) 1.278 (.100) 1.281 (.100) 1.283 (.100) 1.284 (.100)1.686 (.050) 1.685 (.050) 1.685 (.050) 1.686 (.050) 1.684 (.050) 1.685 (.050)2.071 (.025) 2.065 (.025) 2.065 (.025) 2.064 (.025) 2.061 (.025) 2.063 (.025)2.562 (.010) 2.554 (.010) 2.551 (.010) 2.545 (.010) 2.536 (.010) 2.539 (.010) 152.945 (.0050) 2.931 (.0050) 2.923 (.0050) 2.907 (.0050) 2.897 (.0050) 2.896 (.0050)3.876 (.0010) 3.852 (.0010) 3.790 (.0010) 3.809 (.0010) 3.768 (.0010) 3.763 (.0010)4.334 (.0005) 4.296 (.0005) 4.209 (.0005) 4.242 (.0005) 4.160 (.0005) 4.139 (.0005)

1.280 (.100) 1.279 (.100) 1.281 (.100) 1.278 (.100) 1.281 (.100)1.684 (.050) 1.684 (.050) 1.681 (.050) 1.679 (.050) 1.680 (.050)2.062 (.025) 2.059 (.025) 2.055 (.025) 2.051 (.025) 2.049 (.025)2.549 (.010) 2.540 (.010) 2.529 (.010) 2.527 (.010) 2.520 (.010) 162.912 (.0050) 2.897 (.0050) 2.888 (.0050) 2.884 (.0050) 2.871 (.0050)3.800 (.0010) 3.802 (.0010) 3.729 (.0010) 3.721 (.0010) 3.714 (.0010)4.229 (.0005) 4.224 (.0005) 4.141 (.0005) 4.106 (.0005) 4.084 (.0005)

1.282 (.100) 1.280 (.100) 1.282 (.100) 1.281 (.100)1.683 (.050) 1.678 (.050) 1.678 (.050) 1.676 (.050)2.054 (.025) 2.049 (.025) 2.051 (.025) 2.046 (.025)2.531 (.010) 2.525 (.010) 2.517 (.010) 2.511 (.010) 172.885 (.0050) 2.880 (.0050) 2.865 (.0050) 2.857 (.0050)3.734 (.0010) 3.721 (.0010) 3.695 (.0010) 3.666 (.0010)4.124 (.0005) 4.113 (.0005) 4.070 (.0005) 4.038 (.0005)

1.278 (.100) 1.280 (.100) 1.281 (.100)1.679 (.050) 1.677 (.050) 1.677 (.050)2.050 (.025) 2.050 (.025) 2.050 (.025)2.518 (.010) 2.505 (.010) 2.503 (.010) 182.857 (.0050) 2.846 (.0050) 2.849 (.0050)3.696 (.0010) 3.655 (.0010) 3.647 (.0010)4.089 (.0005) 4.026 (.0005) 4.015 (.0005)

1.278 (.100) 1.279 (.100)1.674 (.050) 1.675 (.050)2.040 (.025) 2.038 (.025)2.507 (.010) 2.497 (.010) 192.840 (.0050) 2.835 (.0050)3.646 (.0010) 3.627 (.0010)4.024 (.0005) 3.957 (.0005)

1.278 (.100)1.675 (.050)2.039 (.025)2.491 (.010) 202.820 (.0050)3.588 (.0010)3.927 (.0005)

Page 24: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

296 FELTOVICH

Acknowledgments

I thank John Antel, Steven Craig, John Duffy, Brit Grosskopf, Jach Ochs, Roger Sherman,Nathaniel Wilcox, and two anonymous referees for helpful comments.

Au: Pls. update.

Notes

1. For interval-scale data, the mean is the generally-accepted measure of central tendency. For ordinal-scale data,in contrast, the notion of mean is meaningless; rather, the median is the usual measure of central tendency. Thetests we will examine were developed for use with ordinal-scale data, but are also appropriate for interval-scaledata. In the experimental economics literature, we can find examples of both kinds of data being tested withthese tests, so we will usually use the more generic terms “central tendency” and “first moment” rather thanmedian or mean. We will also use the term “dispersion” rather than “variance” for the same reason.

2. If the research question is whether the distributions are different in any way—central tendency, dispersion, orhigher-order moments—then the nonparametric Kolmogorov–Smirnov test should be used instead (see Siegeland Castellan, 1988). Naturally, one drawback of this test is that if it gives a significant result, the researcherstill doesn’t know which of the moments is different.

3. When the data contain X values that are tied with Y values, a correction (found in Siegel and Castellan) mustbe employed. Our examples will be drawn from populations that are nearly continuous, so the chance of a tieis minuscule. Many economic experiments, however, either use discrete choice sets or result in a small finitenumber of actual choices, so that ties become more likely.

4. We draw a distinction between the terms “theoretically appropriate”, meaning that the assumptions implicitin the test are satis.ed, and “valid”, meaning that the assumed level of significance of the test is approximatelyequal to the actual probability of rejecting a true null hypothesis. It is already known that the Wilcoxon–Mann–Whitney test is not theoretically appropriate when higher-order moments are unequal; part of the purpose ofthis paper is to show when it is not valid.

5. The simulation programs and results, including those mentioned but not shown in the paper, are availablefrom the author.

6. The draws were done with the GAUSS random number generator “rndu”. This generates pseudorandomnumbers, not truly random numbers, so there is the possibility that some nonrandomness in the draws couldaffect the results. In order to examine this possibility, all of the simulations were replicated using a list of1,200,000 true random numbers obtained from www.random.org, which produces random streams of bitsderived from atmospheric noise. These random numbers were enough for 10,000 runs of each simulation. Nosystematic differences were found between those results and the simulation results reported in the paper.

7. Note that because the two populations are identical in this example, it is irrelevant whether the first or thesecond sample is the larger-sized one; results with sample sizes (m, n) should be the same as with samplesizes (n, m). In the next example, when the populations have different levels of dispersion, switching samplesizes will affect the results.

8. This sensitivity of the Wilcoxon–Mann–Whitney test to the combination of different sample sizes and differentpopulation dispersions was discovered by Zimmerman (1987) and has been discussed further by Zimmermanand Zumbo (1993a, 1993b).

9. For m = n = 12, we use the exact critical values. For larger sample sizes, we estimate critical values asfollows. First, we run one million simulations for each sample size using the distributions from Example 1,and record the values of U . We take the 100,000th-highest value of U to be our estimated critical value for the10% level, the 50,000th-highest for the 5% level, and so on. These critical values are checked by re-running thesimulations from Example 1, and verifying that these critical values do indeed correspond to the appropriatelevels of significance (that is, there is no systematic difference between real and nominal levels). Criticalvalues for sample sizes up to 20 are reported in Appendix B; Feltovich (in preparation) reports critical valuesfor larger sample sizes.

10. Boos and Brownie’s (1988) null and alternative hypotheses concern whether the probability of a randomly-selected element from one population being higher than a randomly-selected element from the other population

Page 25: Nonparametric Tests of Differences in Medians: Comparison of …users.monash.edu.au/~nfelt/papers/robust.pdf · 2003-08-12 · Department of Economics, University of Houston, Houston,

P1: GIU

Experimental Economics KL2082-04 July 31, 2003 17:23

UNCORRECTEDPROOF

NONPARAMETRIC TESTS OF DIFFERENCES IN MEDIANS 297

is equal to one-half. As mentioned in Section 3.4, when population distributions are skewed, this is not thesame as the hypotheses of equal versus unequal medians.

References

Boos, D.D. and Brownie, C. (1988). “Bootstrap p-Values for Tests of Nonparametric Hypotheses.” Institute ofStatistics Mimeo Series No. 1919, North Carolina State University.

Cohen, J. (1969). Statistical Power Analysis for the Behavioral Sciences. New York and London: Academic Press.Duffy, J. and Feltovich, N. (1999). “Does Observation of Others Affect Learning in Strategic Environments? An

Experimental Study.” International Journal of Game Theory. 28, 131–152.Feltovich, N. (in preparation). “Critical Values for the Robust Rank-Order Test.” Available at www.uh.edu/ Au: Pls.

update.∼nfelt/papers/rrovals.pdfFligner, M.A. and Pollicello, G.E. III. (1981). “Robust Rank Procedures for the Behrens-Fisher Problem.” Journal

of the American Statistical Association. 76, 162–168.Mann, H.B. and Whitney, D.R. (1947). “On a Test of Whether One of Two Random Variables is Stochastically

Larger Than the Other.” Journal of Statistical Computing and Simulation. 13, 41–48.Siegel, S. and Castellan, N.J., Jr. (1988). Nonparametric Statistics for the Behavioral Sciences. New York:

McGraw-Hill.Tatsuoka, M. (1993). “Effect Size.” In G. Keren and C. Lewis (eds.), A Handbook for Data Analysis in the

Behavioral Sciences: Methodological Issues. Hillsdale, NJ: Erlbaum, pp. 461–479.Wilcoxon, F. (1945). “Individual Comparisons by Ranking Methods.” Biometrics. 3, 119–122.Zimmerman, D.W. (1987). “Comparative Power of Student T Test and Mann-Whitney U Test for Unequal Sample

Sizes and Variances.” Journal of Experimental Education. 55, 171–174.Zimmerman, D.W. and Zumbo, B.D. (1993a). “Rank Transformations and the Power of the Student t Test and Welch

t test for non-normal populations with unequal variances.” Canadian Journal of Experimental Psychology. 47,523–539.

Zimmerman, D.W. and Zumbo, B.D. (1993b). “The Relative Power of Parametric and Nonparametric Statisti-cal Methods.” In G. Keren and C. Lewis (eds.), A Handbook for Data Analysis in the Behavioral Sciences:Methodological Issues. Hillsdale, NJ: Erlbaum, pp. 481–517.

Zumbo, B.D. and Coulombe, D. (1997). “Investigation of the Robust Rank-Order Test for Non-Normal Populationswith Unequal Variances: The Case of Reaction Time.” Canadian Journal of Experimental Psychology. 51, 139–149.