optimal tests of significance

Austral. J. Statist., 21 (3). 1979, 301-310

OPTIMAL TESTS OF SIGNIFICANCE’

J . ROBINSON Department of Mathematical Statistics, University of Sydney

summary To perform a test of significance of a null hypothesis, a test

statistic is chosen which is expected to be small if the hypothesis is false. Then the significance level of the test for an observed sample is the probability that the test statistic, under the assumptions of the hypothesis, is as small, or smaller than, its observed value. A “good” test statistic is taken to be one which is stochastically small when the null hypothesis is false. Optimal test statistics are defined using this criterion and the relationship of these methods to the Neyman- Pearson theory of hypothesis testing is considered.

1. Introduction

One of the attractions of the Neyman-Pearson approach to testing hypotheses is the relatively clearcut theory for selection of a “good” test. There are, however, some objections to the decision theoretic formulation of the problem and there is some advantage in using the significance level of a test as a measure of the strength of evidence against the null hypothesis, rather than as a basis on which to reject or accept it. Thus it is desirable to give a theory for selection of a “good” test where the criteria used are not decision theoretic. Cox and Hinkley (1974) appear to advocate such an approach but they retain the idea of critical regions and so the discussion is not divorced from the decision theoretic approach. Dempster and Schatzoff (1965) and Stone (1969) consider the possibility of using expected significance level as a criterion. There is also a discussion of the use of the stochastic comparison of significance levels for test selection in Kempthorne and Folks (1971). This approach will be extended here.

We consider a test of a null hypothesis which specifies that a random variate X , to be observed, follows one of a given family of distributions, H. If t ( X ) is a test statistic, we calculate the significance level for t at the observation x ,

S,(x) = sup P ( t ( X ) I t ( x ) ) . P E H

Some key words: tests of significance, optimal tests. ’Manuscript received December 12, 1978.

302 J . ROBINSON

If S,(x) is small this is taken as evidence against the null hypothesis. In selecting a “good” test statistic, it is necessary to consider the distribution of S,(X) when X has a distribution in some alternative family, K, and t is chosen so that S,(X) is stochastically small in this case. That is, if Q E K, then t is chosen so that Q(S,(X) S u ) is as small as possible for all O S U d 1. It is noted that K is of interest only in the selection of an appropriate test statistic and not in the interpretation of a particular test.

This method of considering significance testing does not remove the difficulties inherent in a choice of a proper frame of reference for the probabilities used to compute S, (x) and to compare tests. There may be occasions when it is appropriate to consider probabilities conditional on some ancillary statistics. However, this controversy may be considered separately and we will suppose that the appropriate probabilities are known.

Upon examination, it is evident that the methods of the Neyman- Pearson theory can be used to choose test statistics which are stochastically small when alternative hypotheses hold. If for a particular alternative, a most powerful test of size u has critical function (Lehmann, 1959, p. 62) , 4 u ( x ) = l , when S,(X)IU, and 0, St(x)>u, for some statistic t, and if t is the same for all choices of 0 I u 5 1, then the test statistic t has stochastically smaller significance level than any other test statistic. Thus, when the most powerful test is of this form, t is the test statistic we would choose. Most of the well known optimal tests of the Neyman-Pearson theory are of this form, or of a slightly modified form using randomization. Thus the Neyman-Pearson theory could be used as a method of choosing “best” tests even though the implications of accepting or rejecting hypotheses are not used. However, it seems worthwhile to obtain the results directly and to examine any differences which arise between the two approaches.

In the sequel we will discuss optimal tests of significance and consider methods of choosing these tests which parallel the usual Neyman-Pearson results. In Section 2, we will discuss the level of significance and define a randomized level of significance which will enable us to compare tests more easily. In Section 3, we show that the likelihood ratio test is “best” when both H and K are simple, we define uniformly “best” tests and consider “best” tests in the classes of similar and invariant tests. We discuss maximin and most stringent tests in Section 4. Proofs of results stated in these sections are given in the Appendix.

2. Levels of Significance

The level of significance, S,(X), may be a discrete random variable, either because t ( X ) is discrete or because of the process of taking

OpTulAL TESTS OF SIGNIFICANCE 303

the supremum. For example, if X is a N(0, c2) random variable for 0<u2<m, and if t ( x ) = x , then St(X) takes the values f or 1 with probability i. Let

F(u)=sup P ( S t ( X ) 5 U ) , PEH

for O s u l l ; then we can prove the following theorem concerning F ( u ) .

Theorem 1. F(u) is a distribution function and if F(u)- F(u - 6) > 0 for all 6 > 0, then F(u) = u.

Thus we see that S,(x) = suppEH P ( S , ( X ) sSt(x)), SO t ( x ) may be replaced by S,(x) as the test statistic considered. Values of u such that F(u) = u will be called achievable significance leuels, following Kempthorne and Folks (1971). In order to obtain a level of significance which is continuous, we define a u-significance level by

for 0 I u I 1 . Then a randomized significance level is St(x, u) , where x is an observed value of X and u is an observed value of U, a uniform [0,1] random variable. Notice that we cannot define this significance level by suppEH [ P ( t ( X ) < t(x)) + uP(t (X) = t(x))], since, as in the example cited above, where t ( X ) is N(0, a2), for 0 < a2<m, t ( X ) may be continuous for all PEH, but S , ( X ) may be discrete, in which case

Theorem 2. For all u E [0,1], suppeH P(S,(X, U ) 5 u ) = u, where here and in the sequel we will, by an abuse of notation, take P to be the product measure of P, a probability distribution for X , and the probability distribution for U.

One further difficulty may arise with the definition of significance level given above. This is that there may be no P*EH such that P*(S,(X, U ) I u ) = u. This difficulty may sometimes be avoided by a slightly generalised definition of the randomized level of significance which retains all the essential features of the earlier definition.

Sr(X, u ) = S,(X).

Define S(x, u ) to be a randomized level of significance if

(1) s(& = sUpPeH p(s(x, 1) S(x, 1)), (2) for fixed x, S(x, u ) is a non-decreasing function of u E [0,1], ( 3 ) suppeH P ( S ( X , V) 5 U) = U, for all u E [0, I ] .

Clearly, this definition includes the previous one as a special case. The example considered below illustrates the difference in the defini- tions. In the sequel we will frequently refer to the level of significance

304 J. ROBINSON

as the test statistic and we will only use a subscript when it is required to emphasize the role played by a particular statistic. We will call S ( x , u ) a test for the sake of brevity, rather than say a test with significance level S ( x , u) .

It is not suggested here that the randomized significance level be used in practice in an actual test of significance, but only that it be used to compare tests and select an appropriate test statistic. If the jumps in the distribution of S , ( x ) or S ( x , 1) are small then little is lost if we select a test on the basis of the randomized significance level. If they are not small it may be wise to consider statistics with more achievable significance levels.

Example 1. Suppose X and Y are Poisson variables with parameters A and p, respectively, and consider the problem of testing the hypothesis, H : A s p against the alternative, K : A>p. Con- sider the statistic

k - r

which is the standard conditional test statistic given X + Y = x + y . The values taken by t ( x , y ) for all x , y are dense in [0, 11 and it is not difficult to see that for any u E [0,1],

sup P(t (X , Y) I u ) = u. A . (r

However, for each particular value of A = p,

P( t (X , Y ) 5 u ) < u.

Now consider

Then if h = p,

since then

P ( S ( X , Y, u ) ~ u ~ x + Y = x + y ) = v .

It is noted that S(x, y , 1) = r(x, y ) . So the only advantage in considering S ( x , y . u ) is the technical one that comparisons of randomized tests are simplified.

3. Most Sensitive Tests

Consider first the case of a simple alternative hypothesis. Let K ={a}. We will say, in accordance with the definition of Kempthorne

OF’TIMAL TESTS OF SIGNIFICANCE 305

and Folks (1971), that a test, S, is more sensitive than a test, S’, if, for all u E [0, 11,

Q ( S ( X , U) 5 U ) L Q(S’(X, U ) 5 u) .

This is equivalent to saying that S has stochastically smaller significance level than S’ under the alternative distribution Q. If S is more sensitive than all other tests we will say it is most sensitive. It is worthwhile noticing that this is equivalent to saying that the test based on S is most powerful at all levels of significance u.

Now suppose H is also simple and let H=(P}. Let r ( x ) = p ( x ) / q ( x ) , where p and q are densities of P and Q with respect to some dominating measure p. Then r(x) is the inverse of the usual likelihood ratio, where the inverse is used to fit our convention that small values of a test statistic indicate departure from the null hypothesis. The following theorem is just a version of the Neyman- Pearson lemma.

Theorem 3. For any u E [0,1] and any test S,

Q(S,(X, U ) I u ) z Q(S(X, U) 5 u ) .

That is, r is the most sensitiue test statistic for the test of the hypothesis, H : X has probability distribution P, against the alternative, K : X has probability distribution Q.

If u is an achievable significance level for r and S, then

P ( S , ( X ) 5 u ) = P(S(X) 5 u ) = u

and

as is proved by Kempthorne and Folks (1971). In some cases a statistic exists which is uniformly most sensitive.

In particular, in the case of families of distributions, {P,}, depending on a real parameter 8, if these have monotone likelihood ratio, as in the case of the one parameter exponential family, then there is a test statistic which is uniformly most sensitive for the null hypothesis Ho : 8 5 O0 against the alternatives K : 8 > O0. In other cases locally optimal tests can be found by using the efficient score statistic, U = a log p ( x , t9,)/a8,, where p ( x , 8,) is the density of Pe0 with respect to some dominating measure p.

Another approach is to restrict the class of tests in some reason- able way. One such restriction is to the class of similar tests, where by this we mean those tests for which P ( S ( X , U ) 5 u ) = u, for all u E [0,1] and all P E H. Frequently, no non-trivial similar test exists but there may be a subset H of H for which there are similar tests. In particular,

306 J . ROBINSON

this is a useful restriction if H’ is, in some sense, the boundary between H and K.

If T is a sufficient and boundedly complete statistic for H’, then

P(S(X, U ) I u ) = u

for all u E [0,1] and all P E H’, if and only if,

P(S(X, U ) s u 1 T = t ) = u, a.s.

for all u E[O, 13 and all PEH‘. Such a test could be said to have Neyman structure. Notice that S(X, U) is independent of T for all P E H’. In this case we can consider the problem conditionally on T = t, and find the uniformly most sensitive test for this problem. If this test is S*(x, u), then

Q(S*(X, U) 5 u I T = t ) z Q(S(X, U) 5 u 1 T = t ) a.s.

for all u E [0,1] and all Q E K, and any other similar test S(x, w), so for any similar test S,

Q(S*(X, U) 1: u ) 2 Q(S(X, v) 5 u )

for all u E[O, 1 I, and all Q E K. Thus S*(x, u) is the uniformly most sensitive similar test.

Example 2. In the situation described in Example 1, Xt Y is a complete, sufficient statistic for H’ ={A, p : A = p}. The uniformly most sensitive test for the conditional problem is S(x, y, u), SO this is the uniformly most sensitive similar test.

Another restriction is to the class of unbiased tests, where a test is said to be unbiased if Q ( S ( X , U ) S U ) = U for all u E[O, 11 and all Q E K. Under certain conditions of continuity this class is included in the class of tests similar for the boundary of H and K, so if a test is a uniformly most sensitive similar test and is unbiased, then it is a uniformly most sensitive unbiased test. The problem may remain invariant under some group of transformations, in which case it may be appropriate to restrict attention to invariant test statistics and look for best invariant tests.

4. Maxlrmn * andMostStringentTests

A further approach of the Neyman-Pearson theory has been to consider the situation where, in some sense, H and K are separated and then to find a maximin test. We can similarly define a maximin test, S,, as one for which

inf P,(S,(X, U ) S U ) ? inf P,(S(X, U ) S U ) B€K B€K

OPTIMAL TESTS OF SIGNIFICANCE 307

for any test S and all u E[O, 13. One method for determining such tests is given by looking for least favourable distributions. Suppose that H and K have suitable a-fields of subsets and that we can define probabilities on these a-fields. Also suppose that there is a dominating measure, p, for H U K, and that pe(x) , the densities with respect to p, are measurable on the appropriate product spaces of the variables x, 8.

Theorem 4. Suppose probability measures A and v exist such that

r ( x ) = I, pe(x) d k ( e ) / I p e ( x ) d v w ) ,

and suppose

PeiS,(X, U ) 5 U) dA (0) = u I, and

Pe(Sr(X, U) 2 u ) dv(e ) = inf Pe(Sr(X, U) 5 u ) I, e e K

for all u E [0,1], then

inf Pe(S,(X, U) 5 U) 2 inf Pe(S(X, U) 5 U) eeK BEK

for all u E [0 , 13 and all tests S .

Example 3. Suppose XI, . . . , X,, are normally independently distributed with common variance uz and suppose EX, = qi, i = 1,. . . , r and EX, =0, i = r + l , . . . , n. Consider a test of H : q1 =. . .= v, = 0 against K : xr=, q:/a22+2.

S 2 = ~ ~ = l X: is a complete, sufficient statistic for H and the conditional probability density function of X given Sz = s2 is

f(x I s2> = C , , S ~ - ~ exp [ t$ cos e b -&’I where xill q : l ~ ~ = + ~ , C,, depends only on n, tZ=CI,, xi’ and at$ cos 8 = XI=, qq. Let h give all weight to the point a’ = s2 and let v be a uniform distribution on q:/a2 = +* and a2 = s2. Then it can be shown that r ( x ) , as defined in Theorem 3, obtained from the conditional density and A and v, is a monotone function of t Z / s 2 . So the usual F test of the analysis of variance is maximin amongst similar tests.

Finally, we may define a test S to be most sm.ngent if it minimizes

for all U E [ O , 11. If K can be partitioned into subsets, if sup,. Pe(S’(X, U ) S U ) is constant on K6 and if S is maximin for K6,

308 J. ROBINSON

then S is most stringent. Thus most stringent tests can frequently be found from maximin tests. For example, the analysis of variance test considered in Example 3, is most stringent among all similar tests.

In analysing the differences between results obtained by this method and those of the Neyman-Pearson theory, we see that results obtained by averaging critical functions in the latter theory cannot be obtained in that way, since we cannot average test statistics. For example, for any critical function C$(x), there is an equivalent critical function, E[C$(X) I S = s], based on a sufficient statistic S , but corres- ponding to any test statistic there is not necessarily an equivalent test statistic based on S. However, the most sensitive tests are based on likelihood ratios and so automatically depend only on sufficient statis- tin. Also the methods of averaging critical Iunctions using invariant measures to obtain maximin and most stringent tests are not applic- able. Thus it is. not clear that the existence of a maximin test, in the sense defined here, can be proved under the conditions used in the Neyman-Pearson theory. However, when a maximin test is found by that theory, and is of the form used here, then it is maximin in our sense. For example, the analysis of variance test of Example 3 can be shown to be maximin amongst all tests.

Appendix

Proof of Theorem 1. Write G(t)=S,(x) when t ( x ) = t . Then F(u) = suppsn P(G(T)I u) . Clearly, G(t) and F(u) are non-decreasing and 0 I G ( r ) ~ l and O s F ( u ) ~ l . Suppose u is a point such that F(u) -F(u - 6) > 0 for all 6 > 0. Then, for any 6 > O,E > 0, there exist P * E H and q(6)>0 such that

P*(G(T) 5 U) + E 1 F(u)z F(u - 6) + (6) 2 P*( G( T ) 5 u - 6 ) + ~ ( 6 ) .

Since E > 0 is arbitrary, we have for arbitrary 6 > 0,

P*( G(T) 5 U) > P*( G( T) I u - 6).

Now suppose there exists no reaI number, r,, such that G(4) = u. Let u*=sup{G(t): G(t )su} , where u * = O if G(t)>u for all r. Then u* < u. So there exists 6 > 0, such that there are no points t such that u - 6 < G( t ) I u. Thus, P*(u - 6 < G(T) 5 u ) = 0, which.contradicts the result above. Thus there is a point t, such that G(h)=u. So

F ( ~ ) = s u p P ( G ( T ) I v ) = s u p P ( T I ~ , ) = G ( L ) = u . P e H P E H

Now it is clear that F(u) is a distribution function.

OpI?MAL TESTS OF SIGNIFICANCE 309

Proof of Theorem 2. Let z, =sup{F(u') : F ( U ' ) ~ U ) , w, = inf {F(u') : F(u') 2 u} . Then z, I u 5 w,, since F(u) is non-decreasing and

P( s, (X, U ) I u ) = P(S, (X) < z,) + P(z , I s, ( X ) 5 w, )P

So, since 2, and w, are points of increase of F(u) or z, = 0,

Proof of Theorem 3. A = Q(S,(X, U ) C: U) - Q ( S ( X , U ) 5 V )

= /(Ir - I * ) q ( x ) 4 . L du,

where I, and Is are indicator functions for the sets {(x, u ) : S,(x, u ) 5 u } and {(x, u : S(x, u ) I u } , respectively, and the integral is over the product set X x [O, 11, where X denotes the sample space.

Let I?,, = supy {r(y) : S,(y, 1) I u}. Then R, > O for u > 0 and

[ I r ( x , u) - Is(% U)JCR, - r(x>lZ 0.

s o

From Theorem 2, J(Ir - & ) p d p du = 0, so A r 0.

Proof of Theorem 4. Define R,,, Ir, I, as in the proof of Theorem 3. Then ( I , -I,)(& - r ) 2 0 and

[Pe ( Sr ( X , U ) 5 U) - PO (S (X, V) I U) J dA (6) 2 0, JH

JK h ecK h 8 E K I,

for all 0 5 u 5 1 and any test S. Thus

Pe(S,(X, U ) S U ) d v ( e ) r

for all O r u s 1 and any test S ; but

Pe(S(X, V ) S I J ) d v ( e ) ,

inf P,(S,(X, U ) S U ) = Pe(Sr(X, U ) ~ U ) ~ V ( B )

and

inf P,(S(X, U ) 5 u ) I Pe(S(X, U ) I u ) d v ( 8 )

for all 0 I u I 1 and any test S , so the result follows immediately.

310 J. ROBINSON

References

Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics. London: Chapman and

Dempster, A. P. and Schatzoff, M. (1965). “Expected significance level as a sensitivity

Kempthorne. 0. and Folks, L. (1971). Probability, Statistics and Data Analysis. Ames:

Lehmann, E. L. (1959). Testing Statistical Hypotheses. New York: Wiley. Stone, M. (1969). “The role of significance testing: some data with a message.”

Hall.

index for test statistics.” J. Amer. Statist. Assoc.. 60, 420-36.

Iowa State University Press.

Biornetn‘kq 56, 485-93.

optimal tests of significance

Documents