some statistical tests for constrained occupancy problems

Some Statistical Tests for Constrained Occupancy ProblemsAuthor(s): Paul DaviesSource: Biometrics, Vol. 39, No. 3 (Sep., 1983), pp. 719-725Published by: International Biometric SocietyStable URL: http://www.jstor.org/stable/2531099 .

Accessed: 24/06/2014 20:23

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access toBiometrics.

http://www.jstor.org

This content downloaded from 185.2.32.109 on Tue, 24 Jun 2014 20:23:07 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=ibs

http://www.jstor.org/stable/2531099?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


BIOMETRIcs 39, 719-725 September 1983

SHORTER COMMUNICATIONS

EDITOR: JOHN J. GART

Some Statistical Tests for Constrained Occupancy Problems

Paul Davies

Department of Statistics, University of Birmingham, Birmingham B 15 2TT, England

SUMMARY A chi square test is devised for testing the hypothesis that columns have an equal probability of selection in a matrix occupancy problem in which the cells occupied in each row are selected randomly but the matrix is subject to the restriction that each column must be selected at least once. Applications to the 'committee' problem and to a problem of comparing allele frequencies are discussed. Alternative forms of the test, appropriate when the constraint of nonzero totals arises from a particular selection mechanism, are obtained.

1. Introduction

In 1968, Mantel and Pasternack posed the following occupancy problem: 'A group consists of n individuals who may potentially serve on one or more of m committees each of which has r members. If committees are formed independently and each individual has an equal chance of selection for each committee, obtain the distribution of the number of individuals chosen to serve on at least one committee'. This problem has been solved in various ways (Gittelsohn, 1969; White, 1971); it has also been generalized to take account of, for example, committees of unequal size, the possibility of individuals refusing to serve (Sprott, 1969), and the formation of committees from different groups of individuals (Walter, 1976, 1979). Formulation as a matrix occupancy problem becomes clear if the committees are taken as the m rows and the individuals as the n columns of a matrix whose elements are 1 or 0 according to whether an individual does or does not serve on a particular committee.

In applications, a main concern is often to test the hypothesis that each column (individual) has an equal probability of selection for given row (committee) sizes. Sampling mechanisms may be such that the row sizes are either fixed or random, with tests then usually conditional on the observed values. Mantel (1974) reformulated Cochran's Q test to provide a chi square type of statistic for testing homogeneity between columns. Similar statistics were obtained by Walter (1976, 1979) for his variants of the problem. Holst (1980) developed a general asymptotic theory for test statistics of a type that includes Mantel's statistic.

The problem considered here is the construction of a chi square statistic for testing homogeneity between columns when it is known that each column occurs or is selected at least once. This arises from a problem in genetics of comparing allele frequencies; for details see ?4. Holst pointed out that this situation also arises in a capture-recapture sampling method known as the Schnabel census which is used when all of the observed animals have been caught on at least one occasion. To consider the relevance of the standard chi square

Key words: Cell occupancy; Chi square test; Constrained matrix occupancy; Committee problem.

719



720 Biometrics, September 1983

tests in this situation and to construct alternatives, the methodology of Mantel's test (1974) is re-stated.

2. Unconditional Chi Square Test

Suppose that in each of m rows of an m X n matrix, r cells are randomly selected without replacement from a group of n cells (r < n). Let the random variable Y1j (i = 1, 2, ... ., m j = 1, 2, . . , n) equal 1 or 0 according to whether the (i, j)th cell is selected or not. Define the random column totals to be cj = hi Y1j and initially consider the simplest case, where the fixed row totals r = A j Yij are the same for all rows. If the null model, where each column has equal probability of selection for each row, is assumed, then it can straightforwardly be shown that, for all j, k = 1, 2, . . ., n,

E(cj) = T/n,

var(cj) = { T(n - r)}/n 2 1 (1)

cov(c;, Ck) = {-T(n - r)}/{n 2(n - 1)), j + kt where T = mr = >Jc,.

From these results, Mantel (1974) obtained the statistic n-1

E [{cj - E(cj)} {Ck - E(ck)}]vik, (2) j,k

where Pik is the inverse of the variance-covariance matrix of the c; terms, and showed that (2) simplifies to

x2 = n(n -1) { E (c - T/n )2} /I{ T(n - r)}. (3)

If the c terms are assumed to be asymptotically normal, the null distribution of the statistic is approximately chi square with n - 1 degrees of freedom (df) and the test of no column effect is carried out in the usual way. Holst (1980) showed that a better approximation to the distribution of X2 is given by (m - 1)/m times a chi square variate with m (n - 1)/(m - 1) df. However, when m is large there is little to choose between the two approximations.

Before modifying this test to be appropriate for the case where all the c; terms are known to be strictly positive, consider how X2 is affected if each column must be selected at least once. As >Jc; is fixed, larger variance of the c; about the null mean T/n, and hence larger values of the chi square statistic, can be obtained if zero values of c; are permissible. Use of X2 when allowance should have been made for the exclusion of zero c values is therefore conservative and the main concern is probably loss of power of the test to detect column differences. This loss will be least when m is much larger than n and r is large relative to n, so that occurrence of zero column totals will be unlikely.

3. Chi Square Test Conditional on Nonzero Column Totals

The joint probability generating function of the column totals, conditional on all of C1, C2, . . ., cn being nonzero, can be found by using standard inclusion-exclusion arguments [as in Johnson and Kotz (1977, Chapter 3)].

Let S = {il, i2,. ir} be the set of all selections without replacement of r integers from the n integers 1, 2, ... , n, and let S(IF) be the set of selections of r integers from all the n integers except the ith, and so on. Further define polynomial functions

=12 ......n (sew Z Zir )



Tests on Constrained Occupancy Matrices 721

Q12 ... = (n E Zi1Zi2 Zir), etc.

Then the joint probability generating function of the cj, conditional on all cj > 0, can be shown to be

A (zi, Z2, * Zn p+ (Q12 ... n Q12 . T. n nn

+ Q12. . n Q12. T. j.k + ) itj itJt k

where p+ is the probability that in random allocation all of the cj values exceed zero and n-r nrnj / ni]

= (-lJ )(I)j (n)}flZ

The moments of the conditional distribution of the c; can then be found from the probability generating function in the usual way. In particular,

E(c.) = T/n,

var(cj) = { T(n - r)} /n2 - {rT(m - 1)8} /(n2p+), } (4)

cov(cj, ck) = {-var(cj)}/(n - 1) for j, k = 1, 2, ... , n, it k,

where 8i = E2& (-l)i ){(nI) (r )} {j/(n -j)}

When T= E c; is fixed, the means conditional on nonzero values c; are the same as the unconditional means. Otherwise the first terms are the unconditional variances and covari- ances with the terms that involve 83 taking the form of a correction due to the restriction of positive c; values. Then, as before, a test can be based on the assumption that the c; are approximately normal, and the chi square statistic in (2) can be reduced to the simple form

X= (c - T/n)} tn(n - 1)}]{T(n - r) - rT(m - l)8/p?}. (5)

Hence when all c. terms are positive the unconditional statistic is corrected by a simple multiplying factor.

The expectation of X2 on the null hypothesis is n - 1, so the test could still be based on a chi square distribution with n - 1 df. Holst (1980) gave results which could be used to derive a chi square statistic in this situation. He obtained general expressions for the asymptotic mean and variance of a family of statistics which includes Mantel's chi square statistic; the test is then based on the asymptotic normality of these statistics. However, the asymptotic variances are not easy to obtain explicitly and, as Holst pointed out, the distribution of such statistics in samples of moderate size is likely to be right skewed and approximations are better based on the chi square than on the normal distribution. Moreover, the results in (4) are exact and not asymptotic so it is arguable that the test proposed here will perform better than Holst's in samples of moderate size. However, we have not so far been able to compare the performances of these tests.

It would be useful to know how large the values of m and n must be for the assumption of approximate normality to be legitimate and how far the distribution is affected by restricting the c. to be nonzero. Some simulation has been done and it would appear that the statistic X2 has an approximately chi square distribution for quite moderate values of m. For example, consider the situation with m = 10 rows, n = 8 columns and r = 2 cells per row to be filled. With p, = .60 and a mean column total of only 2.5, one would not expect the normal approximation for the distribution of column totals to be very good. Random matrices of ones and zeroes with all column totals being nonzero were constructed 10 000 times and the




sample distribution of X2 was obtained. For the purpose of significance testing, the upper percentage points of the distribution are of most interest. The upper 95% and 99% points of the simulated distribution are 13.2 and 17.7, respectively. The corresponding points of a chi square distribution (7 df) are 14.1 and 18.5, so even in this case, use of the chi square distribution is unlikely to be seriously misleading. Holst suggested that a better approximation for the distribution of x2 is given by (m - 1)/m times a chi square variate with m(n - 1)/ (m - 1) df. If used here this gives even closer agreement, with corresponding percentage points of 13.7 and 17.8. Walter (1976) also found that the chi square approximation worked surprisingly well for small numbers. Note that the uncorrected form of X2, with no allowance for nonzero column totals, has a value that is only 76% of the corrected X2 and would give a test with much lower power.

Generalizing this approach to the situation where the m row totals are not all equal does not lead to simple results. However, in ?5 we consider another approach to devising a test under the constraint of nonempty columns which does allow unequal row totals.

4. Illustration

Species of plants such as the field poppy (Papaver rhoeas) have a self-incompatibility system controlled by multiple alleles at a single gene locus, S. Half of the pollen produced by a plant of genotype S152 is expected to contain S, and the other half S2. Both S, and S2 fail on the stigmas of such plants. Thus the pollen of every plant in a population is incompatible with its own stigmas as well as with those of other plants of the same genotype. It follows that no plant in the population can be homozygous and all plants carry two different S-alleles. Also, a plant with a rare allele is liable to leave more progeny than a plant with commonly occurring alleles, so the incompatibility polymorphism must be maintained by frequency-dependent selection. Wright (1939) established that, provided the effect of selection on the locus is limited to that associated with incompatibility, the equilibrium frequency of each of N alleles present in the population is 1/N.

In an experiment described by Campbell and Lawrence (1981), 51 poppy plants, comprising a random sample derived from a natural population, were classified with respect to incompatibility phenotype by means of a 51 x 51 half diallel crossing scheme. When the 102 alleles of the 51 plants were identified, it was found that 31 distinct alleles were represented. To test the equilibrium or equal-allele-frequency hypothesis, the m = 51 plants were identified with the rows of an occupancy matrix and the n = 31 alleles with the columns. The number of observed alleles per row was r = 2. Assuming the equal-allele-frequency hypothesis, Campbell and Lawrence showed that the maximum likelihood estimate of the number of population alleles, N, was 32, which was close to the sample number n. This estimate may be very misleading if the hypothesis is false. As N is unknown, both test and hypothesis of equal allele frequencies must inevitably be conditional on the alleles represented in the sample and hence on nonzero allele totals. The frequency distribution of the allele totals, c;, is as follows:

cj 1 2 3 4 5 6 7 8 9 10 11 Total Frequency 8 7 6 3 3 1 1 0 0 0 2 31

The probability, p+, that all of cl, c2, . .., c, are nonzero under unrestricted randomization is only .33, so the constraint of nonzero cj values will have a large effect on their distribution. From (5) the statistic X2 = 74.3. With 30 df both this statistic and Mantel's X2 (64.3) reject the equal-frequency hypothesis at low significance levels (P < .0003), reflecting the disparate observed frequencies. However, a situation in which the population contains a few rare alleles amongst many of roughly equal frequency would be hard to detect by this or any other test unless the sampling is extensive enough to capture the rare alleles.




5. Allocations with Randomization Constraints

Though tests should be conditional on nonzero column totals in the example above, the allocation of ones and zeroes in each row is random and not subject to any other restriction. In some problems the allocation may be restricted to ensure that a constraint is satisfied and then different tests will be required. To illustrate this, consider the ethological problem of testing whether all the n monkeys in a colony are equally gregarious or whether some are more disposed than others to cluster in groups. On each of a series of i = 1, 2, ..., m independent occasions, a monkey is selected and the ri - 1 monkeys located 'close' to this selected 'nucleus' monkey are identified (forming a cluster of ri 'ones'). If on each occasion the nucleus is selected at random, Mantel's test can be used to test the hypothesis that the groups are formed at random around the nucleus from a set of equally gregarious monkeys. However, in order to detect departures from the hypothesis and also to give the solution general relevance, it is probably desirable to predetermine that each monkey is selected as the nucleus at least once.

Suppose that the jth monkey is, in advance, systematically selected as the nucleus Pj times, corresponding to a subset, Sj, of the m occasions. As before, a test conditional on the observed row totals, ri, is constructed. Then on the hypothesis of equal probability of selection, the moments of the column or monkey totals are found to be

E(cjI Sj) = pj + A (ri- 1)/(n- 1), 14 Si

var(cul Sj) = A (r, - 1)(n - rl)/(n - 1)2, j= 1, 2, . . ., n, '4 Si (6)

cov(c1, CkI Si, Sk) = - (r - l)(n - rl)/{(n -1)2(n - 2)), 1 4 Sj,Sk

j] k = 1, 2, ..., n.

These expressions can be substituted into the general form of the chi square statistic (2) to give Xf. In the simplest case where all the ri are equal to r and all the pj are equal to m/n, Xf reduces to a multiple, (nr - r)/(nr - n), of Mantel's X2,

2= { (Ci - T/n )2} {(n - 1)2/(rm - m)(n - r)}. (7)

This statistical model could also be used to describe a version of the 'committee' problem in which it is pre-arranged that the jth (j = 1, 2, . . ., n) of a set of n individuals must be chairman of pj of the m committees, where the remaining committee places are allocated at random amongst the other individuals.

A more difficult situation arises when it is known that the randomization is subject to constraints as above but some information on the constraints is unavailable. For example, consider an investigation to compare the frequencies of different categories of petty crime amongst juvenile offenders. Suppose n related types of offence are under consideration and the participation of p1 (j = 1, 2, . . ., n) juveniles convicted for Offence j is obtained where py > 1. To have any hope of establishing by use of questionnaries whether each juvenile would admit to other offences than that for which he/she was already convicted, complete anonymity of the responses must be guaranteed and made demonstrably obvious. If the questionnaires are regarded as the rows and the offences as the columns of a matrix with 0/ 1 entries for the absence/presence of an offence, the selection of the juveniles ensures that the column total j is at least pj' but unidentifiability of the responses means that it is unknown which set, Sj, of rows ensures that this constraint applies. Such an experiment might well




make use of the randomized response method developed by Greenberg et al. (1969). If not, a possible approach to comparing the frequencies in such situations would be to assume that all the possible ways of associating the Sj with the m rows are equally probable and to obtain the unconditional moments of the column totals, c;, over this distribution. Then it can be shown that

E(cj) = T/n + {(T/n - m)(1 - npj/m)}/(n - 1),

var(cj) = ([(n + 1)T- {m(m -pj - 1)R + pjT2}/(m2 - m)

- mn](m -pj))/{m(n - 1)2), (8)

cov(c1, Ck) = [-{(n + 1)T- R - nm}(m -pi -pk)]/{m(n - 1)2(n - 2))

+ {(T2/m - R)pjpk}/{m(m - l)(n - 1)2),

where R = , r2 and m = The chi square statistic, (2) does not reduce to a particularly simple form except when all

the pj values are equal to m/n. In this case,

= [(n - 1)2{j(ci -T/n)2}]/[(n + 1)T -{T2 + (mn - m - n)R}/{n(m - 1)) - mn]. (9) In the special case of equal row totals, rk, the statistic does not depend on the choice of Sj

values, and X2 is equal to X4. A small simulation study was done on the distribution of both the statistics Xf and XC. Although this did not provide useful precise guidelines, it showed that the chi square approximation in the form suggested by Holst appears reasonable in this situation for moderate values of m and n. In the case of X2, the simulated distribution is taken over both a random choice of the Sj values and a random allocation, given the Sj. For a particular set of Sj, X2 must differ from a chi square random variable to an extent that depends on the inequality between row totals. Its use is therefore most justifiable when the disparity between the rk is not great. A similar approach to test construction could be employed for other systems of partially random allocation in matrix occupancy problems.

ACKNOWLEDGEMENTS

I am indebted to Dr M. J. Lawrence of the Department of Genetics, University of Birming- ham, for bringing this problem to me, and to a referee for helpful comments and information.

RESUME On derive un test du chi-carr6 pour tester 1'hypothese que les colonnes ont la meme probability de selection dans un problme de matrice d'occupation dans lequel les cellules occupies dans chaque ligne sont tires au hasard, mais la matrice est sujette a la restriction que chaque colonne doit etre selectionn~e au moins une fois. On discute des applications au 'probleme du comity' et a un probleme de comparaison de fr6quences alkliques. On derive des formes de rechange du test appropriees quand la contrainte de totaux non nuls provient d'un mecanisme de selection particulier.

REFERENCES

Campbell, J. M. and Lawrence, M. J. (1981). The population genetics of the self-incompatibility polymorphisms in Papaver rhoeas. II. The number and frequency of S-alleles in a natural population (R 106). Heredity 46, 88-90.

Gittelsohn, A. M. (1969). An occupancy problem. American Statistician 23, 11-12. Greenberg, B. G., Abdel-Latif, A. A.-E., Simmons, W. R. and Horvitz, D. G. (1969). The unrelated

question randomized response model: Theoretical framework. Journal of the American Statistical Association 64, 520-539.

Holst, L. (1980). On matrix occupancy, committee and capture-recapture problems. Scandinavian Journal of Statistics 7, 139-146.




Johnson, N. L. and Kotz, S. (1977). Urn Models and their Application. New York: Wiley. Mantel, N. and Pasternack, B. S. (1968). A class of occupancy problems. American Statistician 22,

23-24. Mantel, N. (1974). Approaches to a health research occupancy problem. Biometrics 30, 355-362. Sprott, D. A. (1969). A note on a class of occupancy problems. American Statistician 23, 12-13. Walter, S. D. (1976). A generalisation of a matrix occupancy problem. Biometrics 32, 371-375.

(Correction: Biometrics 32, 954). Walter, S. D. (1979). Some generalizations of the committee problem. Canadian Journal of Statistics 7,

1-10. White, C. (1971). The committee problem. American Statistician 25, 25-26. Wright, S. (1939). The distribution of self-sterility alleles in populations. Genetics 24, 538-552.

Received November 1980; revised December 1981 and April 1982



some statistical tests for constrained occupancy problems

Documents