mixtures and limits of symmetric random integer partitions

11
METRON - International Journal of Statistics 2012, vol. LXX, n. 2-3, pp. 207-217 MAURO GASPARINI Mixtures and limits of symmetric random integer partitions Summary - In problems of species counts, the interest is more on the number of different species and their relative abundance rather than on counts of representatives of specific species. This kind of problems originated in Genetics, where species are alleles of a gene, but are also common in other applied sciences. Ewens sampling formula is the first and most famous devoted probability distribution, obtained by a geneticist, in this area of research. From a statistical point of view, Ewens sampling formula is an example of distribution of a random integer partition, i.e. a random list of multiplicities m 1 , ..., m n such that m 1 + 2m 2 + ... nm n = n. In this paper, Ewens and other distributions of symmetric random integer partitions are obtained without reference to biological evolutionary models. Then, their mixtures and limits are studied, obtaining some interesting relationships and some new examples. Key Words - Ewens sampling formula; Hierarchical Bayesian Models; Dirichlet inte- ger partition. 1. Alien at the zoo and Ewens formula Imagine an Alien which lands on Earth, ends up by chance in a zoo and starts sampling animals to get an idea of the biological variety of our planet. Imagine the Alien samples 3 elephants, 2 kangaroos and so on, as depicted in Figure 1. Not knowing what the animals are and their relative importance, the first information the Alien processes is that 6 different species are sampled, one species with 3 elements, 3 species with 2 elements and 2 species with only single elements, for a total of 6 different species, with varying abundances. More formally, given a sample of n categorical observations, let m i , i = 1,..., n be the number of species (classes, levels) which are represented i times in the sample. These species count statistics are different from the class counts any statistician is used to, but they occur very naturally in Genetics and Biology. Received November 2011 and revised May 2012.

Upload: mauro-gasparini

Post on 14-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

METRON - International Journal of Statistics2012, vol. LXX, n. 2-3, pp. 207-217

MAURO GASPARINI

Mixtures and limitsof symmetric random integer partitions

Summary - In problems of species counts, the interest is more on the number ofdifferent species and their relative abundance rather than on counts of representativesof specific species. This kind of problems originated in Genetics, where species arealleles of a gene, but are also common in other applied sciences. Ewens samplingformula is the first and most famous devoted probability distribution, obtained by ageneticist, in this area of research. From a statistical point of view, Ewens samplingformula is an example of distribution of a random integer partition, i.e. a randomlist of multiplicities m1, ..., mn such that m1 + 2m2 + . . . nmn = n. In this paper,Ewens and other distributions of symmetric random integer partitions are obtainedwithout reference to biological evolutionary models. Then, their mixtures and limitsare studied, obtaining some interesting relationships and some new examples.

Key Words - Ewens sampling formula; Hierarchical Bayesian Models; Dirichlet inte-ger partition.

1. Alien at the zoo and Ewens formula

Imagine an Alien which lands on Earth, ends up by chance in a zoo andstarts sampling animals to get an idea of the biological variety of our planet.Imagine the Alien samples 3 elephants, 2 kangaroos and so on, as depicted inFigure 1. Not knowing what the animals are and their relative importance, thefirst information the Alien processes is that 6 different species are sampled,one species with 3 elements, 3 species with 2 elements and 2 species with onlysingle elements, for a total of 6 different species, with varying abundances.

More formally, given a sample of n categorical observations, let mi , i =1, . . . , n be the number of species (classes, levels) which are represented itimes in the sample. These species count statistics are different from the classcounts any statistician is used to, but they occur very naturally in Genetics andBiology.

Received November 2011 and revised May 2012.

208 MAURO GASPARINI

Figure 1. An Alien at the zoo sampling 11 animals, 2 single ones, 3 pairs and one triple of differentspecies.

For any real number x , let x [n] = x(x + 1) . . . (x + n − 1) be the ascendingfactorial. Ewens sampling formula, a cornerstone in this species count theory,is then defined as

E S(m; θ) = n!∏ni=1 imi mi !

θ∑

mi

θ [n](1)

a probability distribution, indexed by the parameter θ , on the set of vectors ofnon-negative integers m = (m1, ..., mn) such that

m1 + 2m2 + . . . nmn = n. (2)

In combinatorial terms, m is also called an integer partition of n and mi itsi-th multiplicity. The name refers to the way of writing the integer n as anunordered sum of m1 times the number 1 plus m2 times the number 2, ... plusmn times the number n, as expressed by formula (2), which puts the vector m ina one-to-one correspondance with the partition. Distribution (1) was originallydeveloped by a biologist in a Population Genetics context [Ewens, W.J. (1972)].The geneticist, exactly like the Alien in the example above, is often concernedwith counting different types of more microscopic objects, like the possiblealleles of a gene. In an evolutionary model called the Infinite Allele Model,the number of possible alleles which can be created by mutations has no upperlimit. In the non-Darwinian theory of evolution, the genetic variation observedis due to purely stochastic changes modeled by the Infinite Allele Model, ratherthan to selective advantages of allele types; Ewens sampling formula plays thena central role in this non-Darwinian theory. See Chapter 41 of [Johnson, Kotz,Balakrishnan (1997)], by Tavare and Ewens for more extensive explanation andfor properties of the Ewens distribution, including moments and distributionsof related statistics.

There are many examples of non-biological and non-genetic applicationsof species counts: rather than counting the number and relative abundance of

Random integer partitions 209

animal species or alleles, linguists count the number of different words usedby an author, archaeologists count the different types of ancient manufacts andin reliability one may be interested in counting the different types of faults asystem can incur. A review of classical problems is for example [Chao, A.(2006)].

The purpose of this note is three-fold: first, in Section 2, we would liketo rephrase the known derivation of Ewens formula and of some other relateddistributions in purely statistical terms, as the distributions of certain statisticswhen sampling a categorical variate. The absence of any reference to evo-lutionary models will clarify the importance of random integer partitions instatistical terms. Secondly, relevant mixing and limiting relationships betweenthe distributions of several integer random partitions will be derived in Sec-tion 3 and 4. Finally, as a by-product, a simple two-parameter extension ofEwens distribution will be obtained by mixing.

2. Random integer partitions as count statistics

Let Y1, . . . , Yn be a vector of exchangeable random variables each of whichis susceptible of assuming one out of K possible classes (levels, species, alleles,types). Without loss of generality, we can identify the K possible classes withthe first K positive integers. The simplest example is i.i.d.-Y sampling, whereY is a categorical random variable which assumes class values 1, . . . , K withknown class frequencies P1, . . . , PK . Another example, which can be given astraightforward Bayesian interpretation, is conditional i.i.d.-Y sampling, whereY1, . . . , Yn are i.i.d. Y given unknown class frequencies P1, . . . , PK which, inturn, are given a known prior distribution π(P1, . . . , PK ).

The usual statistics of interest arising in this context are the class counts

Nk = #{i : Yi = k}, k = 1, . . . , K (3)

where the symbol # indicates cardinality, so that Nk is the number of sampledelements of class k. The distribution of N1, . . . , NK

P(N1 = n1, . . . , NK = nK ) (4)

is a probability distribution over all vectors of non-negative integers (n1, ..., nK )

such that n1 + n2 + . . . nK = n. For example, in i.i.d.-Y sampling, the classcounts N1, . . . , NK have the multinomial distribution

P(N1 = n1, . . . , NK = nK ) = n!

n1!n2! . . . nK !Pn1

1 Pn22 . . . PnK

K . (5)

210 MAURO GASPARINI

There are situations in which one is led to consider other counts, and inparticular

Mi = #{k : Nk = i}, i = 1, . . . , n. (6)

Mi is the number of classes which are represented i times in the sample. Thevector M = (M1, . . . , Mn) is a random integer partition of n.

As an example, in addition to the one depicted in Figure 1, a samplewith N1 = 3, N2 = 1 and N3 = 0 from a population with K = 3 classescorresponds to M1 = 1, M2 = 0, M3 = 1 and M4 = 0. Other possible valuesfor M are (2, 1, 0, 0), (0, 2, 0, 0) and (0, 0, 0, 1), whereas if K > 3 we shouldalso add M = (4, 0, 0, 0) as a possibility. The number of parts in the integerpartition

∑ni=1 Mi is the number of classes represented in the sample and a very

important statistic in this context. In the example above, if M = (2, 1, 0, 0)

then∑n

i=1 Mi = 3 is the observed number of different classes.For each distribution of N1, . . . , NK there exists a corresponding induced

distribution on M = (M1, . . . , Mn), a random vector of size n. The distri-bution can, in principle, be computed with the usual inverse image argument,but in practice may be very complicated, except in the following well-knownsymmetric case (see for example [McCullagh, P. (2011)]).

In addition to the ascending factorial already defined, we also use thedescending factorial notation

x[n] = x(x − 1) . . . (x − n + 1)

if x ≥ n and x[n] = 0 otherwise. Consider the set of all vectors of class countsnm

1 , . . . , nmK with a specific integer partition m = (m1, . . . mn). If each of these

vectors has the same probability, say p(m) = P(N1 = nm1 , . . . , NK = nm

K ), then

P(M = m) = P(M1 = m1, . . . , Mn = mn)

=(

K∑mi

)( ∑mi

m1, . . . , mn

)p(m)

=K[∑

mi ]∏ni=1 mi !

p(m)

(7)

because there are( K∑

mi

)ways to choose which levels are represented in the

sample and, for each of them, there are( ∑

mim1,... ,mn

)ways to choose which levels

are represented 1, . . . , n times. Examples follow in the next section.

Random integer partitions 211

3. Symmetric random integer partitions and their mixtures

The simplest case is i.i.d.-Y sampling with equal frequencies P1 = . . . =PK = 1/K so that, since N1, . . . , NK have the corresponding multinomialdistribution (5), we have

p(m) = n!

n1!n2! . . . nK !

K∏k=1

(1

K

)nk

= n!∏ni=1 i!mi

1

K n,

giving rise to the symmetric multinomial (SM) integer partition distribution withparameter K , for which expression (7) becomes

P(M = m) = SM(m; K ) = n!∏ni=1(i!

mi mi !)

K[∑

mi ]

K n. (8)

As a well-known exchangeable alternative to i.i.d.-Y sampling, consider sam-pling from a symmetric finite population, composed of K sub-populations, onefor each level of the discrete variable Y and each of the same cardinality C ,so that the class frequencies have the symmetric multivariate hypergeometricdistribution

P(N1 = n1, . . . , NK = nK ) =∏K

k=1 C[nk ]∏Kk=1 nk!

n!

(K C)[n]. (9)

This produces the symmetric hypergeometric (SH) integer partition distributionwith parameters K and C , for which expression (7) becomes

P(M = m) = SH(m; K , C) = n!∏ni=1(i!

mi mi !)

K[∑

mi ]

(K C)[n]

n∏i=1

Cmi[i] . (10)

The conditional i.i.d.-Y sampling instead, when class frequencies are given asymmetric Dirichlet distribution with first parameter K and shape parametersall of them equal to θ/K , assigns to Y1, . . . , Yn a marginal distribution whichis a mixture of Dirichlet distributions called (symmetric) Dirichlet-Multinomial,or (symmetric) Multivariate Polya:

P(N1 = n1, . . . , NK = nK ) = n!

n1!n2! . . . nK !

1

θ [n]

K∏k=1

K

)[nk ]

.

We then have

p(m) = n!∏ni=1 i!mi

1

θ [n]

n∏i=1

((θ

K

)[i])mi

212 MAURO GASPARINI

giving rise to the symmetric Dirichlet (SD) integer partition distribution withparameters K and θ , for which expression (7) becomes

P(M = m) = SD(m; K , θ) = n!∏ni=1(i!

mi mi !)

K[∑

mi ]

θ [n]

n∏i=1

((θ

K

)[i])mi

. (11)

This is the case that produces Ewens formula (1) as K → ∞, as recalled in thenext section. All these distributions were considered by Hoppe (see [Hoppe,F.M. (2008)]), who studies formal analogies with Faa di Bruno’s formula forthe n−th derivative of a composite function but does not consider limits, aswe do in the next section.

Figure 2. DAG for SD(m; K , θ).

We can represent pictorially the SD case with a directed acyclic graph (DAG),a common tool used in Bayesian Statistics to represent mixtures, shown inFigure 2. Here, boxes (like θ) represent known constants, ovals representrandom variables, single arrow represent conditional dependence (or better, theabsence of arrows represent conditional independence), double arrow representdeterministic functions and, finally, “sheets” represent loops.

The mixing operation used in conditional i.i.d. sampling may, in turn, beapplied to SD(m; K , T ), say, if we now consider the parameter T (or K , orboth) random, with a distribution which in turn depends on known (hyper-)parameters. For example, we could consider T random, as pictured in Figure 3,and assign it a gamma distribution such that E(T ) = θ and Var(T ) = α,to obtain for P(M = m) the following mixture of symmetric Dirichlet integerpartition distribution (MSD):

M SD(m;K ,θ,α)= n!n∏

i=1(i!mi mi !)

ba K[∑

mi ]

�(a)

∫ ∞

0

ta−1 exp{−bt}t [n]

n∏i=1

((t

K

)[i])mi

dt

where a = θ2/α and b = θ/α.

Random integer partitions 213

Figure 3. DAG for M SD(m; K , θ, α).

4. Limiting relationships

The distributions of random partitions considered so far are connectedby limiting operations, in addition to the mixing operations considered above.In addition to Ewens random integer partition described by (1) and to theexamples provided in the previous section, we now consider two more classesof random integer partitions: and mixtures of Ewens random integer partitionsand degenerate random integer partitions. As a primary example, if T has thegamma distribution above, then the mixture of Ewens integer partition distribution(MES) is

M E S(m; θ, α) = n!∏ni=1 imi mi !

ba

�(a)

∫ ∞

0

ta+∑

mi −1

t [n]e−bt dt (12)

= n!∏ni=1 imi mi !

ba

�(a)

n−1∑k=1

n−1∏i �=ki=1

(−k + i)−1 Ek (13)

where a = θ2/α and b = θ/α and

Ek =∫ ∞

0

ta+∑

mi −2

t + ke−bt dt,

as can be shown by expanding into partial fractions. For integrals of this kind,standard numerical methods are available.

The degenerate random integer partition distribution δm is defined as assign-ing probability one to the single partition m.

The configuration with n classes with one observation each

m∗ = (m1 = n, m2 = 0, . . . , mn = 0),

for example, represents a very sparse situation in which each new observationbrings a new species into the sample. It is intuitive that this is only possiblelimit for a symmetric multinomial integer partition distribution as K → ∞,since in that case the probability of class repetition in the sample goes to 0.

214 MAURO GASPARINI

At the other extreme, another interesting degenerate configuration is oneclass with all n observations

m∗ = (m1 = 0, m2 = 0, . . . , mn = 1),

which occurs when all observations are from a single class. In Genetics, thisphenomenon is called fixation and it happens when a strongly favored alleleeventually dominates over every other allele.

The corresponding degenerate distributions δm∗ and δm∗ are therefore ofparticular interest.

We now collect in the following theorem the precise statements about thelimiting relationships, some of which are known:

Theorem 1. The following limits hold for every vector m of non-negative integerssatisfying constraint (2):

1. SH(m; K , C) → SM(m; K ) as C → ∞2. SM(m; K ) → δm∗ as K → ∞3. SD(m; K , θ) → E S(m; θ) as K → ∞4. E S(m; θ) → δm∗ as θ → ∞5. E S(m; θ) → δm∗ as θ → 0

6. M SD(m; K , θ, α) → SD(m; K , θ) as α → 0

7. M E S(m; θ, α) → E S(m; θ) as α → 0

8. M SD(m; K , θ, α) → M E S(m; θ, α) as K → ∞.

Proof. To prove case 1, notice that (K C)[n] is asymptotic with (K C)n - whichwe write (K C)[n] ≈ (K C)n - and use formula (2) to obtain, as C → ∞,∏n

i=1 Cmi[i]

(K C)[n]≈

∏ni=1 C

mi[i]

(K C)n= 1

K n

n∏i=1

(C[i]

Ci

)mi

→ 1

K n.

To prove case 3, write, as K → ∞,

n!∏ni=1(i!

mi mi !)

K[∑

mi ]

θ [n]

n∏i=1

((θ

K)[i]

)mi

= n!n∏

i=1(imi mi !)

K[∑

mi ]

θ [n]

n∏i=1

K

1K+ 1

1

)(θ

2K+ 2

2

). . .

(i − 1)K+ i − 1

i − 1

))mi

≈ n!∏ni=1 imi mi !

K[∑

mi ]

K∑

mi

θ∑

mi

θ [n]

Random integer partitions 215

and notice that, in the limit, we obtain Ewens formula. The same techniquecan easily be used to prove case 8, since the limit and the integral signs can beexchanged, by dominated convergence. To prove case 2, notice that if m = m∗,then

SM(m∗; K ) = n!∏ni=1 i!mi mi !

K[n]

K n→ 1

as K → ∞. With similar inspections we can prove cases 4 and 5. Finally,cases 6 and 7 illustrate the cases of degenerate mixing distributions with nullvariance.

The limits help one to understand better the different distributions of ran-dom partitions and their genesis. Based on their sampling interpretation, it ispredictable that the symmetric hypergeometric case converges to the symmetricmultinomial, and it does. It is less obvious, although well known now, thatthe symmetric Dirichlet converges to Ewens’ formula, which demonstrates thecentrality of this case to the theory. The degenerate limit δm∗ is reached incase 2 of Theorem 1: when the population frequencies uniformly go to 0,the multinomial sampling scheme is very sparse and every new trial gives riseto a level which has not yet been sampled. It may be more natural to thinkof δm∗ as such a limiting case of multinomial sampling, rather than a specialcase of the symmetric multivariate hypergeometric distribution for C = 1, ascommented in [3, page 547].

The same situation of fragmented sampling arises when θ → ∞ (case 4),whereas, at the other extreme (case 5), sampling produces no variability whenθ → 0.

The mixture cases, both for finite and infinite K , are straightforwardgeneralizations of the symmetric Dirichlet case. Another more complex two-parameter generalization of Ewens’ formula is studied in [Pitman,J. (2003)].We could not establish whether Pitman’s two-parameter distribution can berepresented as a limiting mixture of the symmetric Dirichlet SD(m; K , T ).Certainly, if that were the case, the mixing distribution would depend on bothT and K , since mixing distributions independent of K are covered by case 8of Theorem 1. On the other hand, it should also be stressed that this note is anexample of how far one can go with random partitions with simple tools “frombelow”, i.e. without reference to stochastic process priors like the Dirichletprocess, the GEM distribution and their generalizations giving rise to the Pit-man’s case. With regards to this, it should also be stressed that random integerpartitions are an interesting non-standard example of sampling distributions thatcan be used for teaching graduate courses.

216 MAURO GASPARINI

5. Conclusions

If we consider all together the above mixing and limiting operations, wecan represent them pictorially as in Figure 4.

Figure 4. Pictorial representation of mixing and limiting relationships.

Random partitions are increasingly used as models in diverse statistical appli-cations such as clustering, species sampling, Bayesian statistics and machinelearning. Ewens formula and related distributions, derived by different sam-pling schemes, mixing and limits, obtain in this way new meaning, out of thebiological context where they were born.

Acknowledgments

The author wishes to thank one anonymous referee for suggestions which improved the originalsubmission.

REFERENCES

Chao, A. (2006) Species Estimation and Applications, Encyclopedia of Statistical Sciences, Wiley.

Ewens, W.J. (1972) The sampling theory of selectively neutral alleles, Theoretical Population Ge-netics, (3), 87–112.

Hoppe, F.M. (2008) Faa di Bruno’s formula and the distributions of random partitions in populationgenetics and physics, Theoretical Population Genetics, 73, 543–551.

Johnson, N.L., Kotz, S. and Balakrishnan, N. (1997)Discrete Multivariate Distributions, Wiley,232–246

McCullagh, P. (2011) Partition models (version 20), In: StatProb: The Encyclopedia Sponsored byStatistics and Probability Societies, Freely available athttp://statprob.com/encyclopedia/PartitionModels.html

Random integer partitions 217

Pitman, J. (2003) The two-parameter generalization of Ewens’ random partition structure, Depart-ment of Statistics, U.C. Berkeley, 1992, updated 2003, freely available on the Internet

MAURO GASPARINIDipartimento di Scienze MatematichePolitecnico di TorinoC.so Duca degli Abruzzi, 2410129 Torino (Italia)[email protected]:http://calvino.polito.it/˜gasparin