how many modes can a two component mixture...

27
HOW MANY MODES CAN A TWO COMPONENT MIXTURE HAVE? Surajit Ray and Dan Ren Boston University Abstract: The main result of this article states that one can get as many as D + 1 modes from a two component normal mixture in D dimensions. Multivariate mixture models are widely used for modeling homogeneous populations and for cluster analysis. Either the components directly or modes arsing from these components are often used to extract individual clusters. Though in lower dimensions these strategies work well, our results show that high dimensional mixtures are often very complex and researchers should take extra precaution while using mixtures for cluster analysis. Even in the simplest case of mixing only two normal components in D dimensions, we can show that it can have a maximum of D + 1 modes. When we mix more components or if the components are non-normal the number of modes might be even higher, which might lead us to wrong inference on the number of clusters. Further analyses show that the number of modes depend on the component means and eigenvalues of the ratio of the two component covariance matrices, which in turn provides a clear guideline as to when one can use mixture analysis for clustering high dimensional data. Key words and phrases: Mixture, modal cluster, multivariate mode, clustering, dimension reduc- tion, topography, manifold 1 Introduction 1.1 Number of modes of a normal mixture Multivariate normal mixtures provide a flexible method of fitting high-dimensional data. This fit often provides a primary data reduction through the number, location and shape of its components. However, a more interesting question relates to the exploration of how components interact to describe an overall pattern of density. Of particular interest is finding the number of modes the density displays. The relation between the number of modes and number of components is not one to one. Often modes are used to determine the number of homogeneous groups in a population (Li et al., 2007; McLachlan and Peel, 2000; Titterington et al., 1985). Modes of densities are also widely used to summarize posterior distributions in Bayesian analysis (Berger, 1985; Lehmann and Casella, 1998) and to build Bayesian inferential framework. 1

Upload: others

Post on 03-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

HOW MANY MODES CAN A TWO COMPONENT MIXTURE HAVE?

Surajit Ray and Dan Ren

Boston University

Abstract: The main result of this article states that one can get as many as D + 1 modes from a

two component normal mixture in D dimensions. Multivariate mixture models are widely used

for modeling homogeneous populations and for cluster analysis. Either the components directly

or modes arsing from these components are often used to extract individual clusters. Though in

lower dimensions these strategies work well, our results show that high dimensional mixtures are

often very complex and researchers should take extra precaution while using mixtures for cluster

analysis. Even in the simplest case of mixing only two normal components in D dimensions, we

can show that it can have a maximum of D + 1 modes. When we mix more components or if

the components are non-normal the number of modes might be even higher, which might lead us

to wrong inference on the number of clusters. Further analyses show that the number of modes

depend on the component means and eigenvalues of the ratio of the two component covariance

matrices, which in turn provides a clear guideline as to when one can use mixture analysis for

clustering high dimensional data.

Key words and phrases: Mixture, modal cluster, multivariate mode, clustering, dimension reduc-

tion, topography, manifold

1 Introduction

1.1 Number of modes of a normal mixture

Multivariate normal mixtures provide a flexible method of fitting high-dimensional data.

This fit often provides a primary data reduction through the number, location and shape

of its components. However, a more interesting question relates to the exploration of how

components interact to describe an overall pattern of density. Of particular interest is finding

the number of modes the density displays. The relation between the number of modes and

number of components is not one to one. Often modes are used to determine the number of

homogeneous groups in a population (Li et al., 2007; McLachlan and Peel, 2000; Titterington

et al., 1985). Modes of densities are also widely used to summarize posterior distributions in

Bayesian analysis (Berger, 1985; Lehmann and Casella, 1998) and to build Bayesian inferential

framework.

1

Page 2: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

The main results of this paper is summarized in the following theorem:

Theorem 1. A D dimensional normal mixture of two components has at most D + 1 modes

and a mixture with D + 1 modes always exists in D dimensions.

In one dimension a two-component normal mixture can display one or two modes. But

the density shapes become complex in higher dimensions. For example a two-component

normal in two dimensions can give rise to one, two or three modes (see Ray and Lindsay,

2005, for a three mode example). Ray and Lindsay (2005) provide more examples in two

and three dimensions where the number of modes are more than the number of mixing

components. But beside these pathological examples there is no result on the upper bound of

the number of modes that a mixture of normals can display. This paper provides the first set

of results on the upper bound for the number of modes of a two-component normal mixture.

We also show that this bound is tight, i.e., we can provide numerical values for a mixture

which attains this upper bound.

It is well known that the topography, in the sense of the key features as a density of

a mixture of distributions is often extremely complex. Among the different features of the

topography we are especially interested in the number of modes the density displays referred

to as the modality of the density from here on. Ray and Lindsay (2005) provide a detailed

understanding of the topography of mixtures of normal distributions in terms of the means

and variances of the component distributions. But how these density shapes respond to the

rotation or scaling based on the component covariances is not well studied. For example, it

is not clear if rotation and scaling retains all the modes after transformation. In this paper

we present a set of results showing the invariance of modality of normal mixtures under

the operation of translation, scaling and rotation. These results allow us to show that the

modality of a two-component mixture of normals with arbitrary variance-covariance matrices

is mathematically equivalent to the topography of a mixture of normals, with one component

of which has a spherical covariance and the other has an appropriate diagonal covariance

matrix of the same dimension. A follow up analysis shows that, the number of modes are

closely related to the number of unique eigenvalues of the ratio of the covariance matrices, in

a matrix sense (inverse of one matrix multiplied by the other matrix). Finally we use these

results to arrive at the main result on the tight upper bound on the number of modes.

1.2 Relevant Literature

Studies of the number of modes of normal mixtures date back to the beginning of twentieth

century but until recently the results have focused primarily on univariate mixtures. In fact,

there is a simple description of modality when one is mixing two univariate normal com-

ponents. Helguero (1904) determined necessary and sufficient conditions for bimodality in

2

Page 3: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

the mixture of two univariate normals with equal variances and mixing proportions. More

research on univariate mixture cases followed. For example, Eisenberger (1964) investigated

the conditions for bimodality in the mixture of two univariate normals with arbitrary variance

and mixing proportions and Behboodian (1970) derived a sufficient condition for unimodal

mixture densities. Kakiuchi (1981) and Kemperman (1991) then extended the problem to

mixtures of non-normal distributions, and derived corresponding necessary and sufficient con-

ditions. In the context of multivariate normal mixtures, a recent result by Carreira-Perpinan

and Williams (2003) shows that for any D-dimensional normal mixture, the number of modes

cannot exceed the number of components if each component has the same covariance matrix

up to a scalar scaling factor. The most recent and comprehensive results in this area of re-

search are provided by Ray and Lindsay (2005), who present the most generalized modality

results for arbitrary dimensions, number of components and component variance structure.

The key result in Ray and Lindsay (2005) shows that the topography of multivariate mix-

tures, in the sense of their key features as a density, can be analyzed rigorously in lower

dimensions by use of a ridgeline manifold that contains all critical points as well as the ridges

of the density. This important topographical result allows them to solve for the number of

modes both analytically and numerically. Besides solving for the number of modes Ray and

Lindsay (2005) provide pathological examples of more modes than components in more than

one dimension. A comprehensive summary of the above results are available in Fruhwirth-

Schnatter (2006) and a recent review paper by Melnykov and Maitra (2010). Much of the

modality theory discussed in Ray and Lindsay (2005) has been widely used for developing

clustering techniques by Ray and Lindsay (2008); Coretto and Hennig (2010); Hennig (2010b)

and Hennig (2010a) and for the advancement of likelihood based inference for normal mix-

tures by Chen and Tan (2009); Holzmann and Vollmer (2008); Dannemann and Holzmann

(2008) and Lindsay et al. (2008). Applications of these results are found in new areas of

research such as signal processing (Li, 2007; Scott et al., 2009) and image retrieval (Sfikas

et al., 2005).

Using the modality theorem in the special case of a two-component normal mixture, Ray

and Lindsay (2005) provide examples of three modes in two dimensions, and four modes in

three dimensions. These mixtures have unequal covariances matrices, but they are limited

to being diagonal in structure. But providing an upper bound of modes for mixtures in arbi-

trary dimensions for arbitrary component variance-covariance matrix remained an unresolved

problem.

1.3 Our Results

The main contribution of this paper is to provide a tight upper bound for the number of

modes of a two-component normal mixture for arbitrary dimension and arbitrary component

3

Page 4: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

variance-covariance matrices.

Let us denote the dimension of the multivariate normal density by D and the number

of components of the mixture by K. In this paper, we only consider two-component normal

mixture cases, i.e., K = 2; and the corresponding parameters for each normal density are their

means µi and variance-covariance matrices Σi, i = 1, 2. Let π, π = 1 − π be the respective

proportions of two densities. It can be shown that for specified means and variances the

number of modes depends on the mixing proportions. In fact, Ray and Lindsay (2005) provide

examples of mixtures where different ranges of π display one, two and three modes for the

same means and variance-covariance matrices. But one should notice that the specification

of π is irrelevant in the context of determination of the maximal number of modes displayed

by a mixture of two components. In other words we are asking the following question– given

a pair of component means and covariance matrices what is the maximum number of modes

it can display if one has the complete freedom of choosing the mixing proportion π? Hence

we will ignore the parameter π for our analysis and for notational ease we will denote a D

dimensional mixture of two components with means µ1 and µ2, and variances Σ1 and Σ2

by NM(µ1, Σ1, µ2, Σ2)D. Our main result shows that the number of modes for the above

mixture is bounded above by D + 1 and that bound is achievable for any D. In fact we

provide a recursive algorithm to construct the parameters of the component densities which

attain this bound.

Modes are defined as the local maxima of the density height and understanding the

modes require understanding of the topography of the density along with their higher order

features. Many of the results we will use in this paper are based on these higher order features

of normal mixtures defined in terms of Π-function (different from the omitted parameter π)

and curvature functions defined in Ray and Lindsay (2005). So, in Section 2 we will first

define the terminologies and state some of the important results from Ray and Lindsay (2005)

which will be used in this paper. In particular we will present the concept of Π-functions

and curvature functions of a mixture, which have the advantage of being expressed explicitly

in terms of means and variances of components while retaining full information about the

topography and hence the number of modes of a mixture. Moreover the Π-function and

curvature function attain a very simple form for a two-component normal mixture. This

simplification of the curvature function allows us to show that the number of modes of the

two-component mixtures is explicitly determined by the number of roots of the curvature

function within the range [0,1].

But the roots of the curvature function defined in Section 2 are very difficult to study

for arbitrary mixtures. Ray and Lindsay (2005) explore the roots for curvature functions

only in the case of diagonal covariance matrices up to three dimensions. In this paper we

seek to generalize the modality results for arbitrary dimensions and component variance-

4

Page 5: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

covariance matrices. To arrive at these results in Section 3, we first show that modality of

an arbitrary D-dimensional normal mixture NM(µ1, Σ1, µ2, Σ2)D remains unchanged under

any translation and a specified scaling and rotation of the random variable. These results will

be enormously helpful as it will allows us to study the topography of arbitrary D-dimensional

normal mixture by exploring the topography of a simplified class of normal mixtures with

the first component being a standard normal and the second component having a diagonal

covariance matrix. We denote this class by NM(0, I,µ, Λ)D, where, 0 and µ are both D

dimensional means, I is a identity matrix and Λ is a diagonal matrix of dimension D. These

results are derived analytically and examples are provided to illustrate these results.

In Section 4 we explore the modality of normal mixtures of the form NM(0, I,µ, Λ)D.

We show that the maximum number of modes is constrained by d, the number of distinct

diagonal entries in Λ. In fact the modality of such a normal with d distinct diagonal entries,

is less than or equal to (d+1). It is easy to check that d can be equal to the dimension D and

thus we arrive at the first part of our result showing that any arbitrary D dimensional normal

mixture can have at most (D + 1) modes. The tightness of the stated bound is achieved by

providing a recursive method for construction of two-component normals which achieve this

bound. In this section we also show that many previous modality results can be stated as

special cases of our generalized result. For D = 1, this can be used to prove the univariate

results in Helguero (1904) and Robertson and Fryer (1969). For D = 2 and D = 3 our results

show that the examples in Ray and Lindsay (2005) achieve the upper limit of the number of

modes in their respective dimensions.

Section 5 provides some discussion and further research directions regarding the number

of modes of multivariate normal mixture of more than two components. Generalization of the

modality of mixtures of multivariate normals to multivariate-T densities and then ultimately

to multivariate elliptical distributions will also be discussed in this section.

2 Topography of multivariate normals

In this section we state some important results from Ray and Lindsay (2005) that will be

extensively used in this paper. The rest of the paper will use the notations defined in this

section. Readers familiar with the results in Ray and Lindsay (2005) may skip this section.

Ray and Lindsay (2005) presents a unified theory for understanding the topography of

high dimensional normal mixtures. Their main result shows that the topography of mixtures,

in the sense of their key features as a density, can be analyzed rigorously in lower dimensions

by use of a ridgeline manifold that contains all critical points as well as the ridges of the

density.

A K-component mixture of D-dimensional normals can be represented by the probability

5

Page 6: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

density function

g(x) = π1φ(x; µ1, Σ1) + π2φ(x; µ2, Σ2) + . . . + πKφ(x; µK , ΣK),x ∈ RD,

where πj is the mixing proportion of component j, πj ∈ [0, 1],∑K

j=1 πj = 1, and φ(x; µ, Σ)

is the density of a multivariate normal distribution with mean µ and variance Σ. We will

sometimes use φj(x) as shorthand notation for φ(x; µj , Σj), and call φj the jth component

density.

2.1 The K-1 dimensional ridgeline manifold

Definition 1. The K − 1 dimensional set of points

SK =

{α∈ R

K : αi ∈ [0, 1],

K∑

i=1

αi = 1

}

will be called the unit simplex. The function x∗(α) from SK into RD defined by

x∗(α) =[α1Σ

−11 + α2Σ

−12 + . . . + αKΣ−1

K

]−1 [

α1Σ−11 µ1 + α2Σ

−12 µ2 + . . . + αKΣ−1

K µK

]

will be called the ridgeline function. It will sometimes be written as x∗

α. The image of this

map will be denoted by M and called the ridgeline surface or manifold. If K = 2, it will be

called the ridgeline as it is a one-dimensional curve.

Theorem 2. (Ray and Lindsay, 2005) Let g(x) be the density of a K-component multivari-

ate normal densities as given by (2). Then all of g(x)’s critical values, and hence modes,

antimodes and saddle points, are points in M.

The previous result states that instead of exploring the whole RD space to find modes,

we now only need to concentrate on the ridgeline, embedded in the (K − 1)-dimensional unit

simplex. In this paper we only deal with two components and for K = 2 the ridgeline can be

represented as

x∗(α) = S−1α

[αΣ−1

1 µ1 + αΣ−12 µ2

], where Sα =

[αΣ−1

1 + αΣ−12

](1)

and α ∈ [0, 1] and α = 1−α. As α varies from 0 to 1, the image of the function x∗(α) defines

a curve from µ1 to µ2 and the critical points of the D-dimensional mixture can be explored

by evaluating the height of the density along the curve x∗(α). Thus we next consider the

diagnostic properties of the elevation plot along the curve x∗(α) defined by

h(α) = g (x∗(α)) .

We will call h(α) the ridgeline elevation function. Analytically, the number of peaks of h(α)

is exactly the maximum number of modes the mixture can display. In some cases a visual

6

Page 7: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

inspection of h(α) or numerical root finding methods might allow us the enumerate the roots

of h(α) and hence the number of modes. But depending on the resolution, numerical methods

can always miss some zero crossings. Moreover, numerical solutions will not serve the purpose

of this paper which focuses on determining the upper bound on the number of modes. Hence

we focus our attention to finding analytical solutions for the critical points of h(α) for finding

the number of modes of the mixture.

2.2 The curvature function

To find the number of modes, first note that x∗(α) is a critical value of h(α) if it satisfies

h′(α) = πφ1(x∗(α))′ + πφ2(x

∗(α))′ = 0,

where prime ′ denotes differentiation with respect to α. Solving the last displayed equation

for π, and turning it into a function of α we get:

Π(α) =φ′

2(α)

φ′

2(α) − φ′

1(α).

As we are just interested in the number of modes we can examine the number of up and

down oscillations of the function Π. Section 4 of Ray and Lindsay (2005) shows that the

number of up-down oscillations of Π, is given by n, the zeroes of

Π′(α) = −φ′′

2(α)φ′

1(α) − φ′′

1(α)φ′

2(α)(φ′

2(α) − φ′

1(α))2 .

In general, to determine the sign changes of Π′ we can use any function of α with the

same numerator φ′′

2(α)φ′

1(α) − φ′′

1(α)φ′

2(α), provided the denominator is a positive function

of α. Using the denominator φ1(α)φ2(α) instead of (φ′

2(α) − φ′

1(α))2 the curvature function

κ(α) is defined as:

κ(α) =φ′′

2(α)

φ2(α)

φ′

1(α)

φ1(α)− φ′′

1(α)

φ1(α)

φ′

2(α)

φ2(α). (2)

We use κ(α) as it results in a simple expression for any distribution belonging to the ex-

ponential family. It is closely related to the mixture curvature measures given by Lindsay

(1983).

2.3 Properties of the Curvature function κ(α)

We now study the curvature function κ(α) more closely, as it will be extensively used to

prove the results in Section 3 and Section 4.

The following result, provides a simple expression for the curvature for the mixture of

normals.

7

Page 8: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

Theorem 3. (Ray and Lindsay, 2005) Let g(x) be the mixture of two multivariate normal

densities. Then the curvature function in (2) is given by

κ(α) = [p(α)]2[1 − ααp(α)],

where p(α) = (µ2 − µ1)′Σ−1

1 S−1α Σ−1

2 S−1α Σ−1

2 S−1α Σ−1

1 (µ2 − µ1). (3)

By the expression above, p(α) is always positive. Thus zeroes of κ(α) are the same as

the zeroes of (1 − ααp(α)). For notational ease, let us denote

q(α) = 1 − ααp(α). (4)

By calculation, q(0) = q(1) = 1 and hence, κ takes positive values at the two extremes

α=0 and 1. Thus, there are an even number of sign changes of the function κ(α) in the

range [0,1], as also indicated by the nature of Π. In particular at the first zero, α1, of κ, the

function Π has a maximum, at the next α1 a minimum, and so forth. Thus we arrive at the

following result relating the number of solutions of q(α) to the modality of the mixture.

Result 1. Let n be the number of solutions of q(α) in the range [0,1]. Then the corresponding

mixture will display n2 + 1 modes.

We note that both p(α) and q(α) uniquely defines the number of modes. We will use

p(α) to show the invariance in the proof of Theorem 5, and later use q(α) to find the number

of modes while providing the proofs of other theorems.

3 Invariance of modality under scaling and rotation

Studying the modality of arbitrary normal mixtures directly based on the curvature function

κ(α) is a very complex undertaking. Instead in this section we will show that the curva-

ture function which defines the modal features of a two-component normal mixture remains

unchanged under certain transformations. We will use these transformations to show that

the topography of arbitrary D-dimensional normal mixture can be examined by exploring

the topography of a simplified class of normal mixture given by the mixture of a spherical

normal and a normal with a diagonal covariance matrix. We arrive at this result in two steps

described in the following two subsections.

3.1 Invariance of modality under scaling

First we state the theorem that provides the simplification that in D dimensions the modal

properties of arbitrary two-component normal mixture can be fully examined by studying the

modality of mixture of two components, one of which is the standard normal in D dimensions.

8

Page 9: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

Theorem 4. For an arbitrary mixture of two multivariate normals, the modality, of NM(µ1, Σ1, µ2, Σ2)D

is the same as that of NM(0, I,µ∗

2, Σ∗

2)D, where µ∗

2 = (Σ∗

2)1

2 Σ−

1

2

2 (µ2 −µ1), Σ∗

2 = Σ1

2

2 Σ−11 Σ

1

2

2 .

Proof. See Appendix

Remark 1. First note that the above transformation is not equivalent to the regular stan-

dardization for the first component alone. Using a regular standardization a single component

can be transformed to a standard normal but the resulting parameters of the second component

will lose its symmetry which is crucial for equating the curvature function of the two mix-

tures detailed in the proof of Theorem 4. Also, note that µ∗

2, Σ∗

2 in Theorem 4 is well-defined,

because the variance matrices Σ1 and Σ2 are both positive definite.

Note that the two components are interchangeable and the strategy is to scale the whole

mixture by the covariance of the component whose mean is translated to the origin. Before

moving on to the next result, we provide an application of Theorem 4. For easy visualization

we will use contour plots of a two dimensional mixture. This example will also serve the

purpose of providing a geometric intuition of the proof of Theorem 4. First, note that it

is easy to check that geometrically shifting the means of both the components by the same

vector is equivalent to changing the origin of the reference frame of the contour plot. This

implies that the modal features and hence the number of modes remain unchanged after

simple translation. So we concentrate on the changes of the contour plot strictly under the

operation of scaling defined in Theorem 4 by taking µ1 = 0.

Example 1. Consider the mixture density with the following parameters:

µ1 =

0

0

!

, Σ1 =

3.899 −4.691

−4.691 5.698

!

, µ2 =

4

−4

!

, Σ2 =

1.04 −0.3

−0.3 0.29

!

.

Applying the transformation defined in Theorem 4 the parameters of the two components afterscaling are given by:

µ∗

1 =

0

0

!

, Σ∗

1 =

1 0

0 1

!

, µ∗

2 =

4.272

−0.394

!

, Σ∗

2 =

18.80 4.743

4.743 1.25

!

.

Figure 1 gives the density contour plots before (left panel) and after (right panel) the trans-

formation and clearly though the contour shapes and the location of the modes have changed,

the number of modes and the number of saddle points remains unchanged.

Note that under the transformation both components are scaled, and in this example the

component centered at zero is scaled to have the identity covariance and the covariance of

the other component is scaled appropriately. This is easily visible from the contour plots of

in Figure 1 where the elongated elliptical component in the left panel with the origin as the

center is transformed into a spherical component with the same center. Of course the change

9

Page 10: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

(a)

−4 −2 0 2 4

−4

−2

02

x

y

(b)

−2 0 2 4 6

−4

−2

02

4

xy

Figure 1: Contour plots for the bivariate normal mixture of Example 1 in (a) the original parameters

and (b) the transformed parameter.

in means and covariances of the components have changed the location of the three modes,

but as the theorem suggests the number of modes is strictly preserved between the mixtures.

The contour plots in Figure 1 are not available unless D = 2, so we provide an alternative

graphical display showing the invariance of modes. We compare the ridgeline elevation of the

two mixtures in Example 1. Recall that the ridgeline elevation for a two component mixtures

is simply the height of the mixture density along the ridgeline manifold defined in (1), but

it carries the full modality information for mixtures in any dimensions. Figure 2 displays

the ridgeline elevation plot before and after the transformation. Again note that though the

shape of elevation plots differ, the number of up-down oscillations of the curves in the left

and right panel in Figure 2 are exactly the same. In both cases the ridgeline elevation plot

confirms the presence of three modes.

3.2 Invariance of modality under rotation

By Theorem 4 the topography of any D dimensional mixture can be studied using mixtures

of the form NM(µ1 = 0, Σ1 = I,µ2, Σ2). But uncovering the topography, even when one

component has an arbitrary covariance matrix, is difficult. In this section we seek to provide

a further simplification, which will allow us to find the number of modes of an arbitrary

mixture by studying the modes of another mixture, one component of which is a standard

normal and the other component is a normal with diagonal covariance matrix.

Before we state the result, recall that the maximum number of modes of a two-component

10

Page 11: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

(a)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.10

0.12

0.14

0.16

arc length

den

sity

(b)

0 1 2 3 4 5

0.05

50.

060

0.06

50.

070

0.07

50.

080

0.08

5

arc length

Figure 2: Ridgeline function with respect to the arc distance for the bivariate normal mixture of

Example 1 in (a) the original parameters and (b) the transformed parameter.

normal is uniquely defined by the number of roots between 0 and 1 of q(α) given in (4) and

for any mixture q(α) is uniquely defined by p(α). So we will first provide a simplification

of the expression for p(α) for mixtures of the form NM(0, I,µ2, Σ2)D and then state the

rotation invariance theorem.

Result 2. For mixture of the form NM(0, I,µ2, Σ2)D, the term p(α) in (3) can be expressed

in terms of the eigenvalues and eigenvectors of Σ2 in the following way:

p(α) =D∑

i=1

ci

[α(λi − 1) + 1]3, (5)

where ci = λi(µ′

2ξi)2, and λi’s and ξi’s are eigenvalues and corresponding eigenvectors of

matrix Σ2.

Proof. See Appendix.

We will now state the following property of invariance of mixture modality under rotation.

Theorem 5. The modality of mixture NM(0, I,µ2, Σ2)D, is the same as that of mixture

NM(0, I,µ0, Λ)D, with µT0 = (µ′

2ξ1, µ′

2ξ2, . . . ,µ′

2ξD) and Λ = diag(λ1, λ2, . . . , λD), where

(λi, ξi, i = 1, · · · , D) are the eigenvalue, eigenvector pairs of Σ2

Proof. Using µ0 and Λ in Result 2 it is easy to check that the p(α) of mixtures NM(0, I,µ2, Σ2)D

and NM(0, I,µ0, Λ)D have the same expression, hence the number of roots, which implies

that the two mixtures will have the same modality.

11

Page 12: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

For illustration, we will now apply the rotation described in Theorem 5 to the scaled version

of Example 1 whose first component is a standard normal. Example 1 gives the numerical

values of the parameters after scaling and Figure 3 shows the contour plots of the mixtures

before and after rotation.

Example 2. (Continuation of Example 1) Applying the rotation transformation described inTheorem 5 on the mixture with parameters

µ1 =

0

0

!

, Σ1 =

1 0

0 1

!

, µ2 =

4.272

−0.394

!

, Σ2 =

18.80 4.743

4.743 1.25

!

,

we get the mixture with parameters

µ1 =

0

0

!

, Σ1 =

1 0

0 1

!

, µ0 =

4.472

−1

!

, Λ =

20 0

0 0.05

!

. (6)

The contour plot in Figure 3(a) depicts the unrotated mixture NM(0, I,µ2, Σ2), where as

the Figure 3(b) shows the contours of the rotated mixture NM(0, I,µ0, Λ).

Algebraically the rotation to achieve the diagonal covariance of the second component

is equivalent to using the orthonormal matrix P , whose columns are the eigenvectors of

covariance matrix Σ2, to rotate the random variable. In fact, in two dimensions it has a very

simple interpretation. We simply rotate the mixture contour around the origin (0, 0), such

that the major axis of the ellipse from contour of the second component is parallel to the

x-axis. This will automatically set the minor axis parallel to the y-axis resulting in a diagonal

covariance matrix of the second component (see Figure 3). Note that this rotation does not

affect the covariance matrix of the first component as it remains an identity matrix.

Finally we combine Theorem 4 and Theorem 5 to state the following corollary.

Corollary 1. The modality of any arbitrary mixture is equal to another mixture of the form

NM(0, I,µ0, Λ), where Λ is diagonal.

Proof. First apply Theorem 4 to scale any mixture to the form NM(0, I,µ, Σ) and then

apply Theorem 5 to rotate it to the form NM(0, I,µ0, Λ).

4 Number of modes of a two-component multivariate normal

mixture

In this section we will first focus our attention to exploring the modality of normal mixtures of

the simplified form NM(0, I,µ, Λ)D. We will restrict ourselves to this small class of mixtures

as we have already shown in Section 3 that the modality of any two-component normal

mixture is equivalent to the modality of a corresponding mixture of the form NM(0, I,µ, Λ)D.

12

Page 13: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

(a)

−2 0 2 4 6

−4

−2

02

4

x

y

(b)

−2 0 2 4 6

−4

−2

02

4

x

y

Figure 3: Contour plots for the bivariate normal mixture of Example 2 in (a) before and (b) after

rotation.

First we will show that the maximum number of modes is a function of d, the number of

distinct diagonal entries in Λ, by first showing that the maximum number of modes is less

than or equal to (d + 1), and then by showing that the upper bound (d + 1) is achievable. It

is easy to check that d can be equal to the dimension D and thus we arrive at the final result

on the upper bound of the number of modes of an arbitrary D dimensional mixture.

4.1 Upper bound on the number of modes of a two-component normal

mixture

Recall that the number of modes can be directly enumerated using the number of solutions

of q(α) = 1−α(1−α)p(α) = 0 within the range [0,1]. Using the simplified form of p(α) given

in (5) for mixtures of the form NM(0, I,µ, Λ)D we can simplify q(α) as

q(α) = 1 − α(1 − α)D∑

i=1

ci

[α(λi − 1) + 1]3= 0,

where λi’s are the diagonal elements of Λ and ci = λiµ2i .

To find the roots of q(α), we first state the following Lemma.

Lemma 1. The number of solutions of

q(α) = 1 − α(1 − α)

D∑

i=1

ci

[α(λi − 1) + 1]3= 0,

13

Page 14: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

where α ∈ [0, 1] is exactly equal to the number of non-negative solutions for the equation

q∗(t) = 1 − t(t + 1)D∑

i=1

ci

(t + λi)3= 0.

Proof. Define α =1

t + 1, then t ∈ [0,∞) corresponds to α ∈ [0, 1] and it is easy to check

q(α) = q∗(t).

This simple change of variable from α to t allows us to relate the number of modes to

the positive solutions of q∗(t) instead of the more difficult problem of finding solutions in

the restricted interval [0, 1] for q(α). This simplification will enable us to find the upper

bound of the number of modes and also allow us to recursively construct extra modes in

extra dimensions.

We will now use the mixture density given in (2) to illustrate the result in Lemma 1.

Example 3. (Continuation of Example 1 and 2) After scaling and rotation the modality ofExample 1 is equivalent to the mixture with parameters

µ1 =

0

0

!

, Σ1 =

1 0

0 1

!

, µ2 =

4.472

−1

!

, Σ2 =

20 0

0 0.05

!

.

For the above mixture

q(α) = 1 − α(1 − α)

[400

(19α + 1)3+

0.05

(−0.95α + 1)3

],

Using the change of variable α = 1t+1 we have

q∗(t) = 1 − t(t + 1)

[0.05

(t + 0.05)3+

400

(t + 20)3

].

Solving the equation q(α) = 0 the 4 solutions in the range [0,1] are

α1 = 0.0029474, α2 = 0.1391142, α3 = 0.8608858, α4 = 0.9970526;

while the equation q∗(t) = 0 also have 4 non-negative solutions, which are

t1 = 337.6199, t2 = 6.189139, t3 = 0.1615732, t4 = 0.00296281.

As a visual aid we have also presented the curves q(α) and q∗(t) along with their zero

crossing in Figure 4. As we are only interested in the positive solutions of q∗(t) we have

changed the axis of t to log(t) to accommodate the wide range of t. In fact the solutions for

Example 3 in log scale are symmetric and they are

log(t1) = 5.821, log(t2) = 1.822, log(t3) = −1.822, log(t4) = −5.821

14

Page 15: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

(a)

0.0 0.2 0.4 0.6 0.8 1.0

−2.

0−

1.5

−1.

0−

0.5

0.0

0.5

1.0

α

q(α)

(b)

−6 −4 −2 0 2 4 6

−2.

0−

1.5

−1.

0−

0.5

0.0

0.5

log(t)q(

t)

Figure 4: Plots for (a) q(α) against α and (b) q∗(t) against log(t) for the mixture given in Example 3

Now we state the important result relating the number of non-negative solutions of

q∗(t) = 0, and hence the number of modes to the number of unique diagonal entries of Λ,

which equals to the number of distinct eigenvalues of Σ2.

Lemma 2. Consider mixtures of type NM(0, I,µ, Σ2)D. Suppose Σ2 has d (d ≤ D) distinct

eigenvalues, then irrespective of the value of µ there are at most 2d non-negative solutions

for the corresponding q∗(t) = 0.

Proof. Let the d distinct eigenvalues of Σ2 be λ1, · · · , λd. Let us denote the upper bound of

the number of real roots of q∗(t) by O and the lower bound of its negative roots by N . We are

interested in finding an upper bound for the non-negative roots, i.e O −N . We will calculate

the two bounds in two separate steps. Within each step we will consider two separate cases:

one where all the eigenvalues are distinct from 1 and the other where at least one of the d

distinct eigenvalues is equal to 1.

• Step 1. To enumerate the upper bound of the number of real roots of the rational

function q∗(t) we transform it to a polynomial function, whose roots are easier to enu-

merate.

Case 1: If λi 6= 1 for all i = 1, · · · , d the resulting multiplier for converting q∗(t) = 0 into

a polynomial equation will be∏d

i=1(t + λi)3 and as the highest order of the polynomial

q∗(t)∏d

i=1(t + λi)3 is 3d, we have O = 3d.

15

Page 16: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

Case 2: On the other hand if λi = 1 for any one i ∈ {1, · · · , d} the resulting multiplier

for converting q∗(t) = 0 into a polynomial equation will beQ

d

i=1(t+λi)

3

(t+1) and the highest

order of the polynomial is q∗(t)Q

d

i=1(t+λi)

3

(t+1) will now be 3d − 1 giving O = 3d − 1.

Hence, the equation q(t) = 0 has at most O solutions, where

O =

3d if λi 6= 1,∀i ∈ {1, · · · , d};3d − 1 if λi = 1, for any one i ∈ {1, · · · , d}.

(7)

• Step 2. To find the lower bound on the number of negative roots we first note the

following

q∗(t) = 0

=⇒ 1

t(t + 1)=

D∑

i=1

ci

(t + λi)3

=⇒ 1

t=

1

t + 1+

D∑

i=1

ci

(t + λi)3

Thus the solutions to q∗(t) = 0 are equal to the crossing of the two curves1

t, and

r(t) =1

t + 1+

D∑

i=1

ci

(t + λi)3(see Figure 5 for an illustration). Let us denote the

right limit of a function f at point t, limx→t+ f(x) by f(x+). Similarly we denote

the left limit, limx→t− f(x) by f(x−). Notice that r(t) is a rational function and

ci ≥ 0, λi > 0. Thus for each i = 1, 2 . . . d we have a vertical asymptote i.e., r((−λi)+) =

+∞ and r((−λi)−) = −∞. Additionally we have r((−1)+) = +∞ and r((−1)−) = −∞.

[See the dashed lines representing the asymptotes in Figure 5] This implies that r(t)

will have several disjoint branches and those branches traveling from one negative to

its neighboring positive vertical asymptote have to cross the line y = 0 and hence the

curve 1/t at least once. Now we discuss the two distinct cases.

Case 1: If λi 6= 1 for all i = 1, · · · , d the graph of r(t) has d+1 asymptotes one each at

λ1, . . . , λd and 1. This gives rise to d + 2 disjoint branches among which d intermediate

branches will have at least one crossing with the curve1

t, which gives rise to at least d

negative roots of q∗(t) and hence N = d.

Case 2: On the other hand if λi = 1 for any one i ∈ {1, · · · , d} then there are only (d−1)

distinct eigenvalues different from 1, and the graph of r(t) now has (d + 1) branches,

among which the d−1 intermediate branches give rise to at least d−1 negative solutions

and hence N = d − 1.

16

Page 17: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

Hence, the equation q(t) = 0 has at most N negative solutions, where

N =

d if λi 6= 1,∀i ∈ {1, · · · , d};d − 1 if λi = 1, for any one i ∈ {1, · · · , d}.

(8)

Combining the (7) and (8) we show that for both cases there can be at most (O−N ) = 2d

non-negative solutions for the equation q∗(t) = 0.

−10 −8 −6 −4 −2 0

−10

−5

05

10

t

1/t a

nd r

(t)

1

t

r(t) =1

t + 1+

1

(t + 2)3+

1

(t + 4)3+

1

(t + 8)3+

1

(t + 9)3

1

t

r(t) =1

t + 1+

1

(t + 2)3+

1

(t + 4)3+

1

(t + 8)3+

1

(t + 9)3

1

t

r(t) =1

t + 1+

1

(t + 2)3+

1

(t + 4)3+

1

(t + 8)3+

1

(t + 9)3

Figure 5: Plots showing the vertical asymptotes of r(t) = 1t+1 + 1

(t+2)3 + 1(t+4)3 + 1

(t+8)3 + 1(t+9)3 and

its crossing with the curve 1/t.

Finally we state the main theorem of this paper giving us the upper bound on the number

of modes of a mixture of two normal components.

Theorem 6. The number of modes of the normal mixture NM(µ1, Σ1, µ2, Σ2)D is at most

(d + 1), where d is the number of distinct eigenvalues of the matrix Σ∗

2 = Σ1/22 Σ−1

1 Σ1/22 and

hence the number of distinct eigenvalues of the matrix ratio of the covariance matrices Σ2

and Σ1 denoted by Σ−11 Σ2.

Proof. By Theorem 4 the modality of the mixture NM(µ1, Σ1, µ2, Σ2)D is the same as the

mixture NM(0, I,µ∗

2, Σ1

2

2 Σ−11 Σ

1

2

2 )D, where µ∗

2 is a vector of dimension D. Now using Lemma 2

17

Page 18: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

we know that the corresponding q∗(t) and hence q(α) will have at most 2d roots. Finally,

using Result 1 we can show that NM(µ1, Σ1, µ2, Σ2)D has at most 2d2 + 1 = d + 1 modes.

To show the second part, note that if λ is an eigenvalue of the matrix Σ∗

2 = Σ1/22 Σ−1

1 Σ1/22 ,

then λ satisfies the equation: |Σ∗

2 − λI| = 0. On the other hand,

|Σ2Σ−11 − λI| = |Σ1/2

2 Σ∗

2Σ−1/22 − λI| = |Σ1/2

2 | · |Σ∗

2 − λI| · |Σ−1/22 | = |Σ∗

2 − λI|

Hence, λ is an eigenvalue of the matrix Σ∗

2 if and only if λ is an eigenvalue of the matrix

Σ2Σ−11 , which implies the second part of the Theorem.

Theorem 7. Any D dimensional normal mixture NM(µ1, Σ1, µ2, Σ2)D has at most D + 1

modes.

Proof. Σ∗

2 = Σ1/22 Σ−1

1 Σ1/22 , has D eigenvalues, hence d ≤ D. Using this inequality in Theo-

rem 6 completes the proof.

4.2 Existence of D + 1 modes in D dimensions

In this subsection we will show that it is always possible to find a mixture in any dimension

which will attain D + 1 modes. First we provide two examples for D=2 and D=3 where the

upper bound is achieved.

Remark 2. Example 1, with D = 2, and eigenvalues 20 and 0.05 achieves the upper bound

on the number of modes for a two dimensional mixtures.

Example 4. Consider the three dimensional example with 4 modes given in Ray and Lindsay

(2005) with the parameters being

µ1 =

0

0

0

, Σ1 =

1 0 0

0 1 0

0 0 .05

, µ2 =

1/√

2

2

1/√

2

, Σ2 =

.05 0 0

0 1 0

0 0 1

. (9)

A straightforward calculation based on Theorem 4 shows that Σ∗

2 has eigenvalues 0.05, 1 and

20, i.e., D = d = 3. This density mixture has 4 modes, which again achieves the upper bound

(D + 1).

Though we have come up with examples achieving the upper bound for two and three

components, it is not easy to come up with such pathological examples in higher dimensions.

Hence we will design a construction method which allows one to construct one extra mode

from each additional dimension. Starting from the fact that one can construct a mixture

with two modes in one dimension (or using the examples in D=2 and D=3) one can use the

18

Page 19: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

recursive relation to construct the parameters of a mixture in D dimensions which will have

D + 1 modes.

Recall that Theorem 6 shows that in D dimensions the equation q∗(t) = 0, can have at

most 2D non-negative solutions, which in turn implies that the corresponding mixture can

achieve at most D + 1 modes. Therefore, to achieve one extra mode in D + 1 dimensions

we just need to choose the parameters of the mixture such that the corresponding q∗(t) = 0

achieves two extra non-negative solutions. The following Lemma provides the construction

method to find the two extra solution of q∗(t) = 0 starting from any dimension D.

Lemma 3. Let {(ci, λi), i = 1, 2, . . . , D} be such that the equation

y(t, D) = 1 − t(t + 1)D∑

i=1

ci

(t + λi)3= 0

has 2D non-negative solutions. Then one can always find a pair of scalars (cD+1, λD+1) such

that

y(t, D + 1) = 1 − t(t + 1)D+1∑

i=1

ci

(t + λi)3= 0

has 2D + 2 solutions.

Proof. Note that y(t, D) is the same as q∗(t) = 0 for D dimensions.

Since y(t, D) = 0 has 2D non-negative solutions, and y(0, D) and y(∞, D) are both

positive, y(t, D) changes sign 2D times in the positive axis of t. Let y(t, D) be positive at

points t0, t2, · · · , t2D = a, and negative at points t1, t3, · · · , t2D−1, such that

0 ≤ t0 < t1 < t2 < · · · < t2D−1 < t2D = a.

First we choose y0 > 0 such that y0(a + λ)3 < y(tj , D)(tj + λ)3 for j even, and for all

eigenvalues λ > 0. It can be verified that such an y0 always exists.

Then we choose t2D+1 > a such that1

t2D+1(t2D+1 + 1)<

y0

8, and then we choose λD+1 >

max{λ1, · · · , λD}, such thatt2D+1 + λD+1

a + λD+1< 2, which will ensure that

(t2D+1 + λD+1)3

t2d+1(t2D+1 + 1)(a + λD+1)3< y0 (10)

Now define cD+1 = y0(a + λD+1)3.

With the chosen pair of (cD+1, λD+1) we have

Y (tj) = y(tj , D) − cD+1

(tj + λD+1)3

> 0, for j even;

< 0, for j odd.

19

Page 20: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

i.e., Y (t) = y(t, D) − cD+1

(t + λD+1)3has the same sign as y(t, d) at points t0, t1, · · · , t2D,

which means that Y (t) has 2D non-negative solutions which are all less than a = t2D.

On the other hand, we have

Y (t2D+1) = y(t2D+1, D) − cD+1

(t2D+1 + λD+1)3

<1

t2D+1(t2D+1 + 1)− cD+1

(t2D+1 + λD+1)3

<1

t2D+1(t2D+1 + 1)− y0(a + λD+1)

3

(t2D+1 + λD+1)3< 0

where the last inequality holds because of the inequality (10) . Hence Y (t) will be negative

at point t2D+1 > a, but limt→∞ Y (t) > 0 so Y (t) = y(t, D + 1) = 0 has two more solutions

than y(t, D) = 0, both of which are greater than a.

Remark 3. Note that the proof of the above theorem provides only one method of constructing

the two extra non-negative solutions. These solutions are not unique.

The following corollary provides the recursive construction method for constructing extra

modes when the dimension of mixture is increased by unity.

Corollary 2. If a mixture of two normals in D dimensions has D +1 modes one can choose

the parameters of the extra dimensions such that the resulting D +1 dimensional normal will

have D + 2 modes.

Proof. Use Theorem 4 and 5 to re-parametrize any mixture to the form NM (0, I, µ,Λ)D,

where µ = (µ1, . . . , µD), Λ = diag(λ1, . . . , λD) and then use Lemma 3 with ci = λiµ2i to

compute (cD+1, λD+1). The new mixture

NM (0, I, µ = (µ1, . . . , µD, µD+1), Λ = diag(λ1, . . . , λD, λD+1, ))D+1 ,

with µD+1 =√

λD+1/ci will have D + 2 modes.

We now apply the method described in Corollary 2 to construct a 4-dimensional example

with 5 modes, starting from the 3-dimensional case in Example 4.

Example 5. We first apply theorem 3 to transform the 3-dimensional normal mixture given

in (9) into the form NM(0, I,µ2, Λ)D=3, where

µ2 =

1/√

2

2√10

, Σ2 =

.05 0 0

0 1 0

0 0 20

. (11)

20

Page 21: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

Σ2 has d = 3 eigenvalues: λ1 = 0.05, λ2 = 1, λ3 = 20, with corresponding ci’s given by

c1 = 0.025, c2 = 4, c3 = 200.

Note that the equation q∗(t) = y(t, 3) =1

t(t + 1)−

3∑

i=1

ci

(t + λi)3has 6 positive solutions:

0.00723058, 0.148304, 0.444807, 2.24817, 6.74291 and 138.301.

Now we take 0 < t0 = 0.005 < t1 = 0.1 < t2 = 0.3 < t3 = 1 < t4 = 3 < t5 = 30 < t6 =

200 = a such that y(t) is positive at points t0, t2, t4, t6, and negative at points t1, t3, t5.

Now choose y0 = 7×10−10, then y0(a+λ)3 < y(tj)(tj +λ)3 for all j even, and eigenvalues

λ. Now take t7 = 107000 > a = 200 such that1

t7(t7 + 1)<

y0

8.

Let λ4 = 120000, thent7 + λ4

a + λ4< 2. Let c4 = y0(a + λ4)

3 = 1215658, i.e., the last

component of the new 4-dimensional mean is µ4 =√

c4/λ4 = 3.182842.

This gives a 4-dimensional normal mixture NM(0, I,µnew2 , Σnew

2 )D=4, with

µnew2 =

1/√

2

2√10

3.182842

, Σnew

2 =

.05 0 0 0

0 1 0 0

0 0 20 0

0 0 0 120000

.

The corresponding equation

q∗(t) = 1 − t(t + 1)d∑

i=1

ci

(t + λi)3= 0

has eight positive solutions as following:

0.00723058, 0.148304, 0.444807, 2.24817, 6.74291, 138.304, 82616.8 and 799211.

which implies the existence of five modes.

Figure 6 shows the q∗(t) for the four dimensional example along with the eight non-

negative zero crossings. Among the eight crossings the two on the right are obtained using

the construction method in Corollary 2.

Remark 4. The construction process in Lemma 3 is designed to add two more positive so-

lutions to equation q∗(t) = 0, when the dimension is increased, by adding another term in

the summation, without perturbing the original non-negative solutions too much. In Exam-

ple 5 we started with six roots in three dimensions and constructed two extra roots in four

dimensions. Among the six roots the first five remained exactly the same as the original ones

(according to our precision), and the sixth one is only shifted by a small magnitude (0.001).

21

Page 22: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

−5 0 5 10

−0.5

0.00.5

1.0

log(t)

q∗(t

)

Figure 6: Plots for q∗(t), which has eight positive roots, along with the zero crossing. Here q∗(t) is

plotted with respect to log(t) because of the big range of t.

Finally we state arrive at the main theorem of the paper, Theorem 1, which proves the

tightness of the bound given in Theorem 7, using the following argument

Proof of Theorem 1. The upper bound has already been shown in Theorem 7. To show that

this bound can be achieved we show the construction of mixtures with D + 1 modes in any

dimension. In one dimension two normals with equal variance will have two modes if the

distance between their means is more than two times the common standard deviation. Now

one can use Corollary 2 repeatedly to construct one extra mode per dimension resulting in

exactly D + 1 modes in D dimensions.

4.3 Special Cases

The result given in Theorem 1 is the most general modality theorem available for a two-

component normal mixture. Many previous modality results can be stated as special cases of

this generalized result. In the corollaries which follow we show that our modality result can

be used to duplicate some of the univariate and multivariate results found in the literature.

The study of the case when D = 1, i.e., the mixture of two univariate normals, can be

traced back to the early 20th century. For example, Helguero (1904) discussed the equal

variance case, and Robertson and Fryer (1969) discussed the unequal variance case, and they

both showed that there exists at most 2 modes for the univariate normal mixture. Note

that for both cases, the two variances are either equal or proportional to one another in one

dimension, and our result also shows that at most two modes are achievable. Some results

22

Page 23: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

on the mixture of two higher-dimensional normals with equal or proportional variances have

also been developed later. A recent result from Ray and Lindsay (2005) shows that for any

dimension, a two-component normal mixture with proportional variances can have at most

two modes. Our result confirms the result from Ray and Lindsay (2005), however with a

different methodology.

Corollary 3. In any dimension the mixture of two normal components with equal or propor-

tional variance (Σ2 = cΣ1 for a scalar c > 0), can have at most two modes.

Proof. By Theorem 6 the maximum number of modes is one more than the number of distinct

eigenvalues, d of Σ∗

2 = Σ1/22 Σ−1

1 Σ1/22 . For the equal or proportional case

Σ∗

2 =

I if Σ2 = Σ1

cI if Σ2 = cΣ1

In both case all the eigenvalues are same, thus they can have at most two modes.

Now we discuss some of the examples stated in Ray and Lindsay (2005). Both the two

dimensional example with three modes with parameters given in Example 3 and the three

dimensional example in Example 4 with four modes were stated earlier as mere examples of

existence of more than two modes. But our results show that they actually achieve the upper

bound possible within their respective dimensions. Moreover the construction method of the

examples in Ray and Lindsay (2005) was not easily generalizable in higher dimensions, but

our construction algorithm described in Lemma 3 provides an easy strategy for constructing

such examples.

5 Conclusion and discussion

In this paper we have developed a powerful theory for understanding the topography of a

multivariate normal mixture model. The results on the upper bound are mainly focused on

the two-component case, where we can provide the clear upper bound of D + 1 for any D

dimensional normal mixtures. Moreover, for any dimension one can produce a mixture which

attains the upper bound. In this paper, we have also verified that the number of modes for a

two-component D-dimensional normal distribution mixture NM(µ1, Σ1, µ2, Σ2)D is bounded

above by the distinct eigenvalues of the ratio matrix Σ2Σ−11 , irrespective of the means.

In the process of doing this analysis, we have not discussed how these new bounds and

construction methods might be used for statistical purposes. We think that there is a wide

area of application for these results. Given a parameter structure one can easily estimate

the upper bound of the number of modes which might be enormous help for many cluster-

ing methods. The construction method might become handy for Bayesian prior elicitation.

23

Page 24: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

Finally the results give us a clear understanding of the interplay of component means and

variances in shaping up the topography of mixtures which may be easily generalizable to

mixtures of other distributions.

We also note that there are still a number of open mathematical questions. For example,

mixture of T−distributions are often used as a robust alternative to mixtures of normals,

but there are no available results on the number of modes of the mixture of T ’s. One should

note that the contours of T and normal, which determine the number of modes displays very

similar topographical structure and so one might be able to borrow the results on topography

of normals for exploring the topography of T mixtures. In fact using this intuition one can

then easily generalize the results for any elliptical distribution.

Finally, our results on upper bound are mainly derived for K = 2. It would therefore

be useful to establish relationships between the modality structure of the pairs of densities

in a mixture and the overall modality of the entire mixture of K > 2 components. This

generalization becomes challenging even when K = 3 resulting in the ridgeline manifold of

two dimensions which may involve finding the roots of an equation of two variables.

Acknowledgments: We thank Dr. David Fried of the Department of Mathematics and

Statistics at Boston University for his assistance in solving the algebraic problems for this

paper.

A Proof of Theorems and Results

A.1 Proof of Theorem 4

We only need to compare if the function p(α) is same for the two mixtures NM(µ1, Σ1, µ2, Σ2)D

and NM(0, I,µ∗

2, Σ∗

2)D.

First note that for

Sα = αΣ−11 + αΣ−1

2 = Σ−1/22 (αΣ

1/22 Σ−1

1 Σ1/22 + αI)Σ

−1/22

. Thus

S−1α = Σ

1/22 (αΣ

1/22 Σ−1

1 Σ1/22 + αI)−1Σ

1/22 ,

which implies

Σ−1/22 S−1

α Σ−1/22 = (αΣ

1/22 Σ−1

1 Σ1/22 + αI)−1.

Now for the mixture NM(µ1, Σ1, µ2, Σ2)D,

p(α) = (µ2 − µ1)′Σ−1

1 S−1α Σ−1

2 S−1α Σ−1

2 S−1α Σ−1

1 (µ2 − µ1)

= (µ2 − µ1)′Σ−1

1 Σ1/22 (Σ

−1/22 S−1

α Σ−1/22 )3Σ

1/22 Σ−1

1 (µ2 − µ1)

= (µ2 − µ1)′Σ−1

1 Σ1/22 (αΣ

1/22 Σ−1

1 Σ1/22 + αI)−3Σ

1/22 Σ−1

1 (µ2 − µ1)

24

Page 25: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

For the transformed mixture NM(0, I,µ∗

2, Σ∗

2)D,

p∗(α) = (µ∗

2)′(Σ∗

2)1/2(αΣ∗

2 + αI)−3(Σ∗

2)1/2µ∗

2

= (µ2 − µ1)′Σ−1

1 Σ1/22 (αΣ

1/22 Σ−1

1 Σ1/22 + αI)−3Σ

1/22 Σ−1

1 (µ2 − µ1)

(By substituting µ∗

2 = (Σ∗

2)1

2 Σ−

1

2

2 (µ2 − µ1) and Σ∗

2 = Σ1

2

2 Σ−11 Σ

1

2

2 .)

= p(α).

A.2 Proof of Result 2

Proof. Let (λi, ξi, i = 1, · · · , D) be the eigenvalue eigenvector pairs for Σ2. As Σ2 is a positive

definite matrix all λi > 0. Then the matrix (αΣ2 + αI) will have eigenvalue eigenvector pairs

given by (γi, ξi), where γi = αλi + α = α(λi − 1) + 1 > 0. Similarly (αΣ + αI)−1 will have

eigenvalues ξi with corresponding eigenvalues 1/γi and using the spectral decomposition of

matrices we can write.

(αΣ2 + αI)−1 =D∑

i=1

1

γiξiξ

i

Moreover, as the eigenvalues ξi’s are orthogonal and (αΣ + αI) is symmetric, we have

p(α) = µ′

2Σ1/22 (αΣ2 + αI)−3Σ

1/22 µ2

= µ′

2Σ1/22

{D∑

i=1

1

γiξiξ

i

}3

Σ1/22 µ2

= µ′

2Σ1/22

(D∑

i=1

1

γ3i

ξiξ′

i

1/22 µ2

=D∑

i=1

1

γ3i

(µ′

2Σ1/22 ξi)

2

=D∑

i=1

1

[α(λi − 1) + 1]3(µ′

2Σ1/22 ξi)

2

=

D∑

i=1

ci

[α(λi − 1) + 1]3.

where ci = (µ′

2Σ1/22 ξi)

2 = λi(µ′

2ξi)2, as Σ

1/22 has eigenvalues

√λi with corresponding eigen-

vectors ξi.

References

J. Behboodian. On the modes of a mixture of two normal distributions. Technometrics, 12:131–139, 1970.

J. O. Berger. Statistical decision theory and Bayesian analysis. Springer Series in Statistics. Springer-Verlag,

New York, second edition, 1985. ISBN 0-387-96098-8.

25

Page 26: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

M. A. Carreira-Perpinan and C. K. I. Williams. On the number of modes of a gaussian mixture. In Scale-Space

Methods in Computer Vision, Lecture Notes in Computer Science, volume 2695, pages 625–640. Springer-

Verlag, 2003.

J. Chen and X. Tan. Inference for multivariate normal mixtures. J. Multivariate Anal.,

100(7):1367–1383, 2009. ISSN 0047-259X. doi: 10.1016/j.jmva.2008.12.005. URL

http://dx.doi.org/10.1016/j.jmva.2008.12.005.

P. Coretto and C. Hennig. A simulation study to compare robust clustering methods based on mixtures.

Advances in Data Analysis and Classification, June 2010. ISSN 1862-5347. doi: 10.1007/s11634-010-0065-4.

URL http://dx.doi.org/10.1007/s11634-010-0065-4.

J. Dannemann and H. Holzmann. Likelihood ratio testing for hidden Markov models under non-standard

conditions. Scand. J. Statist., 35(2):309–321, 2008. ISSN 0303-6898. doi: 10.1111/j.1467-9469.2007.00587.x.

URL http://dx.doi.org/10.1111/j.1467-9469.2007.00587.x.

I. Eisenberger. Genesis of bimodal distributions. Technometrics, 6:357–363, 1964.

S. Fruhwirth-Schnatter. Finite mixture and Markov switching models. Springer Series in Statistics. Springer,

New York, 2006. ISBN 978-0-387-32909-3; 0-387-32909-9.

F. d. Helguero. Sui massimi delle curve dimorfiche,. Biometrika, 3:85–98, 1904.

C. Hennig. Ridgeline plot and clusterwise stability as tools for merging Gaussian mixture components. Clas-

sification as a tool for research. Springer, Berlin, accepted for publication, 2010a.

C. Hennig. Methods for merging gaussian mixture components. Advances in Data Analysis

and Classification, January 2010b. ISSN 1862-5347. doi: 10.1007/s11634-010-0058-3. URL

http://dx.doi.org/10.1007/s11634-010-0058-3.

H. Holzmann and S. Vollmer. A likelihood ratio test for bimodality in two-component mixtures with application

to regional income distribution in the EU. Advances in Statistical Analysis, 92(1):57–69, 2008.

I. Kakiuchi. Unimodality conditions of the distribution of a mixture of two distributions. Kobe University

Mathematics Seminar Notes, 9:315–32w5, 1981.

J. H. B. Kemperman. Mixtures with a limited number of modal intervals. The Annals of Statistics, 19:

2120–2144, 1991.

E. L. Lehmann and G. Casella. Theory of point estimation. Springer Texts in Statistics. Springer-Verlag, New

York, second edition, 1998. ISBN 0-387-98502-6.

J. Li, S. Ray, and B. G. Lindsay. A nonparametric statistical approach to clustering via mode identification.

J. Mach. Learn. Res., 8:1687–1723 (electronic), 2007. ISSN 1532-4435.

W. Li. A study of an active approach to speaker and task adaptation based on automatic analysis of vocabulary

confusability. PhD thesis, The University of Hong Kong, 2007.

B. G. Lindsay. The geometry of mixture likelihoods, Part II: The exponential family. The Annals of Statistics,

11:783–792, 1983.

B. G. Lindsay, M. Markatou, S. Ray, K. Yang, and S.-C. Chen. Quadratic distances on probabilities: a unified

foundation. Ann. Statist., 36(2):983–1006, 2008. ISSN 0090-5364.

26

Page 27: HOW MANY MODES CAN A TWO COMPONENT MIXTURE …math.bu.edu/people/sray/preprints/paper_sinica.pdfSurajit Ray and Dan Ren Boston University Abstract: The main result of this article

G. McLachlan and D. Peel. Finite mixture models. Wiley Series in Probability and Statistics: Applied Prob-

ability and Statistics. Wiley-Interscience, New York, 2000. ISBN 0-471-00626-2. doi: 10.1002/0471721182.

URL http://dx.doi.org/10.1002/0471721182.

V. Melnykov and R. Maitra. Finite mixture models and model-based clustering. Statistics Surveys, 4:80–116,

2010.

S. Ray and B. G. Lindsay. The topography of multivariate normal mixtures. Ann.

Statist., 33(5):2042–2065, 2005. ISSN 0090-5364. doi: 10.1214/009053605000000417. URL

http://dx.doi.org/10.1214/009053605000000417.

S. Ray and B. G. Lindsay. Model selection in high dimensions: a quadratic-risk-based approach. Journal of the

Royal Statistical Society: Series B (Statistical Methodology), 70(1):95–118, 2008. doi: 10.1111/j.1467-9868.

2007.00623.x. URL http://www.blackwell-synergy.com/doi/abs/10.1111/j.1467-9868.2007.00623.x.

C. A. Robertson and J. G. Fryer. Some descriptive properties of normal mixtures. Skandinavisk Aktuarietid-

skrift, 69:137–146, 1969.

A. Scott et al. A POMDP framework for coordinated guidance of autonomous UAVs for multitarget tracking.

EURASIP Journal on Advances in Signal Processing, 2009, 2009.

G. Sfikas, C. Constantinopoulos, A. Likas, and N. Galatsanos. An analytic distance metric for Gaussian

mixture models with application in image retrieval. Artificial Neural Networks: Formal Models and Their

Applications-ICANN 2005, pages 835–840, 2005.

D. M. Titterington, A. F. M. Smith, and U. E. Makov. Statistical analysis of finite mixture distributions.

Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley &

Sons Ltd., Chichester, 1985. ISBN 0-471-90763-4.

Surajit Ray

Department of Mathematics and Statistics

Boston University

111 Cummington Street, Boston, MA 02215, USA

E-mail: [email protected]

Dan Ren

Department of Mathematics and Statistics

Boston University

111 Cummington Street, Boston, MA 02215, USA

E-mail: [email protected]

27