learning mixtures by simplifying kernel density estimatorsschwander/articles/mig.pdf · 2015. 10....

Learning mixtures by simplifying kernel density

estimators

Olivier Schwander? and Frank Nielsen?†

? Laboratoire d'Informatique, École Polytechnique, Palaiseau, France? Sony Computer Science Laboratories Inc., Tokyo, Japan

{schwander,nielsen}@lix.polytechnique.fr

Abstract. Gaussian mixture models are a widespread tool for mod-eling various and complex probability density functions. They can beestimated by various means, often using Expectation-Maximization orKernel Density Estimation. In addition to these well known algorithms,new and promising stochastic modeling methods include Dirichlet Pro-cess mixtures and k-Maximum Likelihood Estimators. Most of the meth-ods, including Expectation-Maximization, lead to compact models butmay be expensive to compute. On the other hand Kernel Density Esti-mation yields to large models which are computationally cheap to build.In this paper we present new methods to get high-quality models thatare both compact and fast to compute. This is accomplished by the sim-pli�cation of Kernel Density Estimator. The simpli�cation is a clusteringmethod based on k-means-like algorithms. Like all k-means algorithms,our method rely on divergences and centroids computation and we usetwo di�erent divergences (and their associated centroids), Bregman andFisher-Rao. Along with the description of the algorithms, we describethe pyMEF library, which is a Python library designed for the manipula-tion of mixture of exponential families. Unlike most of the other existingtools, this library allows to use any exponential family instead of beinglimited to a particular distribution. The generic library allows to rapidlyexplore the di�erent available exponential families in order to choose thebetter suited for a particular application. We evaluate the proposed al-gorithms by building mixture models on examples from a bio-informaticsapplication. The quality of the resulting models is measured in terms oflog-likelihood and of Kullback-Leibler divergence.

Keywords: Kernel Density Estimation, simpli�cation, Expectation� Max-imization, k-means, Bregman, Fisher-Rao

1 Introduction

Statistical methods are nowadays commonplace in modern signal processing.There are basically two major approaches for modeling experimental data byprobability distributions: we may either consider a semi-parametric modeling bya �nite mixture model learnt using the Expectation-Maximization (EM) proce-dure, or alternatively choose a non-parametric modeling using a Kernel DensityEstimator (KDE).

On the one hand mixture modeling requires to �x or learn the number ofcomponents but provides a useful compact representation of data. On the otherhand, KDE �nely describes the underlying empirical distribution at the expenseof the dense model size. In this paper, we present a novel statistical modelingmethod that simpli�es e�ciently a KDE model with respect to an underlyingdistance between Gaussian kernels. We consider the Fisher-Rao metric and theKullback-Leibler divergence. Since the underlying Fisher-Rao geometry of Gaus-sians is hyperbolic without a closed-form equation for the centroids, we ratheradopt a close approximation that bears the name of hyperbolic model centroid,and show its use in a single-step clustering method. We report on our experimentsthat show that the KDE simpli�cation paradigm is a competitive approach overthe classical EM, in terms of both processing time and quality.

In Section 2, we present generic results about exponential families, de�nition,Legendre transform, various forms of parametrization and associated Bregmandivergences. These preliminary notions allow us to introduce the Bregman hardclustering algorithm for simpli�cation of mixtures.

In Section 3, we present the mixture models and we brie�y describe somealgorithms to build them.

In Section 4, we introduce tools for the simpli�cation of mixture models. Webegin with the well known Bregman Hard Clustering and present our new tool,the Model Hard Clustering [23] which makes use of an expression of the Fisher-Rao distance for the univariate Gaussian distribution. The Fisher-Rao distance isexpressed using the Poincaré hyperbolic distance and the associated centroids arecomputed with model centroids. Moreover, since an iterative algorithm may betoo slow in time-critical applications, we introduce a one-step clustering methodwhich consists in removing the iterative part of a traditional k-means and takingonly the �rst step of the computation. This method is shown experimentally toachieve the same approximation quality (in terms of log-likelihood) at the costof a little increase in the number of components of the mixtures.

In Section 5, we describe our new software library pyMEF aimed at the ma-nipulation of mixtures of exponential families. The goal of this library is to unifythe various tools used to build mixtures which are usually limited to one kindof exponential family. The use of the library is further explained with a shorttutorial.

In Section 6, we study experimentally the performance of our methodsthrough two applications. First we give a simple example of the modeling ofthe intensity histogram of an image which shows that the proposed methodsare competitive in terms of log-likelihood. Second, a real-world application inbio-informatics is presented where the models built by the proposed methodsare compared to reference state-of-the-art models built using Dirichlet ProcessMixtures.

2 Exponential families

2.1 De�nition and examples

A wide range of usual probability density functions belongs to the class of ex-ponential families: the Gaussian distribution but also Beta, Gamma, Rayleighdistributions and many more. An exponential family is a set of probability massor probability density functions admitting the following canonical decomposi-tion:

p(x; θ) = exp(〈t(x), θ〉 − F (θ) + k(x)) (1)

with

� t(x) the su�cient statistic,� θ the natural parameters,� 〈·, ·〉 the inner product,� F the log-normalizer,� k(x) the carrier measure.

The log-normalizer characterizes the exponential family [5]. It is a strictlyconvex and di�erentiable function which is equal to:

F (θ) = log∫x

exp(〈t(x), θ〉+ k(x)) dx (2)

The next paragraphs detail the decomposition of some common distributions.

Univariate Gaussian distribution The normal distribution is an exponentialfamily: the usual formulation of the density function

f(x; µ, σ2) =1√

2πσ2exp

(− (x− µ)2

2σ2

)(3)

matches the canonical decomposition of the exponential families with

� t(x) = (x, x2),

� (θ1, θ2) =(µ

σ2,− 1

2σ2

),

� F (θ1, θ2) = − θ214θ2

+12

log(− πθ2

),

� k(x) = 0.

Multivariate Gaussian distribution The multivariate normal distribution(d is the dimension of the space of the observations)

f(x; µ,Σ) =1

(2π)d/2√

det(Σ)exp

((x− µ)TΣ(x− µ)

2

)(4)

can be described using the canonical parameters as follows:

� t(x) = (x,−xxT )

� (θ1, θ2) =(Σ−1µ,

12Σ−1

)� F (θ1, θ2) =

14

tr(θ−12 θ1θ

T1 )− 1

2log det θ2 +

d

2log π

� k(x) = 0

2.2 Dual parametrization

The natural parameters space used in the previous section admits a dual space.This dual parametrization of the exponential families comes from the propertiesof the log-normalizer. Since it is a strictly convex and di�erential function, itadmits a dual representation by the Legendre-Fenchel transform:

F ?(η) = supθ{〈θ, η〉 − F (θ)} (5)

We get the maximum for η = ∇F (θ). The parameters η are called expectationparameters since η = E [t(x)].

Gradient of F and of its dual F ? are inversely reciprocal:

∇F = (∇F ?)−1 (6)

and F ? itself can be computed by:

F ? =∫

(∇F ?)−1 + constant. (7)

Notice that this integral is often di�cult to compute and the convex conjugateF ? of F may be not known in closed-form. We can bypass the anti-derivativeoperation by plugging in Eq. (5) the optimal value ∇F (θ∗) = η (that is, θ∗ =(∇F−1)(η)). We get

F ?(η) = 〈(∇F−1)(η), η〉 − F ((∇F−1)(η)) (8)

This requires to take the reciprocal gradient ∇F−1 = ∇F ∗, but allows us todiscard the constant of integration in Eq. (7).

Thus a member of an exponential family can be described equivalently withthe natural parameters or with the dual expectation parameters.

2.3 Bregman divergences

The Kullback-Leibler (KL) divergence between two members of the same expo-nential family can be computed in closed-form using a bijection between Breg-man divergences and exponential families. Bregman divergences are a family ofdivergences parametrized by the set of strictly convex and di�erentiable func-tions F :

BF (p‖q) = F (p) − F (q) − 〈p − q, ∇F (q)〉 (9)

F is a strictly convex and di�erentiable function called the generator of theBregman divergence.

The family of Bregman divergences generalizes a lot of usual divergences, forexample:

� the squared Euclidean distance, for F (x) = x2,� the Kullback-Leibler (KL) divergence, with the Shannon negative entropyF (x) =

∑di=1 xi log xi (also called Shannon information).

Banerjee et al. [2] showed that Bregman divergences are in bijection withthe exponential families through the generator F . This bijection allows one tocompute the Kullback-Leibler divergence between two members of the sameexponential family:

KL (p(x, θ1), p(x, θ2)) =∫x

p(x, θ1)p(x, θ1)p(x, θ2)

dx (10)

= BF (θ2, θ1) (11)

where F is the log-normalizer of the exponential family and the generator of theassociated Bregman divergence.

Thus, computing the Kullback-Leibler divergence between two members ofthe same exponential family is equivalent to compute a Bregman divergencebetween their natural parameters (with swapped order).

2.4 Bregman centroids

Except for the squared Euclidean distance and the squared Mahalanobis dis-tance, Bregman divergences are not symmetrical. This leads to two sided de�ni-tions for Bregman centroids:

� the left-sided onecL = arg min

x

∑i

ωiBF (x, pi) (12)

� and the right-sided one

cR = arg minx

∑i

ωiBF (pi, x) (13)

These two centroids are centroids by optimization, that is, the unique solutionof an optimization problem. Using this principle and various symmetrizations ofthe KL divergence, we can design symmetrized Bregman centroid:

� Je�reys-Bregman divergences:

SF (p, q) =BF (p, q) +BF (q, p)

2(14)

� Jensen-Bregman divergences [18]:

JF (p, q) =BF (p, p+q2 ) +BF (q, p+q2 )

2(15)

� Skew Jensen-Bregman divergences [18]:

J(α)F (p, q) = αBF (p, αp+ (1− α)q) + (1− α)BF (q, αp+ (1− α)q) (16)

Closed-form formula are known for the left- and right-sided centroids [2]:

cR = arg minx

∑i

ωiBF (pi, x) (17)

=n∑i=1

ωipi (18)

cL = arg minx

∑i

ωiBF (x, pi) (19)

= ∇U∗(∑

i

ωi∇U(pi)

)(20)

3 Mixture Models

3.1 Statistical mixtures

Mixture models are a widespread tool for modeling complex data in a lot ofvarious domains, from image processing to medical data analysis through speechrecognition. This success is due to the capacity of these models to estimate theprobability density function (pdf) of complex random variables. For a mixturef of n components, the probability density function takes the form:

f(x) =n∑i=1

ωi g(x; θi) (21)

where ωi denotes the weight of component i (∑ωi = 1) and θi are the parameters

of the exponential family g.

Gaussian mixture models (GMM) are a universal special case used in thelarge majority of the mixture models applications:

f(x) =n∑i=1

ωi g(x; µi, σ2i ) (22)

Each component g(x; µi, σ2i ) is a normal distribution, either univariate or mul-

tivariate.Even if GMMs are the most used mixture models, mixtures of exponential

families like Gamma, Beta or Rayleigh distributions are common in some �elds([14,12]).

3.2 Getting mixtures

We present here some well-known algorithms to build mixtures. For more details,please have a look at the references cited in the next paragraphs.

Expectation-Maximization The most common tool for the estimation of theparameters of a mixture model is the Expectation-Maximization (EM) algorithm[8]. It maximizes the likelihood of the density estimation by iteratively computingthe expectation of the log-likelihood using the current estimate of the parameters(E step) and by updating the parameters in order to maximize the log-likelihood(M step).

Even if originally considered for Mixture of Gaussians (MoGs) the Expectation-Maximization has been extended by Banerjee et al. [2] to learn mixture of arbi-trary exponential families.

The pitfall is that this method leads only to a local maximum of the log-likelihood. Moreover, the number of components is di�cult to choose.

Dirichlet Process Mixtures To avoid the problem of the choice of the numberof components, one has proposed to use a mixture model with an in�nite numberof components. It can be done with a Dirichlet process mixture (DPM) [20]which uses a Dirichlet process to build priors for the mixing proportions of thecomponents. If one needs a �nite mixture, it is easy to sort the componentsaccording to their weights ωi and to keep only the components above somethreshold. The main drawback is that the building of the model needs to evaluatea Dirichlet process using a Monte-Carlo Markov Chain (for example with theMetropolis algorithm) which is computationally costly.

Kernel Density Estimation The kernel density estimator (KDE) [19] (alsoknown as the Parzen windows method) avoids the problem of the choice of thenumber of components by using one component (a Gaussian kernel) centered oneach point of the dataset. All the components share the same weight and since theµi parameters come directly from the data points, the only remaining parameters

are the σi which are chosen equal to a constant called the bandwidth. The criticalpart of the algorithm is the choice of the bandwidth: a lot of studies have beenmade to automatically tune this parameter (see [25] for a comprehensive survey)but it can also be chosen by hand depending on the dataset. Since there is oneGaussian component a point in the data set, a mixture built with a kerneldensity is di�cult to manipulate: the size is large and common operations areslow (evaluation of the density, random sampling, etc) since it is necessary toloop over all the components of the mixture.

Pros and cons The main drawbacks of the EM algorithm are the risk toconverge to a local optimum and the number of iterations needed to �nd thisoptimum. While it may be costly, this time is only spent during the learning step.On the other hand, learning a KDE is nearly free but evaluating the associatedpdf is costly since we need to loop over each component of the mixture. Given thetypical size of a dataset (a 120×120 image leads to 14400 components), the mix-ture can be unsuitable for time-critical applications. Dirichlet process mixturesusually give high precision models which are very useful in some applications [3]but at a computational cost which is not a�ordable in most applications.

Since mixtures with a low number of components have proved their capacityto model complex data (Figure 1), it would be useful to build such a mixtureavoiding the costly learning step of EM or DPM.

4 Simpli�cation of kernel density estimators

4.1 Bregman Hard Clustering

The Bregman Hard Clustering algorithm is an extension of the celebrated k-means clustering algorithm to the class of Bregman divergences [2]. It has beenproposed in Garcia et al. [10] to use this method for the simpli�cation of mixturesof exponential families. Similarly to the Lloyd k-means algorithm, the goal is tominimize the following cost function, for the simpli�cation of n componentsmixture to a k components mixture (with k < n):

L = minθ′1,...,θ

′k

∑1<j≤k

∑i

BF (θ′j , θi) (23)

where F is the log-normalizer of the considered exponential family, the θi are thenatural parameters of the source mixture and the θ′j are the natural parametersof the target mixture.

With the bijection between exponential families and Bregman divergences,the cost function L can be written in terms of Kullback-Leibler divergence:

L = minc1,...,ck

∑1<j≤k

∑i

KL(xi, cj) (24)

where the xi are the components of the original mixture and the cj are thecomponents of the target mixture. With this reformulation, the Bregman Hard

Clustering is shown to be a k-means with the Kullback-Leibler divergence (in-stead of the usual L2-based distance). As in the L2 version, the k-means involvestwo steps: assignation and centroid updates. The centroids of the cluster are herecomputed using the closed-formula presented in Section 2.4.

Though left-, right-sided and symmetrized formulations of this optimizationproblem can be used, it has been shown experimentally in [10] that the right-sided Bregman Hard Clustering performs better in terms of Kullback-Leiblererror. This experimental result is explained theoretically by a theorem statingthat the right-sided centroid is the best single-component approximation of amixture model, in terms of Kullback-Leibler divergence. Introduced by Pelletier[1], a complete and more precise proof of this result is given in the followingsection.

4.2 Kullback-Leibler centroids as geometric projections

Pelletier proved ([1], Theorem 4.1) that the right-sided KL barycenter p̄∗ can beinterpreted as the information-theoretic projection of the mixture model distri-bution p̃ ∈ P onto the model exponential family sub-manifold EF :

p̄∗ = arg minp∈EF

KL(p̃ : p) (25)

Since the mixture of exponential families is not an exponential family (p̃ 6∈EF ),1 it yields a neat interpretation: the best KL approximation of a mixture ofcomponents of the same exponential family is the exponential family memberde�ned using the right-sided KL barycenter of mixture parameters.

Let θji for j ∈ {1, ..., d} be the d coordinates in the primal coordinate systemof parameter θi.

Let us write for short θ = θ(p), and θ̄∗ = θ(p̄∗) the natural coordinates of pand p̄∗, respectively. Similarly, denote by η = η(p), η̄ = η(p̄), and η̄∗ = η(p̄∗) thedual moment coordinates of p and p̄∗, respectively.

We have

KL(p̃ : p) =∫p̃(x) log

p̃(x)p(x)

dx (26)

= Ep̃[log p̃]− Ep̃[log p] (27)

= Ep̃[log p̃]− Ep̃[〈θ, t(x)〉 − F (θ) + k(x)] (28)

= Ep̃[log p̃] + F (θ)− 〈θ,Ep̃[t(x)]〉 − Ep̃[k(x)] (29)

since Ep̃[F (θ)] = F (θ)∫p̃(x)dx = F (θ).

Using the fact that Ep̃[t(x)] = EPni=1 wipF (x;θi)[t(x)] =

∑ni=1 wiEpF (x;θi)[t(x)] =∑n

i=1 wiηi = η̄∗, it follows that

1 The product of exponential families is an exponential family.

0 50 100 150 200 2500

50

100

150

200

0 50 100 150 200 2500.000

0.002

0.004

0.006

0.008

0.010

0.012

0 50 100 150 200 250 3000.000

0.002

0.004

0.006

0.008

0.010

0.012

Fig. 1. Top to bottom, left to right: original image, original histogram, raw KDE (14400components) and simpli�ed mixture (8 components). Even with very few componentscompared to the mixture produced by the KDE, the simpli�ed mixture still reproducesvery well the shape of the histogram.

KL(p̃ : p) = Ep̃[log p̃] + F (θ)− Ep̃[k(x)]−

⟨θ,

n∑i=1

wiηi

⟩(30)

= Ep̃[log p̃] + F (θ)− Ep̃[k(x)]− 〈θ, η̄∗〉. (31)

Let us now add for mathematical convenience the neutralized sum F (θ̄∗) +〈θ̄∗, η̄∗〉 − F (θ̄∗)− 〈θ̄∗, η̄∗〉 = 0 to the former equation.

Since

KL(p̄∗ : p) = BF (θ : θ̄∗) = F (θ)− F (θ̄∗)− 〈θ − θ̄∗, η̄∗〉, (32)

and

KL(p̃ : p̄∗) = Ep̃[log p̃]− Ep̃[k(x)] + F (θ̄∗)− 〈θ̄∗, η̄∗〉, (33)

We end up with the following Pythagorean sum:

KL(p̃ : p) = Ep̃[log p̃] + F (θ)− Ep̃[k(x)]− 〈η̄∗, θ〉 (34)

+F (θ̄∗) + 〈θ̄∗, η̄∗〉 − F (θ̄∗)− 〈θ̄∗, η̄∗〉 (35)

KL(p̃ : p) = KL(p̄∗ : p) + KL(p̃ : p̄∗) (36)

This expression is therefore minimized for KL(p̄∗ : p) = 0 (since KL(p̄∗ : p) ≥0), that is for p = p̄∗. The closest distribution of EF to p̃ ∈ P is given by thedual barycenter. In other words, distribution p̄∗ is the right-sided KL projectionof the mixture model onto the model sub-manifold. Geometrically speaking, itis the projection of p̃ via the mixture connection: the m-connection. Figure 2illustrates the projection operation.

P

p̄∗p̃

exponential family

mixture sub-manifold

manifold of probability distribution

m-geodesic

EF

p

Fig. 2. Projection operation from the mixture manifold to the model exponential familysub-manifold.

This theoretically explains why the right-sided KL centroid (ie., left-sidedBregman centroid) is preferred for simplifying mixtures [16] emanating from akernel density estimator.

0 2 4 6 8 100.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Fig. 3. Right-sided (dashed line) and left-sided (dotted line) Kullback-Leibler centroidsof a 2-components Gaussian mixture model. The left-sided centroid focuses on thehighest mode of the mixture while the right-sided one tries to cover the supports of allthe components. Pelletier's result says the right-sided centroid is the closest Gaussianto the mixture.

4.3 Model Hard Clustering

The statistical manifold of the parameters of exponential families can be studiedthrough the framework of Riemaniann geometry. It has been proved by Censov[6] that the Fisher-Rao metric is the only meaningful Riemaniann metric on thestatistical manifold:

I(θ) = [gij ] = E

[d log pd θi

d log pd θj

](37)

The Fisher-Rao distance (FRD) between two distributions is computed usingthe length of the geodesic path between the two points on the statistical manifold:

FRD(p(x; θ1), p(x; θ2)) = minθ(t)

∫ 1

0

√(d θ

d t

)TI(θ)

d θ

d tdt (38)

with θ such that θ(0) = θ1 and θ(1) = θ2.This integral is not known in the general case and is usually di�cult to

compute (see [21] for a numerical approximation in the case of the Gammadistribution).

However, it is known in the case of a normal distribution that the Fisher-Raometric yields an hyperbolic geometry [13,7].

For univariate Gaussian, a closed-form formula of the Fisher-Rao distancecan be expressed, using the Poincaré hyperbolic distance in the Poincaré upperhalf-plane:

FRD(f(x; µp, σ2p), f(x; µq, σ2

q ))

=√

2 ln|( µp√

2, σp)− ( µq√

2, σq)|+ |( µp√

2, σp)− ( µp√

2, σp)|

|( µp√2, σp)− ( µp√

2, σp)| − |( µp√

2, σp)− ( µp√

2, σp)|

(39)

where | · | denotes the L2 Euclidean distance.In order to perform the k-means iterations using Fisher-Rao distance, we

need to de�ne centroids on the hyperbolic space. Model centroids, introducedby Galperin [9] and successfully used in [22] for hyperbolic centroidal Voronoitesselations, are a way to de�ne centroids in the three kinds of constant curvaturespaces (namely, Euclidean, hyperbolic or spherical). For a d-dimensional curvedspace, it starts with �nding a (k+ 1)-dimensional model in the Euclidean space.For a 2D hyperbolic space, it will be the Minkowski model, that is the uppersheet of the hyperboloid −x2 − y2 + z2 = 1.

O

ω1p′1

ω2p′2

p′1

p′2

ω1p′1 + ω2p

′2

p1 p2

Minkowski model

Klein diskc

c′

Fig. 4. Computation of the centroid c given the system (ω1, p1), (ω2, p2).

First, each point p (with coordinates (xp, yp)) lying on the Klein disk isembedded in the Minkowski model:

xp′ =xp

1− x2p + y2

p

yp′ =yp

1− x2p + y2

p

zp′ =1

1− x2p + y2

p

(40)

Next the center of mass of the points is computed

c′′ =∑

ωip′i (41)

This point needs to be normalized to lie on the Minkowski model, so we lookfor the intersection between the vector Oc′′ and the hyperboloid:

c′ =c′′

−x2c′′ − z2

c′′ + z2c′′

(42)

From this point in the Minkowski model, we can use the reverse transformin order to get a point in the original Klein disk [17]:

xc =xc′

zc′yc =

yc′

zc′(43)

Although this scheme gives the centroid of points located on the Klein disk, itis not su�cient since parameters of the Gaussian distribution are in the Poincaréupper half-plane [7]. Thus we need to convert points from one model to another,using the Poincaré disk as an intermediate step. For a point (a, b) on the half-plane, let z = a+ ib, the mapping with the Poincaré disk is:

z′ =z − iz + i

z =i(z′ + 1)

1− z′(44)

And for a point p on the Poincaré disk, the mapping with a point k on the Kleindisk is:

p =1−

√1− 〈k, k〉〈k, k〉

k =2

1 + 〈p, p〉p (45)

5 Software library

5.1 Presentation

Several tools are already available to build mixture models, either for mixturesof Gaussian distributions or for mixtures of other distributions. But these toolsare usually dedicated to a particular family of distributions.

In order to provide a uni�ed and powerful framework for the manipulationof arbitrary mixture models, we develop pyMEF, a Python library dedicated tothe mixtures of exponential families.

Given the success of the Gaussian mixture models, there are already numer-ous other software available to deal with it:

� some R packages: MCLUST (http://www.stat.washington.edu/mclust/) andMIX (http://icarus.math.mcmaster.ca/peter/mix/),

� MIXMOD [4] which also works on multinomial and provides bindings for Matlaband Scilab,

� PyMIX [11], another Python library which goes beyond simple mixture withContext-speci�c independence mixtures and dependence trees,

� scikits.learn, a Python module for machine learning (http://scikit-learn.sf.net),

� jMEF [16,10] which is the only other library dealing with mixtures of expo-nential families, written in Java.

Although exponential families other than normal distributions have been suc-cessfully used in the literature (see [12] as an example for the Beta distribution),it was made using an implementation speci�c to the underlying distributionper se. The improvement of libraries such as jMEF and pyMEF is to introducegenericity: changing the exponential family means simply changing a parameterof the Bregman Soft clustering (equivalent to performing a EM task), and notcompletely rewriting the algorithm.

Moreover, the choice of the good distribution is a di�cult problem in itself,and is often inspected experimentally, by looking at the shape of the histogramor by comparing a performance score (the log-likelihood or any meaningful scorein the considered application) computed with mixtures of various distributions.It is worth here to use a uni�ed framework instead of using di�erent librariesfrom various sources with various interfaces.

The goal of the pyMEF library is to provide a consistent framework withvarious algorithms to build mixtures (Bregman Soft clustering) and vari-ous Information-theoretic simpli�cation methods (Bregman Hard clustering,Burbea-Rao Hard Clustering [15], Fisher Hard Clustering) along with somewidespread exponential families:

� univariate Gaussian,� multivariate Gaussian,� Generalized Gaussian,� multinomial,� Rayleigh,� Laplacian.

Another goal of pyMEF is to be easily extensible and more distributions areplanned, like:

� Dirichlet,� Gamma,� Von Mises-Fisher.

http://www.stat.washington.edu/mclust/

http://icarus.math.mcmaster.ca/peter/mix/

http://scikit-learn.sf.net

http://scikit-learn.sf.net

5.2 Extending pyMEF

The set of available exponential families can be easily extended by users. Fol-lowing the principles of Flash Cards introduced in [16] for jMEF it is su�cientto implement in a Python class the function describing the distribution:

� the core of the family (the log-normalizer F and its gradient ∇F , the carriermeasure k and the su�cient statistic t),

� the dual characterization with the Legendre dual of F (F ? and ∇F ?)� the conversion between three parameters space (source to natural, natural

to expectation, expectation to source and their reciprocal).

5.3 An example with a Gaussian Mixture Model

We present here a basic example of a pyMEF session. The following can be usedinteractively in the Python toplevel or be part of a larger software. This allowsboth a rapid exploration of a dataset and the development of a real applicationwith the same tools.

We begin with loading the required modules:

import numpyfrom matp lo t l i b import pyplot

from pyMEF. Build import BregmanSoftCluster ing , KDEfrom pyMEF. S imp l i f y import BregmanHardClusteringfrom pyMEF. Fami l i e s import Univar iateGauss ian

An example dataset (6550 samples) is loaded using standard numpy functions:

data = numpy . l oadtx t ( "data . txt " )data = data . reshape ( data . shape\ f oo tno t e {DEFINITION NOT FOUND: 0 } , 1)

An 8-component mixture model is built on this dataset using the BregmanSoft Clustering algorithm (also known as EM in the Gaussian case):

em = BregmanSoftCluster ing ( data , 8 , Univar iateGauss ian , ( ) )mm_em = em. run ( )

Another mixture is built using Kernel Density Estimation (leading to a 6550-component mixture).

mm_kde = KDE( data , Univar iateGauss ian , ( ) )

This very large model is then simpli�ed into an 8-component mixture withthe Bregman Hard Clustering algorithm:

kmeans = BregmanHardClustering (mm_kde, 8)mm_s = kmeans . run ( )

We �nally compute the log-likelihood of the models (original and simpli�ed).

print "EM: " , mm_em. l ogL ik e l i h ood ( data )print "KDE: " , mm_kde. l o gL i k e l i h ood ( data )print " S imp l i f i e d KDE: " , mm_s. l o gL i k e l i h ood ( data )

For illustration purposes (see Figure 5), we plot the histogram of the origi-nal data and the three computed models (pyMEF does not provide any displayfunctions, we rely instead on the powerful matplotlib2 library).

pyplot . subp lot (2 , 2 , 1)pyplot . h i s t ( data , 1000)

pyplot . xl im (0 , 20)x = numpy . arange ( 0 , 2 0 , 0 . 1 )

pyplot . subp lot (2 , 2 , 2)pyplot . p l o t (x , mm_em(x ) )

pyplot . subp lot (2 , 2 , 3)pyplot . p l o t (x , mm_kde(x ) )

pyplot . subp lot (2 , 2 , 4)pyplot . p l o t (x , mm_s(x ) )

pyplot . show ( )

A real application would obviously use multiple runs of the soft and hardclustering algorithms to avoid being trapped in a bad local optimum that canbe reached by the two local optimization methods.

In this example, the Bregman Soft clustering gives the best result in termsof log-likelihood (Table 1) but the model is visually not really satisfying (thereis a lot of local maxima near the �rst mode of the histogram, instead of just onemode). The models relying on Kernel Density Estimation give a bit worse log-likelihood but are visually more convincing. The important point is the qualityof the simpli�ed model: while having a lot less components (8 instead of 6550)the simpli�ed model is nearly identical to the original KDE (both visually andin terms of log-likelihood).

Model Log-likelihoodEM -18486.7957123KDE -18985.4483699Simpli�ed KDE -19015.0604457

Table 1. Log-likelihood of the three computed models. EM still gives the best valueand the simpli�ed KDE has nearly the same log-likelihood than the original KDE.

2 http://matplotlib.sourceforge.net/

0 5 10 15 200

5

10

15

20

25

30

35

40

45

0 5 10 15 200.00

0.05

0.10

0.15

0.20

0.25

0.30

0 5 10 15 200.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0 5 10 15 200.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Fig. 5. Output from the pyMEF demo. Top-left the histogram from the data; top-right, the model computed by EM; bottom-left the onefrom KDE; bottom-right the simpli�ed KDE. Visual appearance is quite bad for EM while it is very good for both KDE and simpli�edKDE, even with a lot less components in the simpli�ed version.

5.4 Examples with other exponential families

Although the Gaussian case is the more widespread and the more universalcase, a lot of other exponential families are useful in particular applications. Wepresent here two examples implemented in pyMEF using the formula detailed in[16].

Rayleigh distribution The Rayleigh mixture models are used in the �eld of In-travascular UltraSound imaging [24] for segmentation and classi�cation tasks.We presents in Figure 6 an example of the learning of a Rayleigh mixture modelon a synthetic dataset built from a 5 components mixture of Rayleigh distribu-tions. The graphics shown in this �gure have been generated with the followingscript (for the sake of brevity, we omit here the loops used to select the bestmodel among some tries). Notice how similar this code is to the previous exam-ple, showing the genericity of our library: using di�erent exponential families forthe mixtures is just a matter of changing one parameter in the program.

import sys , numpy

from pyMEF import MixtureModelfrom pyMEF. Build import BregmanSoftCluster ingfrom pyMEF. S imp l i f y import BregmanHardClusteringfrom pyMEF. Fami l i e s import Rayle igh

# Orig ina l mixture

k = 5mm = MixtureModel (5 , Rayleigh , ( ) )mm\ footnotemark [ 1 ] . source ( ( 1 . , ) )mm\ foo tno t e {DEFINITION NOT FOUND: 1 } . source ( ( 1 0 . , ) )mm\ foo tno t e {DEFINITION NOT FOUND: 2 } . source ( ( 3 . , ) )mm\ foo tno t e {DEFINITION NOT FOUND: 3 } . source ( ( 5 . , ) )mm\ foo tno t e {DEFINITION NOT FOUND: 4 } . source ( ( 7 . , ) )

# Data sample

data = mm. rand (10000)

# Bregman So f t C lu s t e r i n g k=5

em5 = BregmanSoftCluster ing ( data , 5 , Rayleigh , ( ) )em5 . run ( )mm_em5 = em5 . mixture ( )

# Bregman So f t C lu s t e r i n g k=32 + S imp l i f i c a t i o n

em32 = BregmanSoftCluster ing ( data , 32 , Rayleigh , ( ) )em32 . run ( )mm_em32 = em. mixture ( )

kmeans5 = BregmanHardClustering (mm_em32, 5)

kmeans . run ( )mm_simplified = kmeans . mixture ( )

Laplace distribution Although Laplace distributions are only exponential familieswhen their mean is null, zero-mean Laplacian mixture models are used in variousapplications. Figure 7 presents the same experiments as in Figure 6 and has beengenerated with exactly the same script, just by replacing all occurrences of theword Rayleigh by the word CenteredLaplace.

0 2 4 6 8 10 120.00

0.05

0.10

0.15

0.20

0.25

0.30

0 2 4 6 8 10 120

50

100

150

200

250

300

350

0 2 4 6 8 10 120.00

0.05

0.10

0.15

0.20

0.25

0.30

0 2 4 6 8 10 120.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Fig. 6. Rayleigh mixture models. The top left �gure is the true mixture (synthetic data) and the top right one is the histogram of 10000sample drawn from the true mixture. The bottom left �gure is a mixture build with the Bregman Soft Clustering algorithm (with 5components) and the bottom right one is a mixture built by �rst getting a 32 components mixture with Bregman Soft Clustering andthen simplifying it to a 5 components mixtures with the Bregman Hard Clustering algorithm.

20 15 10 5 0 5 10 15 200.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

20 15 10 5 0 5 10 15 200

200

400

600

800

1000

1200

1400

1600

1800

20 15 10 5 0 5 10 15 200.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

20 15 10 5 0 5 10 15 200.00

0.05

0.10

0.15

0.20

0.25

0.30

Fig. 7. Laplace mixture models. The top left �gure is the true mixture (synthetic data) and the top right one is the histogram of 10000sample drawn from the true mixture. The bottom left �gure is a mixture build with the Bregman Soft Clustering algorithm (with 5components) and the bottom right one is a mixture built by �rst getting a 32 components mixture with Bregman Soft Clustering andthen simplifying it to a 5 components mixtures with the Bregman Hard Clustering algorithm.

6 Applications

6.1 Experiments on images

We study here the quality, in terms of log-likelihood, and the computation timeof the proposed methods compared to a baseline Expectation-Maximization al-gorithm. The source distribution is the intensity histogram of the famous Lenaimage (see Figure 1). As explained in Section 4.1, for the Kullback-Leibler di-vergence, we report only results for right-sided centroids since it performs better(as indicated by the theory) than the two other �avors and has the same com-putation cost. The third and fourth methods are the Model centroid, both witha full k-means and with only one iteration.

The top part of Figure 8 shows the evolution of the log-likelihood as a functionof the number of components k. First, we see that all the algorithms performnearly the same and converge very quickly to a maximum value (the KL curveis merged with the EM one).

Kullback-Leibler divergence and Fisher-Rao metric perform similarly butthey are rather di�erent from a theoretical standpoint: KL assumes an underly-ing �at geometry while Fisher-Rao is related to the curved hyperbolic geometryof Gaussian distributions. However at in�nitesimal scale (or on dense compactclusters) they behave the same.

The bottom part of Figure 8 describes the running time (in seconds) as afunction of k. Despite the fact that the quality of mixtures is nearly identical,the costs are very di�erent. Kullback-Leibler divergence is very slow (even inclosed-form, the formulas are quite complex to calculate). While achieving thesame log-likelihood, model centroid is the fastest method, signi�cantly fasterthan EM.

While being slower to converge when k increases, the one step model cluster-ing performs still well and is roughly two times faster than a complete k-means.The initialization is random: we do not use k-means++ here since its cost duringinitialization cancels the bene�t of performing only one step.

6.2 Prediction of 3D structures of RNA molecules

RNA molecules play an important role in many biological processes. The under-standing of the functions of these molecules depends on the study of their 3Dstructure. A common approach is to use knowledge-based potential built frominter-atomic distance coming from experimentally determining structures. Re-cent works use mixture models [3] to model the distribution of the inter-atomicdistances.

In the original work presented in [3] the authors use Dirichlet Process Mix-tures to build the mixture models. This gives high quality mixtures, both interms of log-likelihood and in the context of the application, but with a highcomputational cost which is not a�ordable for building thousands of mixtures.

We study here the e�ectiveness of our proposed simpli�cation mixtures com-pared to reference high quality mixtures built with Dirichlet Process Mixtures.

0 5 10 15 20 25 30 35Number of components

15000

14000

13000

12000

11000

10000

9000

8000

7000

Log-l

ikelih

ood

Expectation-Maximization

Bregman Hard Clustering

Model Hard Clustering

Onestep Model Hard Clustering

0 5 10 15 20 25 30 35Number of components

0

10

20

30

40

50

60

70

80

90

Tim

e

Expectation-Maximization

Bregman Hard Clustering

Model Hard Clustering

Onestep Model Hard Clustering

Fig. 8. Log-likelihood of the simpli�ed models and computation time. All the algo-rithms reach the same log-likelihood maximum with quite few components (but theone-step model centroid needs a few more components than all the others). Modelcentroid based clusterings are the fastest methods, Kullback-Leibler clustering is evenslower than EM due to the computational cost of the KL distance and centroids.

We evaluate the quality of our simpli�ed models by computing mixture in anabsolute way, with the log-likelihood, and in a relative way, with the Kullback-Leibler divergence between a mixture built with Dirichlet and a simpli�ed mix-ture.

A more detailed study of this topic is presented in [26].

Method Log-likelihoodDPM -18420.6999452KDE -18985.4483699KDE + Bregman Hard Clustering -18998.3203038KDE + Model Hard Clustering -18974.0717664KDE + One step Model Hard Clustering -19322.2443988

Table 2. Log-likelihood of the model built by the state-of-the-art Dirichlet ProcessMixture, by Kernel Density Estimation, and by our new simpli�ed models. DPM isbetter but the proposed simpli�cation methods perform as well as the KDE.

KL DPM KDE BHC MHC One step MHCDPM 0.0 0.051 0.060 0.043 0.066KDE 0.090 0.0 0.018 0.002 0.016

Table 3. Kullback-Leibler divergence matrix for models built by Dirichlet ProcessMixture (DPM), by Kernel Density Estimation (KDE), by the Bregman Hard Clus-tering (BHC), by the Model Hard Clustering (MHC) and by the one-step Model HardClustering. We limit the lines of the table to only DPM and KDE since by the natureof Kullback-Leibler, the left term of the divergence is supposed to be the "true" dis-tribution and the right term the estimated distribution (left term comes from the linesand right term from the columns).

Both DPM and KDE produce high quality models (see Table 2): for the �rstwith high computational cost, for the second with a high number of components.Moreover, these two models are very close for the Kullback-Leibler divergence:this means that one may choose between the two algorithms depending on themost critical point, time or size, in their application.

Simpli�ed models get nearly identical log-likelihood values. Only the one-stepModel Hard Clustering leads to a signi�cant loss in likelihood.

Simpli�ed models using Bregman and Model Hard Clustering are both closeto the reference DPM model and to the original KDE (Table 3). Moreover, theModel Hard Clustering outperforms the Bregman Hard Clustering in the twocases. As expected, the one-step Model Hard Clustering is the furthest: it willdepend on the application to know if the decrease in computation time is worththe loss in quality.

7 Conclusion

We presented a novel modeling paradigm which is both fast and accurate. Fromthe Kernel Density Estimates which are precise but di�cult to use due to theirsize, we are able to build new models which achieve the same approximation qual-ity while being faster to compute and compact. We introduce a new mixture sim-pli�cation method, the Model Hard Clustering, which relies on the Fisher-Raometric to perform the simpli�cation. Since closed-form formula are not known inthe general case we exploit the underlying hyperbolic geometry, allowing to usethe Poincaré hyperbolic distance and the Model centroids, which are a notion ofcentroids in constant curvature spaces.

Models simpli�ed by the Bregman Hard Clustering and by Model HardClustering have both a quality comparable to models built by Expectation-Maximization or by Kernel Density Estimation. But the Model Hard Clusteringdoes not only give very high quality models, it is also faster than the usualExpectation-Maximization. The quality of the models simpli�ed by the ModelHard Clustering justify the use of the Model centroids as a substitute for theFisher-Rao centroids.

Both Model and Bregman Hard clustering are also competitive with state-of-the-art approaches in a bio-informatics application for the modeling of the3D structure of a RNA molecule, giving models which are very close, in termsof Kullback-Leibler divergence, to reference models built with Dirichlet ProcessMixtures.

Acknowledgments.The authors would like to thank Julie Bernauer (INRIA team Amib, LIX,

École Polytechnique) for insightful discussions about the bio-informatics applica-tion of our work and for providing us with the presented dataset. FN (5793b870)would like to thank Dr Kitano and Dr Tokoro for their support.

References

1. Pelletier B. Informative barycentres in statistics. Annals of the Institute of Statis-tical Mathematics, 57(4):767�780, December 2005.

2. A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh. Clustering with Bregmandivergences. The Journal of Machine Learning Research, 6:1705�1749, 2005.

3. J. Bernauer, X. Huang, A.Y.L. Sim, and M. Levitt. Fully di�erentiable coarse-grained and all-atom knowledge-based potentials for rna structure evaluation.RNA, 17(6):1066, 2011.

4. C. Biernacki, G. Celeux, G. Govaert, and F. Langrognet. Model-based cluster anddiscriminant analysis with the MIXMOD software. Computational Statistics &Data Analysis, 51(2):587�600, 2006.

5. L. D. Brown. Fundamentals of statistical exponential families: with applications instatistical decision theory. IMS, 1986.

6. N. N. �encov. Statistical decision rules and optimal inference, volume 53 of Trans-lations of Mathematical Monographs. American Mathematical Society, Providence,R.I., 1982. Translation from the Russian edited by Lev J. Leifman.

7. S.I.R. Costa, S.A. Santos, and J.E. Strapasson. Fisher information matrix andhyperbolic geometry. In Information Theory Workshop, 2005 IEEE, page 3 pp.,aug.-1 sept. 2005.

8. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incompletedata via the em algorithm. Journal of the Royal Statistical Society. Series B(Methodological), pages 1�38, 1977.

9. G.A. Galperin. A concept of the mass center of a system of material points in theconstant curvature spaces. Communications in Mathematical Physics, 154(1):63�84, 1993.

10. V. Garcia, F. Nielsen, and R. Nock. Levels of details for gaussian mixture models.Computer Vision�ACCV 2009, pages 514�525, 2010.

11. B. Georgi, I.G. Costa, and A. Schliep. PyMix- The Python mixture package- atool for clustering of heterogeneous biological data. BMC bioinformatics, 11(1):9,2010.

12. Y. Ji, C. Wu, P. Liu, J. Wang, and K.R. Coombes. Applications of beta-mixturemodels in bioinformatics. Bioinformatics, 21(9):2118, 2005.

13. R. E. Kass and P. W. Vos. Geometrical Foundations of Asymptotic Inference. JohnWiley & Sons, September 1987.

14. I. Mayrose, N. Friedman, and T. Pupko. A Gamma mixture model better accountsfor among site rate heterogeneity. Bioinformatics, 21(Suppl 2), 2005.

15. F. Nielsen, S. Boltz, and O. Schwander. Bhattacharyya clustering with applica-tions to mixture simpli�cations. In IEEE International Conference on PatternRecognition, Istanbul, Turkey, 2010. ICPR'10.

16. F. Nielsen and V. Garcia. Statistical exponential families: A digest with �ash cards.arXiv:0911.4863, November 2009.

17. F. Nielsen and R. Nock. Hyperbolic voronoi diagrams made easy. arXiv:0903.3287,March 2009.

18. F. Nielsen and R. Nock. Jensen-bregman voronoi diagrams and centroidal tessella-tions. In Voronoi Diagrams in Science and Engineering (ISVD), 2010 InternationalSymposium on, pages 56�65. IEEE, 2010.

19. E. Parzen. On estimation of a probability density function and mode. The annalsof mathematical statistics, 33(3):1065�1076, 1962.

20. C.E. Rasmussen. The in�nite gaussian mixture model. Advances in neural infor-mation processing systems, 12:554�560, 2000.

21. F. Reverter and JM Oller. Computing the rao distance for gamma distributions.Journal of computational and applied mathematics, 157(1):155�167, 2003.

22. G. Rong, M. Jin, and X. Guo. Hyperbolic centroidal voronoi tessellation. InProceedings of the 14th ACM Symposium on Solid and Physical Modeling, SPM'10, page 117�126, New York, NY, USA, 2010. ACM.

23. O. Schwander and F. Nielsen. Model centroids for the simpli�cation of kerneldensity estimators. In Acoustics, Speech and Signal Processing (ICASSP), 2012IEEE International Conference on, march 2012.

24. J. C Seabra, F. Ciompi, O. Pujol, J. Mauri, P. Radeva, and J. Sanches. Rayleighmixture model for plaque characterization in intravascular ultrasound. IEEETransactions on Biomedical Engineering, 58(5):1314�1324, May 2011.

25. S.J. Sheather and M.C. Jones. A reliable data-based bandwidth selection methodfor kernel density estimation. Journal of the Royal Statistical Society. Series B(Methodological), 53(3):683�690, 1991.

26. A. Y. L. Sim, O. Schwander, M. Levitt, and J. Bernauer. Evaluating mixturemodels for building rna knowledge-based potentials. Journal of Bioinformaticsand Computational Biology (to appear), 2012.

learning mixtures by simplifying kernel density estimatorsschwander/articles/mig.pdf · 2015. 10....

Documents