an empirical comparison of em initialization methods and ... · an empirical comparison of em...
TRANSCRIPT
An Empirical Comparison of EM Initialization Methods and ModelChoice Criteria for Mixtures of Skew-Normal Distributions∗
José Raimundo Gomes Pereira Celso Rômulo Barbosa CabralLeyne Abuim de Vasconcelos Marques
José Mir Justino da Costa
Departamento de Estatística – Universidade Federal do Amazonas – Brazil
Abstract
We investigate, via simulation study, the performance of the EM algorithm for maximum likelihood estima-tion in finite mixtures of skew-normal distributions with component specific parameters. The study takesinto account the initialization method, the number of iterations needed to attain a fixed stopping rule and theability of some classical model choice criteria to estimate the correct number of mixture components. Theresults show that the algorithm produces quite reasonable estimates when using the method of moments toobtain the starting points and that, combining them with the AIC, BIC, ICL or EDC criteria, represents agood alternative to estimate the number of components of the mixture. Exceptions occur in the estimation ofthe skewness parameters, notably when the sample size is relatively small, and in some classical problematiccases, as when the mixture components are poorly separated.Key Words: EM algorithm; Skew-normal distribution; Finite mixture of distributions.
1 Introduction
We define the (finite) mixture of the densities f1, . . . , fk as the density g(x) = ∑ki=1 pi fi(x), where p1, . . . , pk
are unknown positive numbers, named mixing weights, satisfying ∑ki=1 pi = 1. We name fi the i−th compo-
nent of the mixture.Finite mixtures have been widely used as a powerful tool to model heterogeneous data and to approxi-
mate complicated probability densities, presenting multimodality, skewness and heavy tails. These modelshave been applied in several areas like genetics, image processing, medicine and economics. For compre-hensive surveys, see McLachlan & Peel (2000) and Frühwirth-Schnatter (2006).
Maximum likelihood estimation in finite mixtures is a research area with several challenging aspects.There are nontrivial issues, like lack of identifiability and saddle regions surrounding the possible localmaxima of the likelihood. Another problem is that the likelihood is possibly unbounded, which happenswhen the components are normal densities, for example.
∗The authors acknowledge the partial financial support from CAPES, CNPq and FAPEAM. Email addresses:[email protected] (José Raimundo Gomes Pereira), [email protected] (Celso Rômulo Barbosa Cabral),[email protected] (Leyne Abuim de Vasconcelos Marques), [email protected] (José Mir Justino da Costa)
1
There is a lot of literature involving mixtures of normal distributions, some references can be foundin the above-mentioned books. In this work we consider mixtures of skew-normal (SN) distributions, asdefined by Azzalini (1985). This distribution is an extension of the normal distribution that accommodatesasymmetry.
The standard algorithm for maximum likelihood estimation in finite mixtures is the Expectation Max-imization (EM) of Dempster et al. (1977), see also McLachlan & Krishnan (2008). It is well known thatit has slow convergence and that its performance is strongly dependent on the stopping rule and startingpoints. For normal mixtures, several authors have computationally investigated the performance of the EMalgorithm by taking into account initial values (Karlis & Xekalaki, 2003; Biernacki et al., 2003) and asymp-totic properties (Nityasuddhi & Böhning, 2003). See also Dias & Wedel (2004) for a broader simulationexperiment comparing EM with SEM and MCMC algorithms.
Although there are some purposes to overcome the unboundedness problem in the normal mixture case,involving constrained optimization and alternative algorithms – see Hathaway (1985), Ingrassia (2004), andYao (2010), for example, it is interesting to investigate the performance of the (unrestricted) EM algorithm inthe presence of skewness in the component distributions, since algorithms of this kind have been presentedin recent works as Lin et al. (2007a), Lin et al. (2007b), Basso et al. (2009), Lin (2009a), Lin (2009b) andLin & Lin (2009).
The main goal of this work is to study the performance of the estimates produced by the EM algorithm,taking into account two different purposes to obtain initial values – method of moments and a random ini-tialization method, the number of iterations needed to attain a fixed stopping rule and the ability of someclassical model choice criteria (AIC, BIC, ICL and EDC) to estimate the correct number of mixture compo-nents. For each initialization method, the consistency of the estimates is analyzed by computing their biasand mean squared errors over repeated generated samples and the mean number of iterations is observed.Fixing the method of initialization, repeated samples of finite mixtures of SN distributions are generatedwith different number of components and the percentage of times some given criterion chooses the correctnumber of components is observed. The work is restricted to the univariate case.
The remainder of the paper is organized as follows. In Sections 2, 3 and 4 for the sake of completeness,we give a brief sketch of the SN distribution, of the skew-normal mixture model and of estimation via theEM algorithm, respectively. The simulation study through the analysis of the initialization methods andnumber of iterations is presented in Section 5. The study through the analysis of the model choice criteria ispresented in Section 6. Finally, in order to compare results obtained by the several initialization and modelchoice methods, two real data sets are analyzed in Section 7.
2 The Skew-normal Distribution
The skew-normal distribution, introduced by Azzalini (1985), is given by the density
SN(y|µ,σ2,λ ) = 2N(y|µ,σ 2)Φ(
λy−µ
σ
),
where N(·|µ,σ2) denotes the univariate normal density with mean µ ∈R and variance σ2 > 0 and Φ(·) is thedistribution function of the standard normal distribution. In this definition, µ , λ ∈ R and σ2 are parametersregulating location, skewness and scale, respectively. For a random variable Y with this distribution, we usethe notation Y ∼ SN(µ,σ2,λ ).
For a SN random variable Y , a convenient stochastic representation is given next. It can be used tosimulate realizations of Y and to implement the EM-type algorithm that will be shown soon. The prooffollows easily from Henze (1986).
2
Lemma 1. A random variable Y ∼ SN(µ,σ2,λ ) has a stochastic representation given by
Y = µ +σδT +σ(1−δ 2)1/2T1,
where δ = λ/√
1+λ 2, T = |T0|, T0 and T1 are independent standard normal random variables and | · |denotes absolute value.
In what follows, in order to reduce computational difficulties related to the implementation of the algo-rithms used for estimation, we use the parametrization
Γ = (1−δ 2)σ2 and ∆ = σδ ,
which was first suggested by Bayes & Branco (2007). Note that (λ ,σ2)→ (∆,Γ) is a one to one mapping.To recover λ and σ2, we use
λ = ∆/√
Γ and σ2 = ∆2 +Γ.
Then, it follows easily from Lemma 1 that
Y |T = t ∼ N(µ +∆t,Γ) and T ∼ HN(0,1), (1)
where HN(0,1) denotes the half-normal distribution with parameters 0 and 1.The expectation, variance and skewness coefficient of Y ∼ SN(µ,σ 2,λ ) are respectively given by
E(Y ) = µ +σ∆√
2/π, Var[Y ] = σ2(
1− 2π
δ 2)
, γ(Y ) =κδ 3
(1− 2π δ 2)3/2
, (2)
where κ = 4−π2 ( 2
π )3/2, see Azzalini (2005, Lemma 2).
3 The FM-SN Model
In this section we give a brief introduction to the finite mixture of SN distributions model, hereafter FM-SNmodel. More details can be found in Basso et al. (2009), where a more general class of distributions isconsidered, and references herein.
The FM-SN model is defined by considering a random sample y = (y1, . . . ,yn)> from a mixture of SNdensities given by
g(y j|Θ) =k
∑i=1
piSN(y j|θ i), j = 1, . . . ,n, (3)
where pi ≥ 0, i = 1, . . . ,k are the mixing probabilities, ∑ki=1 pi = 1, θ i = (µi,σ2
i ,λi)> is the specific vectorof parameters for the component i and Θ = ((p1, . . . , pk)>,θ>1 , . . . ,θ>k )> is the vector with all parameters.
For each j consider a latent classification random variable Z j taking values in {1, . . . ,k}, such that
y j|Z j = i∼ SN(θi), P(Z j = i) = pi, i = 1, . . . ,k; j = 1, . . . ,n.
Then it is straightforward to prove, integrating out Z j, that y j has density (3). If we combine this result with(1), we have the following stochastic representation for the FM-SN model
y j|Tj = t j,Z j = i∼ N(µi +∆it j,Γi),
Tj ∼ HN(0,1),
P(Z j = i) = pi, i = 1, . . . ,k; j = 1, . . . ,n,
whereΓi = (1−δ 2
i )σ2i , ∆i = σiδi, δi = λi/
√1+λ 2
i , i = 1, . . . ,k. (4)
3
4 Estimation
4.1 An EM-type Algorithm
In this section, for the sake of completeness, we present an EM-type algorithm for estimation of the param-eters of a FM-SN distribution. This algorithm was presented before in Basso et al. (2009), where a widerclass of distributions that includes the FM-SN one is considered.
Here we obtain estimates of the parameters of the FM-SN model using a faster extension of EM calledthe ECM algorithm (Meng & Rubin, 1993). When applying it to the FM-SN model, we obtain a simple setof closed form expressions to update a current estimate of the vector Θ, as we will see below. It is importantto emphasize that this procedure differs from the algorithm presented by Lin et al. (2007b), because in theformer case the updating equations for the component skewness parameter have a closed form.
In what follows we consider the parametrization (4). Let
Θ(m) = ((p(m)1 , . . . , p(m)
k )>,(θ (m)1 )>, . . . ,(θ (m)
k )>)>
be the current estimate (at the mth iteration of the algorithm) of Θ, where θ (m)i = (µ(m)
i , ∆(m)i , Γ(m)
i )>. TheE-step of the algorithm consists in evaluate the expected complete data function – known as the Q−function,defined as
Q(Θ|Θ(m)) = E[`c(Θ)|y,Θ(m)],
where `c(Θ) is the complete-data log-likelihood function, given by
`c(Θ) = c+n
∑j=1
k
∑i=1
zi j
(log pi− 1
2logΓi− 1
2Γi(y j−µi−∆it j)2
),
where zi j is the indicator function of the set (Z j = i) and c is a constant that is independent of Θ.The M-step consists in maximizing the Q-function over Θ. As the M-step turns out to be analytically
intractable, we use, alternatively, the ECM algorithm, which is an extension that essentially replaces it witha sequence of conditional maximization (CM) steps.
The following scheme is used to obtain an updated value Θ(m+1). We can find more details about theconditional expectations involved in the computation of the Q-function and the related maximization stepsin Basso et al. (2009). Here, φ denotes the standard normal densityE-step: Given a current estimate Θ(m), compute zi j, s1i j and s2i j, for i = 1, . . . ,n and j = 1, . . . ,k, where
z(m)i j =
p(m)i SN(y j|θ (m)
i )
∑ki=1 p(m)
i SN(y j|θ (m)i )
, (5)
s(m)1i j = z(m)
i j
µ(m)
Ti j+
φ(
µ(m)Ti j
/σ (m)Ti
)
Φ(
µ(m)Ti j
/σ (m)Ti
) σ (m)Ti
,
s(m)2i j = z(m)
i j
(µ(m)
Ti j)2 +(σ (m)
Ti)2 +
φ(
µ(m)Ti j
/σ (m)Ti
)
Φ(
µ(m)Ti j
/σ (m)Ti
) µ(m)Ti j
σ (m)Ti
,
µ(m)Ti j
=∆(m)
i
Γ(m)i +(∆(m)
i )2(y j− µ(m)
i ),
σ (m)Ti
=
(Γ(m)
i
Γ(m)i +(∆(m)
i )2
)1/2
.
4
CM-steps: Update Θ(m) by maximizing Q(Θ|Θ(m)) over Θ, which leads to the following closed form ex-pressions:
p(m+1)i = n−1
n
∑j=1
z(m)i j ,
µi(m+1) =
∑nj=1(y j z
(m)i j − ∆(m)
i s(m)1i j )
∑nj=1 z(m)
i j
,
Γ(m+1)i =
∑nj=1(z
(m)i j (y j− µ(m+1)
i )2−2(y j− µ(m+1)i )∆(m)
i s(m)1i j +(∆(m)
i )2s(m)2i j )
∑nj=1 z(m)
i j
,
∆(m+1)i =
∑nj=1(y j− µ(m+1)
i )s(m)1i j
∑nj=1 s(m)
2i j
.
The algorithm iterates between the E and CM steps until a suitable convergence rule is satisfied. Someclassical purposes are (i) if ||Θ(m+1)− Θ(m)|| is sufficiently small or (ii) if |`(Θ(m+1))− `(Θ(m))| is smallenough or (iii) if |`(Θ(m+1))/`(Θ(m))− 1| is small enough, where `(Θ) the actual log-likelihood. In thiswork we use the purpose (ii).
4.2 Problems with Estimation in Finite Mixtures
4.2.1 Unbounded Likelihood
An important issue is the existence of the maximum likelihood estimator for finite mixture models. It iswell known that the likelihood of normal mixtures can be unbounded – see Frühwirth-Schnatter (2006,Chapter 6), for example. FM-SN models also have this feature: for instance, let yi, i = 1, . . . ,n, be a randomsample from a mixture of two SN densities, SN(µ,σ 2
1 ,λ1) with weight p ∈ (0,1) and SN(µ,σ 22 ,λ2). Fixing
p, σ 21 , λ1 and λ2, we have that the likelihood function L(µ,σ 2
2 ) evaluated at µ = y1 satisfies
L(y1,σ22 ) = g(σ 2
2 )n
∏i=2
(ci +hi(σ 22 )) > g(σ2
2 )n
∏i=2
ci,
where
g(σ22 ) =
(p(2πσ2
1 )−1/2 +(1− p)(2πσ22 )−1/2
),
ci = pSN(yi|y1,σ 21 ,λ1),
hi(σ22 ) = (1− p)SN(yi|y1,σ 2
2 ,λ2), i = 2, . . . ,n.
The strict inequality is valid because ci and hi(σ22 ) are positive. Now, we can choose a sequence of positive
numbers (σ22 )m, m ∈ N, such that limm→∞(σ2
2 )m = 0. Then, because ∏ni=2 ci > 0 and limm→∞ g((σ2
2 )m) =+∞, we have that limm→∞ L(y1,(σ 2
2 )m) = +∞.Thus, following Nityasuddhi & Böhning (2003), we mention the estimates produced by the algorithm
of section 4.1 as “EM estimates”, that is, some sort of solution of the score equation, instead of “maximumlikelihood estimates”.
5
4.2.2 Non-identifiability
Another nontrivial issue is the lack of identifiability. Strictly speaking, finite mixtures are always non-identifiable because an arbitrary permutation of the labels of the component parameters lead to the samefinite mixture distribution. In the finite mixture context, a more flexible concept of identifiability is usedand is defined as follows: the class of FM-SN models will be identifiable if distinct component densitiescorrespond to distinct mixtures. More specifically, if we have two representations for the same mixturedistribution, say
k
∑i=1
p′iSN(·|µ ′i ,(σ ′i )
2,λ ′i ) =k
∑i=1
piSN(·|µi,σ2i ,λi),
thenp′i = pρ(i), µ ′i = µρ(i), (σ ′
i )2 = σ2
ρ(i), λ ′i = λρ(i)
for some permutation ρ of the indexes 1, . . . ,k (Titterington et al., 1985, Chapter 3).The normal mixture model identifiability was first verified by Yakowitz & Spragins (1968), but it is
interesting to note that subsequent discussions in the related literature concerning mixtures of Student-tdistributions – see, for example, Peel & McLachlan (2000), Shoham (2002), Shoham et al. (2003) and Linet al. (2004) – do not presented a formal proof of its identifiability.
The above commentaries may cause some caution when using maximum likelihood estimation in thefinite mixture case. One way to circumvent the unboundedness problem is the constrained optimization ofthe likelihood, imposing conditions on the component variances in order to obtain global maximization. SeeHathaway (1985), Ingrassia (2004), Ingrassia & Rocci (2007) and Greselin & Ingrassia (2009), for example.The non-identifiability problem is not a major one if we are interested only in the likelihood values, whichare robust to label switching. This is the case, for example, when density estimation is the main goal.
In our study we investigate, as in Nityasuddhi & Böhning (2003), only the performance of the EMalgorithm when are considered component specific parameters (that is, unrestricted) of the mixture.
5 A Simulation Study of Initial Values for the EM Algorithm
5.1 A Study of Consistency Through the Analysis of Bias and MSE
The EM algorithm is an efficient standard tool for maximum likelihood estimation in finite mixtures ofdistributions. It is well known that its performance is strongly dependent on the choice of the criterionof convergence and starting points. In this work, we do not consider the stopping rule issue. A detaileddiscussion about this theme can be founded in McLachlan & Peel (2000). Our fixed rule is to stop theprocess at stage m when
|`(Θ(m+1))− `(Θ(m))|< c,
where we fixed c = 10−5.The choice of starting values for the EM algorithm plays a big role in parameter estimation. In the
mixture context, its importance is amplified because, as noted by Dias & Wedel (2004), there are varioussaddle regions surrounding the possible local maxima of the likelihood function, and the EM algorithm canbe trapped in some of these subsets of the parameter space.
In this work, we make a simulation study in order to compare some methods to obtain starting points forthe algorithm proposed in section 4.1. Previous related work, considering normal mixtures, can be found inKarlis & Xekalaki (2003) and Biernacki et al. (2003). An interesting question is to investigate, in a similarfashion, the performance of the EM algorithm when a skewness parameter for each component density isconsidered.
6
We consider the following methods to obtain initial values:
The True Values Method (TVM): consists in to initialize the algorithm with the distributional parametervalues used to generate the artificial random samples. Of course, the best results are expected when usingthis method. It has been included in our experiment playing a gold standard role.
The Random Values Method (RVM): to fit the FM-SN model with k components, we first divide the gen-erated random sample into k sub-samples. To do this, the k-means method is employed – see Johnson &Wichern (2007) to obtain details about this clustering method. Let ϕi be the sub-sample i. Consider thefollowing points artificially generated from uniform distributions over the specified intervals
ξ (0)i ∼ U(min{ϕi},max{ϕi}) ,
ω(0)i ∼ U(0,Var{ϕi}) , (6)
γ(0)i ∼ U(−0.9953,0.9953) ,
where min{ϕi}, max{ϕi} and Var{ϕi} denote, respectively, the minimum, the maximum and the samplevariance of ϕi, i = 1, ..,k. These quantities are taken as rough estimates for the mean, variance and skewnesscoefficient associated to sub-population i (from which sub-sample i is supposedly collected), respectively.The suggested interval for γ(0)
i is inspired by the fact that the range for the skewness coefficient in SN modelsis (−0.9953,0.9953).
The starting points for the specific component locations, scale and skewness parameters are given re-spectively by
µ(0)i = ξ (0)
i −√
2/πδ(λ (0)
i )σ (0)
i ,
σ (0)i =
√√√√ ω(0)i
1− 2π δ 2
(λ (0)i )
, (7)
λ (0)i = ±
√√√√ π(γ(0)i )2/3
21/3(4−π)2/3− (π−2)(γ(0)i )2/3
,
where δ(λ (0)
i )= λ (0)
i /
√1+(λ (0)
i )2, i = 1, ..,k and the sign of λ (0)i is the same of γ(0)
i . They are obtained
by replacing E(Y ), Var(Y ) and γ(Y ) in (2) with their respective estimators in (6) and solving the resultingequations in µi, σi and λi. The initial values for the weights pi are obtained as
(p(0)1 , . . . , p(0)
k ) ∼ Dirichlet(1, . . . ,1),
– a Dirichlet distribution with all parameters equal to 1 – that is, a uniform distribution over the unit simplex{(p1, . . . , pk); pi ≥ 0, ∑k
i=1 pi = 1}
.
Method of Moments (MM): in this case the initial values are obtained using equations (7), but replacing ξ (0)i ,
ω(0)i and γ(0)
i with the mean, variance and skewness coefficient of sub-sample i, respectively. Let n be thesample size and ni be the size of sub-sample i. The initial values for the weights are given by
p(0)i =
ni
n, i = 1, ..,k.
7
To compare the methods, we generated samples from the FM-SN model with k = 2 and k = 3 com-ponents. For each fixed sample we obtained estimates of the parameters using the algorithm presented insection 4.1 initialized by each method proposed.
Another important issue is the degree of heterogeneity of the components. For k = 2 we consideredthe “moderately separated” (2MS), “well separated” (2WS) and “poorly separated” (2PS) cases. For k = 3we considered the “two poorly separated and one well separated” (3PWS) and “the three well separated”(3WWS) cases. These degrees of heterogeneity were obtained informally, based on the location parametervalues. The main reason to consider them as an important factor to our study is that the convergence of theEM algorithm is typically affected when the components overlap largely – see Park & Ozeki (2009) and thereferences herein.
Figures 1 and 2 show some histograms exemplifying these degrees of heterogeneity. In tables 1 and 2are presented the parameters values used in the study.
Figure 1: Histograms of FM-SN data: (a) 2MS, (b) 2WS and (c) 2PS.
Table 1: FM-SN parameters with k = 2.
Case µ1 µ2 σ21 σ2
2 λ1 λ2 p1 p2
2MS 5 20 9 16 6 -4 0.6 0.42WS 5 40 9 16 6 -4 0.6 0.42PS 5 15 9 16 6 -4 0.6 0.4
All the computations were made using the R system (R Development Core Team, 2009). Sample sizeswere fixed as n = 500, 1000, 5000 and 10000. For each combination of parameters and sample size, 500samples from the FM-SN model were artificially generated. Then we computed the bias and mean squared
8
Figure 2: Histograms of FM-SN data: (a) 3PWS and (b) 3WWS.
Table 2: FM-SN parameters with k = 3.
Case µ1 µ2 µ3 σ21 σ 2
2 σ23 λ1 λ2 λ3 p1 p2 p3
3PWS 5 20 28 9 16 16 6 -4 4 0.4 0.3 0.33WWS 5 30 38 9 16 16 6 -4 4 0.4 0.3 0.3
error (MSE) over all samples. For µi they are defined as
bias =1
500
500
∑j=1
µ( j)i −µi and MSE =
1500
500
∑j=1
(µ( j)i −µi)2,
respectively, where µ( j)i is the estimate of µi when the data is sample j. Definitions for the other parameters
are obtained by analogy.As a note about implementation, a expected consequence of the non-identifiability cited in section 4.2.2
is the permutation of the component labels when using the k−means method to perform an initial clusteringof the data. It has serious implications when the main subject is the consistency of the estimates. Toovercome this problem we adopted an order restriction on the initial values of the location parameters.
Tables 3 and 4 present, respectively, bias and MSE of the estimates in the 2MS case. With two mod-erately separated components and using the TVM and MM methods, the convergence of the estimates isevidenced, as we can conclude observing the decreasing values of bias and MSE when the sample size in-creases. They also show that the estimates of the weights pi and of the location parameters µi have lowerbias and MSE. On the other side, investigating the MSE values, we can note a different pattern of (slower)convergence to zero for the skewness parameters estimates. It is possibly due to well known inferentialproblems related to the skewness parameter (DiCiccio & Monti, 2004), suggesting the use of larger samplesin order to attain the consistency property.
9
When we analyze the initialization methods performances, we can see that MM and TVM have a similar(good) one, presenting better results than the RVM method for all sample sizes and parameters. When usingthe RVM we can see that, in general, the absolute value of the biases of the estimates of σ 2
i and λi are verylarge compared with that obtained when using the other methods. In general, according to our criteria, wecan conclude that the MM method presented a satisfactory performance in all situations.
The bias and MSE of the estimates for the 2WS case are presented in tables 5 and 6, respectively. Asin the 2MS case, their values decrease when the sample size increases (using the methods TVM and MM),although in a slower rate. Comparing the initialization methods, we can see again the poor performance ofRVM, notably when estimating σ2
i and λi. The performances of MM and TVM are similar and satisfactory.Thus, we can say that, in general, the conclusions made for the 2MS case are still valid here.
We present the results for the 2PS case in tables 7 and 8. Bias and MSE are larger than in the 2MSand 2WS cases (for all sample sizes) when using MM and RVM. Also, the consistency of the estimatesseems to be unsatisfactory, clearly not attained in the σ2
i and λi cases. According to the related literature,such drawbacks of the algorithm are expected when the population presents a remarkable homogeneity. Anexception is made when the initial values are closer to the true parameter values see, for example, McLachlan& Peel (2000) and the above-mentioned work of Park & Ozeki (2009).
Results for the 3PWS case are showed in tables 9 and 10. It seems that consistency is achieved for pi, µi
and σ 2i , using TVM and MM. However, this is not the behavior in the λi case. This instability is common to
all initialization methods, according to the MSE criterion. Using the RVM method we obtained, as before,larger values of bias and MSE.
Finally, tables 11 and 12 present the results for the 3PWW case. Concerning the estimates pi and µi, verysatisfactory results are obtained, with small values of bias and MSE when using TVM and MM. The valuesof MSE of σ 2
i exhibit a decreasing behavior when the sample size increases. On the other side, although weare in the well separated case, the values of bias and MSE of λi are larger, notably when using RVM as theinitialization method.
Concluding this section, we can say that, as a general rule, MM and TVM methods presented very goodsimilar results for the initialization of the EM algorithm. Considering TVM as the gold standard, this meansthat MM can be seen as a good alternative for real applications. If this condition is maintained, our studysuggests that the consistency property holds for all EM estimates – maybe the convergence is slower for thescale parameter case – except for the skewness parameter, indicating that a sample size larger than 5000 isnecessary to achieve consistency in this case. The study also suggests that the degree of heterogeneity of thepopulation has a remarkable influence on the quality of the estimates.
5.2 Density Estimation Analysis
In this section we investigate the density estimation issue, that is, the point estimation of the parameter`(Θ), the log-likelihood evaluated at the true value of the parameter. We considered FM-SN models withtwo components and restricted ourselves to the cases 2MS, 2WS and 2PS, with sample sizes n =100, 500,1000, 5000. The following measure was considered to compare the several methods of initialization
dr(M) =
∣∣∣∣∣`(Θ)− `(M)(Θ)
`(Θ)
∣∣∣∣∣×100,
where `(M)(Θ) is the log-likelihood evaluated at the EM estimate Θ, which was obtained using the initializa-tion method M. According to this criterion, an initialization method M is better than M′ if dr(M) < dr(M′).Obviously, we expect the lowest dr values for the TVM method, the gold standard one.
10
Tabl
e3:
Bia
sof
the
estim
ates
–k
=2,
mod
erat
ely
sepa
rate
d(2
MS)
nM
etho
dµ 1
µ 2σ
2 1σ
2 2λ 1
λ 2p 1
p 250
0T
VM
0.01
2916
66-0
.035
1854
90.
0166
1924
0.18
4764
60.
5253
373
-0.4
9595
44-0
.001
5565
830.
0015
5658
3
RVM
0.97
4826
4-1
.323
073
-4.1
6303
6-0
.123
6784
-1.8
6761
7-3
.058
108
-0.0
0557
9744
0.00
5579
744
MM
0.01
2819
75-0
.041
9933
-0.1
9027
140.
3501
801
0.52
9611
9-0
.477
2044
-0.0
0136
8731
0.00
1368
731
1000
TV
M0.
0146
0258
-0.0
2278
766
-0.0
6984
684
0.11
0084
40.
1304
437
-0.1
4552
34-0
.001
5700
180.
0015
7001
8
RVM
1.07
2717
-1.2
9584
7-4
.469
534
0.23
5595
2-2
.595
832
1.15
6491
-0.0
1132
529
0.01
1325
29
MM
0.01
4371
74-0
.025
6469
8-0
.086
1297
60.
0923
462
0.13
4413
6-0
.133
0872
-0.0
0140
9981
0.00
1409
981
5000
TV
M0.
0012
7738
60.
0001
4364
020.
0219
3594
0.04
5697
420.
0421
8275
-0.0
4685
099
-5.4
5267
3e-0
55.
4526
73e-
05
RVM
0.97
4323
-1.0
0248
7-4
.040
373
1.54
1311
-2.4
4467
00.
7957
775
-0.0
1203
366
0.01
2033
66
MM
0.00
0964
1692
-0.0
0138
8175
0.03
0964
520.
0153
7434
0.04
7261
14-0
.039
0465
76.
3243
71e-
05-6
.324
371e
-05
1000
0T
VM
-0.0
0058
7850
80.
0001
1765
630.
0066
4092
30.
0179
7871
0.00
7833
535
-0.0
2678
607
0.00
0495
2318
-0.0
0049
5231
8
RVM
1.00
3622
-1.0
6729
9-4
.154
656
1.04
9366
-2.5
8721
60.
9152
174
-0.0
1212
533
0.01
2125
33
MM
-0.0
0090
5076
-0.0
0109
3679
0.01
4760
33-0
.007
2071
030.
0127
8570
-0.0
2045
472
0.00
0595
9484
-0.0
0059
5948
4
11
Tabl
e4:
MSE
ofth
ees
timat
es-k
=2,
mod
erat
ely
sepa
rate
d(2
MS)
nM
etho
dµ 1
µ 2σ
2 1σ
2 2λ 1
λ 2p 1
p 250
0T
VM
0.01
6187
350.
1123
241
2.60
9074
21.0
5012
16.8
9325
3.96
3948
0.00
0768
7456
0.00
0768
7456
RVM
2.25
0586
4.26
6357
26.1
9319
78.7
4164
34.6
3418
6729
.553
0.00
2770
527
0.00
2770
527
MM
0.01
6277
960.
1324
369
2.20
3446
18.5
8105
16.9
1288
3.97
7364
0.00
0771
4305
0.00
0771
4305
1000
TV
M0.
0073
5256
30.
0438
1211
1.32
0318
9.83
0591
1.57
0595
1.00
7072
0.00
0396
7391
0.00
0396
7391
RVM
2.36
6453
4.08
8751
27.8
0727
70.8
3071
19.3
6655
9.24
7467
0.00
1719
727
0.00
1719
727
MM
0.00
7378
450.
0447
3454
1.24
6387
9.68
0204
1.56
9277
1.01
5799
0.00
0398
1356
0.00
0398
1356
5000
TV
M0.
0015
8175
20.
0073
0923
30.
2642
777
2.03
950.
2449
436
0.18
4589
18.
4407
37e-
058.
4407
37e-
05
RVM
2.16
5632
3.19
515
26.1
2405
70.8
6125
17.2
2613
7.42
687
0.00
1879
775
0.00
1879
775
MM
0.00
1590
433
0.00
7424
660.
2694
587
2.07
6536
0.24
6535
20.
1866
866
8.50
2204
e-05
8.50
2204
e-05
1000
0T
VM
0.00
0847
5006
0.00
3877
835
0.12
3834
51.
1091
950.
1206
761
0.08
5568
484.
3413
59e-
054.
3413
59e-
05
RVM
2.18
3152
3.34
6971
25.8
5164
69.0
9116
17.6
1139
7.63
6557
0.00
1412
294
0.00
1412
294
MM
0.00
0853
3806
0.00
3948
246
0.12
6905
91.
1344
170.
1216
638
0.08
6953
4.37
8802
e-05
4.37
8802
e-05
12
Tabl
e5:
Bia
sof
the
estim
ates
-k=
2,w
ells
epar
ated
(2W
S)
nM
etho
dµ 1
µ 2σ
2 1σ
2 2λ 1
λ 2p 1
p 250
0T
VM
-0.0
0134
2122
-0.0
0089
4665
40.
0708
7477
0.07
3461
840.
5476
595
-0.3
6930
640.
0011
0362
3-0
.001
1036
23
RVM
1.14
5212
-1.6
0942
5-3
.282
642
-4.0
8094
5-2
.568
067
1.81
6600
0.00
1102
224
-0.0
0110
2224
MM
-0.0
0144
0393
0.00
1164
155
0.07
1483
510.
1861
907
0.81
2108
3-0
.394
4749
0.00
1069
988
-0.0
0106
9988
1000
TV
M0.
0026
7377
5-0
.001
7472
50.
0047
3462
10.
0052
8477
40.
1993
675
-0.1
6654
100.
0004
0793
17-0
.000
4079
317
RVM
1.20
5876
-1.7
5960
3-3
.485
664
-4.5
7172
3-2
.863
520
2.13
3156
0.00
0406
9826
-0.0
0040
6982
6
MM
0.00
2777
323
-0.0
0192
2072
0.00
5182
833
0.00
5432
662
0.20
2555
8-0
.167
8265
0.00
0407
9617
-0.0
0040
7961
7
5000
TV
M-0
.001
0891
580.
0025
8044
60.
0066
1772
80.
0187
4684
0.03
1302
16-0
.031
0103
2-0
.000
7086
542
0.00
0708
6542
RVM
1.29
8881
-1.8
0411
6-3
.685
889
-4.8
0707
2-3
.238
162
2.28
6944
-0.0
0071
1124
30.
0007
1112
43
MM
-0.0
0104
6705
0.00
2517
380
0.00
6534
310.
0184
3972
0.03
1467
8-0
.030
9596
8-0
.000
7086
519
0.00
0708
6519
1000
0T
VM
0.00
0376
472
-0.0
0173
8574
0.00
3361
113
-0.0
2059
716
0.02
2554
72-0
.012
5948
3-0
.000
2449
717
0.00
0244
9717
RVM
1.41
6778
-1.7
7324
1-3
.851
791
-4.8
5917
3-3
.541
599
2.25
3168
-0.0
0024
8539
50.
0002
4853
95
MM
0.00
0357
3312
-0.0
0175
2626
0.00
3659
679
-0.0
2058
202
0.02
3148
23-0
.012
6740
2-0
.000
2449
714
0.00
0244
9714
13
Tabl
e6:
MSE
ofth
ees
timat
es-k
=2,
wel
lsep
arat
ed(2
WS)
nM
etho
dµ 1
µ 2σ
2 1σ
2 2λ 1
λ 2p 1
p 250
0T
VM
0.01
1615
860.
0616
181
0.82
1638
25.
1750
63.
3562
311.
9544
430.
0004
4208
630.
0004
4208
63
RVM
2.78
1579
5.21
3882
16.5
4363
37.8
6230
18.3
8287
9.24
199
0.00
0442
0563
0.00
0442
0563
MM
0.01
1955
560.
0645
2464
0.84
2286
610
.645
2837
.273
052.
1990
250.
0004
4372
830.
0004
4372
83
1000
TV
M0.
0066
4421
40.
0298
738
0.38
0233
12.
2913
711.
1911
680.
7158
823
0.00
0249
1188
0.00
0249
1188
RVM
2.92
4958
5.63
0983
17.4
4514
40.6
9785
18.5
4422
9.17
7224
0.00
0249
1342
0.00
0249
1342
MM
0.00
6828
364
0.03
0304
110.
3902
985
2.32
1045
1.23
0409
0.73
0105
10.
0002
4911
970.
0002
4911
97
5000
TV
M0.
0011
8719
10.
0058
8391
30.
0779
6818
0.49
0199
60.
2062
550
0.10
2121
14.
9505
53e-
054.
9505
53e-
05
RVM
3.11
4875
5.68
6154
18.6
0732
41.6
3566
19.5
9477
9.20
1493
4.95
0964
e-05
4.95
0964
e-05
MM
0.00
1215
798
0.00
5946
889
0.07
9780
340.
4948
037
0.21
1575
90.
1033
688
4.95
0552
e-05
4.95
0552
e-05
1000
0T
VM
0.00
0588
9338
0.00
2968
532
0.03
6591
780.
2307
873
0.09
6023
110.
0529
5109
2.26
7009
e-05
2.26
7009
e-05
RVM
3.39
0618
5.56
8737
19.8
5308
42.9
8142
21.3
2373
9.03
7181
2.26
6685
e-05
2.26
6685
e-05
MM
0.00
0602
5367
0.00
3005
308
0.03
7585
90.
2332
507
0.09
8670
730.
0537
0464
2.26
7008
e-05
2.26
7008
e-05
14
Tabl
e7:
Bia
sof
the
estim
ates
-k=
2,po
orly
sepa
rate
d(2
PS)
nM
etho
dµ 1
µ 2σ
2 1σ
2 2λ 1
λ 2p 1
p 250
0T
VM
0.01
1302
08-0
.102
9151
0.82
2288
8-2
.537
293
0.79
4874
7-0
.259
8434
0.03
0277
12-0
.030
2771
2
RVM
0.73
7736
2-1
.413
347
-4.7
8207
7-3
.800
659
-1.6
0642
-1.5
9273
60.
0206
6073
-0.0
2066
073
MM
0.57
6095
4-2
.337
444
-6.5
2663
4-5
.732
598
-1.3
6840
83.
8786
450.
0914
515
-0.0
9145
15
1000
TV
M0.
0051
3527
7-0
.031
0267
20.
5121
32-1
.235
290
0.23
3222
6-0
.111
8209
0.01
8142
67-0
.018
1426
7
RVM
0.64
5093
7-1
.351
040
-4.4
5307
4-3
.756
38-2
.403
474
2.00
1093
0.01
9545
08-0
.019
5450
8
MM
0.53
2643
2-2
.340
892
-6.6
8651
1-5
.682
92-1
.799
719
4.05
4375
0.09
6282
67-0
.096
2826
7
5000
TV
M0.
0014
0111
40.
0008
8786
850.
0624
029
-0.1
6147
830.
0540
8275
-0.0
3992
003
0.00
2785
076
-0.0
0278
5076
RVM
0.46
5385
-1.0
1983
7-3
.645
296
-2.6
4772
3-1
.864
61.
7282
860.
0163
5918
-0.0
1635
918
MM
0.41
3232
8-2
.313
775
-6.5
3142
3-5
.741
616
-2.1
2780
24.
0416
190.
0904
191
-0.0
9041
91
1000
0T
VM
0.00
1821
830
0.00
0150
5038
0.03
0255
59-0
.114
0996
0.00
6581
757
-0.0
0026
9398
0.00
1758
447
-0.0
0175
8447
RVM
0.50
6123
6-1
.031
940
-3.6
9262
8-2
.634
238
-2.1
0580
51.
6772
710.
0077
5284
-0.0
0775
284
MM
0.28
7132
9-2
.205
388
-6.4
8101
2-5
.305
207
-1.9
0987
14.
0506
290.
1025
349
-0.1
0253
49
15
Tabl
e8:
MSE
ofth
ees
timat
es-k
=2,
poor
lyse
para
ted
(2P
S)
nM
etho
dµ 1
µ 2σ
2 1σ
2 2λ 1
λ 2p 1
p 250
0T
VM
0.01
6901
0.15
3715
8.10
4147
31.8
3997
13.7
952
10.9
3927
0.00
6690
502
0.00
6690
502
RVM
1.46
428
4.31
3297
34.9
5844
38.2
0203
205.
0405
5609
.656
0.01
0705
960.
0107
0596
MM
1.25
8757
6.29
5013
44.2
611
52.1
9197
11.5
8125
16.2
4227
0.01
7145
220.
0171
4522
1000
TV
M0.
0085
3905
50.
0642
0256
4.44
3361
14.5
5311
2.11
5562
1.76
4568
0.00
3219
140
0.00
3219
140
RVM
1.29
0693
3.79
1282
32.5
5702
31.7
4903
14.6
9199
8.17
5651
0.00
8868
588
0.00
8868
588
MM
1.03
5179
5.99
5885
45.3
8505
48.8
999
10.0
9152
16.4
6602
0.01
7352
440.
0173
5244
5000
TV
M0.
0018
0055
10.
0092
7133
40.
6618
201
1.43
2464
0.37
4085
0.15
2128
70.
0004
0517
850.
0004
0517
85
RVM
0.87
3221
22.
4779
0425
.597
2419
.833
529.
8165
46.
5639
710.
0075
9387
80.
0075
9387
8
MM
0.63
5642
85.
8210
9443
.626
743
.924
638.
1717
8416
.363
730.
0163
0810
0.01
6308
10
1000
0T
VM
0.00
0777
9437
0.00
5205
021
0.31
2441
80.
7374
107
0.16
6761
80.
0834
3909
0.00
0188
2813
0.00
0188
2813
RVM
1.16
1408
2.75
7048
26.1
7153
20.1
3907
11.1
0304
6.23
0668
0.00
7534
106
0.00
7534
106
MM
0.36
5803
75.
2071
2342
.813
3135
.699
685.
8196
6616
.408
980.
0172
6078
0.01
7260
78
16
Tabl
e9:
Bia
sof
the
estim
ates
-k=
3,tw
opo
orly
sepa
rate
dan
don
ew
ells
epar
ated
(3PW
S)
nM
etho
dµ 1
µ 2µ 3
σ2 1
σ2 2
σ2 3
λ 1λ 2
λ 3p 1
p 2p 3
500
TV
M0.
0124
259
-0.0
5853
80.
0789
760.
0752
56-0
.095
745
-0.4
6277
4-2
.038
312
18.0
6389
-8.2
4194
-0.0
9909
90.
0983
80.
0007
106
RVM
0.83
9994
-1.6
2511
2.28
647
-4.7
5611
-7.5
1127
1.38
9074
-0.3
2077
-3.7
1601
9-3
.224
084
0.00
9681
9-0
.025
417
0.01
5735
MM
0.01
2744
-0.0
9969
10.
0789
37-0
.334
22-2
.318
771.
9496
48.
1623
06-0
.175
09-0
.160
34-0
.000
5117
9-0
.014
921
0.01
5433
1000
TV
M0.
0009
163
-0.0
1305
30.
0277
10.
2364
55-0
.030
0267
-0.2
4516
-1.9
5914
10.4
8063
-8.1
8114
7-0
.100
0959
0.10
1737
-0.0
0164
1
RVM
0.84
1724
-1.5
8681
92.
2195
5-4
.759
36-7
.316
270.
6959
45-1
.038
880.
9023
9-3
.244
660.
0055
68-0
.017
829
0.01
226
MM
0.00
0666
6-0
.015
744
0.02
7732
0.09
991
-1.5
7201
1.40
084
0.54
447
-0.1
6887
-0.0
1763
10.
0018
592
-0.0
1164
40.
0097
849
5000
TV
M-0
.002
4898
-0.0
0716
47-0
.001
6895
0.06
3075
-0.0
5103
2-0
.021
537
-1.9
6329
210
.094
9-8
.011
75-0
.100
198
0.10
076
-0.0
0056
612
RVM
0.86
8112
-1.5
3414
2.20
497
-4.7
778
-7.3
0824
0.65
4812
-1.3
3676
1.72
577
-3.2
5474
0.00
3257
9-0
.013
555
0.01
0297
MM
-0.0
0293
99-0
.009
0219
-0.0
0169
90.
0755
84-0
.811
220.
7025
090.
1023
2-0
.002
4549
0.03
6783
0.00
0870
3-0
.005
1312
0.00
4260
8
1000
0T
VM
0.00
0722
90.
0004
312
0.00
5227
-0.0
0127
70.
0820
34-0
.034
1-2
.004
4210
.044
5-8
.036
2-0
.100
480.
1001
80.
0003
056
RVM
0.76
53-1
.425
82.
4471
-4.4
972
-7.1
377
0.48
926
-1.3
367
1.59
05-3
.325
10.
0049
4-0
.015
250.
0103
0
MM
0.00
0272
-0.0
0100
20.
0052
226
0.01
0146
-0.4
979
0.51
546
0.05
1814
-0.0
2867
-0.0
0438
40.
0002
749
-0.0
0335
0.00
3078
17
Tabl
e10
:MSE
ofth
ees
timat
es-k
=3,
two
poor
lyse
para
ted
and
one
wel
lsep
arat
ed(3
PWS)
nM
etho
dµ 1
µ 2µ 3
σ2 1
σ2 2
σ2 3
λ 1λ 2
λ 3p 1
p 2p 3
500
TV
M0.
0256
408
0.12
6759
0.08
6256
4.11
1284
25.7
3358
6.28
545
5.37
295
2457
5.55
70.4
3225
0.01
0261
10.
0104
158
0.00
0663
46
RVM
2.09
4829
5.42
932
8.41
0817
27.1
7296
69.4
9805
181.
8071
32.0
8280
6671
.365
12.9
9524
0.00
3902
50.
0034
232
0.00
0851
44
MM
0.02
6501
60.
2668
460.
0867
212.
9006
0310
.974
7816
.823
724
090.
792.
7372
91.
0681
90.
0006
7316
0.00
0516
960.
0005
3893
1000
TV
M0.
0123
157
0.05
6870
72.
4331
613
.128
283.
0555
64.
5338
711
.279
468
.311
360
.102
070.
0106
960.
0003
2769
0.00
0328
3
RVM
1.95
1414
5.05
061
7.46
6443
26.8
1183
66.3
2812
97.3
9908
13.4
5374
330.
0578
13.0
7518
0.00
1704
30.
0014
459
0.00
0441
76
MM
0.01
2360
70.
0574
984
0.03
6433
51.
8917
76.
0545
58.
6940
63.
0529
471.
4181
050.
6019
880.
0003
4945
0.00
0272
030.
0002
453
5000
TV
M0.
0023
085
0.01
1105
30.
0076
361
0.43
256
2.92
208
0.62
1509
3.98
858
102.
270
64.4
356
0.01
0079
0.01
0225
6.52
5753
e-05
RVM
1.97
832
4.88
315
7.39
087
27.0
6606
65.7
5679
.267
712
.624
849.
9027
813
.058
70.
0016
350.
0010
396
0.00
0443
9
MM
0.00
2320
90.
0112
920.
0076
426
0.44
148
1.82
228
1.77
380.
3738
0.25
0619
0.13
556
7.33
5385
e-05
5.68
1646
e-05
4.85
0018
e-05
1000
0T
VM
0.00
1100
430.
0050
401
0.00
4103
080.
1821
21.
3066
0.31
966
4.08
7410
1.07
264
.692
0.01
0118
0.01
0068
2.80
6801
e-05
RVM
1.74
505
4.82
237
9.01
6929
24.7
309
64.0
593
80.2
7004
11.1
823
8.88
2013
.332
0.00
189
0.00
1888
0.00
0455
1
MM
0.00
1109
0.00
514
0.00
4107
0.18
730.
7276
0.92
790.
1849
0.11
501
0.06
9822
3.23
0155
e-05
2.54
5253
e-05
2.42
436e
-05
18
Tabl
e11
:Bia
sof
the
estim
ates
-k=
3,th
eth
ree
wel
lsep
arat
ed(3
WW
S)
nM
etho
dµ 1
µ 2µ 3
σ2 1
σ2 2
σ2 3
λ 1λ 2
λ 3p 1
p 2p 3
500
TV
M0.
0026
29-0
.030
459
0.17
4474
0.02
879
-0.0
1010
93-1
.095
813
-2.6
6170
723
.330
88-8
.294
356
-0.0
9867
930.
0986
485
3.08
7517
e-05
RVM
1.12
8429
-1.8
9298
12.
3233
37-4
.006
363
-7.9
0329
-1.3
7325
5.44
9758
0.72
7331
-3.4
7222
70.
0034
786
-0.0
1705
080.
0135
722
MM
0.00
2049
-0.0
7305
60.
1644
9-0
.000
293
-1.9
645
0.94
1228
7.49
67-0
.286
28-0
.689
06-0
.000
8928
-0.0
1298
0.01
387
1000
TV
M0.
0035
82-0
.010
470.
0997
30.
0148
15-0
.044
9-0
.607
2-2
.418
028
10.3
1084
-8.1
5506
-0.0
9904
0.09
9609
-0.0
0056
9
RVM
1.14
596
-1.7
5138
2.24
6543
-3.8
2581
4-7
.676
75-1
.014
807
-2.1
0663
51.
932
-3.3
737
0.00
3889
-0.0
1303
0.00
9140
5
MM
0.00
3149
-0.0
1126
0.10
0995
0.02
0765
-1.3
3552
0.66
874
0.32
52-0
.153
007
-0.4
2507
-0.0
0038
8-0
.000
3888
-0.0
0928
5000
TV
M0.
0008
79-0
.003
1127
0.02
638
-0.0
0371
9-0
.024
0517
-0.1
6356
-2.1
1300
110
.062
6-8
.031
46-0
.099
870.
1002
55-0
.000
378
RVM
1.34
845
-2.1
397
2.32
68-4
.117
4-8
.091
005
0.68
9014
4-2
.475
972.
4024
8-3
.498
240.
0096
67-0
.011
420.
0017
53
MM
0.00
0467
-0.0
0347
0.02
667
0.00
0168
3-0
.558
190.
3651
90.
0688
7-0
.030
102
-0.1
1407
0.00
0255
-0.0
0446
50.
0042
097
1000
0T
VM
0.00
4409
0.00
0328
80.
0125
3-0
.020
480.
0049
85-0
.066
435
-2.0
6170
310
.011
59-8
.018
37-0
.099
0.10
033
-0.0
0034
97
RVM
1.35
49-2
.150
32.
4061
46-4
.125
6-7
.937
45-0
.763
128
-2.5
7214
2.37
3008
-3.5
7646
0.00
7488
-0.0
0948
40.
0019
96
MM
0.00
4030
49.
5033
94e-
050.
0126
8-0
.017
105
-0.3
395
0.27
4816
0.01
6875
-0.0
1749
-0.0
6225
0.00
0332
7-0
.003
137
0.00
2804
19
Tabl
e12
:MSE
ofth
ees
timat
es-k
=3,
the
thre
ew
ells
epar
ated
(3W
WS)
nM
etho
dµ 1
µ 2µ 3
σ2 1
σ2 2
σ2 3
λ 1λ 2
λ 3p 1
p 2p 3
500
TV
M0.
0228
40.
0842
50.
1031
31.
3009
26.
9941
86.
3296
7.49
639
122.
8570
.743
0.01
0132
0.01
0204
30.
0004
017
RVM
3.06
886.
8425
68.
3173
19.5
307
71.4
026
260.
9848
2562
2.18
335.
8204
13.6
056
0.00
2224
0.00
1829
0.00
0625
MM
0.02
3629
0.96
250.
1817
51.
3752
7.49
979.
696
1992
1.21
2.09
006
0.88
950.
0005
540.
0004
960.
0004
24
1000
TV
M0.
0096
320.
0382
30.
0488
360.
7212
73.
3526
083.
2399
6.13
927
108.
1138
67.2
790.
0100
098
0.01
0195
0.00
0225
8
RVM
3.06
376.
2251
47.
7863
718
.289
569
.655
632
8.19
8613
.184
19.
4350
5213
.326
690.
0019
764
0.00
1447
0.00
0351
06
MM
0.00
987
0.03
860.
0495
070.
7390
383.
8604
32.
7949
1.94
190.
8063
40.
4740
70.
0002
737
0.00
0213
350.
0002
143
5000
TV
M0.
0018
830.
0079
20.
0075
40.
1372
60.
7051
70.
6406
84.
5479
101.
544
64.6
403
0.01
0019
0.01
0098
3.96
2665
e-05
RVM
4.38
288.
9594
78.
4774
19.9
5474
.574
762.
535
14.1
316
9.94
7314
.043
900.
0040
459
0.00
2173
0.00
0472
8
MM
0.00
1926
0.00
8009
30.
0076
50.
1402
0.73
4012
0.62
4008
40.
2994
0.13
834
0.09
7003
4.70
2413
e-05
4.33
0964
e-05
4.00
1327
e-05
1000
0T
VM
0.00
1008
490.
0039
560.
0035
190.
0661
70.
3612
407
0.30
119
4.30
2710
0.37
5864
.376
980.
0100
169
0.01
0093
2.12
2882
e-05
RVM
4.46
6610
.107
097.
9793
20.1
2889
73.5
9039
419.
6245
14.6
3398
9.83
2314
.459
60.
0023
150.
0018
670.
0001
507
MM
0.00
1025
40.
0039
90.
0035
50.
0675
80.
3353
80.
3328
50.
1479
0.08
390.
0565
02.
6725
32e-
052.
0902
53e-
052.
0746
95e-
05
20
Table 13 presents the means and standard deviations of dr over 500 samples in the 2MS case. For TVMand MM, we can see that these values decrease when the sample size increases. The RVM presented thelowest mean value only when n = 100. For larger sample sizes its performance was inferior. In addition, forall sample sizes considered, it presented the largest standard deviation.
Table 13: Means and standard deviations (×10−4) of dr. 2MS case.
Method
TVM RVM MM
n Mean SD Mean SD Mean SD
100 1.50 15.90 1.20 21.00 1.50 15.96
500 0.27 2.87 0.87 15.13 0.27 2.87
1000 0.13 1.44 0.84 15.71 0.13 1.44
5000 0.03 0.29 0.88 17.16 0.03 0.29
Table 14: Means and standard deviations (×10−4) of dr. 2WS case.
Method
TVM RVM MM
n Mean SD Mean SD Mean SD
100 1.48 15.33 1.21 17.32 1.47 15.05
500 0.25 2.58 1.23 19.09 0.25 2.58
1000 0.12 1.26 1.40 20.97 0.12 1.26
5000 0.03 0.25 1.60 22.72 0.03 0.25
For the 2WS case we have again MM and TVM presenting a similar performance – see Table 14, with aclear monotonicity when the sample size increases. Excepting the RVM, all methods have smaller standarddeviations in this well separated case, maybe due to the improved ability of the k-means method to clusterthis heterogeneous data set.
Table 15: Means and standard deviations (×10−4) of dr. 2PS case.
Method
TVM RVM MM
n Mean SD Mean SD Mean SD
100 1.70 18.53 1.29 22.52 1.30 19.19
500 0.30 3.18 0.44 6.54 0.36 6.99
1000 0.15 1.68 0.43 8.53 0.40 7.83
5000 0.02 0.21 0.39 8.76 0.48 5.61
Results for the 2PS case are shown in Table 15. In this case the MM method has an opposed pattern.We do not observe a monotone behavior for dr and the standard errors are larger than that presented in the2MS and 2WS cases. It is reasonable to suppose that the homogeneous structure of the data is negativelyinfluencing the ability of the k-means method to cluster it.
21
The main message is that the MM method seems to be suitable when we are interested in the estimationof the likelihood values, with some caution when the population is highly homogeneous.
5.3 Number of Iterations
It is well known that one of the major drawbacks of the EM algorithm is the slow convergence. The prob-lem become more serious when there is a bad choice of the starting values (McLachlan & Krishnan, 2008).Consequently, an important issue is the investigation of the number of iterations necessary to the conver-gence of the algorithm. Here we consider the 2MS, 2WS and 2PS cases, k = 2 and n = 100, 500, 1000,5000. For each combination of parameters and sample size, 500 samples were generated artificially, thenumber of iterations was observed and the means and standard deviations of this quantity were computed.The simulations results are reported in Tables 16, 17, and 18.
Table 16: Means and standard deviations of number of iterations. 2MS case
Method
TVM RVM MM
n Mean SD Mean SD Mean SD
100 2359.46 121.13 1612.09 100.79 2293.78 120.02
500 551.10 15.45 640.23 18.41 573.62 15.26
1000 468.65 8.17 666.58 11.06 501.72 7.67
5000 449.28 6.27 845.64 8.66 501.32 5.49
Table 17: Means and standard deviations of number of iterations. 2WS case.
Method
TVM RVM MM
n Mean SD Mean SD Mean SD
100 1519.10 105.98 942.32 79.56 1520.42 105.26
500 209.42 4.44 369.35 4.26 238.49 5.02
1000 189.85 2.60 403.22 3.88 220.24 2.68
5000 176.04 1.72 577.10 7.02 215.61 2.06
Results suggest that in the moderately and well separated cases (Tables 16 and 17, respectively), andusing TVM and MM, the mean number of iterations decreases as the sample size increases, but the same isnot true when RVM is adopted as the initialization method.
The 2PS case is illustrated in Table 18. As expected, we have a poor behavior possibly due to thepopulation homogeneity, as we commented before. An interesting fact is that, comparing with MM, RVMhas a smaller mean number of iterations.
6 A Simulation Study of Model Choice
Someone can argue that an arbitrary density can always be approximated by a finite mixture of normaldistributions, see McLachlan & Peel (2000, Chapter 1), for example. However, it is not enough to increasethe number of components in order to obtain a suitable fit. The main question is how to achieve this using
22
Table 18: Means and standard deviations of number of iterations. 2PS case.
Method
TVM RVM MM
n Mean SD Mean SD Mean SD
100 2470.30 105.94 1539.45 93.29 1713.71 90.74
500 1061.02 39.90 1090.39 38.42 1095.16 23.07
1000 900.66 30.85 1063.38 23.97 1135.90 15.85
5000 700.77 12.53 1274.86 19.16 1499.98 15.90
the smallest number of mixture components without overfit the data. One possible approach is to use somecriteria function and compute
k = argmink{C(Θ(k)), k ∈ {kmin, ...,kmax}},
where C(Θ(k)) is the criteria function evaluated at the EM estimate Θ(k), obtained by modeling the datausing the FM-SN model with k components, and kmin and kmax are fixed positive integers.
Our main purpose in this section is to investigate the ability of some classical criteria to estimate thecorrect number of mixture components. We consider the Akaike Information Criterion (AIC) (Akaike,1974), the Bayesian Information Criterion (BIC) (Schwarz, 1978), the Efficient Determination Criterion(EDC) (Bai et al., 1989) and the Integrated Completed Likelihood Criterion (ICL) (Biernacki et al., 2000).AIC, BIC and EDC have the form
−2`(Θ)+dk cn,
where `(·) is the actual log-likelihood, dk is the number of free parameters that have to be estimated underthe model with k components and the penalty term cn is a convenient sequence of positive numbers. Wehave cn = 2 for AIC and cn = log(n) for BIC. For the EDC criterion, cn is chosen so that it satisfies theconditions
cn/n→ 0 and cn/(log log(n))→ 0
when n→ ∞. Here we compare the following alternatives
cn = 0.2√
n, cn = 0.2log(n), cn = 0.2n/ log(n), and cn = 0.5√
n.
The ICL is defined as−2`∗(Θ)+dk log(n),
where `∗(·) is the integrated log-likelihood of the sample and the indicator latent variables , given by
`∗(Θ) =k
∑i=1
∑j∈Ci
log(piSN(y j|θi)),
where Ci is a set of indexes defined as: j belongs to Ci if, and only if, the observation y j is allocated tocomponent i by the following clustering process: after the FM-SN model with k components was fittedusing the EM algorithm we obtain the estimate of the posterior probability that an observation yi belongs tothe jth component of the mixture, zi j – see equation (5). If q = argmax j{zi j}we allocate yi to the componentq.
In this study we simulated samples of the FM-SN model with k = 3, p1 = p2 = p3 = 1/3, µ1 = 5, µ2 =20, µ3 = 28, σ2
1 = 9, σ22 = 16, σ 2
3 = 16, λ1 = 6, λ2 =−4 and λ3 = 4, and considered the sample sizes n =200, 300, 500,1000, 5000. Figure 3 shows a typical sample of size 1000 following this specified setup.
23
Figure 3: Histogram of a FM-SN sample with k = 3 and n = 1000
For each generated sample (with fixed number of 3 components) we fitted the FM-SN model with k = 2,k = 3 and k = 4, using the EM algorithm initialized by the method of moments. For each fitted modelthe criteria AIC, BIC, ICL and EDC were computed. We repeated this procedure 500 times and obtainedthe percentage of times some given criterion chooses the correct number of components. The results arereported in Table 19.
Table 19: Percentage of times the true model (with k = 3) is chosen using different criteria
n AIC BIC ICL EDC
cn = 0.2log(n) cn = 0.2√
n cn = 0.2n/ log(n) cn = 0.5√
n
200 94.2 99.2 99.2 77.8 98.4 99.4 99.4
300 94.0 98.8 98.8 78.2 98.4 98.8 98.8
500 95.8 99.8 99.8 86.4 99.8 99.8 99.8
1000 96.2 100.0 100.0 88.5 100.0 100.0 100.0
5000 95.6 100.0 100.0 92.8 100.0 100.0 100.0
We can see that BIC and ICL have a better performance than AIC for all sample sizes. Except for AIC,the rates presented an increasing behavior when the sample size increases. This possible drawback of AICmay be due to the fact that its definition does not take into account the sample size in its penalty term.Results for BIC and ICL were similar, while EDC showed some dependence on the term cn. In general, wecan say that BIC and ICL have equivalent abilities to choose the correct number of components and that,depending on the choice of cn, ICL can be not as good as AIC or better than ICL and BIC.
24
7 Applications to Real Data Sets
Here we apply the modeling approach suggested in the previous sections to the analysis of two real datasets. That is, we model data using a FM-SN model, estimating parameters through the proposed EM-typealgorithm, with initial values obtained by the method of moments and random values method and fixing asstopping rule |`(Θ(k+1))− `(Θ(k))|< 10−5.
7.1 GDP data set
0 1 2 3 4 5 6
0.00.5
1.01.5
2.0
Figure 4: Histogram of the GDP data set and fitted FM-SN densities with 2 (solid line) and 3 (dotted line)components
In this application we consider observations from the gross domestic product (GDP) per capita (PPPUS$) in 1998 from 174 countries. This data set appeared before in Dias & Wedel (2004) where, beforefitting a mixture of two normal distributions to the data, was made a logarithmic transformation, in orderto eliminate the clear strong skewness. In figure 4 we have a histogram of the data, suggesting a bimodalor trimodal pattern. Thus, a mixture of 2 or 3 SN distributions seems to be a natural candidate to modelthis data, avoiding a possibly unnecessary data transformation. For this application, each observation in theoriginal data set was divided by 103.
Table 20 presents the initial values obtained by the MM and RVM methods. Table 21 shows the EMestimates. We can verify that, besides an apparent label switching, there are differences between startingpoints and EM estimates, which are more marked for the skewness parameter λi. In this case we have a smallsample, and some care is needed when interpreting this results, because of the deficiencies of the algorithmin estimating this parameter, commented in the previous sections.
As in Lin et al. (2007a), we applied the Kolmogorov-Smirnov (K-S) test to evaluate the goodness offit under different methods of initialization of the algorithm. The p− values are presented in Table 22 andindicate a better fit using the model with 3 components, regardless of the initialization method.
25
Table 20: GDP data: initial values obtained by MM and RVM methods
MethodParameter k RVM MM
µi 2 ( 1.8365; 0.3911) (0.0047; 1.6078)
3 (2.4609; 0.9195; 0.0903 ) (1.7354; 0.6208; 0.0896 )
σ2i 2 (0.3617; 0.1542) ( 0.2168; 0.4216)
3 (0.0445; 0.0660; 0.0007) (0.3689; 0.1730; 0.0522 )
λi 2 (1.0619; -3.9085) (11.5960; 1.5634)
3 (-2.2742; -3.2476; 0.6745 ) (3.3291; 2.4475; 2.1934)
pi 2 (0.5419; 0.4580) ( 0.7873; 0.2126)
3 (0.3566; 0.2274; 0.4158) (0.1724; 0.2068; 0.6206)
`(Θ) 2 -260.8486 -106.5715
3 -417.0599 -111.6361
Table 21: GDP data: EM estimates
MethodParameter k RVM MM
µi 2 (1.4124; 0.3043) (0.0452; 1.8755)
3 (1.9950; 0.4773; 0.0456) (1.7915; 0.3968; 0.0454)
σ2i 2 (0.5620; 0.0380) (0.2056; 0.2897)
3 (0.4937; 0.0370; 0.0137) (0.3102; 0.1840; 0.0521 )
λi 2 (0.2926; -0.0665) (2387.4927; 0.2941 )
3 (-0.4830; -0.0053; 1825.5701) (0.7189; 7.4099; 2127.1691 )
pi 2 (0.3388; 0.6611) (0.7799; 0.2200 )
3 (0.2857; 0.3813; 0.3329) (0.2059; 0.2565; 0.5375 )
`(Θ) 2 -123.0055 -95.6176
3 -94.4735 -92.0709
Fixing MM as the initialization method, we computed the model choice criteria AIC, BIC and EDC(with cn = 0.2
√n) for the models with 2 and 3 components. Table 23 shows the results, besides the log-
likelihood evaluated at the EM estimates and the number of iterations needed to convergence. Although themodel with 3 components has a smaller log-likelihood value, it has been penalized by the criteria becauseof the higher number of parameters. Surprisingly, the model with 2 components was rejected using the K-Stest.
Figure 4 also presents the plug-in densities – that is, the FM-SN density with 2 and 3 components inwhich the parameters have been replaced by the EM estimates – superimposed on a single set of coordi-nate axes. Based on the graphical visualization, it appears that the model with 2 components has a quitereasonable and better fit.
26
Table 22: GDP data: Kolmogorov-Smirnov test
2 components 3 components
RVM MM RVM MM
K-S statistic 0.0687 0.0966 0.0365 0.0396
p-value 0.0250 0.0190 0.1940 0.1820
Table 23: GDP data: criteria values and number of iterations
k `(Θ) AIC BIC EDC Number ofiterations
2 −94.95 205.24 227.35 214.32 79603 −92.08 206.15 240.90 220.43 7164
7.2 Old Faithful Geyser data set
Now we consider the famous Old Faithful Geyser data taken from Silverman (1986), consisting of 272 uni-variate measurements of eruptions lengths (in minutes) of the Old Faithful Geyser in Yellowstone NationalPark, Wyoming, USA. It has been analyzed by many authors, like Lin et al. (2007b), fitting FM-SN andFM-NOR models with two components, for example.
The histogram of this data, showed in figure 5, clearly suggests the fit of a bimodal distribution with twoheterogenous asymmetric sub-populations. We made an analysis similar to that in the previous example,fitting SN models with 2 and 3 components. Results are displayed in tables 24, 25 and 26. In this case the
1 2 3 4 5 6
0.00.2
0.40.6
0.81.0
Figure 5: Histogram of the Old Faithful Geyser data set and fitted FM-SN densities with 2 (solid line) and 3(dotted line) components
27
K-S test and the criteria agree on the choice of the model with 2 components.
Table 24: Old Faithful data: initial values obtained by MM and RVM methods
MethodParameter k RVM MM
µi 2 (4.7851; 2.8597) (4.6874; 1.6713)
3 (5.7732; 3.5815; 4.5693 ) (4.4258; 4.7396; 4.6198)
σ2i 2 (0.0801; 0.0571) (0.3125; 0.2236)
3 (1.4072; 1.4842; 0.8971) (2.3636; 2.5202; 2.7715 )
λi 2 (1.4607; -1.5402) (-1.7855; 122.9431)
3 (-3.4674; -3.0006; -2.8209 (-1.4248; -2.3205; -1.8137)
pi 2 (0.7718; 0.2281) (0.6397; 0.3602)
3 (0.2499; 0.1149; 0.6351) (0.3308; 0.3566; 0.3125)
`(Θ) 2 -1886.2170 -274.9906
3 -3228.8101 -411.5698
Table 25: Old Faithful data: EM estimates
MethodParameters k RVM MM
µi 2 (4.8193; 2.0185) (4.7994; 1.7264)
3 (5.1021; 1.9993; 4.7367) (2.0002; 4.9903; 4.6653 )
σ2i 2 (0.5333; 0.0477) (0.4688; 0.1451)
3 (1.4799; 0.0444; 0.3476) (0.0442; 1.0963; 0.2489 )
λi 2 (-3.9733; -0.0731) (-3.4798; 5.8263
3 (-1.6523e+03; -3.4013e-03; -3.6525e+00) (-0.0109; -10.2598; -2.3059 )
pi 2 (0.6584; 0.3415) (0.6511; 0.3488)
3 (0.1883; 0.3340; 0.4776) (0.3344; 0.3005; 0.3650 )
`(Θ) 2 -226.4929 -257.5662
3 -264.4344 -265.2770
8 Final Conclusion
In this work we presented a simulation study in order to investigate the performance of the EM algorithmfor maximum likelihood estimation in finite mixtures of skew-normal distributions with component specificparameters. The results show that the algorithm produces quite reasonable estimates, in the sense of consis-tency and the total number of iterations, when using the method of moments to obtain the starting points.The study also suggested that the random initialization method used is not a reasonable procedure. Whenthe EM estimates were used to compute some model choice criteria, namely the AIC, BIC, ICL or EDCcriteria (in this case, depending on the penalization therm used), a good alternative to estimate the numberof components of the mixture was obtained. On the other side, these patterns do not hold when the mixturecomponents are poorly separated, notably for the skewness parameters estimates which, in addition, showed
28
Table 26: Old Faithful data: Kolmogorov-Smirnov test
2 components 3 components
RVM MM RVM MM
K-S statistic 0.0491 0.03569 0.0481 0.0485
p−value 0.1060 0.1730 0.0940 0.1010
Table 27: Old Faithful data: criteria values and number of iterations
k `(Θ) AIC BIC EDC Number ofIterations
2 −257.57 529.13 554.37 538.22 8053 −257.26 536.52 576.19 550.81 1641
a performance strongly dependent on large samples. Possible extensions of this work include the multivari-ate case and a wider family of skewed distributions, like the class of skew-normal independent distributions(Lachos et al., 2010).
References
Akaike, H. (1974). A new look at the statistical model identification. Automatic Control, IEEE Transactionson, 19, 716–723.
Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal ofStatistics, 12, 171–178.
Azzalini, A. (2005). The skew-normal distribution and related multivariate families. Scandinavian Journalof Statistics, 32, 159–188.
Bai, Z. D., Krishnaiah, P. R. & Zhao, L. C. (1989). On rates of convergence of efficient detection criteria insignal processing with white noise. IEEE Transactions on Information Theory, 35, 380–388.
Basso, R. M., Lachos, V. H., Cabral, C. R. B. & Ghosh, P. (2009). Robust mixture modelingbased on scale mixtures of skew-normal distributions. Computational Statistics and Data Analysis.doi:10.1016/j.csda.2009.09.031.
Bayes, C. L. & Branco, M. D. (2007). Bayesian inference for the skewness parameter of the scalar skew-normal distribution. Brazilian Journal of Probability and Statistics, 21, 141–163.
Biernacki, C., Celeux, G. & Govaert, G. (2000). Assessing a mixture model for clustering with the integratedcompleted likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 719–725.
Biernacki, C., Celeux, G. & Govaert, G. (2003). Choosing starting values for the EM algorithm for get-ting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & DataAnalysis, 41, 561–575.
Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete data via theEM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38.
29
Dias, J. G. & Wedel, M. (2004). An empirical comparison of EM, SEM and MCMC performance forproblematic gaussian mixture likelihoods. Statistics and Computing, 14, 323–332.
DiCiccio, T. J. & Monti, A. C. (2004). Inferential aspects of the skew exponential power distribution.Journal of the American Statistical Association, 99, 439–450.
Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer Verlag.
Greselin, F. & Ingrassia, S. (2009). Constrained monotone EM algorithms for mixtures of multivariate tdistributions. Statistics and Computing. doi: 10.1007/s11222-008-9112-9.
Hathaway, R. J. (1985). A constrained formulation of maximum-likelihood estimation for normal mixturemodels. The Annals of Statistics, 13, 795–800.
Henze, N. (1986). A probabilistic representation of the skew-normal distribution. Scandinavian Journal ofStatistics, 13, 271–275.
Ingrassia, S. (2004). A likelihood-based constrained algorithm for multivariate normal mixture models.Statistical Methods and Applications, 13, 151–166.
Ingrassia, S. & Rocci, R. (2007). Constrained monotone EM algorithms for finite mixture of multivariategaussians. Computational Statistics & Data Analysis, 51, 5339–5351.
Johnson, R. A. & Wichern, D. W. (2007). Applied Multivariate Statistical Analysis. Prentice Hall, 6thedition.
Karlis, D. & Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Compu-tational Statistics & Data Analysis, 41, 577–590.
Lachos, V. H., Ghosh, P. & Arellano-Valle, R. B. (2010). Likelihood based inference for skew normalindependent linear mixed models. Statistica Sinica, 20, 303–322.
Lin, T. & Lin, T. (2009). Supervised learning of multivariate skew normal mixture models with missinginformation. Computational Statistics. 10.1007/s00180-009-0169-5.
Lin, T. I. (2009a). Maximum likelihood estimation for multivariate skew normal mixture models. Journalof Multivariate Analysis, 100, 257–265.
Lin, T. I. (2009b). Robust mixture modeling using multivariate skew t distributions. Statistics and Comput-ing. doi: 10.1007/s11222-009-9128-9.
Lin, T. I., Lee, J. C. & Ni, H. F. (2004). Bayesian analysis of mixture modelling using the multivariate tdistribution. Statistics and Computing, 14, 119–130.
Lin, T. I., Lee, J. C. & Hsieh, W. J. (2007a). Robust mixture modelling using the skew t distribution.Statistics and Computing, 17, 81–92.
Lin, T. I., Lee, J. C. & Yen, S. Y. (2007b). Finite mixture modelling using the skew normal distribution.Statistica Sinica, 17, 909–927.
McLachlan, G. J. & Krishnan, T. (2008). The EM Algorithm and Extensions. John Wiley and Sons, secondedition.
30
McLachlan, G. J. & Peel, G. J. (2000). Finite Mixture Models. John Wiley and Sons.
Meng, X. L. & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A generalframework. Biometrika, 80, 267–278.
Nityasuddhi, D. & Böhning, D. (2003). Asymptotic properties of the EM algorithm estimate for normalmixture models with component specific variances. Computational Statistics & Data Analysis, 41, 591–601.
Park, H. & Ozeki, T. (2009). Singularity and slow convergence of the EM algorithm for gaussian mixtures.Neural Process Letters, 29, 45–59.
Peel, D. & McLachlan, G. J. (2000). Robust mixture modelling using the t distribution. Statistics andComputing, 10, 339–348.
R Development Core Team (2009). R: A language and environment for statistical computing. R Foundationfor Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Shoham, S. (2002). Robust clustering by deterministic agglomeration EM of mixtures of multivariate t-distributions. Pattern Recognition, 35, 1127–1142.
Shoham, S., Fellows, M. R. & Normann, R. A. (2003). Robust, automatic spike sorting using mixtures ofmultivariate t-distributions. Journal of Neuroscience Methods, 127, 111–122.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.
Titterington, D. M., Smith, A. F. M. & Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distribu-tions. John Wiley and Sons.
Yakowitz, S. J. & Spragins, J. D. (1968). On the identifiability of finite mixtures. The Annals of Mathemat-ical Statistics, 39, 209–214.
Yao, W. (2010). A profile likelihood method for normal mixture with unequal variance. Journal of StatisticalPlanning and Inference. doi:10.1016/j.jspi.2010.02.004.
31