a bayesian nonparametric approach to factor analysisweb.econ.ku.dk/piatek/pdf/bnpfactor.pdf · a...
TRANSCRIPT
A Bayesian Nonparametric Approachto Factor Analysis
Remi Piatek∗
University of [email protected]
Omiros PapaspiliopoulosICREA–UPF
September 17, 2018
Abstract
This paper introduces a new approach for the inference of non-Gaussian factormodels based on Bayesian nonparametric methods. It relaxes the usual normalityassumption on the latent factors, widely used in practice, which is too restrictive inmany settings. Our approach, on the contrary, does not impose any particular as-sumptions on the shape of the distribution of the factors, but still secures the basicrequirements for the identification of the model. We design a new sampling schemebased on marginal data augmentation for the inference of mixtures of normals with lo-cation and scale restrictions. This approach is augmented by the use of a retrospectivesampler, to allow for the inference of a constrained Dirichlet process mixture model forthe distribution of the latent factors. We carry out a simulation study to illustrate themethodology and demonstrate its benefits. Our sampler is very efficient in recoveringthe distribution of the factors, and only generates models that fulfill the identificationrequirements. A real data example illustrates the applicability of the approach.
JEL Classification: C11; C38; C63.
Keywords: Factor models; Identification; Bayesian nonparametric methods; Dirichlet pro-cess hierarchical models; Marginal data augmentation; Retrospective sampler.
This project has received funding from the European Union’s Seventh Framework Programme for research,technological development and demonstration under grant agreement no 600207. See acknowledgments p. 36.
∗Corresponding author: Department of Economics, University of Copenhagen, Øster Farimagsgade 5,DK–1353 Copenhagen K, Denmark. Phone: (+45) 35 32 30 35. The methodology introduced in thispaper will be released as an extension to the R package BayesFM available on CRAN at https://cran.r-project.org/package=BayesFM upon publication of this article.
1 Introduction
Factor analysis has grown as a very popular and powerful tool in many fields of research,
particularly in the social sciences, where it is routinely used to aggregate large sets of variables
into smaller sets of meaningful factors. A myriad of examples relying on this data reduction
strategy can be found in the empirical literature, ranging from the extraction of latent
factors underlying macroeconomic indicators to study monetary policies or business cycles
(Bernanke et al., 2005; Forni and Gambetti, 2010), to the measurement of personality traits
and cognitive abilities and their impact on economic outcomes (see, e.g., Carneiro et al.,
2003; Heckman et al., 2006; Conti et al., 2014; Piatek and Pinger, 2016).
One of the main challenges inherent to the inference of these models is identification. To
make inference feasible and produce meaningful results, identifying assumptions are needed,
often in the form of parameter restrictions and distributional assumptions. Techniques for
dealing with such issues were developed as early as Anderson and Rubin (1956). Within the
social sciences, most of the articles published up to date assume the factors to be Gaussian,
while within the Machine Learning community, non-Gaussian factor analysis is popular.
The Gaussian assumption is convenient and has a natural interpretation for many analysts.
However, it may have little justification empirically, and the model misspecification it can
induce is likely to contaminate the inference of the remaining parameters of the model.
This paper offers a more flexible approach to factor analysis that relaxes the Gaussian
assumption on the latent factors. We offer a modeling framework that allows for dependence
across factors, assumes a flexible distribution based on mixtures of Gaussians, and permits
the identification of the latent factors. Dependence across factors and identifiability are key
requirements in the applications we are interested in, as they allow to unravel rich latent
structures where unobserved traits can easily be interpreted—see Almlund et al. (2011) for
a discussion on this topic in personality economics. In the econometric literature, estimation
methods relying on finite mixtures of normals are commonly used (Hansen et al., 2004; Cunha
and Heckman, 2008; Cunha et al., 2010). Mixtures of normals provide an approximation
to the unknown distribution of the latent factors that can otherwise be nonparametrically
identified through appropriate parameter restrictions. These approaches therefore guarantee
identification and ensure interpretability. However, another type of misspecification can
emerge when the number of mixture components selected is not appropriate to provide a
good fit to the data.
To learn the appropriate number of mixture components from the data, instead of fixing
it a priori, Bayesian nonparametric (BNP) methods have been introduced. See Antoniak
(1974), Quintana and Muller (2004), and Paisley and Carin (2009) for Dirichlet processes
1
in general, Neal (2000) and Papaspiliopoulos and Roberts (2008) for state-of-the-art com-
putational methods for estimating the corresponding models. Effectively, these approaches
specify an infinite mixture of Gaussians with a specific prior distribution on the mixture
weights, and the number of active components can grow with the size of the data. Typically,
these procedures do not provide any guarantee of the formal identification of the model.
A notable exception is Yang et al. (2010), who propose a semiparametric approach we will
return to. More generally, this lack of identification becomes a major obstacle when the
inference of the structural part of the model is of main interest—e.g., if factor loadings need
to be identified to make inference on policy-relevant statistics such as elasticities or marginal
effects.
The goal of the present paper is to develop a richly parameterized and flexible distribu-
tion for the latent factors, which allows for dependence among factors while ensuring their
identifiability. We specify the distribution of the latent factors as an affine transformation
of a Dirichlet process that fixes the location and the scale of the process. We achieve this by
appropriately transforming the parameters of the mixture components. We develop a new
approach for the inference of constrained mixtures of Gaussians that relies on Marginal Data
Augmentation (MDA) methods (Meng and van Dyk, 1999; van Dyk and Meng, 2001; van
Dyk, 2010). MDA methods proceed by expanding the original constrained model, introduc-
ing extra parameters which, even though they cannot be identified from the data, facilitate
sampling and make inference more efficient in terms of convergence and mixing. An inter-
esting by-product is that the model expansion can be tailored to safeguard the identification
of the factor model. In the case of a Dirichlet process mixture model, where the number of
mixture components is free to grow infinitely to accommodate the data, the implementation
of MDA methods is not straightforward. In this article, we tackle both the finite and infinite
cases. For the former, we resort to truncations of the Dirichlet process, as in Ishwaran and
James (2002). For the latter, we apply both conditional approaches, such as retrospective
sampling ideas in Papaspiliopoulos and Roberts (2008) and Yau et al. (2011), as well as
marginal approaches as in Neal (2000), to infer the infinite-dimensional mixture model.
Interestingly, with this representation based on a mixture of normals, our model can be
reformulated as a mixture of factor analyzers (McLachlan and Peel, 2000; McLachlan et al.,
2003; Fokoue and Titterington, 2003). Despite the analogy of the two approaches, there are
fundamental differences. Mixtures of factor analyzers assume Gaussian factors and mix the
structural parameters of the model (factor loadings, intercepts and error term variances),
while our approach assumes that those are fixed across mixture components and rather mixes
the moments of the distribution of the latent factors. The two approaches correspond to two
completely different representations of the model, thus resulting in different interpretations.
2
They should therefore not be seen as competitive approaches, but rather as alternatives that
allow analysts to address different problems.
A relevant related and active literature is that of independent factor analysis popular in
Machine Learning, see for example Attias (1999). In that framework, identifiability is of real
concern since the latent factors are used for signal reconstruction. A key observation is that
identifiability can be partially resolved by working with certain non-Gaussian distributions
for the latent factors, a point to which we return below.
We conduct an extensive Monte Carlo study to investigate the performance of our sam-
pler, using synthetic data sets generated from a two-factor model with a non-standard dis-
tribution for the latent factors. We implement and compare several approaches for the infer-
ence of the Dirichlet process. The results are very promising. They show that our MCMC
sampling scheme succeeds in retrieving the true underlying distribution of the latent factors,
without any a priori assumptions on the shape of the distribution. Most importantly, it does
so in generating identified models only. Sampling turns out to be highly efficient thanks to
the MDA procedure. The mixing of the Markov chains is indeed very good compared to what
can usually be achieved in latent variable models, where convergence can be prohibitively
slow and mixing bad.
To illustrate the applicability of our approach, we implement it using real data from the
British cohort study to extract the distribution of two latent factors capturing cognitive skills
and behavioral problems. The empirical results clearly provide evidence for non-Gaussian
factors, thus questioning standard factor analysis approaches that rely on the normality
assumption.
The algorithms developed in this article will be released as an extension to the R package
BayesFM, to allow researchers to replicate our results and also to apply our method to their
own data in a user-friendly manner.1
The baseline factor model used throughout this paper is presented in Section 2. We
briefly outline the parametric identification of the structural part of the model, then spend
some time on the nonparametric identification of the distribution of the latent factors, which
is our main focus. Section 3 introduces the Marginal Data Augmentation sampling scheme
for mixtures of normal distributions, and explains how to plug it in into sampling methods
for Dirichlet process mixture models. Section 4.1 carries out our simulation study and
Section 4.2 applies it to real data. Section 5 concludes.
1. Package available on CRAN at https://cran.r-project.org/package=BayesFM. The corresponding packageextension will be released upon publication of this article.
3
2 Specification and identification of the factor model
2.1 General model structure
The generic structure of the latent factor model we are considering is as follows. There are
Q manifest variables Yi and P latent factors θi (P � Q), for i = 1, . . . , N , following a linear
relationship through a matrix of factor loadings Λ and a vector of intercept terms δ:
Yi(Q×1)
= δ(Q×1)
+ Λ(Q×P )
θi(P×1)
+ εi(Q×1)
, (1)
εi ∼ N (0; Σ) , Σ = diag(σ21, . . . , σ
2Q).
The error terms εi are assumed to be Gaussian for the sake of simplicity.2 The independence
of the error terms is standard in factor analysis, and implies that the factors are the only
source of correlation between the observed variables. The statistical model requires a specifi-
cation for the distribution of the latent factors. A default option in the literature is that of a
Gaussian distribution. In this paper, we relax this assumption by allowing a nonparametric
specification of this distribution. Bayesian inference for this factor model also requires priors
on δ, Λ, Σ, and typically other hyperparameters as well.
The model as stated in Eq. (1) is not identified. Some identifiability issues arise because
of the specification of the structural part of the model (Section 2.2), while others are related
to the distributional assumptions on the latent factors (Section 2.3). The following sections
provide an overview of the two sources of identification issues that lead to our proposal for
an identifiable nonparametric latent factor model.
2.2 Identification of the structural part of the model
The intercept terms δ and the factor loading matrix Λ can be identified using appropriate
parameter restrictions. This can be done independently of the distribution assumed on the
latent factors, but we will also see that specific distributional assumptions can allow to relax
some of these restrictions.
If the distribution of the latent factors belongs to a location family, such as the Gaus-
sian, or a mixture of distributions in a location family, such as mixture of Gaussians, with
location parameter(s) to be estimated from the data, then δ is not identifiable. Indeed, the
distribution of Yi remains the same by adding an arbitrary constant to δ and subtracting
2. The normality of the error terms could be relaxed in a similar way to the latent factors. However, westick to the standard Gaussian assumption in this paper for simplicity, and because the main focus is on thedistribution of the latent factors.
4
appropriate constant(s) from the location parameter(s). This lack of identifiability can be
tackled by fixing the location of the factors, e.g., by fixing the mean of the factor distribution
to 0, such that E(θi) = 0. This constraint is straightforward to impose in the Gaussian case,
but not as trivial in the nonparametric case. In this article, we propose a distribution for
the factors that fixes their location.
The second identification problem affects the factor loadings. The bilinear form Λθi
implies that the latent factors can only be identified up to a scale transformation, since
the distribution of Yi remains unaltered if the factors are multiplied by a nonsingular scaling
matrix, and the factor loading matrix by the inverse of this matrix. This can be seen from the
expression of the overall covariance matrix of the manifest variables, which can be expressed
as ΛΦΛ′ + Σ = (ΛR−1)(RΦR′)(ΛR−1)′ + Σ, for any nonsingular (P × P )-matrix R, where
Φ ≡ V(θi) denotes the covariance matrix of the latent factors. This indeterminacy, commonly
referred to as the rotation problem, is well known since the seminal work of Thurstone (1934),
later formalized by Reiersøl (1950), Koopmans and Reiersøl (1950), and Anderson and Rubin
(1956). See also Williams (2017) for a recent revival of these questions.
This lack of identifiability has been addressed in the literature by assuming that the
latent factors are uncorrelated and have unit variances, such that Φ = IP . This requirement
has made the standard Gaussian a distribution of choice in factor analysis. However, this
assumption does not completely solve the indeterminacy problem, as the system still remains
unchanged if R is specified as an orthogonal matrix. To rule out these cases, Anderson and
Rubin (1956, p. 121) propose to use a lower triangular structure for the upper part of Λ. This
structure has become popular in factor analysis, see, e.g., Geweke and Zhou (1996), Aguilar
and West (2000), Lopes and West (2004), and Fruhwirth-Schnatter and Lopes (2010).
One last identifiability issue needs to be taken care of. It arises because the sign of the
latent factors and of the corresponding columns of the loading matrix can be flipped simul-
taneously without affecting the distribution of Yi. This property of the model implies that
without further constraints on Λ (or on the factors) the signs of the correlations between the
factors are not identifiable. In our work, we deal with the sign issue by making assumptions
on the sign of certain entries of the loading matrix. Computationally, we work with the sign-
unconstrained model and enforce the constrains at a post-processing stage by appropriate
transformations of the MCMC output, as in, e.g., Fruhwirth-Schnatter and Lopes (2010)
and Conti et al. (2014).
Alternatively, the scales and the signs of the latent factors can be set by constraining one
loading in each column of Λ instead of constraining the diagonal elements of Φ. This approach
has been popular in the econometrics literature, as it allows to anchor the latent factors in
real measurements, thus facilitating interpretation (for example Cunha and Heckman, 2008;
5
Cunha et al., 2010, anchor the factors in earnings outcomes). Nevertheless, constraining some
factor loadings can be too restrictive in some frameworks. For example, when a stochastic
search is carried out to determine the number of latent factors and the structure of the factor
loading matrix in terms of zero and nonzero elements, it is not possible to fix any of the
loadings a priori. These approaches are becoming increasingly popular in the literature, see,
among others, Lucas et al. (2006), Carvalho et al. (2008), Fruhwirth-Schnatter and Lopes
(2010), Bhattacharya and Dunson (2011), and Conti et al. (2014). In the present paper, we
rely on identifying criteria that fix the variances of the factors rather than some of the factor
loadings.
When working with correlated factors, the block lower triangular structure of Λ no longer
safeguards identification. Indeed, pre-multiplying the latent factors by a nonsingular lower
triangular matrix R and post-multiplying Λ by the inverse of R results in a model that is
observationally equivalent to the original one, since ΛR−1 also has a block lower triangular
structure. Therefore, moving from the uncorrelated to the correlated case requires to add a
number of additional constraints on the factor loading matrix that is equal to the number of
off-diagonal elements of the covariance matrix Φ. This can be done by specifying a diagonal
matrix for the upper part of Λ, such that Λ′ =(DΛ1 Λ′2
), with DΛ1 = diag(λ11, . . . , λPP ),
and Λ2 is a full matrix that may contain additional zero elements. In this specification,
the first P manifest variables each load on a single latent factor, and are sometimes called
dedicated measurements in the literature (Conti et al., 2014; Williams, 2017). Similarly to
the uncorrelated case, the scale of the factors is set by either assuming that DΛ1 = IP , or
that Φjj = 1 and Λjj > 0, for j = 1, . . . , P .
2.3 Nonparametric identification of the distribution of the latent
factors
The restrictions derived in Section 2.2 allow to achieve identification of the structural part of
the model, i.e., δ and Λ, and also of the covariance matrix of the latent factors Φ. Importantly,
these assumptions do not depend on the distributional assumptions made on the latent
factors. They only secure the identification of the covariance matrix of the factors, and
therefore do not guarantee that the whole distribution of the factors is identified if we depart
from the Gaussian case.
In the nonparametric case, these assumptions might be over-restrictive. For example,
working with non-Gaussian latent factors can remove some identifiability problems when the
latent factors follow a mixture of Gaussians with diagonal covariance matrix for each compo-
nent but different from the identity. This property has propelled the so-called independent
6
component analysis and independent factor analysis, popular within Machine Learning, see,
e.g., Attias (1999).
On the other hand, some nonparametric approaches might require additional restrictions
to fully identify the distribution of the factors nonparametrically. In this paper, we rely on
the identification strategy developed in Cunha et al. (2010). Their nonparametric approach
requires mild assumptions on the latent factors, and only minor additional restrictions on the
factor loading matrix: two dedicated manifest variables are needed for each factor instead
of one in the previous section.3
With two dedicated manifest variables in hand for each latent factor, such that Λ′ =(DΛ1 DΛ2 Λ′3
),4 the proof for nonparametric identification of the factor distribution fol-
lows from Cunha et al. (2010). Assuming nonzero diagonal elements in DΛ1 and DΛ2 , the
first 2P equations can be rewritten as
W1 = θ + ω1,
W2 = θ + ω2,(2)
with
W1 = D−1Λ1(Y1:P − δ1:P ) , ω1 = D−1Λ1
ε1:P ,
W2 = D−1Λ2
(Y(P+1):(2P ) − δ(P+1):(2P )
), ω2 = D−1Λ2
ε(P+1):(2P ),
where the subscripts denote the elements of the corresponding subvectors (e.g., Y1:P contains
the first P elements of the vector Y ). The expression of the subsystem corresponding to the
dedicated measurements in Eq. (2) is particularly convenient, as it allows to directly use
the first theorem of Cunha et al. (2010, Theorem 1, p. 893) to prove the nonparametric
identification of the distribution of the factors, after having secured the identification of the
intercept terms and the factor loadings as explained in the previous section. This theorem
states that if W1, W2, θ, ω1 and ω2 are random vectors taking values in RP and related
through the equations in Eq. (2), then the factor distribution is nonparametrically identified
and can be expressed in terms of observable quantities, provided that E(ω1 | θ, ω2
)= 0
and ω2 is independent from θ. The last two conditions are automatically fulfilled, since we
assume the error terms to be independently normally distributed.
3. In most cases, the assumption of two dedicated measurements per factor is not restrictive in practice,since numerous indicators are usually available to measure the latent factors.
4. Similarly to Section 2.2, DΛ1and DΛ2
are diagonal matrices, Λ3 is a full matrix.
7
2.4 Identifiable Bayesian nonparametric correlated factor models
We build a model for the latent factors that is sufficiently constrained in its location and
scale to facilitate identifiability of the structural part of the overall model. The model
is constructed as an affine transformation of an auxiliary process, which is modeled as a
Dirichlet process Gaussian mixture model and is described below. Therefore, our approach
is a combination of Bayesian nonparametrics and econometric modeling, in order to ensure
both a flexible form for the latent factors and identifiability of the structural part of the
model. It turns out that an insightful perspective on our model is as a Gaussian mixture
model where the number of components can be learned from the data automatically, and
where the mixture parameters are constrained to ensure identifiability of the structural part
of the factor model. The induced constraints lead to a complicated posterior distribution,
but we propose marginal data augmentation methods in Section 3 to sample from it very
efficiently.
2.4.1 Modeling the distribution of the factors
In the rest of the paper we will follow a notational convention. The intercept, factor loadings
and latent factors that appear in the final formulation of the (identified) factor model will
be denoted by δ, Λ and θi, respectively, whereas transformations thereof by δ, Λ and θi.
These transformations might be used as intermediate variables in the construction of the
final model, e.g., an intermediate θi is used to define a model for factors θi with constraints
on their location and scale. Below we explain the precise ways that these transformations
relate to Eq. (1).
The factor model is an affine transformation of an auxiliary process that models the
distribution of the latent factors. This stochastic process is specified as the following Dirichlet
8
process Gaussian mixture model:
θi | µGi , ΦGi ∼ N(µGi ; ΦGi
),
Gi | p ∼K∑k=1
pkδk(Gi), (3)
µk | Φk, A0 ∼ N(
0; A0Φk
), (4)
Φk | ν0, s0 ∼ IW(ν0; s0IP ) , (5)
p1 = V1, pk = Vk
k−1∏l=1
(1− Vl), (6)
Vk ∼ Beta(1;α), (7)
1 < k ≤ K.
The parameters that define the Gaussian distribution at the top level are denoted by
ϑk = {µk, Φk} and are collected in the set ϑ = {ϑ1, ϑ2, . . .}. These and the random variables
V = (V1, V2, . . .), are assumed to be independent of each other. The Dirac delta function
centered at k is denoted δk(·). Hence, when K =∞, Eqs. (3) to (7) in the above hierarchy
define a Dirichlet process model for ϑ with base distribution normal-inverse-Wishart param-
eterized by {A0, ν0, s0}, see Eqs. (4) and (5). When K < ∞, VK is set to 1 to ensure that
the mixture weights sum up to 1, and the resultant model is a truncated Dirichlet process
for ϑ (Ishwaran and James, 2001, 2002). In either case, we adopt the stick-breaking rep-
resentation for the Dirichlet process (Sethuraman, 1994), as described in Eqs. (6) and (7),
and we explicitly augment the model with latent variables G = (G1, . . . , GN) for the mixture
group memberships. Marginalizing over these membership variables, we obtain a (potentially
infinite) mixture of Gaussians for the distribution of the factors:
θi ∼K∑k=1
pkNP(µk; Φk
),
with corresponding moments:
E(θi
)=
K∑k=1
pkµk,
V(θi
)=
K∑k=1
pk
((µk − µ)(µk − µ)′ + Φk
).
9
2.4.2 Constrained version of the model
Relying on the distribution specified in Section 2.4.1 for θi, we propose the following identi-
fiable nonparametric factor model:
Yi = δ + Λθi + εi,
θi = D−12 (θi − µ),
(8)
where µ and D are chosen so as to constrain the location and scale of the latent factors. We
treat the finite (K <∞) and infinite (K =∞) mixture cases separately, although both are
based on the following construction:
µ =∑k∈K
βkµk, (9)
Φ =∑k∈K
βk
((µk − µ)(µk − µ)′ + Φk
), D ≡ diag(Φ11, . . . , ΦPP ), (10)
where the set of mixture indices K and the weights βk are chosen in different ways for the
finite and infinite mixture models—see below in Sections 2.4.3 and 2.4.4.
The structure of the factor loading matrix, in terms of zero restrictions, is not affected
by the transformation because the matrix D used to rescale the latent factors is diagonal.
This is particularly important, as zero restrictions on Λ are required for identification in our
framework, see Section 2.2.5 Our construction is analogous to the one used by Yang et al.
(2010), except that for the parameter transformation we only use the diagonal elements of Φ,
while they use the Cholesky decomposition of this covariance matrix. This is an important
difference between the two approaches: Ours allows to work with correlated factors, as the
corresponding transformation preserves the zero restrictions on the factor loading matrix,
while the latter is only appropriate for uncorrelated factors, because it only preserves the
zero restrictions of the loading matrix if it has a block lower triangular structure.
Since the Gaussian is a location-scale family, it follows that an equivalent way to under-
stand the proposed latent factor model is as a Gaussian mixture with linearly constrained
5. If sign restrictions are imposed on Λ for identification, these restrictions also remain unaffected by theexpansion, since the diagonal elements of D are all positive.
10
parameters:
θi ∼K∑k=1
pkNP (µk; Φk) (11)
µk = D−1/2(µk − µ) (12)
Φk = D−1/2ΦkD−1/2, (13)
where the parameter transformations expressed in Eqs. (12) and (13) imply, by construction,
that the following constraints are fulfilled in the identified model:
µ ≡∑k∈K
βkµk = 0P , (14)
Φ ≡∑k∈K
βk (µkµ′k + Φk) , D ≡ diag(Φ11, . . . , ΦPP ) = IP . (15)
The prior on ϑ specified in Eqs. (4) and (5) implies a prior for the corresponding con-
strained parameters ϑ = {ϑk}k∈K, where ϑk = {µk, Φk}. The form of this induced density is
given in Proposition 2 in the Appendix, and specifically in Eq. (A1). The density does not
belong in a known family and looks cumbersome. Fortunately, this density is not required
in the sampling scheme, as the marginal data augmentation procedure we will use mainly
relies on the expanded version of the model, which is easier to sample from. This should
not, however, make us forget to investigate how the prior induced in the identified model is
shaped, to make sure we do not work with an odd prior. To do this, it is straightforward to
simulate the prior rather than trying to work out its kernel analytically.6
The Bayesian formulation of the factor model is complemented by priors on δ, Λ, Σ
and α. The concentration parameter α has a major impact on the estimated number of
components in the infinite mixture model: The larger α, the more likely new components
will be introduced into the process a priori. This parameter can therefore be tuned to
control the expansion of the Dirichlet process in terms of number of mixture components.
This is analogous to alternative nonparametric approaches, such as kernel density estimation
methods, where a smoothing parameter usually needs to be selected by the analyst to control
the level of smoothness of the estimator (e.g., bandwidth parameter). In our approach, we
prefer to learn α from the data instead of fixing it a priori, and therefore equip this parameter
with a prior distribution.
6. See Section 4.1.1 for an example.
11
The general structure of the prior distribution on the hyperparameters is
δ | c0 ∼ NQ(0Q; c0IQ) , (16)
Λq | d0 ∼ NP (0P ; d0IP ) , (17)
σ2q | a0, b0 ∼ IG(a0; b0) , (18)
α | g0, h0 ∼ G(g0; h0) , (19)
for q = 1, . . . , Q, where Λq = (λq1, . . . , λqP )′ denotes the column vector of factor loadings
corresponding to the qth row of Λ, and each single factor loading is denoted λqj, for j =
1, . . . , P .
2.4.3 Finite mixture model
In the finite mixture model, K = {1, . . . , K} with K <∞. We simply take βk = pk in terms
of the generic model structure in Eqs. (9) and (10). Therefore, Eqs. (14) and (15) are by
construction equivalent to E(θi) = 0P and diag (V(θi)) = ιP , where ιP is the vector of length
P that contains only 1s.
2.4.4 Infinite mixture model
We could repeat the above construction for K →∞, but each expression in Eqs. (9) and (10)
would require an infinite summation, which would make the resulting model computationally
intractable. Instead, we use the ingredients of the retrospective sampling methodology of
Papaspiliopoulos and Roberts (2008) to define µ and D required in Eqs. (8) to (10). The
construction now also involves the allocation variables Gi.
In the mixture of Dirichlet processes, the number of mixture components K is nominally
infinite, but in practice only a finite number of observations N is available and can be
allocated to the mixture groups. Therefore, only a finite number of mixture groups will
contain observations, the remaining ones being empty mixture components. We introduce
some notation and divide the set of mixture component indices (I) into two distinct groups,
the group of non-empty mixture components (“alive” components I (al)), and the group of
“dead” components (I (d)):
I = {1, 2, . . .},
I (al) = {k ∈ I : Nk > 0},
I (d) = {k ∈ I : Nk = 0},
12
where Nk =∑N
i=1 1{Gi = k}, for k = 1, 2, . . ., such that I = I (al) ∪ I (d).
Using the generic notation introduced in Eqs. (9) and (10), the set of mixture indices is
defined as K = I (al), and the weights such that βk ≡ wk = Nk/N , therefore measuring the
observed frequency of an individual being allocated to mixture component k. By construc-
tion, the weights depend on the configuration of the allocation variable G, and wk > 0 for
k ∈ I (al), wk = 0 for all k ∈ I (d) and∑
k∈I(al) wk = 1. This construction does not collapse
to the one for the finite mixture when K < ∞. It does, however, fix the location and scale
of the factors—not by setting their first two prior moments to 0 and to a correlation matrix,
respectively, but by fixing the linear combinations in Eqs. (14) and (15) to these values.
2.4.5 Related approaches in the literature
An alternative approach to dealing with the identifiability constraints is to impose them after
sampling, for instance through appropriate transformations of the MCMC output produced
with the nonidentified model. An example of this is the treatment of the sign issue discussed
earlier. This approach is often equivalent to assuming certain priors for the parameters of
the factor model, which imply a nontrivial prior dependence among them. In some cases, the
induced prior distribution can be derived analytically and may exhibit desirable properties.
For example in the framework of a factor model, Ghosh and Dunson (2009) show that this
mechanism can be used to induce heavy-tailed priors on the factor loadings, which are well-
defined and more flexible than the usual normal prior.
In other cases, the implied prior dependence might be more difficult to grasp. This is
for example the case in the paradigm of Yang et al. (2010), who propose a semiparametric
approach to factor analysis that relies on parameter expansion. They use a model transfor-
mation that is similar to ours, but rely on a post-processing stage to achieve identification.
This posterior transformation implies a complicated prior on the loadings because of the
transformation that involves a mixture of Gaussians. Instead, we impose the identifiability
constraints a priori, and exploit the connection to the nonidentifiable model to build effi-
cient marginal data augmentation algorithms. We therefore use a different prior than theirs.
Another difference concerns the identification requirements on the factor loading matrix.
The parameter expansion they use can only be implemented on specific patterns of zero re-
strictions on the factor loading matrix—such as the block lower triangular matrix proposed
by Geweke and Zhou (1996), but no additional zero restrictions can be imposed below the
diagonal. This occurs because they work with the Cholesky decomposition of the covariance
matrix of the factors. In contrast, we only use the diagonal matrix D to transform our
model, which allows for arbitrary patterns of zero elements on the loading matrix.
13
These differences may be very relevant when embedding the inference of the distribution
of latent factors into other existing approaches, for instance to implement stochastic search
algorithms on the structure of the factor loading matrix—as already mentioned in Section 2.2.
These methods usually require to know the prior analytically, and would be impaired by
arbitrary zero restrictions on the factor loading matrix. In this respect, our approach would
be straightforward to use for such extensions.
3 Marginal data augmentation methods for nonpara-
metric factor models
3.1 Accelerating MCMC using nonidentifiable model formulations
Marginal Data Augmentation (MDA) methods (Meng and van Dyk, 1999) emerged in parallel
with parameter-expansion methods (Liu and Wu, 1999), as a by-product of different attempts
made to improve the convergence of the EM-algorithm (Meng and van Dyk, 1997; Liu et al.,
1998). These approaches start from the observation that the introduction of extra parameters
into the model (called working parameters), which cannot be identified from the data but
can be sampled along the remaining parameters of the model, can dramatically improve
convergence and mixing of the MCMC sampler. Based on this result, Meng and van Dyk
(1999), van Dyk and Meng (2001), and van Dyk (2010) have formalized the mechanisms of
MDA, and provided extensive examples to apply these methods to a wide range of models.
These approaches have proved to be particularly efficient in some types of models where
convergence is usually very slow, to the point it can hinder proper inference, such as in latent
variable models. For example, MDA methods have been successfully applied to a variety of
discrete choice models, such as the multinomial probit (Imai and van Dyk, 2005; Jiao and
van Dyk, 2015), the multivariate probit (Lawrence et al., 2008), the multinomial logit (Scott,
2011), in factor analysis (Ghosh and Dunson, 2009; Yang et al., 2010; Fruhwirth-Schnatter
and Lopes, 2010; Conti et al., 2014), and to the sampling of correlation matrices (Liu and
Daniels, 2006; Liu, 2008).
MDA methods provide the advantage of allowing to sample indirectly from complicated
distributions that would otherwise be difficult to simulate. This feature is particularly useful
in our framework: the Dirichlet process hierarchical model is challenging to simulate in its
constrained version, but it can be marginally augmented to make it easier to handle. Last
but not least, these methods are usually easy to implement—only a few additional working
parameters need to be sampled at a low marginal cost, and no tuning is required. Hence, we
can decouple the modeling, for which we can impose constraints for identifiability, from the
14
computation, which can be done efficiently despite the complicated posteriors the modeling
implies.
3.2 Working parameters for the nonparametric factor model
We build efficient MDA algorithms for the identifiable nonparametric factor model proposed
in Section 2 using the following working parameters: µ, D, as they have already been defined
in Section 2, and the following additional parameter transformations:
Λ = Λ D−12 ,
δ = δ − Λ µ.(20)
The backbone of the MDA algorithm we propose are the following results about the distri-
bution of the working parameters. These are key to the efficient MCMC implementation we
introduce.
Proposition 1. Consider the parameters µ and D defined in Eqs. (9) and (10), and the
one-to-one mappings from ϑ to ϑ as defined in Eqs. (12) and (13). Then, the normal-inverse-
Wishart prior distribution specified on ϑk = {µk, Φk} in Eqs. (4) and (5), for k ∈ K, implies
that
f(µ, D | ϑ,G, s0, ν0, A0) = f(µ | D, ϑ,G,A0)P∏j=1
f(Dj | ϑ,G, ν0, s0),
with
µ | D, ϑ,G,A0 ∼ NP(−D
12E−1F ; A0D
12E−1D
12
), (21)
Dj | ϑ,G, ν0, s0 ∼ IG(ν0|K|
2;s0E[jj]
2
), for j = 1, . . . , P , (22)
E =∑k∈K
Φ−1k , F =∑k∈K
Φ−1k µk,
where E[jj] denotes the jth diagonal element of E, K = {1, . . . , K} in the finite mixture
model, K = I (al) in the infinite mixture model, and |K| is the cardinal number of the set K.
Proof. See Appendix A1.
In the case of the finite mixture model, G can be dropped from the conditioning sets
above. Interestingly, conditionally on ϑ, {µ, D} are independent of the mixture probabilities
15
pk that are used for βk in Eqs. (9) and (10). In the infinite mixture model, however, the
construction imposes a prior dependence of µ and D on G, but only via the set of active
components I (al) and its cardinal number |I (al)| implied by G.
The other distributions we need for the implementation of the MDA algorithm are those
that correspond to the parameters defined in Eq. (20). However, it is a simple consequence
of their definitions and the priors on the identifiable parameters in Eqs. (16) and (17), that
f(δ, Λ | µ, D, c0, d0) = f(δ | Λ, µ, D, c0)Q∏q=1
f(Λq | D, d0),
with:
δ | Λ, µ, D, c0 ∼ N(−Λµ; c0IQ
), (23)
Λq | D, d0 ∼ N(
0; d0D−1), (24)
for q = 1, . . . , Q, where Λq denotes the column vector of factor loadings corresponding to the
qth row of Λ in the expanded model .
3.3 MDA sampling scheme
The sampler is presented as Algorithm 1, in its generic form to accommodate both the finite
and the infinite mixture cases. Those two cases only differ with respect to the sampling of
the mixture parameters in step 4. Full details are provided in Algorithms 2 and 3 for this
particular step. Parameters and latent variables have an exponent (t) only if their values are
used across MCMC iterations or if they are kept for posterior inference. The other ones are
auxiliary draws that are immediately discarded at the end of the corresponding iteration.
Some of them, like the working parameters µ and D, may be updated several times in a
single MCMC iteration. In that case, their most up-to-date values are used in any given
substep of the MCMC sampler.
The main difference between this sampling scheme and a standard Gibbs sampler for
factor models is not only the potentially infinite number of mixture components, but also
the additional working parameters that need to be sampled jointly with the parameters
of interest and with the latent variables of the model. The introduction of these working
parameters requires a transformation of the model, which is performed at the end of each
iteration to move back to the identified version (van Dyk, 2010). These additional steps,
however, only represent a small additional cost. The intermediate values of the working
16
parameters are all drawn directly from standard distributions, except at step 3a, where a
Metropolis-Hastings step is implemented. For full details on the sampler, see Appendix B.
Initialization is done for all parameters and latent variables that are not marginalized out
before their first update.7 Since no information on the working parameters can be retrieved
from the data, they are sampled from their conditional prior distribution the first time they
are required, in step 2a. The latent factors θ are then sampled from the identified model,
and immediately transformed to obtain their counterpart in the expanded model. This step
is equivalent to sampling directly from f(θ | Y, δ, Λ,Σ,G, ϑ).
Step 4 is done in the nonidentified model, with some important differences between the
finite and infinite mixture cases. In the finite case (Algorithm 2), the mixture parameters ϑk
of the non-empty mixture components are sampled from their posterior distribution, while
those corresponding to the empty components are sampled from their prior. Similarly, the
stick-breaking variables Vk are either sampled from their posterior or prior distribution. In
the infinite case (Algorithm 3), this procedure is not feasible. Instead, the ϑk’s and Vk’s
corresponding to the non-empty components (resp., empty components) are sampled from
their posterior distribution (resp., prior distribution) up to the last non-empty component
kmax ≡ maxi{Gi}. The mixture indicators G are then updated sequentially for each obser-
vation i = 1, . . . , N , and any new component k > kmax that may be required to increase the
size of the mixture is introduced retrospectively, using the procedure of Papaspiliopoulos and
Roberts (2008). In this algorithm, the variable N? denotes the temporary maximum num-
ber of mixture components (N? ≥ kmax), which measures how far the sampler goes into the
exploration of the Dirichlet process. As noted by Papaspiliopoulos and Roberts (2008), and
observed in our simulations, the algorithm can introduce large numbers of temporary mix-
ture components (large N?) at the beginning of sampling, but this number usually shrinks
quickly when the sampler converges to the stationary distribution.
Since the mixture parameters are all updated in the nonidentified model, the prior depen-
dence on G that affects ϑ in the infinite case is not relevant at these stages. This dependence
is later restored by the transformation in step 6. Therefore, these steps represent a standard
Gibbs step in the finite case, and a standard—but nontrivial—retrospective sampling step
in the infinite case.8
As an alternative to the conditional approach of the retrospective sampler, it is also
possible to use a marginal approach to update the Dirichlet process mixture model, by
integrating out the mixture probabilities p. (See for a discussion on the respective advantages
7. As for the initial number of mixture components, we start our algorithm with the true number in oursimulations, and with a single component in our real data application, such that K(0) = {1}.
8. See details in Appendix B6.
17
Algorithm 1 Generic MDA sampler
Initialization. Assign starting values to the parameters and latent variables δ(0), Λ(0),θ(0), Σ(0), G(0), α(0), {ϑ(0)
k }k∈K(0) , where K(0) is the initial set of non-empty mixture compo-nents. Mixture weights and stick-breaking variables V need no initialization, as they are notconditioned upon before their first update in step 4.
MCMC sampling. At each iteration t = 1, . . . , T , cycle through the following steps:
1) Sample Σ(t) from f(Σ | Y, δ(t−1), Λ(t−1), θ(t−1)). B Eq. (B1)
2) Sample θ from f(θ | Y, δ(t−1), Λ(t−1), Σ(t), G(t−1), ϑ(t−1)), in steps:
a) Sample µ and D from f(µ, D | ϑ(t−1)). B Eqs. (21) and (22)
b) Sample θ from f(θ | Y, δ(t−1), Λ(t−1), Σ(t), G(t−1), ϑ(t−1)). B Eq. (B2)
c) Compute θi = µ+ D12 θi, for i = 1, . . . , N .
3) Sample δ(t), Λ(t) from f(δ, Λ | Y, θ, Σ(t), G(t−1), ϑ(t−1)) in steps:
a) Sample µ, D from f(µ, D | θ, G(t−1), ϑ(t−1)). B Eqs. (B4) and (B6)
b) Sample Λ from f(Λ | Y, θ, Σ(t), µ, D). B Eq. (B7)
c) Sample δ from f(δ | Y, θ, Σ(t), Λ, µ, D). B Eq. (B8)
d) Compute and save Λ(t) = ΛD12 and δ(t) = δ + Λµ.
4) Sample ϑ, V (t) and G(t) from their conditional distributions, and compute the corre-
sponding weights {β(t)k }k∈K. This is done differently for the finite and infinite mixture
cases, see Algorithms 2 and 3, respectively.
5) Sample α(t) from f(α | G(t)). B Eq. (B15)
6) Compute µ and D as in Eqs. (9) and (10), using ϑ and {β(t)k }k∈K generated in step 4.
Apply the transformation in Eqs. (12) and (13) to produce the parameters ϑ(t) corre-sponding to the identified model. Transform the latent factors back to the identified
model as θ(t)i = D−
12
(θi − µ
), for i = 1, . . . , N .
Post-processing. Perform a sign switch on the factor loading matrix, mixture means andmixture covariances, to ensure that the model is identified with respect to the signs of thelatent factors and factor loadings (Fruhwirth-Schnatter and Lopes, 2010; Conti et al., 2014).
18
Algorithm 2 Sampling the mixture parameters in the finite mixture case
Step 4 of Algorithm 1 consists of the following Gibbs steps:
a) Sample ϑk from f(ϑk | θ, G(t−1)) if Nk > 0, B Eqs. (B9) and (B10)or from its prior if Nk = 0, for k = 1, . . . , K, B Eqs. (4) and (5)with Nk =
∑Ni=1 1{Gi = k}.
b) Sample V(t)k from f(Vk | G(t−1), α(t−1)), for k = 1, . . . , K − 1. B Eq. (B14)
Set VK = 1.
c) Compute the resulting mixture weights pk, B Eq. (6)
and set β(t)k = pk, for k = 1, . . . , K.
d) Sample G(t) from f(G | θ, p(t), ϑ). B Eq. (B11)
and drawbacks of the conditional and marginal approaches.) In our Monte Carlo experiment
in Section 4.1.3, we consider Algorithms 7 and 8 of Neal (2000), and compare the results to
those obtained with the retrospective sampler.
The parameter transformation carried out in step 6 guarantees that the mixture pa-
rameters fulfill the identification requirements exactly at each step of the MCMC sampler.
Importantly, the parameters and latent variables that are affected by the expansion are al-
ways sampled simultaneously with the working parameters. This ensures that the sampling
scheme preserves the prior distribution of the parameters in the identified model, and does
not distort the posterior distribution, as would happen if sampling was done conditional on
the working parameters.
4 Illustrations with synthetic and real data
We run our sampler on simulated and real data to investigate how our approach performs,
and also compare the results to those obtained from different algorithms.
4.1 Simulation study
In Section 4.1.2, we test our algorithm on synthetic data generated from our generic model
in Eq. (1). We use the retrospective sampler for the inference of the infinite version of the
Dirichlet process in this first exercise. We then repeat the experiment in Section 4.1.3 in the
framework of a Monte Carlo study, to gauge the efficiency of our method and to compare it
to alternative algorithms.
19
Algorithm 3 Sampling the mixture parameters in the infinite mixture case
Step 4 of Algorithm 1 is done retrospectively :
a) Set kmax ≡ maxi{Gi}.
b) Sample ϑk from f(ϑk | θ, G(t−1)) if k ∈ I (al), B Eqs. (B9) and (B10)or from its prior if k ∈ I (d), for k = 1, . . . , kmax. B Eqs. (4) and (5)
c) Sample V (t) from f(V | G(t−1), α(t−1)), for k = 1, . . . , kmax. B Eq. (B14)
d) Compute the resulting mixture weights pk, for k = 1, . . . , kmax. B Eq. (6)
e) Update G. If necessary, introduce new mixture components retrospectively. Set g =G(t−1) and cycle through the following steps, for i = 1, . . . , N , in random order:
(i) Synchronize N? and maxi{Gi}.(ii) Sample Ui ∼ U(0; 1).
(iii) For j = 1, . . . , kmax + 1, evaluate: B Eq. (B12)
j−1∑l=0
qi(g, l) < Ui ≤j∑l=1
qi(g, l),
where qi(g, l) is the probability mass function of assigning observation i, cur-rently in mixture group gi, to mixture component l, while the other observationsremain assigned to their respective groups. By convention, qi(g, 0) = 0 for alli = 1, . . . , N . See details in Appendix B6.2.
(iv) If the condition is verified for some j ≤ N?:Set gi = j with probability αi{g, g(i, j)}. B Eq. (B13)Otherwise, leave gi unchanged, set i← i+ 1 and go to step (i).
(v) If the condition is not verified for any j ≤ N?:Set N? = N? + 1 and j = N?.Sample V
(t)j and ϑj from their priors. B Eqs. (4), (5) and (7)
Compute pj = V(t)j
∏j−1l=1 (1− V (t)
l ) and go to step (iii).
f) Set G(t) = g and β(t)k = Nk/N for k ∈ I (al).
20
4.1.1 Setup of the experiment
Data generation. A data set with N = 2, 000 observations on Q = 9 manifest variables
is simulated with P = 2 latent factors, using the following values for the structural part of
the model:9
δ′ =(
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0),
Λ′ =
(1.0 0.9 0.8 0.0 0.0 0.0 0.8 0.6 0.4
0.0 0.0 0.0 1.0 0.9 0.8 0.4 0.6 0.8
), (25)
Σ = diag(
0.05 0.20 0.40 0.05 0.20 0.40 0.05 0.20 0.40).
Each factor has three dedicated measurements, and the last three measurements load on both
factors. This type of structure is very common in the social sciences, where some particular
tests are designed to measure specific traits (think of an IQ test), while others capture several
features simultaneously (e.g., personality tests measuring self-esteem and self-confidence at
the same time). The idiosyncratic variances in Σ are unbalanced to vary the proportion of
noise affecting each measurement. The intercept terms are set to 0 to allow comparison with
the Gibbs sampler on the unrestricted Dirichlet process mixture model later in this section,
but these zero restrictions are not required in our approach.
The distribution of the latent factors is specified as a mixture of three Gaussian distri-
butions, parametrized as follows in the expanded version of the model:
p1 = 0.4, p2 = 0.3, p3 = 0.3,
µ1 =(
0 0), µ2 =
(1.4 −1.4
), µ3 =
(1.4 1.4
),
Φ1 =
(0.7 0.0
0.0 0.7
), Φ2 =
(0.8 0.4
0.4 0.8
), Φ3 =
(0.8 −0.4
−0.4 0.8
).
The mixture parameters are transformed according to Eqs. (12) and (13), using µ and D as
defined in Eqs. (9) and (10) with βk = pk, for k = 1, 2, 3, to standardize the latent factors
to have zero means and unit variances. The resulting joint distribution of the factors is
displayed in Fig. 1. It is not unlikely to encounter such a distribution in practice, where for
a low level of the first trait θ1, the population has a unimodal distribution conditional on the
other trait (θ2 | θ1), while this conditional distribution becomes bimodal on the other end of
9. This number of observations is similar to that of our real data set used in Section 4.2.
21
Figure 1: True joint distribution of the latent factors in the simulation study.
fact
or 1
−2
−1
0
1
2factor 2
−2−1
0
1
2
density
0.05
0.10
−2 −1 0 1 2
−2
−1
01
2
factor 1fa
ctor
2
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.11
0.11
0.12
0.13
the distribution of the first trait θ1. Common methods traditionally used in the empirical
literature (i.e., standard factor analysis) do not allow to uncover such features of the data,
thus potentially generating misleading results.
Identification. As explained in Section 2.2, each latent factor needs at least two dedicated
measurements to achieve nonparametric identification. The true factor loading matrix Λ
specified in Eq. (25) has three measurements loading exclusively on each factor, therefore
it is sufficient to constrain these zero elements to their true value to identify the model
nonparametrically.
Prior specification. The hyperparameters used in our simulation study are specified in
Table 1. The joint prior distribution of the mixture parameters µk and Φk depends on several
hyperparameters and has a complicated expression in the identified model, see Section 2.4.2
and Eq. (A1). It can, however, easily be simulated to understand the role of these prior
parameters.10 The two most important ones are the scale A0 of the prior covariance matrix
of the mixture means, and the number of degrees of freedom ν0 of the mixture covariance
matrices.
10. To do so, the mixture parameters are first sampled in the expanded version of the model—whichis straightforward to do—and then transformed through Eqs. (12) and (13) to obtain the correspondingparameters in the identified version of the model.
22
Table 1: Hyperparameter specification in the simulation study.
Parameters Hyperparameter values
Intercept terms δ c0 = 10.0Factor loadings Λ d0 = 10.0Idiosyncratic variances Σ a0 = 2, b0 = 1.0Mixture means µk A0 = 1.0
Mixture covariance matrices Φk ν0 = 3, s0 = 1.0Concentration parameter α g0 = 1.0, h0 = 1.0
Figure 2 shows the prior distributions of the parameters of the first mixture component
in the identified model, as well as of the corresponding correlation between the latent factors,
for different values of A0 and ν0 and keeping the remaining prior parameters fixed. While
both parameters have an impact on the prior of the correlation between the factors (see
left column, where larger values of A0 and smaller values of ν0 induce a larger correlation)
and on the mixture variances (right column), mixture means are only influenced by A0
(middle column). This scale parameter A0 affects mixture means and variances in opposite
directions: larger values values of A0 imply more diffuse priors for the mixture means and
more concentrated priors towards zero for the mixture variances, and vice versa—because
of the identification restrictions that tie together these parameters. For the variance of the
mixture component, the peak of the prior distribution observed at 1 in the right column of
Fig. 2 is due to the cases where a single mixture component is simulated, which happens
with prior probability 11% in this setup—see right panel of Fig. 3. In these simulations,
the concentration parameter α of the Dirichlet process and the resulting number of mixture
components are simulated from their priors as well, and therefore have an impact on the
prior of the number of mixture components. Figure 3 shows the corresponding prior density
of α, as well as the resulting numbers of mixture components (both displayed in gray), using
a sample size N = 2, 000.
Based on this simulation of the prior distribution, our specification in Table 1 appears
to be rather noninformative. The correlation between the latent factors, with its inversed
U-shaped distribution, is bound away from extreme cases of perfect collinearity, but still
wide enough to allow a wide range of correlations. The prior on the concentration parameter
α, which has been used in previous studies (see, e.g., Yau et al., 2011), favors rather small
numbers of mixture components, without being too informative about this number.
MCMC tuning and inference. We run our algorithm with the retrospective sampler
for the infinite Dirichlet process (i.e., Algorithms 1 and 3), for a total number of 120, 000
23
Figure 2: Induced prior distribution on the parameters of the first mixture component inthe identified model (k = 1), for different values of A0 and of ν0.
factor correlation mixture component mean mixture component variance
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0.0 0.5 1.0 1.5
0
2
4
6
0
5
10
0.0
0.2
0.4
0.6
dens
ity
ν0 = 3, A0 = 0.1 1 10 50
factor correlation mixture component mean mixture component variance
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0.0 0.5 1.0 1.5
0
1
2
3
0
2
4
6
0.00
0.25
0.50
0.75
1.00
dens
ity
A0 = 1, ν0 = 2 3 4 5
Notes: This figure shows the induced prior distribution on the correlation between the two factors, onthe first mixture mean µ1[1], and on the first the mixture variance Φ1[11] of the first mixture component(k = 1). Note that the last two columns would look similar for µ1[2] and Φ1[22], because of the symmetryof the prior. Prior parameters specified as in Table 1, except for the two parameters A0 and ν0 that arevaried as indicated in the legends of the two figures. Concentration parameter α and corresponding numberof mixture components simulated from their priors using N = 2, 000 observations. Simulations done with100,000 random draws.
24
iterations, and discard the first 20, 000 ones as burn-in period. A sign switch is performed a
posteriori on the factor loading matrix, mixture means and mixture covariances, to ensure
that the model is identified with respect to the signs of the latent factors and factor loadings
(see Fruhwirth-Schnatter and Lopes, 2010; Conti et al., 2014). More precisely, signs are
switched such that the first nonzero elements in each column of Λ are always positive across
MCMC iterations. This simple transformation is innocuous for the interpretation of the
results.
4.1.2 Simulation results
First, we look at how the concentration parameter of the Dirichlet process is inferred from
the data, and at the number of mixture components generated by the algorithm. Figure 3
plots the posterior distributions of α (left panel) and of the number of non-empty mixture
components (right panel), against their corresponding prior distributions. A learning process
is clearly operating, as the posterior of α is concentrated around its mode at 0.42, and looks
different from the prior. The true number of mixture components (K = 3) is sampled
by the algorithm with posterior probability 0.135, which in this particular data set is not
the highest one. Models with larger numbers of mixture components are often visited,
but this overfitting is mostly due to small mixture components introduced during sampling
and reflects the noise in the data. The numbers of observations in the six largest mixture
components are, respectively, equal to 829, 606, 467, 71, 18 and 5, thus showing that three
mixture components dominate.
To get an idea of the fit of the estimated distribution to the true distribution of the
latent factors, we rely on their posterior predictive distribution. More precisely, we plot
this distribution, which is bivariate in our example, over a grid of L1L2 pairs of points
xl1,l2 = (xl11 , xl22 )′, for l1 = 1, . . . , L1 and l2 = 1, . . . , L2. For each pair of points, we compute
the probability density function of the mixture of Gaussians corresponding to Eq. (11),
repeating this for each MCMC iteration t = 1, . . . , T :
f (t)(xl1,l2) =∑k∈K(t)
β(t)k φ(xl1,l2 ;µ
(t)k , Φ
(t)k
),
where φ( · ;µ, Φ) is the probability density function of the multivariate normal distribution
with mean µ and covariance matrix Φ. The set of mixture indices K(t)—thus also the number
of mixture components—may change across MCMC iterations. In the infinite mixture case
used in this experiment, only the alive mixture components are used, such that β(t)k = N
(t)k /N ,
where N(t)k is the number of observations allocated to mixture component k at iteration t.
Similarly, the finite mixture case would be accommodated by setting the weights β(t)k equal
25
Figure 3: Posterior vs. prior distributions of the concentration parameter α of the Dirichletprocess and of the number of non-empty mixture components in the simulation study. Modelwith N = 2, 000 observations.
0.0
0.3
0.6
0.9
0 1 2 3 4
concentration parameter α
dens
ity
0.00
0.05
0.10
0.15
0.20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
number of non−empty mixture componentspr
obab
ility
posterior prior
Notes: Prior specified with g0 = h0 = 1 and simulated with 100,000 random draws.
to the sampled mixture probabilities pk, for k = 1, . . . , K. We then average f (t)(xl1,l2) over
all MCMC iterations for each grid point (l1, l2), and also compute 95% highest posterior
density intervals. The advantage of this procedure, over the traditional approach that would
directly draw future values of θ from the posterior predictive distribution of the factors, is
that it allows to show the conditional distribution of the latent factors by fixing one of the
two dimensions.
The corresponding results are displayed in Fig. 4. The top panel of this figure shows
that the algorithm manages to recover the true distribution of the latent factors pretty
well in comparison to Fig. 1.11 The bottom six figures plot different slices of the joint
distribution, which are proportional to the corresponding conditional distributions f(θ1 | θ2)and f(θ2 | θ1) for different values of θ1 and θ2, together with the corresponding 95% highest
posterior density intervals. Overall, the fit appears to be very good.
To gain more insights into the performance of our approach, we now repeat this experi-
ment and compare the results to those obtained from alternative approaches.
11. Note that for the joint distribution for the factors in the top panel, we do not show the highest posteriordensity intervals, as this would make the three-dimensional figure too difficult to read.
26
Figure 4: Joint posterior distribution of the latent factors in the simulation study. Modelwith N = 2, 000 observations.
fact
or 1
−2
−1
0
1
2factor 2
−2−1
0
1
2
density
0.05
0.10
f(θ1, θ2 = −1) f(θ1, θ2 = 0) f(θ1, θ2 = 1)
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
0.00
0.05
0.10
0.15
θ1
dens
ity
f(θ2, θ1 = −1) f(θ2, θ1 = 0) f(θ2, θ1 = 1)
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
0.00
0.05
0.10
θ2
dens
ity
posterior true 95% highest posterior density interval
27
4.1.3 Monte Carlo experiment
We carry out a Monte Carlo experiment using the setup laid out above for models with
N = 100 and 2, 000 observations, for a total number of 100 replications each. We compare
the results obtained from the five following settings, where the one used in Section 4.1.1
corresponds to P5a:
P1 Gibbs sampler on unrestricted truncated Dirichlet process (K = 5), using a post-
processing stage to restore identification.12
P2 Truncated Dirichlet process Gaussian mixture model with MDA (K = 5), using the
mixture weights p for the computation of the working parameters.
P3 Truncated Dirichlet process Gaussian mixture model with MDA (K = 5), using the
observed mixture frequencies w for the computation of the working parameters.
P4 Gibbs sampler on unrestricted infinite Dirichlet process combined with the retrospective
sampler. Parameter restrictions on intercept terms (δ = 0) and on two factor loadings
(λ11 = λ42 = 1) to set the scale and location of the factors.
P5 Infinite Dirichlet process Gaussian mixture model with MDA, using either the retro-
spective sampler (P5a), or Algorithms 7 or 8 of Neal (2000) (P5b/P5c, respectively).
These five approaches imply five different prior distributions, hence the labels P1–P5.
Settings P2, P3 and P5 correspond to the methods introduced in the present paper. All
approaches are intrinsically different, not only in how they preserve the original prior in the
identified model or induce a different one, but also in how they achieve identification, and
how they approximate the Dirichlet process (truncated versions) or allow to deal directly
with the infinite case. For all methods relying on a truncated version of the Dirichlet process,
we use an upper bound of 5 mixture components.
These settings all allow to make inference on the full model with location and scale
restrictions on the distribution of the latent factors to achieve identification, besides P4: As
it is not possible to sample sequentially the mixture parameters and ensure at the same time
that the identification restrictions on the means and variances of the factors are fulfilled, we
instead set the intercept terms δ to zero and fix one element in each column of the factor
loading matrix, i.e., λ11 = λ42 = 1. These types of restrictions, widely used in practice,
are sufficient for identification but put an additional burden on the model. For example, it
12. Similar to Yang et al. (2010), but using only the variances of the factors as working parameters foridentification reasons, see Section 2.4.2.
28
might be too restrictive in some applications to fix some of the loadings to 1 a priori—see
discussion in Section 2.2.
To assess the results of the different approaches, we compare their efficiency in Figs. 5
and 6. These boxplots summarize inefficiency factors for selected parameters of the model
and for the deviance of the estimated distributions of the latent factors θ and of the manifest
variables Y . The inefficiency factor is a popular statistic used to monitor the mixing of the
Markov chain. Computed as the inverse of the relative numerical efficiency (Geweke, 1989),
it gives a measure of the number of MCMC iterations required by the sampler to provide the
same numerical accuracy as an hypothetical independent and identically distributed (iid)
sample from the target distribution.13 Unfortunately, the inefficiency factor is notoriously
difficult to estimate, and can be unstable depending on the available number of MCMC
iterations. See, for example, the discussion in Sokal (1997). We investigated this problem
and found that although it can result in outliers, as seen in Figs. 5 and 6, it does not affect
the overall picture and the general conclusions of our simulation study.
The deviance is a function of several relevant model parameters that summarizes the
accuracy of the approximation of the corresponding distributions. It has been used as a
measure of fit by Neal (2000), Green and Richardson (2001), and Papaspiliopoulos and
Roberts (2008) in their comparison studies. It is calculated as:
D(θ) = −2N∑i=1
log
∑k∈I(al)
Nk
Nf(θi | µk, Φk)
,
D(Y ) = −2N∑i=1
log
∑k∈I(al)
Nk
Nf(Yi | δ, Λ,Σ, µk, Φk)
,
using the true values of the simulated factors in the computation of D(θ). These two de-
viances can be evaluated at each iteration of the MCMC sampler, using the corresponding
draws of the model parameters. Thereby, we obtain a posterior distribution of these two
statistics, which we use to compute the corresponding inefficiency factors.
Figures 5 and 6 reveal that overall, some of the approaches provide comparable results,
but others show some marked differences that are worth highlighting and explaining. First,
the inefficiency factors are very similar across approaches for the structural part of the model,
as can be seen from the three top panels for the selected parameters of the factor model.
The only exceptions are settings P1 and P4 that rely on a Gibbs sampler for the unrestricted
13. Lower inefficiency factors are better. For example, with an inefficiency factor of 5, 50, 000 draws arerequired to provide a numerical accuracy equivalent to the one that could ideally be obtained with 10,000iid draws.
29
Figure 5: Boxplots of inefficiency factors for selected parameters and statistics of interest.Monte Carlo experiments with 100 data sets of N = 100 observations each.
●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●●●●
●
●
●
●●
●
●
●
●●●
●●●●●●●●●●●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●●●
●●●
●●●●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
α D(Y) D(θ)
δ2 λ21 σ22
P1 P2 P3 P4 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c
P1 P2 P3 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c1
2
3
4
5
0
25
50
75
100
125
2
4
6
8
0
25
50
75
1.00
1.25
1.50
1.75
2.00
0
50
100
150
settings
inef
ficie
ncy
fact
ors
Dirichlet process: truncated infinite
Notes: Boxplots in the style of Tukey (1977): 25/50/95th percentiles (box), 1.5 inter-quartile range(whiskers), and outliers (dots). Inefficiency factors for one intercept term (δ2), one factor loading (λ21),one idiosyncratic variance (σ2
2), for the concentration parameter α of the Dirichlet process, and for the de-viance of the distribution of the manifest variables (D(Y )) and the distribution of the latent factors (D(θ)).Settings: P1: Unrestricted truncated Dirichlet process with post-processing stage; P2: Truncated Dirichletprocess Gaussian mixture model using mixture weights p for computation of working parameters; P3: Sameas P2 but using observed mixture frequencies w; P4: Gibbs sampler on unrestricted infinite Dirichlet pro-cess, with parameter restrictions on intercept terms and two loadings for identification; P5: Infinite Dirichletprocess Gaussian mixture model with MDA with retrospective sampler (P5a), Algorithm 7 (P5b) or 8 (P5c)of Neal (2000). See beginning of Section 4.1.3 for full details. For P4, the intercepts are fixed to 0 foridentification purpose, hence the lack of boxplot for δ2. Monte Carlo experiments based on 100 replications,using the same 100 data sets across the seven different settings.
30
Figure 6: Boxplots of inefficiency factors for selected parameters and statistics of interest.Monte Carlo experiments with 100 data sets of N = 2000 observations each.
●●●●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●●●●●
●
● ●●●●●●●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●●
●●
●
●
●
●
●
●
●
●
●●●●
●
●●●
●
●
●
●
●
●
●●●●
●
●
●●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
α D(Y) D(θ)
δ2 λ21 σ22
P1 P2 P3 P4 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c
P1 P2 P3 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c1.0
1.5
2.0
2.5
0
20
40
60
2
4
6
0
50
100
150
200
1.0
1.2
1.4
1.6
0
50
100
150
200
settings
inef
ficie
ncy
fact
ors
Dirichlet process: truncated infinite
Notes: Boxplots in the style of Tukey (1977): 25/50/95th percentiles (box), 1.5 inter-quartile range(whiskers), and outliers (dots). Inefficiency factors for one intercept term (δ2), one factor loading (λ21),one idiosyncratic variance (σ2
2), for the concentration parameter α of the Dirichlet process, and for the de-viance of the distribution of the manifest variables (D(Y )) and the distribution of the latent factors (D(θ)).Settings: P1: Unrestricted truncated Dirichlet process with post-processing stage; P2: Truncated Dirichletprocess Gaussian mixture model using mixture weights p for computation of working parameters; P3: Sameas P2 but using observed mixture frequencies w; P4: Gibbs sampler on unrestricted infinite Dirichlet pro-cess, with parameter restrictions on intercept terms and two loadings for identification; P5: Infinite Dirichletprocess Gaussian mixture model with MDA with retrospective sampler (P5a), Algorithm 7 (P5b) or 8 (P5c)of Neal (2000). See beginning of Section 4.1.3 for full details. For P4, the intercepts are fixed to 0 foridentification purpose, hence the lack of boxplot for δ2. Monte Carlo experiments based on 100 replications,using the same 100 data sets across the seven different settings.
31
Dirichlet process: Inefficiency appears to be three to six times lower for the factor loading
λ21 in the experiments with N = 2, 000 observations—but less so with N = 100. This
efficiency loss affecting our approaches (P2/P3/P5) can be explained by the marginal data
augmentation scheme we rely on, where the prior distribution of the working parameters is
a conditional prior distribution—it includes the mixture parameters of the identified model
ϑ in its conditioning set, see Eqs. (21) and (22). This does not invalidate the approach
(van Dyk, 2010), but slightly deteriorates the efficiency of the algorithm. This is the price to
pay to obtain a sampler that preserves the original prior distribution of the structural part
of the factor model (especially for the loadings, contrary to P1), and that does not require
additional parameter restrictions on these parameters (contrary to P4). This computational
cost, however, turns out to be rather modest given the benefits of our approaches.
The efficiency of the samplers for the nonparametric part of the model can be assessed
thanks to the three bottom panels of Figs. 5 and 6. The concentration parameter α of the
Dirichlet process influences the number of mixture components. Therefore, the corresponding
inefficiency factor gives an idea of how well the sampler manages to introduce/remove mixture
components to fit the data nonparametrically. Clearly, the approaches based on an infinite
Dirichlet process are outperformed by those using a truncated version of the process. This
result, however, is not reflected in the approximation of the distributions of the latent factors
θ and of the manifest variables Y : The corresponding deviances D(θ) and D(Y ) are fairly
similar across approaches, to the exception of P2 (truncated Dirichlet process using the
mixture probabilities p to compute the working parameters), which becomes much more
efficient when N increases. The downside of this setting P2 seems to be a larger variability
of the efficiency for the factor loadings and idiosyncratic variances, as revealed by the higher
boxes for λ21 and σ22 in Fig. 6. This could potentially indicate a trade-off between efficiency
of the sampler for the inference of the structural part of the model (factor model) and its
efficiency for the inference of the nonparametric part of the model (distribution of the latent
factors).
Finally, it is interesting to note that the conditional approach (P5a, retrospective sampler)
and the marginal approaches (P5b/P5c, Algorithms 7 and 8 of Neal, 2000) provide similar
results in terms of deviance of the estimated distributions of θ and Y , independently of the
number of observations. However, the latter are slightly more efficient when it comes to
the inference of the number of mixture components, as shown by the inefficiency factors of
the concentration parameter α. This observation was also made by Papaspiliopoulos and
Roberts (2008). It also appears that increasing the number of observations improves the
efficiency of the sampler for the conditional approaches (P4/P5a), whereas the marginal
32
approaches (P5b/P5c) look stable when N changes—compare the bottom right panels of
Figs. 5 and 6.
These simulation results shed light on the properties of our approach, how it compares
to alternative approaches, and how the different versions of our algorithm perform. In this
respect, they provide guidance on the choice of the version to use, depending on the require-
ments of the model to be estimated (e.g., relevance of using an infinite Dirichlet process
rather than a truncated one, importance of using a conditional approach vs. a marginal ap-
proach). Most importantly, it is important to bear in mind that the algorithms we propose
are the only ones that safeguard the identification of the model without resorting to addi-
tional restrictions, and to preserve the original prior distribution of the model parameters at
the same time.
4.2 Empirical example
Many empirical applications in economics rely on the assumption of normality of the latent
factors. This usually makes inference straightforward to carry out, and facilitates interpre-
tation. It is, however, reasonable to question the relevance of this assumption in practice.
To illustrate this problem, we estimate a simple factor model using data from the British
Cohort Study (BCS).
Data. The British Cohort Study (BCS) is a longitudinal survey that follows all babies
born in one particular week of April 1970 in the United Kingdom. It includes a large num-
ber of measurements on cognitive abilities, socio-emotional traits, behavioral and physical
development at different stages in the life cycle of the surveyed individuals, and therefore
represents a unique opportunity for psychologists and economists to study human capital
development. This data set has been used in economics, for example, by Conti et al. (2014)
and Uysal (2015).
For the sake of simplicity, we restrict our analysis to two dimensions and focus on cognitive
ability and behavioral problems in this section. The first factor is measured by 7 test scores,
while the second is captured by 16 measurements related to the Rutter and Conners scales.
The sample contains 2,080 individuals.
Inference. We run our algorithm with the retrospective sampler for the inference of the
infinite Dirichlet process mixture model (P5a). We do not incorporate any strong prior
information into the model and use the same prior specification as in our simulation study,
see Table 1. To identify the structural part of the model, a dedicated structure is assumed,
where the cognitive measurements load on the first factor and the behavioral problems
33
Figure 7: Posterior joint distribution of the latent factors in the empirical example withthe BCS data.
fact
or 1
(co
gniti
on)−2
−1
0
1
2
factor 2 (behavioral problems)
−2−1
01
2
density
0.05
0.10
0.15
measurements load on the second one. We therefore create two clusters of measurements,
and the underlying latent factors are allowed to be correlated. The sampler is run for
1,020,000 iterations, and the first 20,000 ones are discarded as burn-in period. The sampling
is repeated ten times with different starting values. The large number of MCMC iterations
and the different starting values are used to make sure convergence is achieved. In the
following, we only present the results from one single run to keep the figures simple—the
results are virtually identical across the ten runs.
Empirical results. Figure 7 shows the joint posterior distribution of the latent factors.
This distribution exhibits several modes, and a fat tail for the factor θ2 capturing behavioral
problems. The multiplicity of modes can also be seen in Fig. 8, which shows that the
algorithm visits models with 6 mixture components most often, with a posterior probability of
0.17. Overall, the retrospective sampler switches very quickly between models with different
numbers of mixture components, as shown by the trace plot at the bottom of this figure.
Models containing between 3 and 35 mixture components are visited, but the sampler favors
relatively sparse solutions, as models with up to 10 mixture components are produced with
34
posterior probability 79%. This result is confirmed by looking at the posterior distribution
of the concentration parameter α of the Dirichlet process, shown in the upper left panel of
this figure. Its mode is equal to 0.73, which shows evidence for a rather small number of
mixture components compared to the number of observations.
This simple example reveals that the normality assumption is likely to be violated in this
data set. Relaxing this distributional assumption allows the sampler to explore alternative
solutions with non-standard distributions that are supported by the data. The misspeci-
fication of the model that results from the standard Gaussian assumption can potentially
contaminate the interpretation of the results, and also affect estimation if this model is
subsequently used to measure the impact of these latent factors on economic outcomes.
Figure 8: BCS data: Posterior distributions of the concentration parameter of the Dirichletprocess, posterior distribution and trace plot of the number of non-empty mixture compo-nents.
35
5 Conclusion
This paper introduces a new approach to factor analysis with non-normal factors that draws
on the literature on Bayesian nonparametric methods. It extends these approaches by plac-
ing the formal identification of the factor model at the core of the inferential procedure,
guaranteeing that the algorithm only produces identified models during sampling. This is
achieved by implementing a new sampling scheme for mixtures of normals with location
and scale restrictions based on marginal data augmentation, combined with a retrospective
MCMC sampler for the Dirichlet process mixture model.
A simulation study is carried out and provides very encouraging results. The sampler
successfully manages to retrieve the distribution of the latent factors nonparametrically, and
exhibits good properties in terms of convergence and mixing. A real data example illustrates
the relevance of the methodology. It shows that the latent factors extracted from the data
appear to be highly non-normal. This provides evidence that the normality assumption can
be questioned in practice. We therefore advocate to relax this assumption whenever possible,
and leave it for further research to investigate the impact a potential misspecification may
have on the results.
Combining our Bayesian nonparametric approach with other approaches in factor analysis
has great potential to allow for the inference of richer structures that can better explain
empirical data. Embedding a nonparametric approach into a structural model, however,
raises some questions about the properties of the resulting sampler, especially in terms of
efficiency. These important questions are not limited to our particular setup, but are likely
to arise in any framework where a structural model is augmented with a nonparametric
approach for the estimation of the unknown distribution of one of its components. These
questions are currently being investigated further in ongoing projects.
Acknowledgments
This paper was previously circulated under the title “A Bayesian Nonparametric Approach
to Factor Analysis with Non-Gaussian Factors”. It was presented at the European Semi-
nar on Bayesian Econometrics (ESOBE, Venice, Italy), at the 69th European Meeting of
the Econometric Society (ESEM, Geneva, Switzerland), at the World Meeting of the In-
ternational Society for Bayesian Analysis (ISBA 2016, Sardinia, Italy), at the Department
of Economics Seminar at The University of Sydney (Australia), at the Bayesian Analysis
and Modeling Summer Workshop at The University of Melbourne (Australia), and at the
Research Workshop of the Centre for Applied Microeconometrics (CAM, Copenhagen, Den-
36
mark). The authors are very grateful for all the comments received at these conferences and
seminars, which helped improve substantially the paper.
Computations made with own code written in Fortran 2008, combined with the R Pro-
gramming Language (R Core Team, 2017).14 Graphics generated with the R package ggplot2
(Wickham, 2009).
Remi Piatek’s research was funded by the Danish Council for Independent Research and
the Marie Curie programme COFUND under the European Union’s Seventh Framework
Programme for research, technological development and demonstration, Grant-ID DFF—
4091-00246.
A Prior distribution
Proposition 2. Consider the parameters µ and D defined in Eqs. (9) and (10) and the
one-to-one mappings from ϑ to ϑ as defined in Eqs. (12) and (13). Then, the normal-
inverse-Wishart prior distribution specified on ϑk = {µk, Φk} in Eqs. (4) and (5), for k ∈ K,
where in the finite case K = {1, . . . , K} and in the infinite case K = I (al), implies that
f(ϑ |ν0, A0, β) ∝∣∣∣∣∣∑k∈K
Φ−1k
∣∣∣∣∣− 1
2∣∣∣∣∣∏k∈K
Φk
∣∣∣∣∣− ν0+P+2
2(
P∏j=1
∑k∈K
[Φ−1k
][jj]
)− |K|ν02
(A1)
× exp
− 1
2A0
∑k∈K
µ′kΦ−1k µk −
(∑k∈K
Φ−1k µk
)′(∑k∈K
Φ−1k
)−1(∑k∈K
Φ−1k µk
)× 1{µ = 0}1{diag(Φ) = ιP},
where [·][jj] denotes the jth diagonal element of the corresponding matrix, |K| is the cardinal
number of the set K, and the conditions in the two indicator functions in the last line enforce
the constraints on the location and scale of the latent factors, see Eqs. (14) and (15).
Note that the dependence on the mixture weights β = (β1, . . . , βK) is hidden in the
constraints imposed via the indicator functions. Also, note that this density does not depend
on the scaling parameter s0 of the inverse-Wishart distribution of the covariance matrices
in the auxiliary model. This parameter controls the degree of inflation of the parameters in
the augmented model, but has no influence on the prior distribution of the parameters in
the identified model.
14. The methodology introduced in this paper will be released as an extension to the R package BayesFM
available on CRAN at https://cran.r-project.org/package=BayesFM upon publication of this article.
37
A1 Proof of Propositions 1 and 2
Induced joint prior distribution. The joint distribution of {µ, D, µK, ΦK} is derived
from the distribution of {µK, ΦK} in the expanded model using a transformation of random
variables. The restrictions in Eqs. (14) and (15) imply that:
f(µ, D, µK, ΦK) = f(µ, D, µ−k, Φ−k, ΦLk ) f(µk | µ, D, µ−k, ΦK)︸ ︷︷ ︸
1{µ=0P }
f(ΦDk | µ, D, µ−k, Φ−k, ΦLk )︸ ︷︷ ︸1{diag(Φ)=ιP }
(A2)
where µK = {µk}k∈K and ΦK = {Φk}k∈K, and ΦDk and ΦLk denote, respectively, the diagonal
elements and the lower triangular part (excluding the diagonal elements) of Φk. The first den-
sity is obtained from the transformation of random variables (µK, ΦK)→ (µ, D, µ−k, Φ−k, ΦLk ),
such that:
f(µ, D, µ−k, Φ−k, ΦLk ) = f(µK, ΦK)J {(µK, ΦK)→ (µ, D, µ−k, Φ−k, Φ
Lk )}, (A3)
where J {(·)→ (·)} is the Jacobian of the corresponding transformation.
Since the mixture parameters {µk, Φk} are assumed to be independent across mixture
components and to follow a normal-inverse-Wishart distribution for each k ∈ K in the
expanded model, see Eqs. (4) and (5), the joint distribution of the corresponding parameters
in the identified model µK and ΦK and of the working parameters µ and D is derived from
38
Eqs. (A2) and (A3) as follows, and without loss of generality:15
f(µ,D, µK, ΦK)
= f(µK, ΦK)J {(µK, ΦK)→ (µ, D, µ−k, Φ−k, ΦLk )}
× 1{µ = 0P}1{diag(Φ) = ιP},
∝∏k∈K
∣∣∣ Φk ∣∣∣−1/2 exp
{− 1
2A0
µ′kΦ−1k µk
} ∣∣∣ Φk ∣∣∣− ν0+P+12
exp{−s0
2tr(Φ−1k
)}×
P∏j=1
(Dj
) |K|(P+2)−32
1{µ = 0P}1{diag(Φ) = ιP},
∝ exp
{− 1
2A0
(µ′D−
12
(∑k∈K
Φ−1k
)D−
12 µ+ 2µ′D−
12
(∑k∈K
Φ−1k µk
))}(A4)
×P∏j=1
(Dj
)− |K|ν0+12−1
exp
{−s0
2
∑k∈K
[Φ−1k
][jj]D−1j
}(A5)
×∏k∈K
|Φk |−ν0+P+2
2 exp
{−µ
′kΦ−1k µk
2A0
}× 1{µ = 0P}1{diag(Φ) = ιP},
where the Jacobian of the transformation is proportional to∏P
j=1
(Dj
) |K|(P+2)−32
, see Ap-
pendix A2.
The kernel can be factorized as
f(µ, D, µK, ΦK) = f(µ | D, µK, ΦK)f(D | µK, ΦK)f(µK, ΦK),
and the three distributions on the right-hand side can be retrieved as follows. The conditional
distribution of µ is obtained from Eq. (A4), which is the kernel of a Gaussian distribution:
µ | D, µK, ΦK, A0 ∼ N
−D 12
(∑k∈K
Φ−1k
)−1(∑k∈K
Φ−1k µk
); A0D
12
(∑k∈K
Φ−1k
)−1D
12
.
15. Because of the identification constraints on the mixture means and variances, the mean and variance ofone mixture component are redundant and can be discarded. Which component is discarded does not affectthe results.
39
The conditional distribution of D is obtained by integrating out µ, using the kernel in
Eq. (A5) and completing the normalizing constant that depends on D in Eq. (A4):
f(D | µK, ΦK, ν0, s0) ∝∫f(µ, D, µK, ΦK)dµ,
∝∣∣∣ D ∣∣∣ 12 P∏
j=1
(Dj
)− |K|ν0+12−1
exp
{−s0
2
∑k∈K
[Φ−1k
][jj]D−1j
},
∝P∏j=1
(Dj
)− |K|ν02−1
exp
{−s0
2
∑k∈K
[Φ−1k
][jj]D−1j
},
which results in a product of kernels of inverse-Gamma distributions:
Dj | ΦK, ν0, s0 ∼ IG
(|K|ν0
2;s02
∑k∈K
[Φ−1k
][jj]
),
for j = 1, . . . , P .
Finally, the kernel of the marginal distribution of the mixture parameters in the identified
model is obtained by integrating both µ and D out of the joint distribution:
f(µK, ΦK | A0, ν0) ∝∫∫
f(µ, µK, D, ΦK) dµ dD,
which produces the kernel in Eq. (A1).
A2 Jacobian of the transformation
The Jacobian corresponding to the change of variables that allows to move from the expanded
model to the identified model can be derived in several steps. Because of the restrictions on
the parameters of the identified model (µ = 0P and diag(Φ) = ιP ), one of the mixture means
and the diagonal elements of one of the covariance matrices are redundant in the parameter
transformation and can be left aside in the derivation. The subscript −k indicates that the
kth element of the corresponding set is left out, e.g., µ−k = {µl | l ∈ K, l 6= k}. We denote
ΦLk the lower triangular elements of Φk, excluding the diagonal elements. Without loss of
generality, we derive the Jacobian for the case where the mean and the diagonal elements of
40
the covariance matrix of the kth mixture component are left aside:
J {(µK,ΦK)→ (µ, D, µ−k, Φ−k, ΦLk )}
= J {(µK, ΦK)→ (µ, µ−k, ΦK)}
× J {(µ, µ−k, ΦK)→ (µ, µ−k, Φ, Φ−k)}
× J {(µ, µ−k, Φ, Φ−k)→ (µ, µ−k, D, Φ, Φ−k)} (A6)
× J {(µ, µ−k, D, Φ, Φ−k)→ (µ, µ−k, D, Φ, Φ−k)}
× J {(µ, µ−k, D, Φ, Φ−k)→ (µ, µ−k, D, Φ, Φ−k)}
× J {(µ, µ−k, D, Φ, Φ−k)→ (µ, µ−k, D, Φ−k, ΦLk )}
=
(1
pk
)P×(
1
pk
)P (P+1)2
×P∏j=1
(Dj
)P−12 ×
P∏j=1
(Dj
) |K|−12
×P∏j=1
(Dj
) (P+1)(|K|−1)2 × p
P (P−1)2
k ,
=
(1
pk
)2P P∏j=1
(Dj
) |K|(P+2)−32
,
∝P∏j=1
(Dj
) |K|(P+2)−32
,
where the Jacobian in line A6 is derived as in Zhang et al. (2006).
B Details on MCMC Sampler
This appendix provides technical details on the MCMC sampler. These steps are presented
in their generic form and can be used for both the finite and the infinite cases, where in the
former K = {1, . . . , K}, while in the latter K = I (al), respectively, and |K| is the cardinal
number of K.
B1 Sampling the idiosyncratic variances (step 1)
The inverse-Gamma prior in Eq. (18) provides the following posterior, for q = 1, . . . , Q:
σ2q | Y, θ, δ, Λ, a0, b0 ∼ IG
(a0 +
N
2; b0 +
1
2
N∑i=1
(Yi − δ − Λθi)2). (B1)
41
B2 Sampling the latent factors (step 2b)
For i = 1, . . . , N :
θi | Yi, Gi, ϑ, δ, Λ,Σ ∼ N (Bθibθi ; Bθi) , B−1θi = Λ′Σ−1Λ+ Φ−1Gi , (B2)
bθi = Λ′Σ−1(Yi − δ) + Φ−1GiµGi .
B3 Sampling the working parameters conditional on the latent
factors in the expanded model (step 3a)
The joint conditional distribution of the working parameters, given their prior distributions
expressed in Eqs. (21) and (22), is proportional to:
p(µ, D | θ, G, ϑ) ∝ p(θ | µ, D, G, ϑ) p(µ | D, ϑ) p(D | ϑ),
∝N∏i=1
∣∣∣ D 12ΦGiD
12
∣∣∣− 12
× exp
{−1
2
N∑i=1
(θi − µ− D12µGi)
′(D12ΦGiD
12 )−1(θi − µ− D
12µGi)
}
×∣∣∣ D ∣∣∣− 1
2exp
{− 1
2A0
[µ′D−
12
(∑k∈K
Φ−1k
)D−
12 µ
+2µ′D−12
(∑k∈K
Φ−1k µk
)]}
×∣∣∣ D ∣∣∣− |K|ν02
−1exp
{−s0
2
∑k∈K
tr(Φ−1k D−1
)},
∝ exp
{−1
2
[µ′D−
12
(∑k∈K
(Nk + A−10 )Φ−1k
)D−
12 µ (B3)
−2µ′D−12
∑k∈K
Φ−1k
([D−
12
∑i∈Ik
θi
]− (Nk + A−10 )µk
)]}
×∣∣∣ D ∣∣∣− |K|ν0+N+1
2−1
exp
{−1
2
∑k∈K
tr
(D−
12Φ−1k D−
12
[∑i∈Ik
θiθ′i + s0IP
])
+∑k∈K
µ′kΦ−1k D−
12
∑i∈Ik
θi
}.
42
This provides the kernel of a normal distribution for µ conditional on D and on the remaining
parameters:
µ | θ, D, G, ϑ ∼ N(D
12B2(B1(D)−B3); D
12B2D
12
), (B4)
with:
B1(D) =∑k∈K
Φ−1k D−12
∑i∈Ik
θi, B−12 =∑k∈K
(Nk + A−10 )Φ−1k ,
B3 =∑k∈K
(Nk + A−10 )Φ−1k µk.
As for the other working parameters D, the kernel of their conditional distribution is ob-
tained by integrating µ out of the joint distribution, by completing the normalizing constant
of Eq. (B3):
p(D | θ, G, ϑ) =
∫p(µ, D | θ, G, ϑ)dµ,
∝∣∣∣ D ∣∣∣− |K|ν0+N2
−1exp
{1
2B1(D)′B2
(B1(D)− 2B3
)− 1
2
∑k∈K
tr
(D−
12Φ−1k D−
12
[∑i∈Ik
θiθ′i + s0IP
])
+∑k∈K
µ′kΦ−1k D−
12
∑i∈Ik
θi
}, (B5)
which is not the kernel of a known distribution. However, D can be simulated with a
Metropolis-Hastings step.
Metropolis-Hastings step to sample D. As proposal distribution for each of the diag-
onal elements j = 1, . . . , P of D, a log-normal distribution is used, parametrized such that
its mode is equal to Dj:
D?j | (Dj, ρ
2) ∼ lnN(
ln Dj + ρ2; ρ2), (B6)
q(D? | D, ρ2) ∝P∏j=1
1
D?j
exp
{− 1
2ρ2
(ln D?
j − ln Dj − ρ2)2}
.
43
The P proposed values D? are accepted as new draws for D with probability:
α(D? | D) = min
{1;f(D? | θ, G, ϑ)
f(D | θ, G, ϑ)
q(D | D?, ρ2)
q(D? | D, ρ2)
},
where the first ratio can be computed using Eq. (B5), while the second ratio, after some
algebra, simplifies to ln q(D|D?,ρ2)q(D?|D,ρ2)
=∑P
j=1(ln Dj − ln D?j ). The parameter ρ2 is a tuning
parameter that influences the acceptance rate of the Metropolis-Hastings algorithm. We use
ρ2 = 1/N in our applications.
B4 Sampling the intercept terms and factor loadings (step 3b-c)
In the expanded model, the prior distributions specified in Eqs. (23) and (24) result in the
following conditional distributions for each vector of factor loadings Λq and intercept term
δq corresponding to manifest variable q = 1, . . . , Q:
Λq | Yq, θ, σ2q , µ, D ∼ N
(BΛq
bΛq ; BΛq
), (B7)
δq | Yq, θ, Λq, σ2q , µ, D ∼ N
(Bδq
bδq ; Bδq
), (B8)
with:
B−1δq
=1
c0+N
σ2q
, bδq =1
σ2q
N∑i=1
(Yqi − Λ′qθi
)−Λ′qµ
c0,
B−1Λq
=θ′θ
σ2q
+µµ′
c0+D
d0−Bδq
bqb′q, bΛq =
1
σ2q
(θ′Yq − bqBδq
(N∑i=1
Yqi
)),
where bq = c−10 µ+ 1σ2q
∑Ni=1 θi.
B5 Sampling the parameters of the non-empty mixture compo-
nents in the expanded model (step 4)
The conjugate normal-inverse-Wishart prior distribution specified on the mixture parameters
in Eqs. (4) and (5) results in the following posterior distribution for the non-empty mixture
44
components:
Φk | θ, G ∼ IW
ν0 +Nk; s0IP +∑i∈Ik
θiθ′i −
(∑i∈Ik θi
)(∑i∈Ik θi
)′Nk + A−10
, (B9)
µk | Φk, θ, G ∼ N
( ∑i∈Ik θi
Nk + A−10
;Φk
Nk + A−10
), (B10)
where Ik = {i ∈ I : Gi = k}, with I = {1, . . . , N}, is the set of indices corresponding to the
observations belonging to mixture group k, and Nk = card (Ik) is the number of observations
in mixture group k.
B6 Sampling the mixture group indicators (step 4)
B6.1 Finite mixture case
Each observation i = 1, . . . , N is allocated to mixture group k with probability
p(Gi = k | θi, pk, ϑk
)∝ pk
∣∣∣ Φk ∣∣∣− 12φP
(Φ− 1
2k (θi − µk)
), (B11)
where φP (·) denotes the probability density function of the multivariate standard normal
distribution.
B6.2 Infinite mixture case
In the infinite case, we implement algorithm 2 of Papaspiliopoulos and Roberts (2008, p. 176)
to update the mixture group indicators and to introduce new mixture components on the
fly. The parameters of the corresponding new mixture components are sampled from their
prior distribution retrospectively as they become required.
More precisely, in Algorithm 3 the probability of assigning an observation i to mixture
component l, while the other individuals remain assigned to their respective groups, is pro-
portional to:
qi(g, l) ∝
pl f(θi
(t)| ϑ(t)
l
)if l ≤ kmax,
plMi(g) if l > kmax,(B12)
where g ≡ G(t−1), kmax ≡ maxi{Gi} denotes the last non-empty component of the mix-
ture, and f(· | ϑ) is the probability density function of the multivariate normal distribution
45
parametrized by ϑ. The user-defined function Mi(g) is chosen as Mi(g) = maxl≤kmax{f(θi |ϑl)}.16 The normalizing constant ci(g) of the mixture mass probabilities in Eq. (B12) is
equal to:
ci(g) =kmax∑l=1
pl f(θi
(t)| ϑ(t)
l
)+Mi(g)
(1−
kmax∑l=1
pl
).
The acceptance probability of the Metropolis-Hastings move in step (iv) of Algorithm 3
is computed as:
αi{g, g(i, j)} =
1 if max{g(i, j)} = kmax,
min
{1,
ci(g)Mi(g(i, j))
ci(g(i, j))f(θi | ϑgi)
}if max{g(i, j)} < kmax,
min
{1,
ci(g)f(θi | ϑj)ci(g(i, j))Mi(g)
}if j > kmax,
(B13)
where g(i, j) is identical to the vector g up to its ith element that is set to j, and gi denotes the
ith element of g. This acceptance probability depends on whether an existing mixture group
is proposed (first two cases, when j ≤ kmax) and the dimension of the Dirichlet process does
not change, or on the contrary whether a new mixture group is proposed for incorporation
into the process (last case).
B7 Sampling the stick-breaking variables (step 4)
Each random variable Vk underlying the stick-breaking process is updated as
Vk | G,α ∼ Beta
(Nk + 1;α +N −
k∑j=1
Nj
), (B14)
where Nk denotes the number of observations assigned to mixture group k. In the finite
mixture case, this is done for k = 1, . . . , K − 1, while VK = 1. In the infinite mixture case,
this conditional distribution collapses to the prior in Eq. (7) for k ≥ kmax.
16. Following Papaspiliopoulos and Roberts (2008, p. 176).
46
B8 Sampling the concentration parameter α (step 5)
Following Escobar and West (1995), the Gamma prior distribution specified on α in Eq. (19)
results in a posterior that is a mixture of two Gamma distributions:
η | α,K+ ∼ Beta(α + 1;N),
α | η,K+ ∼ πηG(g0 +K+; h0 − log(η)
)+ (1− πη)G
(g0 +K+ − 1; h0 − log(η)
), (B15)
with πη/(1− πη) = (g0 + K+ − 1)/(N(h0 − log(η))), and where K+ denotes the number of
non-empty mixture components.
References
Aguilar, O., and M. West. 2000. “Bayesian Dynamic Factor Models and Portfolio Allocation.”
Journal of Business & Economic Statistics 18 (3): 338–357. doi:10.1080/07350015.
2000.10524875.
Almlund, M., A. L. Duckworth, J. J. Heckman, and T. Kautz. 2011. “Personality Psychol-
ogy and Economics.” Chap. 1 in Handbook of the Economics of Education, edited by
E. A. Hanushek, S. Machin, and L. Woessmann, 4:1–181. 2008. North-Holland, Elsevier.
doi:10.1016/B978-0-444-53444-6.00001-8.
Anderson, T. W., and H. Rubin. 1956. “Statistical Inference in Factor Analysis.” Chap. 3 in
Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probabil-
ity, edited by J. Neyman, 5:111–150. Berkeley: University of California Press.
Antoniak, C. E. 1974. “Mixtures of Dirichlet Processes with Applications to Bayesian Non-
parametric Problems.” The Annals of Statistics 2 (6): 1152–1174. doi:10.1214/aos/
1176342871.
Attias, H. 1999. “Independent Factor Analysis.” Neural Computation 11 (4): 803–51. doi:10.
1162/089976699300016458.
Bernanke, B. S., J. Boivin, and P. Eliasz. 2005. “Measuring the Effects of Monetary Policy: A
Factor-Augmented Vector Autoregressive (FAVAR) Approach.” The Quarterly Journal
of Economics 120 (1): 387–422. doi:10.1162/0033553053327452.
Bhattacharya, A., and D. B. Dunson. 2011. “Sparse Bayesian Infinite Factor Models.”
Biometrika 98 (2): 291–306. doi:10.1093/biomet/asr013.
47
Carneiro, P., K. T. Hansen, and J. J. Heckman. 2003. “Estimating Distributions of Treatment
Effects with an Application to the Returns to Schooling and Measurement of the Effects
of Uncertainty on College Choice.” International Economic Review 44 (2): 361–422.
doi:10.1111/1468-2354.t01-1-00074.
Carvalho, C. M., J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang, and M. West. 2008. “High-
Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics.” Jour-
nal of the American Statistical Association 103 (484): 1438–1456. doi:10.1198/016214508000000869.
Conti, G., S. Fruhwirth-Schnatter, J. J. Heckman, and R. Piatek. 2014. “Bayesian Ex-
ploratory Factor Analysis.” Journal of Econometrics 183 (1): 31–57. doi:10 . 1016 /
j.jeconom.2014.06.008.
Cunha, F., and J. J. Heckman. 2008. “Formulating, Identifying and Estimating the Technol-
ogy of Cognitive and Noncognitive Skill Formation.” Journal of Human Resources 43
(4): 738–782. doi:10.1353/jhr.2008.0019.
Cunha, F., J. J. Heckman, and S. M. Schennach. 2010. “Estimating the Technology of Cog-
nitive and Noncognitive Skill Formation.” Econometrica 78 (3): 883–931. doi:10.3982/
ECTA6551.
Escobar, M. D., and M. West. 1995. “Bayesian Density Estimation and Inference Using
Mixtures.” Journal of the American Statistical Association 90 (430): 577–588. doi:10.
2307/2291069.
Fokoue, E., and D. M. Titterington. 2003. “Mixtures of Factor Analysers. Bayesian Esti-
mation and Inference by Stochastic Simulation.” Machine Learning 50:73–94. doi:10.
1023/A:1020297828025.
Forni, M., and L. Gambetti. 2010. “The Dynamic Effects of Monetary Policy: A Structural
Factor Model Approach.” Journal of Monetary Economics 57 (2): 203–216. doi:10.
1016/j.jmoneco.2009.11.009.
Fruhwirth-Schnatter, S., and H. F. Lopes. 2010. “Parsimonious Bayesian Factor Analysis
when the Number of Factors is Unknown.” Working Paper: The University of Chicago
Booth School of Business.
Geweke, J. F. 1989. “Bayesian Inference in Econometric Models Using Monte Carlo Integra-
tion.” Econometrica 57 (6): 1317–1339. doi:10.2307/1913710.
Geweke, J. F., and G. Zhou. 1996. “Measuring the Pricing Error of the Arbitrage Pricing
Theory.” Review of Financial Studies 9 (2): 557–587. doi:10.1093/rfs/9.2.557.
48
Ghosh, J., and D. B. Dunson. 2009. “Default Prior Distributions and Efficient Posterior
Computation in Bayesian Factor Analysis.” Journal Of Computational And Graphical
Statistics 18 (2): 306–320. doi:10.1198/jcgs.2009.07145.
Green, P. J., and S. Richardson. 2001. “Modelling Heterogeneity With and Without the
Dirichlet Process.” Scandinavian Journal of Statistics 28 (1999): 355–375. doi:10.1111/
1467-9469.00242.
Hansen, K. T., J. J. Heckman, and K. J. Mullen. 2004. “The Effect of Schooling and Ability
on Achievement Test Scores.” Journal of Econometrics 121 (1-2): 39–98. doi:10.1016/
j.jeconom.2003.10.011.
Heckman, J. J., J. Stixrud, and S. Urzua. 2006. “The Effects of Cognitive and Noncognitive
Abilities on Labor Market Outcomes and Social Behavior.” Journal of Labor Economics
24 (3): 411–482. doi:10.1086/504455.
Imai, K., and D. A. van Dyk. 2005. “A Bayesian Analysis of the Multinomial Probit Model
using Marginal Data Augmentation.” Journal of Econometrics 124 (2): 311–334. doi:10.
1016/j.jeconom.2004.02.002.
Ishwaran, H., and L. F. James. 2001. “Gibbs Sampling Methods for Stick-Breaking Pri-
ors.” Journal of the American Statistical Association 96 (453): 161–173. doi:10.1198/
016214501750332758.
. 2002. “Approximate Dirichlet Process Computing in Finite Normal Mixtures.” Jour-
nal of Computational and Graphical Statistics 11 (3): 508–532. doi:10.1198/106186002411.
Jiao, X., and D. A. van Dyk. 2015. “A Corrected and More Efficient Suite of MCMC Samplers
for the Multinomal Probit Model.” Working Paper: 1–20. arXiv: 1504.07823.
Koopmans, T. C., and O. Reiersøl. 1950. “The Identification of Structural Characteristics.”
The Annals of Mathematical Statistics 21 (2): 165–181. doi:10.1214/aoms/1177729837.
Lawrence, E., D. Bingham, C. Liu, and V. N. Nair. 2008. “Bayesian Inference for Multivariate
Ordinal Data Using Parameter Expansion.” Technometrics 50 (2): 182–191. doi:10.
1198/004017008000000064.
Liu, C., D. B. Rubin, and Y. N. Wu. 1998. “Parameter Expansion to Accelerate EM : The
PX-EM Algorithm.” Biometrika 85 (4): 755–770. doi:10.1093/biomet/85.4.755.
Liu, J. S., and Y. N. Wu. 1999. “Parameter Expansion for Data Augmentation.” Journal
of the American Statistical Association 94 (448): 1264–1274. doi:10.1080/01621459.
1999.10473879.
49
Liu, X. 2008. “Parameter Expansion for Sampling a Correlation Matrix: An Efficient GPX-
RPMH Algorithm.” Journal of Statistical Computation and Simulation 78 (11): 1065–
1076. doi:10.1080/00949650701519635.
Liu, X., and M. J. Daniels. 2006. “A New Algorithm for Simulating a Correlation Matrix
Based on Parameter Expansion and Reparameterization.” Journal of Computational
and Graphical Statistics 15 (4): 897–914. doi:10.1198/106186006X160681.
Lopes, H. F., and M. West. 2004. “Bayesian Model Assessment in Factor Analysis.” Statistica
Sinica 14:41–67.
Lucas, J. E., C. M. Carvalho, Q. Wang, A. Bild, J. Nevins, and M. West. 2006. “Sparse
Statistical Modelling in Gene Expression Genomics.” In Bayesian Inference for Gene
Expression and Proteomics, edited by K. A. Do, P. Muller, and M. Vannucci, 155–176.
Cambridge University Press.
McLachlan, G. J., and D. Peel. 2000. “Mixtures of Factor Analyzers.” Chap. 8 in Finite
Mixture Models, 238–256. John Wiley & Sons, Inc. doi:10.1002/0471721182.ch8.
McLachlan, G. J., D. Peel, and R. W. Bean. 2003. “Modelling High-Dimensional Data by
Mixtures of Factor Analyzers.” Computational Statistics and Data Analysis 41 (3-4):
379–388. doi:10.1016/S0167-9473(02)00183-4.
Meng, X.-L., and D. A. van Dyk. 1997. “The EM Algorithm — an Old Folk song Sung to
a Fast New Tune (with Discussion).” Journal of the Royal Statistical Society. Series B
59 (3): 511–567. doi:10.1111/1467-9868.00082.
. 1999. “Seeking Efficient Data Augmentation Schemes via Conditional and Marginal
Augmentation.” Biometrika 86 (2): 301–320. doi:10.1093/biomet/86.2.301.
Neal, R. M. 2000. “Markov Chain Sampling Methods for Dirichlet Process Mixture Mod-
els.” Journal of Computational and Graphical Statistics 9 (2): 249–265. doi:10.1080/
10618600.2000.10474879.
Paisley, J., and L. Carin. 2009. “Nonparametric Factor Analysis with Beta Process Priors.” In
Proceedings of the 26th International Conference on Machine Learning, 1–8. Montreal,
Canada: ACM Press. doi:10.1145/1553374.1553474.
Papaspiliopoulos, O., and G. O. Roberts. 2008. “Retrospective Markov Chain Monte Carlo
Methods for Dirichlet Process Hierarchical Models.” Biometrika 95 (1): 169–186. doi:10.
1093/biomet/asm086.
50
Piatek, R., and P. Pinger. 2016. “Maintaining (Locus of) Control? Data Combination for
the Identification and Inference of Factor Structure Models.” Journal of Applied Econo-
metrics 31 (4): 734–755. doi:10.1002/jae.2456.
Quintana, F. A., and P. Muller. 2004. “Nonparametric Bayesian Data Analysis.” Statistical
Science 19 (1): 95–110. doi:10.1214/088342304000000017.
R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna,
Austria.
Reiersøl, O. 1950. “On the Identifiability of Parameters in Thurstone’s Multiple Factor Anal-
ysis.” Psychometrika 15 (2): 121–149. doi:10.1007/BF02289197.
Scott, S. L. 2011. “Data Augmentation, Frequentist Estimation, and the Bayesian Analysis
of Multinomial Logit Models.” Statistical Papers 52 (1): 87–109. doi:10.1007/s00362-
009-0205-0.
Sethuraman, J. 1994. “A Constructive Definition of Dirichlet Priors.” Statistica Sinica 4 (2):
639–650.
Sokal, A. D. 1997. “Monte Carlo Methods in Statistical Mechanics: Foundations and New
Algorithms.” In Functional Integration (Cargese, 1996), edited by C. Dewitt-Morette
and A. Folacci, 361:131–192. Nato Science Series B. Springer US. doi:10.1007/978-1-
4899-0319-8.
Thurstone, L. L. 1934. “The Vectors of Mind.” The Psychological Review (Chicago) 41 (1):
1–32. doi:10.1037/h0075959.
Tukey, J. W. 1977. Exploratory Data Analysis. Pearson.
Uysal, D. S. 2015. “Doubly Robust Estimation of Causal Effects with Multivalued Treat-
ments: An Application to the Returns to Schooling.” Journal of Applied Econometrics
30:763–786. doi:10.1002/jae.2386.
Van Dyk, D. A. 2010. “Marginal Markov Chain Monte Carlo Methods.” Statistica Sinica 20
(4): 1423–1454.
Van Dyk, D. A., and X.-L. Meng. 2001. “The Art of Data Augmentation.” Journal of Com-
putational and Graphical Statistics 10 (1): 1–50. doi:10.1198/10618600152418584.
Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
Williams, B. D. 2017. “Identification of the Linear Factor Model.” Working Paper.
51
Yang, M., D. B. Dunson, and D. Baird. 2010. “Semiparametric Bayes Hierarchical Models
with Mean and Variance Constraints.” Computational Statistics & Data Analysis 54
(9): 2172–2186. doi:10.1016/j.csda.2010.03.025.
Yau, C., O. Papaspiliopoulos, G. O. Roberts, and C. C. Holmes. 2011. “Bayesian Non-
Parametric Hidden Markov Models with Applications in Genomics.” Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 73 (1): 37–57. doi:10.1111/
j.1467-9868.2010.00756.x.
Zhang, X., W. J. Boscardin, and T. R. Belin. 2006. “Sampling Correlation Matrices in
Bayesian Models With Correlated Latent Variables.” Journal of Computational and
Graphical Statistics 15 (4): 880–896. doi:10.1198/106186006X160050.
52