a bayesian nonparametric approach to factor analysisweb.econ.ku.dk/piatek/pdf/bnpfactor.pdf · a...

A Bayesian Nonparametric Approachto Factor Analysis

Remi Piatek∗

University of [email protected]

Omiros PapaspiliopoulosICREA–UPF

[email protected]

September 17, 2018

Abstract

This paper introduces a new approach for the inference of non-Gaussian factormodels based on Bayesian nonparametric methods. It relaxes the usual normalityassumption on the latent factors, widely used in practice, which is too restrictive inmany settings. Our approach, on the contrary, does not impose any particular as-sumptions on the shape of the distribution of the factors, but still secures the basicrequirements for the identification of the model. We design a new sampling schemebased on marginal data augmentation for the inference of mixtures of normals with lo-cation and scale restrictions. This approach is augmented by the use of a retrospectivesampler, to allow for the inference of a constrained Dirichlet process mixture model forthe distribution of the latent factors. We carry out a simulation study to illustrate themethodology and demonstrate its benefits. Our sampler is very efficient in recoveringthe distribution of the factors, and only generates models that fulfill the identificationrequirements. A real data example illustrates the applicability of the approach.

JEL Classification: C11; C38; C63.

Keywords: Factor models; Identification; Bayesian nonparametric methods; Dirichlet pro-cess hierarchical models; Marginal data augmentation; Retrospective sampler.

This project has received funding from the European Union’s Seventh Framework Programme for research,technological development and demonstration under grant agreement no 600207. See acknowledgments p. 36.

∗Corresponding author: Department of Economics, University of Copenhagen, Øster Farimagsgade 5,DK–1353 Copenhagen K, Denmark. Phone: (+45) 35 32 30 35. The methodology introduced in thispaper will be released as an extension to the R package BayesFM available on CRAN at https://cran.r-project.org/package=BayesFM upon publication of this article.

mailto:[email protected]

mailto:[email protected]

https://cran.r-project.org/package=BayesFM


1 Introduction

Factor analysis has grown as a very popular and powerful tool in many fields of research,

particularly in the social sciences, where it is routinely used to aggregate large sets of variables

into smaller sets of meaningful factors. A myriad of examples relying on this data reduction

strategy can be found in the empirical literature, ranging from the extraction of latent

factors underlying macroeconomic indicators to study monetary policies or business cycles

(Bernanke et al., 2005; Forni and Gambetti, 2010), to the measurement of personality traits

and cognitive abilities and their impact on economic outcomes (see, e.g., Carneiro et al.,

2003; Heckman et al., 2006; Conti et al., 2014; Piatek and Pinger, 2016).

One of the main challenges inherent to the inference of these models is identification. To

make inference feasible and produce meaningful results, identifying assumptions are needed,

often in the form of parameter restrictions and distributional assumptions. Techniques for

dealing with such issues were developed as early as Anderson and Rubin (1956). Within the

social sciences, most of the articles published up to date assume the factors to be Gaussian,

while within the Machine Learning community, non-Gaussian factor analysis is popular.

The Gaussian assumption is convenient and has a natural interpretation for many analysts.

However, it may have little justification empirically, and the model misspecification it can

induce is likely to contaminate the inference of the remaining parameters of the model.

This paper offers a more flexible approach to factor analysis that relaxes the Gaussian

assumption on the latent factors. We offer a modeling framework that allows for dependence

across factors, assumes a flexible distribution based on mixtures of Gaussians, and permits

the identification of the latent factors. Dependence across factors and identifiability are key

requirements in the applications we are interested in, as they allow to unravel rich latent

structures where unobserved traits can easily be interpreted—see Almlund et al. (2011) for

a discussion on this topic in personality economics. In the econometric literature, estimation

methods relying on finite mixtures of normals are commonly used (Hansen et al., 2004; Cunha

and Heckman, 2008; Cunha et al., 2010). Mixtures of normals provide an approximation

to the unknown distribution of the latent factors that can otherwise be nonparametrically

identified through appropriate parameter restrictions. These approaches therefore guarantee

identification and ensure interpretability. However, another type of misspecification can

emerge when the number of mixture components selected is not appropriate to provide a

good fit to the data.

To learn the appropriate number of mixture components from the data, instead of fixing

it a priori, Bayesian nonparametric (BNP) methods have been introduced. See Antoniak

(1974), Quintana and Muller (2004), and Paisley and Carin (2009) for Dirichlet processes

1

in general, Neal (2000) and Papaspiliopoulos and Roberts (2008) for state-of-the-art com-

putational methods for estimating the corresponding models. Effectively, these approaches

specify an infinite mixture of Gaussians with a specific prior distribution on the mixture

weights, and the number of active components can grow with the size of the data. Typically,

these procedures do not provide any guarantee of the formal identification of the model.

A notable exception is Yang et al. (2010), who propose a semiparametric approach we will

return to. More generally, this lack of identification becomes a major obstacle when the

inference of the structural part of the model is of main interest—e.g., if factor loadings need

to be identified to make inference on policy-relevant statistics such as elasticities or marginal

effects.

The goal of the present paper is to develop a richly parameterized and flexible distribu-

tion for the latent factors, which allows for dependence among factors while ensuring their

identifiability. We specify the distribution of the latent factors as an affine transformation

of a Dirichlet process that fixes the location and the scale of the process. We achieve this by

appropriately transforming the parameters of the mixture components. We develop a new

approach for the inference of constrained mixtures of Gaussians that relies on Marginal Data

Augmentation (MDA) methods (Meng and van Dyk, 1999; van Dyk and Meng, 2001; van

Dyk, 2010). MDA methods proceed by expanding the original constrained model, introduc-

ing extra parameters which, even though they cannot be identified from the data, facilitate

sampling and make inference more efficient in terms of convergence and mixing. An inter-

esting by-product is that the model expansion can be tailored to safeguard the identification

of the factor model. In the case of a Dirichlet process mixture model, where the number of

mixture components is free to grow infinitely to accommodate the data, the implementation

of MDA methods is not straightforward. In this article, we tackle both the finite and infinite

cases. For the former, we resort to truncations of the Dirichlet process, as in Ishwaran and

James (2002). For the latter, we apply both conditional approaches, such as retrospective

sampling ideas in Papaspiliopoulos and Roberts (2008) and Yau et al. (2011), as well as

marginal approaches as in Neal (2000), to infer the infinite-dimensional mixture model.

Interestingly, with this representation based on a mixture of normals, our model can be

reformulated as a mixture of factor analyzers (McLachlan and Peel, 2000; McLachlan et al.,

2003; Fokoue and Titterington, 2003). Despite the analogy of the two approaches, there are

fundamental differences. Mixtures of factor analyzers assume Gaussian factors and mix the

structural parameters of the model (factor loadings, intercepts and error term variances),

while our approach assumes that those are fixed across mixture components and rather mixes

the moments of the distribution of the latent factors. The two approaches correspond to two

completely different representations of the model, thus resulting in different interpretations.

2

They should therefore not be seen as competitive approaches, but rather as alternatives that

allow analysts to address different problems.

A relevant related and active literature is that of independent factor analysis popular in

Machine Learning, see for example Attias (1999). In that framework, identifiability is of real

concern since the latent factors are used for signal reconstruction. A key observation is that

identifiability can be partially resolved by working with certain non-Gaussian distributions

for the latent factors, a point to which we return below.

We conduct an extensive Monte Carlo study to investigate the performance of our sam-

pler, using synthetic data sets generated from a two-factor model with a non-standard dis-

tribution for the latent factors. We implement and compare several approaches for the infer-

ence of the Dirichlet process. The results are very promising. They show that our MCMC

sampling scheme succeeds in retrieving the true underlying distribution of the latent factors,

without any a priori assumptions on the shape of the distribution. Most importantly, it does

so in generating identified models only. Sampling turns out to be highly efficient thanks to

the MDA procedure. The mixing of the Markov chains is indeed very good compared to what

can usually be achieved in latent variable models, where convergence can be prohibitively

slow and mixing bad.

To illustrate the applicability of our approach, we implement it using real data from the

British cohort study to extract the distribution of two latent factors capturing cognitive skills

and behavioral problems. The empirical results clearly provide evidence for non-Gaussian

factors, thus questioning standard factor analysis approaches that rely on the normality

assumption.

The algorithms developed in this article will be released as an extension to the R package

BayesFM, to allow researchers to replicate our results and also to apply our method to their

own data in a user-friendly manner.1

The baseline factor model used throughout this paper is presented in Section 2. We

briefly outline the parametric identification of the structural part of the model, then spend

some time on the nonparametric identification of the distribution of the latent factors, which

is our main focus. Section 3 introduces the Marginal Data Augmentation sampling scheme

for mixtures of normal distributions, and explains how to plug it in into sampling methods

for Dirichlet process mixture models. Section 4.1 carries out our simulation study and

Section 4.2 applies it to real data. Section 5 concludes.

1. Package available on CRAN at https://cran.r-project.org/package=BayesFM. The corresponding packageextension will be released upon publication of this article.

3


2 Specification and identification of the factor model

2.1 General model structure

The generic structure of the latent factor model we are considering is as follows. There are

Q manifest variables Yi and P latent factors θi (P � Q), for i = 1, . . . , N , following a linear

relationship through a matrix of factor loadings Λ and a vector of intercept terms δ:

Yi(Q×1)

= δ(Q×1)

+ Λ(Q×P )

θi(P×1)

+ εi(Q×1)

, (1)

εi ∼ N (0; Σ) , Σ = diag(σ21, . . . , σ

2Q).

The error terms εi are assumed to be Gaussian for the sake of simplicity.2 The independence

of the error terms is standard in factor analysis, and implies that the factors are the only

source of correlation between the observed variables. The statistical model requires a specifi-

cation for the distribution of the latent factors. A default option in the literature is that of a

Gaussian distribution. In this paper, we relax this assumption by allowing a nonparametric

specification of this distribution. Bayesian inference for this factor model also requires priors

on δ, Λ, Σ, and typically other hyperparameters as well.

The model as stated in Eq. (1) is not identified. Some identifiability issues arise because

of the specification of the structural part of the model (Section 2.2), while others are related

to the distributional assumptions on the latent factors (Section 2.3). The following sections

provide an overview of the two sources of identification issues that lead to our proposal for

an identifiable nonparametric latent factor model.

2.2 Identification of the structural part of the model

The intercept terms δ and the factor loading matrix Λ can be identified using appropriate

parameter restrictions. This can be done independently of the distribution assumed on the

latent factors, but we will also see that specific distributional assumptions can allow to relax

some of these restrictions.

If the distribution of the latent factors belongs to a location family, such as the Gaus-

sian, or a mixture of distributions in a location family, such as mixture of Gaussians, with

location parameter(s) to be estimated from the data, then δ is not identifiable. Indeed, the

distribution of Yi remains the same by adding an arbitrary constant to δ and subtracting

2. The normality of the error terms could be relaxed in a similar way to the latent factors. However, westick to the standard Gaussian assumption in this paper for simplicity, and because the main focus is on thedistribution of the latent factors.

4

appropriate constant(s) from the location parameter(s). This lack of identifiability can be

tackled by fixing the location of the factors, e.g., by fixing the mean of the factor distribution

to 0, such that E(θi) = 0. This constraint is straightforward to impose in the Gaussian case,

but not as trivial in the nonparametric case. In this article, we propose a distribution for

the factors that fixes their location.

The second identification problem affects the factor loadings. The bilinear form Λθi

implies that the latent factors can only be identified up to a scale transformation, since

the distribution of Yi remains unaltered if the factors are multiplied by a nonsingular scaling

matrix, and the factor loading matrix by the inverse of this matrix. This can be seen from the

expression of the overall covariance matrix of the manifest variables, which can be expressed

as ΛΦΛ′ + Σ = (ΛR−1)(RΦR′)(ΛR−1)′ + Σ, for any nonsingular (P × P )-matrix R, where

Φ ≡ V(θi) denotes the covariance matrix of the latent factors. This indeterminacy, commonly

referred to as the rotation problem, is well known since the seminal work of Thurstone (1934),

later formalized by Reiersøl (1950), Koopmans and Reiersøl (1950), and Anderson and Rubin

(1956). See also Williams (2017) for a recent revival of these questions.

This lack of identifiability has been addressed in the literature by assuming that the

latent factors are uncorrelated and have unit variances, such that Φ = IP . This requirement

has made the standard Gaussian a distribution of choice in factor analysis. However, this

assumption does not completely solve the indeterminacy problem, as the system still remains

unchanged if R is specified as an orthogonal matrix. To rule out these cases, Anderson and

Rubin (1956, p. 121) propose to use a lower triangular structure for the upper part of Λ. This

structure has become popular in factor analysis, see, e.g., Geweke and Zhou (1996), Aguilar

and West (2000), Lopes and West (2004), and Fruhwirth-Schnatter and Lopes (2010).

One last identifiability issue needs to be taken care of. It arises because the sign of the

latent factors and of the corresponding columns of the loading matrix can be flipped simul-

taneously without affecting the distribution of Yi. This property of the model implies that

without further constraints on Λ (or on the factors) the signs of the correlations between the

factors are not identifiable. In our work, we deal with the sign issue by making assumptions

on the sign of certain entries of the loading matrix. Computationally, we work with the sign-

unconstrained model and enforce the constrains at a post-processing stage by appropriate

transformations of the MCMC output, as in, e.g., Fruhwirth-Schnatter and Lopes (2010)

and Conti et al. (2014).

Alternatively, the scales and the signs of the latent factors can be set by constraining one

loading in each column of Λ instead of constraining the diagonal elements of Φ. This approach

has been popular in the econometrics literature, as it allows to anchor the latent factors in

real measurements, thus facilitating interpretation (for example Cunha and Heckman, 2008;

5

Cunha et al., 2010, anchor the factors in earnings outcomes). Nevertheless, constraining some

factor loadings can be too restrictive in some frameworks. For example, when a stochastic

search is carried out to determine the number of latent factors and the structure of the factor

loading matrix in terms of zero and nonzero elements, it is not possible to fix any of the

loadings a priori. These approaches are becoming increasingly popular in the literature, see,

among others, Lucas et al. (2006), Carvalho et al. (2008), Fruhwirth-Schnatter and Lopes

(2010), Bhattacharya and Dunson (2011), and Conti et al. (2014). In the present paper, we

rely on identifying criteria that fix the variances of the factors rather than some of the factor

loadings.

When working with correlated factors, the block lower triangular structure of Λ no longer

safeguards identification. Indeed, pre-multiplying the latent factors by a nonsingular lower

triangular matrix R and post-multiplying Λ by the inverse of R results in a model that is

observationally equivalent to the original one, since ΛR−1 also has a block lower triangular

structure. Therefore, moving from the uncorrelated to the correlated case requires to add a

number of additional constraints on the factor loading matrix that is equal to the number of

off-diagonal elements of the covariance matrix Φ. This can be done by specifying a diagonal

matrix for the upper part of Λ, such that Λ′ =(DΛ1 Λ′2

), with DΛ1 = diag(λ11, . . . , λPP ),

and Λ2 is a full matrix that may contain additional zero elements. In this specification,

the first P manifest variables each load on a single latent factor, and are sometimes called

dedicated measurements in the literature (Conti et al., 2014; Williams, 2017). Similarly to

the uncorrelated case, the scale of the factors is set by either assuming that DΛ1 = IP , or

that Φjj = 1 and Λjj > 0, for j = 1, . . . , P .

2.3 Nonparametric identification of the distribution of the latent

factors

The restrictions derived in Section 2.2 allow to achieve identification of the structural part of

the model, i.e., δ and Λ, and also of the covariance matrix of the latent factors Φ. Importantly,

these assumptions do not depend on the distributional assumptions made on the latent

factors. They only secure the identification of the covariance matrix of the factors, and

therefore do not guarantee that the whole distribution of the factors is identified if we depart

from the Gaussian case.

In the nonparametric case, these assumptions might be over-restrictive. For example,

working with non-Gaussian latent factors can remove some identifiability problems when the

latent factors follow a mixture of Gaussians with diagonal covariance matrix for each compo-

nent but different from the identity. This property has propelled the so-called independent

6

component analysis and independent factor analysis, popular within Machine Learning, see,

e.g., Attias (1999).

On the other hand, some nonparametric approaches might require additional restrictions

to fully identify the distribution of the factors nonparametrically. In this paper, we rely on

the identification strategy developed in Cunha et al. (2010). Their nonparametric approach

requires mild assumptions on the latent factors, and only minor additional restrictions on the

factor loading matrix: two dedicated manifest variables are needed for each factor instead

of one in the previous section.3

With two dedicated manifest variables in hand for each latent factor, such that Λ′ =(DΛ1 DΛ2 Λ′3

),4 the proof for nonparametric identification of the factor distribution fol-

lows from Cunha et al. (2010). Assuming nonzero diagonal elements in DΛ1 and DΛ2 , the

first 2P equations can be rewritten as

W1 = θ + ω1,

W2 = θ + ω2,(2)

with

W1 = D−1Λ1(Y1:P − δ1:P ) , ω1 = D−1Λ1

ε1:P ,

W2 = D−1Λ2

(Y(P+1):(2P ) − δ(P+1):(2P )

), ω2 = D−1Λ2

ε(P+1):(2P ),

where the subscripts denote the elements of the corresponding subvectors (e.g., Y1:P contains

the first P elements of the vector Y ). The expression of the subsystem corresponding to the

dedicated measurements in Eq. (2) is particularly convenient, as it allows to directly use

the first theorem of Cunha et al. (2010, Theorem 1, p. 893) to prove the nonparametric

identification of the distribution of the factors, after having secured the identification of the

intercept terms and the factor loadings as explained in the previous section. This theorem

states that if W1, W2, θ, ω1 and ω2 are random vectors taking values in RP and related

through the equations in Eq. (2), then the factor distribution is nonparametrically identified

and can be expressed in terms of observable quantities, provided that E(ω1 | θ, ω2

)= 0

and ω2 is independent from θ. The last two conditions are automatically fulfilled, since we

assume the error terms to be independently normally distributed.

3. In most cases, the assumption of two dedicated measurements per factor is not restrictive in practice,since numerous indicators are usually available to measure the latent factors.

4. Similarly to Section 2.2, DΛ1and DΛ2

are diagonal matrices, Λ3 is a full matrix.

7

2.4 Identifiable Bayesian nonparametric correlated factor models

We build a model for the latent factors that is sufficiently constrained in its location and

scale to facilitate identifiability of the structural part of the overall model. The model

is constructed as an affine transformation of an auxiliary process, which is modeled as a

Dirichlet process Gaussian mixture model and is described below. Therefore, our approach

is a combination of Bayesian nonparametrics and econometric modeling, in order to ensure

both a flexible form for the latent factors and identifiability of the structural part of the

model. It turns out that an insightful perspective on our model is as a Gaussian mixture

model where the number of components can be learned from the data automatically, and

where the mixture parameters are constrained to ensure identifiability of the structural part

of the factor model. The induced constraints lead to a complicated posterior distribution,

but we propose marginal data augmentation methods in Section 3 to sample from it very

efficiently.

2.4.1 Modeling the distribution of the factors

In the rest of the paper we will follow a notational convention. The intercept, factor loadings

and latent factors that appear in the final formulation of the (identified) factor model will

be denoted by δ, Λ and θi, respectively, whereas transformations thereof by δ, Λ and θi.

These transformations might be used as intermediate variables in the construction of the

final model, e.g., an intermediate θi is used to define a model for factors θi with constraints

on their location and scale. Below we explain the precise ways that these transformations

relate to Eq. (1).

The factor model is an affine transformation of an auxiliary process that models the

distribution of the latent factors. This stochastic process is specified as the following Dirichlet

8

process Gaussian mixture model:

θi | µGi , ΦGi ∼ N(µGi ; ΦGi

),

Gi | p ∼K∑k=1

pkδk(Gi), (3)

µk | Φk, A0 ∼ N(

0; A0Φk

), (4)

Φk | ν0, s0 ∼ IW(ν0; s0IP ) , (5)

p1 = V1, pk = Vk

k−1∏l=1

(1− Vl), (6)

Vk ∼ Beta(1;α), (7)

1 < k ≤ K.

The parameters that define the Gaussian distribution at the top level are denoted by

ϑk = {µk, Φk} and are collected in the set ϑ = {ϑ1, ϑ2, . . .}. These and the random variables

V = (V1, V2, . . .), are assumed to be independent of each other. The Dirac delta function

centered at k is denoted δk(·). Hence, when K =∞, Eqs. (3) to (7) in the above hierarchy

define a Dirichlet process model for ϑ with base distribution normal-inverse-Wishart param-

eterized by {A0, ν0, s0}, see Eqs. (4) and (5). When K < ∞, VK is set to 1 to ensure that

the mixture weights sum up to 1, and the resultant model is a truncated Dirichlet process

for ϑ (Ishwaran and James, 2001, 2002). In either case, we adopt the stick-breaking rep-

resentation for the Dirichlet process (Sethuraman, 1994), as described in Eqs. (6) and (7),

and we explicitly augment the model with latent variables G = (G1, . . . , GN) for the mixture

group memberships. Marginalizing over these membership variables, we obtain a (potentially

infinite) mixture of Gaussians for the distribution of the factors:

θi ∼K∑k=1

pkNP(µk; Φk

),

with corresponding moments:

E(θi

)=

K∑k=1

pkµk,

V(θi

)=

K∑k=1

pk

((µk − µ)(µk − µ)′ + Φk

).

9

2.4.2 Constrained version of the model

Relying on the distribution specified in Section 2.4.1 for θi, we propose the following identi-

fiable nonparametric factor model:

Yi = δ + Λθi + εi,

θi = D−12 (θi − µ),

(8)

where µ and D are chosen so as to constrain the location and scale of the latent factors. We

treat the finite (K <∞) and infinite (K =∞) mixture cases separately, although both are

based on the following construction:

µ =∑k∈K

βkµk, (9)

Φ =∑k∈K

βk

((µk − µ)(µk − µ)′ + Φk

), D ≡ diag(Φ11, . . . , ΦPP ), (10)

where the set of mixture indices K and the weights βk are chosen in different ways for the

finite and infinite mixture models—see below in Sections 2.4.3 and 2.4.4.

The structure of the factor loading matrix, in terms of zero restrictions, is not affected

by the transformation because the matrix D used to rescale the latent factors is diagonal.

This is particularly important, as zero restrictions on Λ are required for identification in our

framework, see Section 2.2.5 Our construction is analogous to the one used by Yang et al.

(2010), except that for the parameter transformation we only use the diagonal elements of Φ,

while they use the Cholesky decomposition of this covariance matrix. This is an important

difference between the two approaches: Ours allows to work with correlated factors, as the

corresponding transformation preserves the zero restrictions on the factor loading matrix,

while the latter is only appropriate for uncorrelated factors, because it only preserves the

zero restrictions of the loading matrix if it has a block lower triangular structure.

Since the Gaussian is a location-scale family, it follows that an equivalent way to under-

stand the proposed latent factor model is as a Gaussian mixture with linearly constrained

5. If sign restrictions are imposed on Λ for identification, these restrictions also remain unaffected by theexpansion, since the diagonal elements of D are all positive.

10

parameters:

θi ∼K∑k=1

pkNP (µk; Φk) (11)

µk = D−1/2(µk − µ) (12)

Φk = D−1/2ΦkD−1/2, (13)

where the parameter transformations expressed in Eqs. (12) and (13) imply, by construction,

that the following constraints are fulfilled in the identified model:

µ ≡∑k∈K

βkµk = 0P , (14)

Φ ≡∑k∈K

βk (µkµ′k + Φk) , D ≡ diag(Φ11, . . . , ΦPP ) = IP . (15)

The prior on ϑ specified in Eqs. (4) and (5) implies a prior for the corresponding con-

strained parameters ϑ = {ϑk}k∈K, where ϑk = {µk, Φk}. The form of this induced density is

given in Proposition 2 in the Appendix, and specifically in Eq. (A1). The density does not

belong in a known family and looks cumbersome. Fortunately, this density is not required

in the sampling scheme, as the marginal data augmentation procedure we will use mainly

relies on the expanded version of the model, which is easier to sample from. This should

not, however, make us forget to investigate how the prior induced in the identified model is

shaped, to make sure we do not work with an odd prior. To do this, it is straightforward to

simulate the prior rather than trying to work out its kernel analytically.6

The Bayesian formulation of the factor model is complemented by priors on δ, Λ, Σ

and α. The concentration parameter α has a major impact on the estimated number of

components in the infinite mixture model: The larger α, the more likely new components

will be introduced into the process a priori. This parameter can therefore be tuned to

control the expansion of the Dirichlet process in terms of number of mixture components.

This is analogous to alternative nonparametric approaches, such as kernel density estimation

methods, where a smoothing parameter usually needs to be selected by the analyst to control

the level of smoothness of the estimator (e.g., bandwidth parameter). In our approach, we

prefer to learn α from the data instead of fixing it a priori, and therefore equip this parameter

with a prior distribution.

6. See Section 4.1.1 for an example.

11

The general structure of the prior distribution on the hyperparameters is

δ | c0 ∼ NQ(0Q; c0IQ) , (16)

Λq | d0 ∼ NP (0P ; d0IP ) , (17)

σ2q | a0, b0 ∼ IG(a0; b0) , (18)

α | g0, h0 ∼ G(g0; h0) , (19)

for q = 1, . . . , Q, where Λq = (λq1, . . . , λqP )′ denotes the column vector of factor loadings

corresponding to the qth row of Λ, and each single factor loading is denoted λqj, for j =

1, . . . , P .

2.4.3 Finite mixture model

In the finite mixture model, K = {1, . . . , K} with K <∞. We simply take βk = pk in terms

of the generic model structure in Eqs. (9) and (10). Therefore, Eqs. (14) and (15) are by

construction equivalent to E(θi) = 0P and diag (V(θi)) = ιP , where ιP is the vector of length

P that contains only 1s.

2.4.4 Infinite mixture model

We could repeat the above construction for K →∞, but each expression in Eqs. (9) and (10)

would require an infinite summation, which would make the resulting model computationally

intractable. Instead, we use the ingredients of the retrospective sampling methodology of

Papaspiliopoulos and Roberts (2008) to define µ and D required in Eqs. (8) to (10). The

construction now also involves the allocation variables Gi.

In the mixture of Dirichlet processes, the number of mixture components K is nominally

infinite, but in practice only a finite number of observations N is available and can be

allocated to the mixture groups. Therefore, only a finite number of mixture groups will

contain observations, the remaining ones being empty mixture components. We introduce

some notation and divide the set of mixture component indices (I) into two distinct groups,

the group of non-empty mixture components (“alive” components I (al)), and the group of

“dead” components (I (d)):

I = {1, 2, . . .},

I (al) = {k ∈ I : Nk > 0},

I (d) = {k ∈ I : Nk = 0},

12

where Nk =∑N

i=1 1{Gi = k}, for k = 1, 2, . . ., such that I = I (al) ∪ I (d).

Using the generic notation introduced in Eqs. (9) and (10), the set of mixture indices is

defined as K = I (al), and the weights such that βk ≡ wk = Nk/N , therefore measuring the

observed frequency of an individual being allocated to mixture component k. By construc-

tion, the weights depend on the configuration of the allocation variable G, and wk > 0 for

k ∈ I (al), wk = 0 for all k ∈ I (d) and∑

k∈I(al) wk = 1. This construction does not collapse

to the one for the finite mixture when K < ∞. It does, however, fix the location and scale

of the factors—not by setting their first two prior moments to 0 and to a correlation matrix,

respectively, but by fixing the linear combinations in Eqs. (14) and (15) to these values.

2.4.5 Related approaches in the literature

An alternative approach to dealing with the identifiability constraints is to impose them after

sampling, for instance through appropriate transformations of the MCMC output produced

with the nonidentified model. An example of this is the treatment of the sign issue discussed

earlier. This approach is often equivalent to assuming certain priors for the parameters of

the factor model, which imply a nontrivial prior dependence among them. In some cases, the

induced prior distribution can be derived analytically and may exhibit desirable properties.

For example in the framework of a factor model, Ghosh and Dunson (2009) show that this

mechanism can be used to induce heavy-tailed priors on the factor loadings, which are well-

defined and more flexible than the usual normal prior.

In other cases, the implied prior dependence might be more difficult to grasp. This is

for example the case in the paradigm of Yang et al. (2010), who propose a semiparametric

approach to factor analysis that relies on parameter expansion. They use a model transfor-

mation that is similar to ours, but rely on a post-processing stage to achieve identification.

This posterior transformation implies a complicated prior on the loadings because of the

transformation that involves a mixture of Gaussians. Instead, we impose the identifiability

constraints a priori, and exploit the connection to the nonidentifiable model to build effi-

cient marginal data augmentation algorithms. We therefore use a different prior than theirs.

Another difference concerns the identification requirements on the factor loading matrix.

The parameter expansion they use can only be implemented on specific patterns of zero re-

strictions on the factor loading matrix—such as the block lower triangular matrix proposed

by Geweke and Zhou (1996), but no additional zero restrictions can be imposed below the

diagonal. This occurs because they work with the Cholesky decomposition of the covariance

matrix of the factors. In contrast, we only use the diagonal matrix D to transform our

model, which allows for arbitrary patterns of zero elements on the loading matrix.

13

These differences may be very relevant when embedding the inference of the distribution

of latent factors into other existing approaches, for instance to implement stochastic search

algorithms on the structure of the factor loading matrix—as already mentioned in Section 2.2.

These methods usually require to know the prior analytically, and would be impaired by

arbitrary zero restrictions on the factor loading matrix. In this respect, our approach would

be straightforward to use for such extensions.

3 Marginal data augmentation methods for nonpara-

metric factor models

3.1 Accelerating MCMC using nonidentifiable model formulations

Marginal Data Augmentation (MDA) methods (Meng and van Dyk, 1999) emerged in parallel

with parameter-expansion methods (Liu and Wu, 1999), as a by-product of different attempts

made to improve the convergence of the EM-algorithm (Meng and van Dyk, 1997; Liu et al.,

1998). These approaches start from the observation that the introduction of extra parameters

into the model (called working parameters), which cannot be identified from the data but

can be sampled along the remaining parameters of the model, can dramatically improve

convergence and mixing of the MCMC sampler. Based on this result, Meng and van Dyk

(1999), van Dyk and Meng (2001), and van Dyk (2010) have formalized the mechanisms of

MDA, and provided extensive examples to apply these methods to a wide range of models.

These approaches have proved to be particularly efficient in some types of models where

convergence is usually very slow, to the point it can hinder proper inference, such as in latent

variable models. For example, MDA methods have been successfully applied to a variety of

discrete choice models, such as the multinomial probit (Imai and van Dyk, 2005; Jiao and

van Dyk, 2015), the multivariate probit (Lawrence et al., 2008), the multinomial logit (Scott,

2011), in factor analysis (Ghosh and Dunson, 2009; Yang et al., 2010; Fruhwirth-Schnatter

and Lopes, 2010; Conti et al., 2014), and to the sampling of correlation matrices (Liu and

Daniels, 2006; Liu, 2008).

MDA methods provide the advantage of allowing to sample indirectly from complicated

distributions that would otherwise be difficult to simulate. This feature is particularly useful

in our framework: the Dirichlet process hierarchical model is challenging to simulate in its

constrained version, but it can be marginally augmented to make it easier to handle. Last

but not least, these methods are usually easy to implement—only a few additional working

parameters need to be sampled at a low marginal cost, and no tuning is required. Hence, we

can decouple the modeling, for which we can impose constraints for identifiability, from the

14

computation, which can be done efficiently despite the complicated posteriors the modeling

implies.

3.2 Working parameters for the nonparametric factor model

We build efficient MDA algorithms for the identifiable nonparametric factor model proposed

in Section 2 using the following working parameters: µ, D, as they have already been defined

in Section 2, and the following additional parameter transformations:

Λ = Λ D−12 ,

δ = δ − Λ µ.(20)

The backbone of the MDA algorithm we propose are the following results about the distri-

bution of the working parameters. These are key to the efficient MCMC implementation we

introduce.

Proposition 1. Consider the parameters µ and D defined in Eqs. (9) and (10), and the

one-to-one mappings from ϑ to ϑ as defined in Eqs. (12) and (13). Then, the normal-inverse-

Wishart prior distribution specified on ϑk = {µk, Φk} in Eqs. (4) and (5), for k ∈ K, implies

that

f(µ, D | ϑ,G, s0, ν0, A0) = f(µ | D, ϑ,G,A0)P∏j=1

f(Dj | ϑ,G, ν0, s0),

with

µ | D, ϑ,G,A0 ∼ NP(−D

12E−1F ; A0D

12E−1D

12

), (21)

Dj | ϑ,G, ν0, s0 ∼ IG(ν0|K|

2;s0E[jj]

2

), for j = 1, . . . , P , (22)

E =∑k∈K

Φ−1k , F =∑k∈K

Φ−1k µk,

where E[jj] denotes the jth diagonal element of E, K = {1, . . . , K} in the finite mixture

model, K = I (al) in the infinite mixture model, and |K| is the cardinal number of the set K.

Proof. See Appendix A1.

In the case of the finite mixture model, G can be dropped from the conditioning sets

above. Interestingly, conditionally on ϑ, {µ, D} are independent of the mixture probabilities

15

pk that are used for βk in Eqs. (9) and (10). In the infinite mixture model, however, the

construction imposes a prior dependence of µ and D on G, but only via the set of active

components I (al) and its cardinal number |I (al)| implied by G.

The other distributions we need for the implementation of the MDA algorithm are those

that correspond to the parameters defined in Eq. (20). However, it is a simple consequence

of their definitions and the priors on the identifiable parameters in Eqs. (16) and (17), that

f(δ, Λ | µ, D, c0, d0) = f(δ | Λ, µ, D, c0)Q∏q=1

f(Λq | D, d0),

with:

δ | Λ, µ, D, c0 ∼ N(−Λµ; c0IQ

), (23)

Λq | D, d0 ∼ N(

0; d0D−1), (24)

for q = 1, . . . , Q, where Λq denotes the column vector of factor loadings corresponding to the

qth row of Λ in the expanded model .

3.3 MDA sampling scheme

The sampler is presented as Algorithm 1, in its generic form to accommodate both the finite

and the infinite mixture cases. Those two cases only differ with respect to the sampling of

the mixture parameters in step 4. Full details are provided in Algorithms 2 and 3 for this

particular step. Parameters and latent variables have an exponent (t) only if their values are

used across MCMC iterations or if they are kept for posterior inference. The other ones are

auxiliary draws that are immediately discarded at the end of the corresponding iteration.

Some of them, like the working parameters µ and D, may be updated several times in a

single MCMC iteration. In that case, their most up-to-date values are used in any given

substep of the MCMC sampler.

The main difference between this sampling scheme and a standard Gibbs sampler for

factor models is not only the potentially infinite number of mixture components, but also

the additional working parameters that need to be sampled jointly with the parameters

of interest and with the latent variables of the model. The introduction of these working

parameters requires a transformation of the model, which is performed at the end of each

iteration to move back to the identified version (van Dyk, 2010). These additional steps,

however, only represent a small additional cost. The intermediate values of the working

16

parameters are all drawn directly from standard distributions, except at step 3a, where a

Metropolis-Hastings step is implemented. For full details on the sampler, see Appendix B.

Initialization is done for all parameters and latent variables that are not marginalized out

before their first update.7 Since no information on the working parameters can be retrieved

from the data, they are sampled from their conditional prior distribution the first time they

are required, in step 2a. The latent factors θ are then sampled from the identified model,

and immediately transformed to obtain their counterpart in the expanded model. This step

is equivalent to sampling directly from f(θ | Y, δ, Λ,Σ,G, ϑ).

Step 4 is done in the nonidentified model, with some important differences between the

finite and infinite mixture cases. In the finite case (Algorithm 2), the mixture parameters ϑk

of the non-empty mixture components are sampled from their posterior distribution, while

those corresponding to the empty components are sampled from their prior. Similarly, the

stick-breaking variables Vk are either sampled from their posterior or prior distribution. In

the infinite case (Algorithm 3), this procedure is not feasible. Instead, the ϑk’s and Vk’s

corresponding to the non-empty components (resp., empty components) are sampled from

their posterior distribution (resp., prior distribution) up to the last non-empty component

kmax ≡ maxi{Gi}. The mixture indicators G are then updated sequentially for each obser-

vation i = 1, . . . , N , and any new component k > kmax that may be required to increase the

size of the mixture is introduced retrospectively, using the procedure of Papaspiliopoulos and

Roberts (2008). In this algorithm, the variable N? denotes the temporary maximum num-

ber of mixture components (N? ≥ kmax), which measures how far the sampler goes into the

exploration of the Dirichlet process. As noted by Papaspiliopoulos and Roberts (2008), and

observed in our simulations, the algorithm can introduce large numbers of temporary mix-

ture components (large N?) at the beginning of sampling, but this number usually shrinks

quickly when the sampler converges to the stationary distribution.

Since the mixture parameters are all updated in the nonidentified model, the prior depen-

dence on G that affects ϑ in the infinite case is not relevant at these stages. This dependence

is later restored by the transformation in step 6. Therefore, these steps represent a standard

Gibbs step in the finite case, and a standard—but nontrivial—retrospective sampling step

in the infinite case.8

As an alternative to the conditional approach of the retrospective sampler, it is also

possible to use a marginal approach to update the Dirichlet process mixture model, by

integrating out the mixture probabilities p. (See for a discussion on the respective advantages

7. As for the initial number of mixture components, we start our algorithm with the true number in oursimulations, and with a single component in our real data application, such that K(0) = {1}.

8. See details in Appendix B6.

17

Algorithm 1 Generic MDA sampler

Initialization. Assign starting values to the parameters and latent variables δ(0), Λ(0),θ(0), Σ(0), G(0), α(0), {ϑ(0)

k }k∈K(0) , where K(0) is the initial set of non-empty mixture compo-nents. Mixture weights and stick-breaking variables V need no initialization, as they are notconditioned upon before their first update in step 4.

MCMC sampling. At each iteration t = 1, . . . , T , cycle through the following steps:

1) Sample Σ(t) from f(Σ | Y, δ(t−1), Λ(t−1), θ(t−1)). B Eq. (B1)

2) Sample θ from f(θ | Y, δ(t−1), Λ(t−1), Σ(t), G(t−1), ϑ(t−1)), in steps:

a) Sample µ and D from f(µ, D | ϑ(t−1)). B Eqs. (21) and (22)

b) Sample θ from f(θ | Y, δ(t−1), Λ(t−1), Σ(t), G(t−1), ϑ(t−1)). B Eq. (B2)

c) Compute θi = µ+ D12 θi, for i = 1, . . . , N .

3) Sample δ(t), Λ(t) from f(δ, Λ | Y, θ, Σ(t), G(t−1), ϑ(t−1)) in steps:

a) Sample µ, D from f(µ, D | θ, G(t−1), ϑ(t−1)). B Eqs. (B4) and (B6)

b) Sample Λ from f(Λ | Y, θ, Σ(t), µ, D). B Eq. (B7)

c) Sample δ from f(δ | Y, θ, Σ(t), Λ, µ, D). B Eq. (B8)

d) Compute and save Λ(t) = ΛD12 and δ(t) = δ + Λµ.

4) Sample ϑ, V (t) and G(t) from their conditional distributions, and compute the corre-

sponding weights {β(t)k }k∈K. This is done differently for the finite and infinite mixture

cases, see Algorithms 2 and 3, respectively.

5) Sample α(t) from f(α | G(t)). B Eq. (B15)

6) Compute µ and D as in Eqs. (9) and (10), using ϑ and {β(t)k }k∈K generated in step 4.

Apply the transformation in Eqs. (12) and (13) to produce the parameters ϑ(t) corre-sponding to the identified model. Transform the latent factors back to the identified

model as θ(t)i = D−

12

(θi − µ

), for i = 1, . . . , N .

Post-processing. Perform a sign switch on the factor loading matrix, mixture means andmixture covariances, to ensure that the model is identified with respect to the signs of thelatent factors and factor loadings (Fruhwirth-Schnatter and Lopes, 2010; Conti et al., 2014).

18

Algorithm 2 Sampling the mixture parameters in the finite mixture case

Step 4 of Algorithm 1 consists of the following Gibbs steps:

a) Sample ϑk from f(ϑk | θ, G(t−1)) if Nk > 0, B Eqs. (B9) and (B10)or from its prior if Nk = 0, for k = 1, . . . , K, B Eqs. (4) and (5)with Nk =

∑Ni=1 1{Gi = k}.

b) Sample V(t)k from f(Vk | G(t−1), α(t−1)), for k = 1, . . . , K − 1. B Eq. (B14)

Set VK = 1.

c) Compute the resulting mixture weights pk, B Eq. (6)

and set β(t)k = pk, for k = 1, . . . , K.

d) Sample G(t) from f(G | θ, p(t), ϑ). B Eq. (B11)

and drawbacks of the conditional and marginal approaches.) In our Monte Carlo experiment

in Section 4.1.3, we consider Algorithms 7 and 8 of Neal (2000), and compare the results to

those obtained with the retrospective sampler.

The parameter transformation carried out in step 6 guarantees that the mixture pa-

rameters fulfill the identification requirements exactly at each step of the MCMC sampler.

Importantly, the parameters and latent variables that are affected by the expansion are al-

ways sampled simultaneously with the working parameters. This ensures that the sampling

scheme preserves the prior distribution of the parameters in the identified model, and does

not distort the posterior distribution, as would happen if sampling was done conditional on

the working parameters.

4 Illustrations with synthetic and real data

We run our sampler on simulated and real data to investigate how our approach performs,

and also compare the results to those obtained from different algorithms.

4.1 Simulation study

In Section 4.1.2, we test our algorithm on synthetic data generated from our generic model

in Eq. (1). We use the retrospective sampler for the inference of the infinite version of the

Dirichlet process in this first exercise. We then repeat the experiment in Section 4.1.3 in the

framework of a Monte Carlo study, to gauge the efficiency of our method and to compare it

to alternative algorithms.

19

Algorithm 3 Sampling the mixture parameters in the infinite mixture case

Step 4 of Algorithm 1 is done retrospectively :

a) Set kmax ≡ maxi{Gi}.

b) Sample ϑk from f(ϑk | θ, G(t−1)) if k ∈ I (al), B Eqs. (B9) and (B10)or from its prior if k ∈ I (d), for k = 1, . . . , kmax. B Eqs. (4) and (5)

c) Sample V (t) from f(V | G(t−1), α(t−1)), for k = 1, . . . , kmax. B Eq. (B14)

d) Compute the resulting mixture weights pk, for k = 1, . . . , kmax. B Eq. (6)

e) Update G. If necessary, introduce new mixture components retrospectively. Set g =G(t−1) and cycle through the following steps, for i = 1, . . . , N , in random order:

(i) Synchronize N? and maxi{Gi}.(ii) Sample Ui ∼ U(0; 1).

(iii) For j = 1, . . . , kmax + 1, evaluate: B Eq. (B12)

j−1∑l=0

qi(g, l) < Ui ≤j∑l=1

qi(g, l),

where qi(g, l) is the probability mass function of assigning observation i, cur-rently in mixture group gi, to mixture component l, while the other observationsremain assigned to their respective groups. By convention, qi(g, 0) = 0 for alli = 1, . . . , N . See details in Appendix B6.2.

(iv) If the condition is verified for some j ≤ N?:Set gi = j with probability αi{g, g(i, j)}. B Eq. (B13)Otherwise, leave gi unchanged, set i← i+ 1 and go to step (i).

(v) If the condition is not verified for any j ≤ N?:Set N? = N? + 1 and j = N?.Sample V

(t)j and ϑj from their priors. B Eqs. (4), (5) and (7)

Compute pj = V(t)j

∏j−1l=1 (1− V (t)

l ) and go to step (iii).

f) Set G(t) = g and β(t)k = Nk/N for k ∈ I (al).

20

4.1.1 Setup of the experiment

Data generation. A data set with N = 2, 000 observations on Q = 9 manifest variables

is simulated with P = 2 latent factors, using the following values for the structural part of

the model:9

δ′ =(

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0),

Λ′ =

(1.0 0.9 0.8 0.0 0.0 0.0 0.8 0.6 0.4

0.0 0.0 0.0 1.0 0.9 0.8 0.4 0.6 0.8

), (25)

Σ = diag(

0.05 0.20 0.40 0.05 0.20 0.40 0.05 0.20 0.40).

Each factor has three dedicated measurements, and the last three measurements load on both

factors. This type of structure is very common in the social sciences, where some particular

tests are designed to measure specific traits (think of an IQ test), while others capture several

features simultaneously (e.g., personality tests measuring self-esteem and self-confidence at

the same time). The idiosyncratic variances in Σ are unbalanced to vary the proportion of

noise affecting each measurement. The intercept terms are set to 0 to allow comparison with

the Gibbs sampler on the unrestricted Dirichlet process mixture model later in this section,

but these zero restrictions are not required in our approach.

The distribution of the latent factors is specified as a mixture of three Gaussian distri-

butions, parametrized as follows in the expanded version of the model:

p1 = 0.4, p2 = 0.3, p3 = 0.3,

µ1 =(

0 0), µ2 =

(1.4 −1.4

), µ3 =

(1.4 1.4

),

Φ1 =

(0.7 0.0

0.0 0.7

), Φ2 =

(0.8 0.4

0.4 0.8

), Φ3 =

(0.8 −0.4

−0.4 0.8

).

The mixture parameters are transformed according to Eqs. (12) and (13), using µ and D as

defined in Eqs. (9) and (10) with βk = pk, for k = 1, 2, 3, to standardize the latent factors

to have zero means and unit variances. The resulting joint distribution of the factors is

displayed in Fig. 1. It is not unlikely to encounter such a distribution in practice, where for

a low level of the first trait θ1, the population has a unimodal distribution conditional on the

other trait (θ2 | θ1), while this conditional distribution becomes bimodal on the other end of

9. This number of observations is similar to that of our real data set used in Section 4.2.

21

Figure 1: True joint distribution of the latent factors in the simulation study.

fact

or 1

−2

−1

0

1

2factor 2

−2−1

0

1

2

density

0.05

0.10

−2 −1 0 1 2

−2

−1

01

2

factor 1fa

ctor

2

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.11

0.11

0.12

0.13

the distribution of the first trait θ1. Common methods traditionally used in the empirical

literature (i.e., standard factor analysis) do not allow to uncover such features of the data,

thus potentially generating misleading results.

Identification. As explained in Section 2.2, each latent factor needs at least two dedicated

measurements to achieve nonparametric identification. The true factor loading matrix Λ

specified in Eq. (25) has three measurements loading exclusively on each factor, therefore

it is sufficient to constrain these zero elements to their true value to identify the model

nonparametrically.

Prior specification. The hyperparameters used in our simulation study are specified in

Table 1. The joint prior distribution of the mixture parameters µk and Φk depends on several

hyperparameters and has a complicated expression in the identified model, see Section 2.4.2

and Eq. (A1). It can, however, easily be simulated to understand the role of these prior

parameters.10 The two most important ones are the scale A0 of the prior covariance matrix

of the mixture means, and the number of degrees of freedom ν0 of the mixture covariance

matrices.

10. To do so, the mixture parameters are first sampled in the expanded version of the model—whichis straightforward to do—and then transformed through Eqs. (12) and (13) to obtain the correspondingparameters in the identified version of the model.

22

Table 1: Hyperparameter specification in the simulation study.

Parameters Hyperparameter values

Intercept terms δ c0 = 10.0Factor loadings Λ d0 = 10.0Idiosyncratic variances Σ a0 = 2, b0 = 1.0Mixture means µk A0 = 1.0

Mixture covariance matrices Φk ν0 = 3, s0 = 1.0Concentration parameter α g0 = 1.0, h0 = 1.0

Figure 2 shows the prior distributions of the parameters of the first mixture component

in the identified model, as well as of the corresponding correlation between the latent factors,

for different values of A0 and ν0 and keeping the remaining prior parameters fixed. While

both parameters have an impact on the prior of the correlation between the factors (see

left column, where larger values of A0 and smaller values of ν0 induce a larger correlation)

and on the mixture variances (right column), mixture means are only influenced by A0

(middle column). This scale parameter A0 affects mixture means and variances in opposite

directions: larger values values of A0 imply more diffuse priors for the mixture means and

more concentrated priors towards zero for the mixture variances, and vice versa—because

of the identification restrictions that tie together these parameters. For the variance of the

mixture component, the peak of the prior distribution observed at 1 in the right column of

Fig. 2 is due to the cases where a single mixture component is simulated, which happens

with prior probability 11% in this setup—see right panel of Fig. 3. In these simulations,

the concentration parameter α of the Dirichlet process and the resulting number of mixture

components are simulated from their priors as well, and therefore have an impact on the

prior of the number of mixture components. Figure 3 shows the corresponding prior density

of α, as well as the resulting numbers of mixture components (both displayed in gray), using

a sample size N = 2, 000.

Based on this simulation of the prior distribution, our specification in Table 1 appears

to be rather noninformative. The correlation between the latent factors, with its inversed

U-shaped distribution, is bound away from extreme cases of perfect collinearity, but still

wide enough to allow a wide range of correlations. The prior on the concentration parameter

α, which has been used in previous studies (see, e.g., Yau et al., 2011), favors rather small

numbers of mixture components, without being too informative about this number.

MCMC tuning and inference. We run our algorithm with the retrospective sampler

for the infinite Dirichlet process (i.e., Algorithms 1 and 3), for a total number of 120, 000

23

Figure 2: Induced prior distribution on the parameters of the first mixture component inthe identified model (k = 1), for different values of A0 and of ν0.

factor correlation mixture component mean mixture component variance

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0.0 0.5 1.0 1.5

0

2

4

6

0

5

10

0.0

0.2

0.4

0.6

dens

ity

ν0 = 3, A0 = 0.1 1 10 50

factor correlation mixture component mean mixture component variance

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0.0 0.5 1.0 1.5

0

1

2

3

0

2

4

6

0.00

0.25

0.50

0.75

1.00

dens

ity

A0 = 1, ν0 = 2 3 4 5

Notes: This figure shows the induced prior distribution on the correlation between the two factors, onthe first mixture mean µ1[1], and on the first the mixture variance Φ1[11] of the first mixture component(k = 1). Note that the last two columns would look similar for µ1[2] and Φ1[22], because of the symmetryof the prior. Prior parameters specified as in Table 1, except for the two parameters A0 and ν0 that arevaried as indicated in the legends of the two figures. Concentration parameter α and corresponding numberof mixture components simulated from their priors using N = 2, 000 observations. Simulations done with100,000 random draws.

24

iterations, and discard the first 20, 000 ones as burn-in period. A sign switch is performed a

posteriori on the factor loading matrix, mixture means and mixture covariances, to ensure

that the model is identified with respect to the signs of the latent factors and factor loadings

(see Fruhwirth-Schnatter and Lopes, 2010; Conti et al., 2014). More precisely, signs are

switched such that the first nonzero elements in each column of Λ are always positive across

MCMC iterations. This simple transformation is innocuous for the interpretation of the

results.

4.1.2 Simulation results

First, we look at how the concentration parameter of the Dirichlet process is inferred from

the data, and at the number of mixture components generated by the algorithm. Figure 3

plots the posterior distributions of α (left panel) and of the number of non-empty mixture

components (right panel), against their corresponding prior distributions. A learning process

is clearly operating, as the posterior of α is concentrated around its mode at 0.42, and looks

different from the prior. The true number of mixture components (K = 3) is sampled

by the algorithm with posterior probability 0.135, which in this particular data set is not

the highest one. Models with larger numbers of mixture components are often visited,

but this overfitting is mostly due to small mixture components introduced during sampling

and reflects the noise in the data. The numbers of observations in the six largest mixture

components are, respectively, equal to 829, 606, 467, 71, 18 and 5, thus showing that three

mixture components dominate.

To get an idea of the fit of the estimated distribution to the true distribution of the

latent factors, we rely on their posterior predictive distribution. More precisely, we plot

this distribution, which is bivariate in our example, over a grid of L1L2 pairs of points

xl1,l2 = (xl11 , xl22 )′, for l1 = 1, . . . , L1 and l2 = 1, . . . , L2. For each pair of points, we compute

the probability density function of the mixture of Gaussians corresponding to Eq. (11),

repeating this for each MCMC iteration t = 1, . . . , T :

f (t)(xl1,l2) =∑k∈K(t)

β(t)k φ(xl1,l2 ;µ

(t)k , Φ

(t)k

),

where φ( · ;µ, Φ) is the probability density function of the multivariate normal distribution

with mean µ and covariance matrix Φ. The set of mixture indices K(t)—thus also the number

of mixture components—may change across MCMC iterations. In the infinite mixture case

used in this experiment, only the alive mixture components are used, such that β(t)k = N

(t)k /N ,

where N(t)k is the number of observations allocated to mixture component k at iteration t.

Similarly, the finite mixture case would be accommodated by setting the weights β(t)k equal

25

Figure 3: Posterior vs. prior distributions of the concentration parameter α of the Dirichletprocess and of the number of non-empty mixture components in the simulation study. Modelwith N = 2, 000 observations.

0.0

0.3

0.6

0.9

0 1 2 3 4

concentration parameter α

dens

ity

0.00

0.05

0.10

0.15

0.20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

number of non−empty mixture componentspr

obab

ility

posterior prior

Notes: Prior specified with g0 = h0 = 1 and simulated with 100,000 random draws.

to the sampled mixture probabilities pk, for k = 1, . . . , K. We then average f (t)(xl1,l2) over

all MCMC iterations for each grid point (l1, l2), and also compute 95% highest posterior

density intervals. The advantage of this procedure, over the traditional approach that would

directly draw future values of θ from the posterior predictive distribution of the factors, is

that it allows to show the conditional distribution of the latent factors by fixing one of the

two dimensions.

The corresponding results are displayed in Fig. 4. The top panel of this figure shows

that the algorithm manages to recover the true distribution of the latent factors pretty

well in comparison to Fig. 1.11 The bottom six figures plot different slices of the joint

distribution, which are proportional to the corresponding conditional distributions f(θ1 | θ2)and f(θ2 | θ1) for different values of θ1 and θ2, together with the corresponding 95% highest

posterior density intervals. Overall, the fit appears to be very good.

To gain more insights into the performance of our approach, we now repeat this experi-

ment and compare the results to those obtained from alternative approaches.

11. Note that for the joint distribution for the factors in the top panel, we do not show the highest posteriordensity intervals, as this would make the three-dimensional figure too difficult to read.

26

Figure 4: Joint posterior distribution of the latent factors in the simulation study. Modelwith N = 2, 000 observations.

fact

or 1

−2

−1

0

1

2factor 2

−2−1

0

1

2

density

0.05

0.10

f(θ1, θ2 = −1) f(θ1, θ2 = 0) f(θ1, θ2 = 1)

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

0.00

0.05

0.10

0.15

θ1

dens

ity

f(θ2, θ1 = −1) f(θ2, θ1 = 0) f(θ2, θ1 = 1)

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

0.00

0.05

0.10

θ2

dens

ity

posterior true 95% highest posterior density interval

27

4.1.3 Monte Carlo experiment

We carry out a Monte Carlo experiment using the setup laid out above for models with

N = 100 and 2, 000 observations, for a total number of 100 replications each. We compare

the results obtained from the five following settings, where the one used in Section 4.1.1

corresponds to P5a:

P1 Gibbs sampler on unrestricted truncated Dirichlet process (K = 5), using a post-

processing stage to restore identification.12

P2 Truncated Dirichlet process Gaussian mixture model with MDA (K = 5), using the

mixture weights p for the computation of the working parameters.

P3 Truncated Dirichlet process Gaussian mixture model with MDA (K = 5), using the

observed mixture frequencies w for the computation of the working parameters.

P4 Gibbs sampler on unrestricted infinite Dirichlet process combined with the retrospective

sampler. Parameter restrictions on intercept terms (δ = 0) and on two factor loadings

(λ11 = λ42 = 1) to set the scale and location of the factors.

P5 Infinite Dirichlet process Gaussian mixture model with MDA, using either the retro-

spective sampler (P5a), or Algorithms 7 or 8 of Neal (2000) (P5b/P5c, respectively).

These five approaches imply five different prior distributions, hence the labels P1–P5.

Settings P2, P3 and P5 correspond to the methods introduced in the present paper. All

approaches are intrinsically different, not only in how they preserve the original prior in the

identified model or induce a different one, but also in how they achieve identification, and

how they approximate the Dirichlet process (truncated versions) or allow to deal directly

with the infinite case. For all methods relying on a truncated version of the Dirichlet process,

we use an upper bound of 5 mixture components.

These settings all allow to make inference on the full model with location and scale

restrictions on the distribution of the latent factors to achieve identification, besides P4: As

it is not possible to sample sequentially the mixture parameters and ensure at the same time

that the identification restrictions on the means and variances of the factors are fulfilled, we

instead set the intercept terms δ to zero and fix one element in each column of the factor

loading matrix, i.e., λ11 = λ42 = 1. These types of restrictions, widely used in practice,

are sufficient for identification but put an additional burden on the model. For example, it

12. Similar to Yang et al. (2010), but using only the variances of the factors as working parameters foridentification reasons, see Section 2.4.2.

28

might be too restrictive in some applications to fix some of the loadings to 1 a priori—see

discussion in Section 2.2.

To assess the results of the different approaches, we compare their efficiency in Figs. 5

and 6. These boxplots summarize inefficiency factors for selected parameters of the model

and for the deviance of the estimated distributions of the latent factors θ and of the manifest

variables Y . The inefficiency factor is a popular statistic used to monitor the mixing of the

Markov chain. Computed as the inverse of the relative numerical efficiency (Geweke, 1989),

it gives a measure of the number of MCMC iterations required by the sampler to provide the

same numerical accuracy as an hypothetical independent and identically distributed (iid)

sample from the target distribution.13 Unfortunately, the inefficiency factor is notoriously

difficult to estimate, and can be unstable depending on the available number of MCMC

iterations. See, for example, the discussion in Sokal (1997). We investigated this problem

and found that although it can result in outliers, as seen in Figs. 5 and 6, it does not affect

the overall picture and the general conclusions of our simulation study.

The deviance is a function of several relevant model parameters that summarizes the

accuracy of the approximation of the corresponding distributions. It has been used as a

measure of fit by Neal (2000), Green and Richardson (2001), and Papaspiliopoulos and

Roberts (2008) in their comparison studies. It is calculated as:

D(θ) = −2N∑i=1

log

∑k∈I(al)

Nk

Nf(θi | µk, Φk)

,

D(Y ) = −2N∑i=1

log

∑k∈I(al)

Nk

Nf(Yi | δ, Λ,Σ, µk, Φk)

,

using the true values of the simulated factors in the computation of D(θ). These two de-

viances can be evaluated at each iteration of the MCMC sampler, using the corresponding

draws of the model parameters. Thereby, we obtain a posterior distribution of these two

statistics, which we use to compute the corresponding inefficiency factors.

Figures 5 and 6 reveal that overall, some of the approaches provide comparable results,

but others show some marked differences that are worth highlighting and explaining. First,

the inefficiency factors are very similar across approaches for the structural part of the model,

as can be seen from the three top panels for the selected parameters of the factor model.

The only exceptions are settings P1 and P4 that rely on a Gibbs sampler for the unrestricted

13. Lower inefficiency factors are better. For example, with an inefficiency factor of 5, 50, 000 draws arerequired to provide a numerical accuracy equivalent to the one that could ideally be obtained with 10,000iid draws.

29

Figure 5: Boxplots of inefficiency factors for selected parameters and statistics of interest.Monte Carlo experiments with 100 data sets of N = 100 observations each.

●

●

●

●●

●

●

●●

●

●

●

●

●

●●●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●●●●

●

●

●

●●

●

●

●

●●●

●●●●●●●●●●●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●●●

●●●

●●●●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

α D(Y) D(θ)

δ2 λ21 σ22

P1 P2 P3 P4 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c

P1 P2 P3 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c1

2

3

4

5

0

25

50

75

100

125

2

4

6

8

0

25

50

75

1.00

1.25

1.50

1.75

2.00

0

50

100

150

settings

inef

ficie

ncy

fact

ors

Dirichlet process: truncated infinite

Notes: Boxplots in the style of Tukey (1977): 25/50/95th percentiles (box), 1.5 inter-quartile range(whiskers), and outliers (dots). Inefficiency factors for one intercept term (δ2), one factor loading (λ21),one idiosyncratic variance (σ2

2), for the concentration parameter α of the Dirichlet process, and for the de-viance of the distribution of the manifest variables (D(Y )) and the distribution of the latent factors (D(θ)).Settings: P1: Unrestricted truncated Dirichlet process with post-processing stage; P2: Truncated Dirichletprocess Gaussian mixture model using mixture weights p for computation of working parameters; P3: Sameas P2 but using observed mixture frequencies w; P4: Gibbs sampler on unrestricted infinite Dirichlet pro-cess, with parameter restrictions on intercept terms and two loadings for identification; P5: Infinite Dirichletprocess Gaussian mixture model with MDA with retrospective sampler (P5a), Algorithm 7 (P5b) or 8 (P5c)of Neal (2000). See beginning of Section 4.1.3 for full details. For P4, the intercepts are fixed to 0 foridentification purpose, hence the lack of boxplot for δ2. Monte Carlo experiments based on 100 replications,using the same 100 data sets across the seven different settings.

30

Figure 6: Boxplots of inefficiency factors for selected parameters and statistics of interest.Monte Carlo experiments with 100 data sets of N = 2000 observations each.

●●●●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●●●●●

●

● ●●●●●●●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●●●

●●

●

●

●

●

●

●

●

●

●●●●

●

●●●

●

●

●

●

●

●

●●●●

●

●

●●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

α D(Y) D(θ)

δ2 λ21 σ22

P1 P2 P3 P4 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c

P1 P2 P3 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c P1 P2 P3 P4 P5a P5b P5c1.0

1.5

2.0

2.5

0

20

40

60

2

4

6

0

50

100

150

200

1.0

1.2

1.4

1.6

0

50

100

150

200

settings

inef

ficie

ncy

fact

ors

Dirichlet process: truncated infinite

Notes: Boxplots in the style of Tukey (1977): 25/50/95th percentiles (box), 1.5 inter-quartile range(whiskers), and outliers (dots). Inefficiency factors for one intercept term (δ2), one factor loading (λ21),one idiosyncratic variance (σ2

2), for the concentration parameter α of the Dirichlet process, and for the de-viance of the distribution of the manifest variables (D(Y )) and the distribution of the latent factors (D(θ)).Settings: P1: Unrestricted truncated Dirichlet process with post-processing stage; P2: Truncated Dirichletprocess Gaussian mixture model using mixture weights p for computation of working parameters; P3: Sameas P2 but using observed mixture frequencies w; P4: Gibbs sampler on unrestricted infinite Dirichlet pro-cess, with parameter restrictions on intercept terms and two loadings for identification; P5: Infinite Dirichletprocess Gaussian mixture model with MDA with retrospective sampler (P5a), Algorithm 7 (P5b) or 8 (P5c)of Neal (2000). See beginning of Section 4.1.3 for full details. For P4, the intercepts are fixed to 0 foridentification purpose, hence the lack of boxplot for δ2. Monte Carlo experiments based on 100 replications,using the same 100 data sets across the seven different settings.

31

Dirichlet process: Inefficiency appears to be three to six times lower for the factor loading

λ21 in the experiments with N = 2, 000 observations—but less so with N = 100. This

efficiency loss affecting our approaches (P2/P3/P5) can be explained by the marginal data

augmentation scheme we rely on, where the prior distribution of the working parameters is

a conditional prior distribution—it includes the mixture parameters of the identified model

ϑ in its conditioning set, see Eqs. (21) and (22). This does not invalidate the approach

(van Dyk, 2010), but slightly deteriorates the efficiency of the algorithm. This is the price to

pay to obtain a sampler that preserves the original prior distribution of the structural part

of the factor model (especially for the loadings, contrary to P1), and that does not require

additional parameter restrictions on these parameters (contrary to P4). This computational

cost, however, turns out to be rather modest given the benefits of our approaches.

The efficiency of the samplers for the nonparametric part of the model can be assessed

thanks to the three bottom panels of Figs. 5 and 6. The concentration parameter α of the

Dirichlet process influences the number of mixture components. Therefore, the corresponding

inefficiency factor gives an idea of how well the sampler manages to introduce/remove mixture

components to fit the data nonparametrically. Clearly, the approaches based on an infinite

Dirichlet process are outperformed by those using a truncated version of the process. This

result, however, is not reflected in the approximation of the distributions of the latent factors

θ and of the manifest variables Y : The corresponding deviances D(θ) and D(Y ) are fairly

similar across approaches, to the exception of P2 (truncated Dirichlet process using the

mixture probabilities p to compute the working parameters), which becomes much more

efficient when N increases. The downside of this setting P2 seems to be a larger variability

of the efficiency for the factor loadings and idiosyncratic variances, as revealed by the higher

boxes for λ21 and σ22 in Fig. 6. This could potentially indicate a trade-off between efficiency

of the sampler for the inference of the structural part of the model (factor model) and its

efficiency for the inference of the nonparametric part of the model (distribution of the latent

factors).

Finally, it is interesting to note that the conditional approach (P5a, retrospective sampler)

and the marginal approaches (P5b/P5c, Algorithms 7 and 8 of Neal, 2000) provide similar

results in terms of deviance of the estimated distributions of θ and Y , independently of the

number of observations. However, the latter are slightly more efficient when it comes to

the inference of the number of mixture components, as shown by the inefficiency factors of

the concentration parameter α. This observation was also made by Papaspiliopoulos and

Roberts (2008). It also appears that increasing the number of observations improves the

efficiency of the sampler for the conditional approaches (P4/P5a), whereas the marginal

32

approaches (P5b/P5c) look stable when N changes—compare the bottom right panels of

Figs. 5 and 6.

These simulation results shed light on the properties of our approach, how it compares

to alternative approaches, and how the different versions of our algorithm perform. In this

respect, they provide guidance on the choice of the version to use, depending on the require-

ments of the model to be estimated (e.g., relevance of using an infinite Dirichlet process

rather than a truncated one, importance of using a conditional approach vs. a marginal ap-

proach). Most importantly, it is important to bear in mind that the algorithms we propose

are the only ones that safeguard the identification of the model without resorting to addi-

tional restrictions, and to preserve the original prior distribution of the model parameters at

the same time.

4.2 Empirical example

Many empirical applications in economics rely on the assumption of normality of the latent

factors. This usually makes inference straightforward to carry out, and facilitates interpre-

tation. It is, however, reasonable to question the relevance of this assumption in practice.

To illustrate this problem, we estimate a simple factor model using data from the British

Cohort Study (BCS).

Data. The British Cohort Study (BCS) is a longitudinal survey that follows all babies

born in one particular week of April 1970 in the United Kingdom. It includes a large num-

ber of measurements on cognitive abilities, socio-emotional traits, behavioral and physical

development at different stages in the life cycle of the surveyed individuals, and therefore

represents a unique opportunity for psychologists and economists to study human capital

development. This data set has been used in economics, for example, by Conti et al. (2014)

and Uysal (2015).

For the sake of simplicity, we restrict our analysis to two dimensions and focus on cognitive

ability and behavioral problems in this section. The first factor is measured by 7 test scores,

while the second is captured by 16 measurements related to the Rutter and Conners scales.

The sample contains 2,080 individuals.

Inference. We run our algorithm with the retrospective sampler for the inference of the

infinite Dirichlet process mixture model (P5a). We do not incorporate any strong prior

information into the model and use the same prior specification as in our simulation study,

see Table 1. To identify the structural part of the model, a dedicated structure is assumed,

where the cognitive measurements load on the first factor and the behavioral problems

33

Figure 7: Posterior joint distribution of the latent factors in the empirical example withthe BCS data.

fact

or 1

(co

gniti

on)−2

−1

0

1

2

factor 2 (behavioral problems)

−2−1

01

2

density

0.05

0.10

0.15

measurements load on the second one. We therefore create two clusters of measurements,

and the underlying latent factors are allowed to be correlated. The sampler is run for

1,020,000 iterations, and the first 20,000 ones are discarded as burn-in period. The sampling

is repeated ten times with different starting values. The large number of MCMC iterations

and the different starting values are used to make sure convergence is achieved. In the

following, we only present the results from one single run to keep the figures simple—the

results are virtually identical across the ten runs.

Empirical results. Figure 7 shows the joint posterior distribution of the latent factors.

This distribution exhibits several modes, and a fat tail for the factor θ2 capturing behavioral

problems. The multiplicity of modes can also be seen in Fig. 8, which shows that the

algorithm visits models with 6 mixture components most often, with a posterior probability of

0.17. Overall, the retrospective sampler switches very quickly between models with different

numbers of mixture components, as shown by the trace plot at the bottom of this figure.

Models containing between 3 and 35 mixture components are visited, but the sampler favors

relatively sparse solutions, as models with up to 10 mixture components are produced with

34

posterior probability 79%. This result is confirmed by looking at the posterior distribution

of the concentration parameter α of the Dirichlet process, shown in the upper left panel of

this figure. Its mode is equal to 0.73, which shows evidence for a rather small number of

mixture components compared to the number of observations.

This simple example reveals that the normality assumption is likely to be violated in this

data set. Relaxing this distributional assumption allows the sampler to explore alternative

solutions with non-standard distributions that are supported by the data. The misspeci-

fication of the model that results from the standard Gaussian assumption can potentially

contaminate the interpretation of the results, and also affect estimation if this model is

subsequently used to measure the impact of these latent factors on economic outcomes.

Figure 8: BCS data: Posterior distributions of the concentration parameter of the Dirichletprocess, posterior distribution and trace plot of the number of non-empty mixture compo-nents.

35

5 Conclusion

This paper introduces a new approach to factor analysis with non-normal factors that draws

on the literature on Bayesian nonparametric methods. It extends these approaches by plac-

ing the formal identification of the factor model at the core of the inferential procedure,

guaranteeing that the algorithm only produces identified models during sampling. This is

achieved by implementing a new sampling scheme for mixtures of normals with location

and scale restrictions based on marginal data augmentation, combined with a retrospective

MCMC sampler for the Dirichlet process mixture model.

A simulation study is carried out and provides very encouraging results. The sampler

successfully manages to retrieve the distribution of the latent factors nonparametrically, and

exhibits good properties in terms of convergence and mixing. A real data example illustrates

the relevance of the methodology. It shows that the latent factors extracted from the data

appear to be highly non-normal. This provides evidence that the normality assumption can

be questioned in practice. We therefore advocate to relax this assumption whenever possible,

and leave it for further research to investigate the impact a potential misspecification may

have on the results.

Combining our Bayesian nonparametric approach with other approaches in factor analysis

has great potential to allow for the inference of richer structures that can better explain

empirical data. Embedding a nonparametric approach into a structural model, however,

raises some questions about the properties of the resulting sampler, especially in terms of

efficiency. These important questions are not limited to our particular setup, but are likely

to arise in any framework where a structural model is augmented with a nonparametric

approach for the estimation of the unknown distribution of one of its components. These

questions are currently being investigated further in ongoing projects.

Acknowledgments

This paper was previously circulated under the title “A Bayesian Nonparametric Approach

to Factor Analysis with Non-Gaussian Factors”. It was presented at the European Semi-

nar on Bayesian Econometrics (ESOBE, Venice, Italy), at the 69th European Meeting of

the Econometric Society (ESEM, Geneva, Switzerland), at the World Meeting of the In-

ternational Society for Bayesian Analysis (ISBA 2016, Sardinia, Italy), at the Department

of Economics Seminar at The University of Sydney (Australia), at the Bayesian Analysis

and Modeling Summer Workshop at The University of Melbourne (Australia), and at the

Research Workshop of the Centre for Applied Microeconometrics (CAM, Copenhagen, Den-

36

mark). The authors are very grateful for all the comments received at these conferences and

seminars, which helped improve substantially the paper.

Computations made with own code written in Fortran 2008, combined with the R Pro-

gramming Language (R Core Team, 2017).14 Graphics generated with the R package ggplot2

(Wickham, 2009).

Remi Piatek’s research was funded by the Danish Council for Independent Research and

the Marie Curie programme COFUND under the European Union’s Seventh Framework

Programme for research, technological development and demonstration, Grant-ID DFF—

4091-00246.

A Prior distribution

Proposition 2. Consider the parameters µ and D defined in Eqs. (9) and (10) and the

one-to-one mappings from ϑ to ϑ as defined in Eqs. (12) and (13). Then, the normal-

inverse-Wishart prior distribution specified on ϑk = {µk, Φk} in Eqs. (4) and (5), for k ∈ K,

where in the finite case K = {1, . . . , K} and in the infinite case K = I (al), implies that

f(ϑ |ν0, A0, β) ∝∣∣∣∣∣∑k∈K

Φ−1k

∣∣∣∣∣− 1

2∣∣∣∣∣∏k∈K

Φk

∣∣∣∣∣− ν0+P+2

2(

P∏j=1

∑k∈K

[Φ−1k

][jj]

)− |K|ν02

(A1)

× exp

− 1

2A0

∑k∈K

µ′kΦ−1k µk −

(∑k∈K

Φ−1k µk

)′(∑k∈K

Φ−1k

)−1(∑k∈K

Φ−1k µk

)× 1{µ = 0}1{diag(Φ) = ιP},

where [·][jj] denotes the jth diagonal element of the corresponding matrix, |K| is the cardinal

number of the set K, and the conditions in the two indicator functions in the last line enforce

the constraints on the location and scale of the latent factors, see Eqs. (14) and (15).

Note that the dependence on the mixture weights β = (β1, . . . , βK) is hidden in the

constraints imposed via the indicator functions. Also, note that this density does not depend

on the scaling parameter s0 of the inverse-Wishart distribution of the covariance matrices

in the auxiliary model. This parameter controls the degree of inflation of the parameters in

the augmented model, but has no influence on the prior distribution of the parameters in

the identified model.

14. The methodology introduced in this paper will be released as an extension to the R package BayesFM

available on CRAN at https://cran.r-project.org/package=BayesFM upon publication of this article.

37


A1 Proof of Propositions 1 and 2

Induced joint prior distribution. The joint distribution of {µ, D, µK, ΦK} is derived

from the distribution of {µK, ΦK} in the expanded model using a transformation of random

variables. The restrictions in Eqs. (14) and (15) imply that:

f(µ, D, µK, ΦK) = f(µ, D, µ−k, Φ−k, ΦLk ) f(µk | µ, D, µ−k, ΦK)︸︷︷︸

1{µ=0P }

f(ΦDk | µ, D, µ−k, Φ−k, ΦLk )︸︷︷︸1{diag(Φ)=ιP }

(A2)

where µK = {µk}k∈K and ΦK = {Φk}k∈K, and ΦDk and ΦLk denote, respectively, the diagonal

elements and the lower triangular part (excluding the diagonal elements) of Φk. The first den-

sity is obtained from the transformation of random variables (µK, ΦK)→ (µ, D, µ−k, Φ−k, ΦLk ),

such that:

f(µ, D, µ−k, Φ−k, ΦLk ) = f(µK, ΦK)J {(µK, ΦK)→ (µ, D, µ−k, Φ−k, Φ

Lk )}, (A3)

where J {(·)→ (·)} is the Jacobian of the corresponding transformation.

Since the mixture parameters {µk, Φk} are assumed to be independent across mixture

components and to follow a normal-inverse-Wishart distribution for each k ∈ K in the

expanded model, see Eqs. (4) and (5), the joint distribution of the corresponding parameters

in the identified model µK and ΦK and of the working parameters µ and D is derived from

38

Eqs. (A2) and (A3) as follows, and without loss of generality:15

f(µ,D, µK, ΦK)

= f(µK, ΦK)J {(µK, ΦK)→ (µ, D, µ−k, Φ−k, ΦLk )}

× 1{µ = 0P}1{diag(Φ) = ιP},

∝∏k∈K

∣∣∣ Φk ∣∣∣−1/2 exp

{− 1

2A0

µ′kΦ−1k µk

} ∣∣∣ Φk ∣∣∣− ν0+P+12

exp{−s0

2tr(Φ−1k

)}×

P∏j=1

(Dj

) |K|(P+2)−32

1{µ = 0P}1{diag(Φ) = ιP},

∝ exp

{− 1

2A0

(µ′D−

12

(∑k∈K

Φ−1k

)D−

12 µ+ 2µ′D−

12

(∑k∈K

Φ−1k µk

))}(A4)

×P∏j=1

(Dj

)− |K|ν0+12−1

exp

{−s0

2

∑k∈K

[Φ−1k

][jj]D−1j

}(A5)

×∏k∈K

|Φk |−ν0+P+2

2 exp

{−µ

′kΦ−1k µk

2A0

}× 1{µ = 0P}1{diag(Φ) = ιP},

where the Jacobian of the transformation is proportional to∏P

j=1

(Dj

) |K|(P+2)−32

, see Ap-

pendix A2.

The kernel can be factorized as

f(µ, D, µK, ΦK) = f(µ | D, µK, ΦK)f(D | µK, ΦK)f(µK, ΦK),

and the three distributions on the right-hand side can be retrieved as follows. The conditional

distribution of µ is obtained from Eq. (A4), which is the kernel of a Gaussian distribution:

µ | D, µK, ΦK, A0 ∼ N

−D 12

(∑k∈K

Φ−1k

)−1(∑k∈K

Φ−1k µk

); A0D

12

(∑k∈K

Φ−1k

)−1D

12

.

15. Because of the identification constraints on the mixture means and variances, the mean and variance ofone mixture component are redundant and can be discarded. Which component is discarded does not affectthe results.

39

The conditional distribution of D is obtained by integrating out µ, using the kernel in

Eq. (A5) and completing the normalizing constant that depends on D in Eq. (A4):

f(D | µK, ΦK, ν0, s0) ∝∫f(µ, D, µK, ΦK)dµ,

∝∣∣∣ D ∣∣∣ 12 P∏

j=1

(Dj

)− |K|ν0+12−1

exp

{−s0

2

∑k∈K

[Φ−1k

][jj]D−1j

},

∝P∏j=1

(Dj

)− |K|ν02−1

exp

{−s0

2

∑k∈K

[Φ−1k

][jj]D−1j

},

which results in a product of kernels of inverse-Gamma distributions:

Dj | ΦK, ν0, s0 ∼ IG

(|K|ν0

2;s02

∑k∈K

[Φ−1k

][jj]

),

for j = 1, . . . , P .

Finally, the kernel of the marginal distribution of the mixture parameters in the identified

model is obtained by integrating both µ and D out of the joint distribution:

f(µK, ΦK | A0, ν0) ∝∫∫

f(µ, µK, D, ΦK) dµ dD,

which produces the kernel in Eq. (A1).

A2 Jacobian of the transformation

The Jacobian corresponding to the change of variables that allows to move from the expanded

model to the identified model can be derived in several steps. Because of the restrictions on

the parameters of the identified model (µ = 0P and diag(Φ) = ιP ), one of the mixture means

and the diagonal elements of one of the covariance matrices are redundant in the parameter

transformation and can be left aside in the derivation. The subscript −k indicates that the

kth element of the corresponding set is left out, e.g., µ−k = {µl | l ∈ K, l 6= k}. We denote

ΦLk the lower triangular elements of Φk, excluding the diagonal elements. Without loss of

generality, we derive the Jacobian for the case where the mean and the diagonal elements of

40

the covariance matrix of the kth mixture component are left aside:

J {(µK,ΦK)→ (µ, D, µ−k, Φ−k, ΦLk )}

= J {(µK, ΦK)→ (µ, µ−k, ΦK)}

× J {(µ, µ−k, ΦK)→ (µ, µ−k, Φ, Φ−k)}

× J {(µ, µ−k, Φ, Φ−k)→ (µ, µ−k, D, Φ, Φ−k)} (A6)

× J {(µ, µ−k, D, Φ, Φ−k)→ (µ, µ−k, D, Φ, Φ−k)}

× J {(µ, µ−k, D, Φ, Φ−k)→ (µ, µ−k, D, Φ, Φ−k)}

× J {(µ, µ−k, D, Φ, Φ−k)→ (µ, µ−k, D, Φ−k, ΦLk )}

=

(1

pk

)P×(

1

pk

)P (P+1)2

×P∏j=1

(Dj

)P−12 ×

P∏j=1

(Dj

) |K|−12

×P∏j=1

(Dj

) (P+1)(|K|−1)2 × p

P (P−1)2

k ,

=

(1

pk

)2P P∏j=1

(Dj

) |K|(P+2)−32

,

∝P∏j=1

(Dj

) |K|(P+2)−32

,

where the Jacobian in line A6 is derived as in Zhang et al. (2006).

B Details on MCMC Sampler

This appendix provides technical details on the MCMC sampler. These steps are presented

in their generic form and can be used for both the finite and the infinite cases, where in the

former K = {1, . . . , K}, while in the latter K = I (al), respectively, and |K| is the cardinal

number of K.

B1 Sampling the idiosyncratic variances (step 1)

The inverse-Gamma prior in Eq. (18) provides the following posterior, for q = 1, . . . , Q:

σ2q | Y, θ, δ, Λ, a0, b0 ∼ IG

(a0 +

N

2; b0 +

1

2

N∑i=1

(Yi − δ − Λθi)2). (B1)

41

B2 Sampling the latent factors (step 2b)

For i = 1, . . . , N :

θi | Yi, Gi, ϑ, δ, Λ,Σ ∼ N (Bθibθi ; Bθi) , B−1θi = Λ′Σ−1Λ+ Φ−1Gi , (B2)

bθi = Λ′Σ−1(Yi − δ) + Φ−1GiµGi .

B3 Sampling the working parameters conditional on the latent

factors in the expanded model (step 3a)

The joint conditional distribution of the working parameters, given their prior distributions

expressed in Eqs. (21) and (22), is proportional to:

p(µ, D | θ, G, ϑ) ∝ p(θ | µ, D, G, ϑ) p(µ | D, ϑ) p(D | ϑ),

∝N∏i=1

∣∣∣ D 12ΦGiD

12

∣∣∣− 12

× exp

{−1

2

N∑i=1

(θi − µ− D12µGi)

′(D12ΦGiD

12 )−1(θi − µ− D

12µGi)

}

×∣∣∣ D ∣∣∣− 1

2exp

{− 1

2A0

[µ′D−

12

(∑k∈K

Φ−1k

)D−

12 µ

+2µ′D−12

(∑k∈K

Φ−1k µk

)]}

×∣∣∣ D ∣∣∣− |K|ν02

−1exp

{−s0

2

∑k∈K

tr(Φ−1k D−1

)},

∝ exp

{−1

2

[µ′D−

12

(∑k∈K

(Nk + A−10 )Φ−1k

)D−

12 µ (B3)

−2µ′D−12

∑k∈K

Φ−1k

([D−

12

∑i∈Ik

θi

]− (Nk + A−10 )µk

)]}

×∣∣∣ D ∣∣∣− |K|ν0+N+1

2−1

exp

{−1

2

∑k∈K

tr

(D−

12Φ−1k D−

12

[∑i∈Ik

θiθ′i + s0IP

])

+∑k∈K

µ′kΦ−1k D−

12

∑i∈Ik

θi

}.

42

This provides the kernel of a normal distribution for µ conditional on D and on the remaining

parameters:

µ | θ, D, G, ϑ ∼ N(D

12B2(B1(D)−B3); D

12B2D

12

), (B4)

with:

B1(D) =∑k∈K

Φ−1k D−12

∑i∈Ik

θi, B−12 =∑k∈K

(Nk + A−10 )Φ−1k ,

B3 =∑k∈K

(Nk + A−10 )Φ−1k µk.

As for the other working parameters D, the kernel of their conditional distribution is ob-

tained by integrating µ out of the joint distribution, by completing the normalizing constant

of Eq. (B3):

p(D | θ, G, ϑ) =

∫p(µ, D | θ, G, ϑ)dµ,

∝∣∣∣ D ∣∣∣− |K|ν0+N2

−1exp

{1

2B1(D)′B2

(B1(D)− 2B3

)− 1

2

∑k∈K

tr

(D−

12Φ−1k D−

12

[∑i∈Ik

θiθ′i + s0IP

])

+∑k∈K

µ′kΦ−1k D−

12

∑i∈Ik

θi

}, (B5)

which is not the kernel of a known distribution. However, D can be simulated with a

Metropolis-Hastings step.

Metropolis-Hastings step to sample D. As proposal distribution for each of the diag-

onal elements j = 1, . . . , P of D, a log-normal distribution is used, parametrized such that

its mode is equal to Dj:

D?j | (Dj, ρ

2) ∼ lnN(

ln Dj + ρ2; ρ2), (B6)

q(D? | D, ρ2) ∝P∏j=1

1

D?j

exp

{− 1

2ρ2

(ln D?

j − ln Dj − ρ2)2}

.

43

The P proposed values D? are accepted as new draws for D with probability:

α(D? | D) = min

{1;f(D? | θ, G, ϑ)

f(D | θ, G, ϑ)

q(D | D?, ρ2)

q(D? | D, ρ2)

},

where the first ratio can be computed using Eq. (B5), while the second ratio, after some

algebra, simplifies to ln q(D|D?,ρ2)q(D?|D,ρ2)

=∑P

j=1(ln Dj − ln D?j ). The parameter ρ2 is a tuning

parameter that influences the acceptance rate of the Metropolis-Hastings algorithm. We use

ρ2 = 1/N in our applications.

B4 Sampling the intercept terms and factor loadings (step 3b-c)

In the expanded model, the prior distributions specified in Eqs. (23) and (24) result in the

following conditional distributions for each vector of factor loadings Λq and intercept term

δq corresponding to manifest variable q = 1, . . . , Q:

Λq | Yq, θ, σ2q , µ, D ∼ N

(BΛq

bΛq ; BΛq

), (B7)

δq | Yq, θ, Λq, σ2q , µ, D ∼ N

(Bδq

bδq ; Bδq

), (B8)

with:

B−1δq

=1

c0+N

σ2q

, bδq =1

σ2q

N∑i=1

(Yqi − Λ′qθi

)−Λ′qµ

c0,

B−1Λq

=θ′θ

σ2q

+µµ′

c0+D

d0−Bδq

bqb′q, bΛq =

1

σ2q

(θ′Yq − bqBδq

(N∑i=1

Yqi

)),

where bq = c−10 µ+ 1σ2q

∑Ni=1 θi.

B5 Sampling the parameters of the non-empty mixture compo-

nents in the expanded model (step 4)

The conjugate normal-inverse-Wishart prior distribution specified on the mixture parameters

in Eqs. (4) and (5) results in the following posterior distribution for the non-empty mixture

44

components:

Φk | θ, G ∼ IW

ν0 +Nk; s0IP +∑i∈Ik

θiθ′i −

(∑i∈Ik θi

)(∑i∈Ik θi

)′Nk + A−10

, (B9)

µk | Φk, θ, G ∼ N

( ∑i∈Ik θi

Nk + A−10

;Φk

Nk + A−10

), (B10)

where Ik = {i ∈ I : Gi = k}, with I = {1, . . . , N}, is the set of indices corresponding to the

observations belonging to mixture group k, and Nk = card (Ik) is the number of observations

in mixture group k.

B6 Sampling the mixture group indicators (step 4)

B6.1 Finite mixture case

Each observation i = 1, . . . , N is allocated to mixture group k with probability

p(Gi = k | θi, pk, ϑk

)∝ pk

∣∣∣ Φk ∣∣∣− 12φP

(Φ− 1

2k (θi − µk)

), (B11)

where φP (·) denotes the probability density function of the multivariate standard normal

distribution.

B6.2 Infinite mixture case

In the infinite case, we implement algorithm 2 of Papaspiliopoulos and Roberts (2008, p. 176)

to update the mixture group indicators and to introduce new mixture components on the

fly. The parameters of the corresponding new mixture components are sampled from their

prior distribution retrospectively as they become required.

More precisely, in Algorithm 3 the probability of assigning an observation i to mixture

component l, while the other individuals remain assigned to their respective groups, is pro-

portional to:

qi(g, l) ∝

pl f(θi

(t)| ϑ(t)

l

)if l ≤ kmax,

plMi(g) if l > kmax,(B12)

where g ≡ G(t−1), kmax ≡ maxi{Gi} denotes the last non-empty component of the mix-

ture, and f(· | ϑ) is the probability density function of the multivariate normal distribution

45

parametrized by ϑ. The user-defined function Mi(g) is chosen as Mi(g) = maxl≤kmax{f(θi |ϑl)}.16 The normalizing constant ci(g) of the mixture mass probabilities in Eq. (B12) is

equal to:

ci(g) =kmax∑l=1

pl f(θi

(t)| ϑ(t)

l

)+Mi(g)

(1−

kmax∑l=1

pl

).

The acceptance probability of the Metropolis-Hastings move in step (iv) of Algorithm 3

is computed as:

αi{g, g(i, j)} =

1 if max{g(i, j)} = kmax,

min

{1,

ci(g)Mi(g(i, j))

ci(g(i, j))f(θi | ϑgi)

}if max{g(i, j)} < kmax,

min

{1,

ci(g)f(θi | ϑj)ci(g(i, j))Mi(g)

}if j > kmax,

(B13)

where g(i, j) is identical to the vector g up to its ith element that is set to j, and gi denotes the

ith element of g. This acceptance probability depends on whether an existing mixture group

is proposed (first two cases, when j ≤ kmax) and the dimension of the Dirichlet process does

not change, or on the contrary whether a new mixture group is proposed for incorporation

into the process (last case).

B7 Sampling the stick-breaking variables (step 4)

Each random variable Vk underlying the stick-breaking process is updated as

Vk | G,α ∼ Beta

(Nk + 1;α +N −

k∑j=1

Nj

), (B14)

where Nk denotes the number of observations assigned to mixture group k. In the finite

mixture case, this is done for k = 1, . . . , K − 1, while VK = 1. In the infinite mixture case,

this conditional distribution collapses to the prior in Eq. (7) for k ≥ kmax.

16. Following Papaspiliopoulos and Roberts (2008, p. 176).

46

B8 Sampling the concentration parameter α (step 5)

Following Escobar and West (1995), the Gamma prior distribution specified on α in Eq. (19)

results in a posterior that is a mixture of two Gamma distributions:

η | α,K+ ∼ Beta(α + 1;N),

α | η,K+ ∼ πηG(g0 +K+; h0 − log(η)

)+ (1− πη)G

(g0 +K+ − 1; h0 − log(η)

), (B15)

with πη/(1− πη) = (g0 + K+ − 1)/(N(h0 − log(η))), and where K+ denotes the number of

non-empty mixture components.

References

Aguilar, O., and M. West. 2000. “Bayesian Dynamic Factor Models and Portfolio Allocation.”

Journal of Business & Economic Statistics 18 (3): 338–357. doi:10.1080/07350015.

2000.10524875.

Almlund, M., A. L. Duckworth, J. J. Heckman, and T. Kautz. 2011. “Personality Psychol-

ogy and Economics.” Chap. 1 in Handbook of the Economics of Education, edited by

E. A. Hanushek, S. Machin, and L. Woessmann, 4:1–181. 2008. North-Holland, Elsevier.

doi:10.1016/B978-0-444-53444-6.00001-8.

Anderson, T. W., and H. Rubin. 1956. “Statistical Inference in Factor Analysis.” Chap. 3 in

Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probabil-

ity, edited by J. Neyman, 5:111–150. Berkeley: University of California Press.

Antoniak, C. E. 1974. “Mixtures of Dirichlet Processes with Applications to Bayesian Non-

parametric Problems.” The Annals of Statistics 2 (6): 1152–1174. doi:10.1214/aos/

1176342871.

Attias, H. 1999. “Independent Factor Analysis.” Neural Computation 11 (4): 803–51. doi:10.

1162/089976699300016458.

Bernanke, B. S., J. Boivin, and P. Eliasz. 2005. “Measuring the Effects of Monetary Policy: A

Factor-Augmented Vector Autoregressive (FAVAR) Approach.” The Quarterly Journal

of Economics 120 (1): 387–422. doi:10.1162/0033553053327452.

Bhattacharya, A., and D. B. Dunson. 2011. “Sparse Bayesian Infinite Factor Models.”

Biometrika 98 (2): 291–306. doi:10.1093/biomet/asr013.

47

http://dx.doi.org/10.1080/07350015.2000.10524875

http://dx.doi.org/10.1080/07350015.2000.10524875

http://dx.doi.org/10.1016/B978-0-444-53444-6.00001-8

http://dx.doi.org/10.1214/aos/1176342871

http://dx.doi.org/10.1214/aos/1176342871

http://dx.doi.org/10.1162/089976699300016458

http://dx.doi.org/10.1162/089976699300016458

http://dx.doi.org/10.1162/0033553053327452

http://dx.doi.org/10.1093/biomet/asr013

Carneiro, P., K. T. Hansen, and J. J. Heckman. 2003. “Estimating Distributions of Treatment

Effects with an Application to the Returns to Schooling and Measurement of the Effects

of Uncertainty on College Choice.” International Economic Review 44 (2): 361–422.

doi:10.1111/1468-2354.t01-1-00074.

Carvalho, C. M., J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang, and M. West. 2008. “High-

Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics.” Jour-

nal of the American Statistical Association 103 (484): 1438–1456. doi:10.1198/016214508000000869.

Conti, G., S. Fruhwirth-Schnatter, J. J. Heckman, and R. Piatek. 2014. “Bayesian Ex-

ploratory Factor Analysis.” Journal of Econometrics 183 (1): 31–57. doi:10 . 1016 /

j.jeconom.2014.06.008.

Cunha, F., and J. J. Heckman. 2008. “Formulating, Identifying and Estimating the Technol-

ogy of Cognitive and Noncognitive Skill Formation.” Journal of Human Resources 43

(4): 738–782. doi:10.1353/jhr.2008.0019.

Cunha, F., J. J. Heckman, and S. M. Schennach. 2010. “Estimating the Technology of Cog-

nitive and Noncognitive Skill Formation.” Econometrica 78 (3): 883–931. doi:10.3982/

ECTA6551.

Escobar, M. D., and M. West. 1995. “Bayesian Density Estimation and Inference Using

Mixtures.” Journal of the American Statistical Association 90 (430): 577–588. doi:10.

2307/2291069.

Fokoue, E., and D. M. Titterington. 2003. “Mixtures of Factor Analysers. Bayesian Esti-

mation and Inference by Stochastic Simulation.” Machine Learning 50:73–94. doi:10.

1023/A:1020297828025.

Forni, M., and L. Gambetti. 2010. “The Dynamic Effects of Monetary Policy: A Structural

Factor Model Approach.” Journal of Monetary Economics 57 (2): 203–216. doi:10.

1016/j.jmoneco.2009.11.009.

Fruhwirth-Schnatter, S., and H. F. Lopes. 2010. “Parsimonious Bayesian Factor Analysis

when the Number of Factors is Unknown.” Working Paper: The University of Chicago

Booth School of Business.

Geweke, J. F. 1989. “Bayesian Inference in Econometric Models Using Monte Carlo Integra-

tion.” Econometrica 57 (6): 1317–1339. doi:10.2307/1913710.

Geweke, J. F., and G. Zhou. 1996. “Measuring the Pricing Error of the Arbitrage Pricing

Theory.” Review of Financial Studies 9 (2): 557–587. doi:10.1093/rfs/9.2.557.

48

http://dx.doi.org/10.1111/1468-2354.t01-1-00074

http://dx.doi.org/10.1198/016214508000000869

http://dx.doi.org/10.1016/j.jeconom.2014.06.008


http://dx.doi.org/10.1353/jhr.2008.0019

http://dx.doi.org/10.3982/ECTA6551

http://dx.doi.org/10.3982/ECTA6551

http://dx.doi.org/10.2307/2291069

http://dx.doi.org/10.2307/2291069

http://dx.doi.org/10.1023/A:1020297828025

http://dx.doi.org/10.1023/A:1020297828025

http://dx.doi.org/10.1016/j.jmoneco.2009.11.009

http://dx.doi.org/10.1016/j.jmoneco.2009.11.009

http://dx.doi.org/10.2307/1913710

http://dx.doi.org/10.1093/rfs/9.2.557

Ghosh, J., and D. B. Dunson. 2009. “Default Prior Distributions and Efficient Posterior

Computation in Bayesian Factor Analysis.” Journal Of Computational And Graphical

Statistics 18 (2): 306–320. doi:10.1198/jcgs.2009.07145.

Green, P. J., and S. Richardson. 2001. “Modelling Heterogeneity With and Without the

Dirichlet Process.” Scandinavian Journal of Statistics 28 (1999): 355–375. doi:10.1111/

1467-9469.00242.

Hansen, K. T., J. J. Heckman, and K. J. Mullen. 2004. “The Effect of Schooling and Ability

on Achievement Test Scores.” Journal of Econometrics 121 (1-2): 39–98. doi:10.1016/

j.jeconom.2003.10.011.

Heckman, J. J., J. Stixrud, and S. Urzua. 2006. “The Effects of Cognitive and Noncognitive

Abilities on Labor Market Outcomes and Social Behavior.” Journal of Labor Economics

24 (3): 411–482. doi:10.1086/504455.

Imai, K., and D. A. van Dyk. 2005. “A Bayesian Analysis of the Multinomial Probit Model

using Marginal Data Augmentation.” Journal of Econometrics 124 (2): 311–334. doi:10.

1016/j.jeconom.2004.02.002.

Ishwaran, H., and L. F. James. 2001. “Gibbs Sampling Methods for Stick-Breaking Pri-

ors.” Journal of the American Statistical Association 96 (453): 161–173. doi:10.1198/

016214501750332758.

. 2002. “Approximate Dirichlet Process Computing in Finite Normal Mixtures.” Jour-

nal of Computational and Graphical Statistics 11 (3): 508–532. doi:10.1198/106186002411.

Jiao, X., and D. A. van Dyk. 2015. “A Corrected and More Efficient Suite of MCMC Samplers

for the Multinomal Probit Model.” Working Paper: 1–20. arXiv: 1504.07823.

Koopmans, T. C., and O. Reiersøl. 1950. “The Identification of Structural Characteristics.”

The Annals of Mathematical Statistics 21 (2): 165–181. doi:10.1214/aoms/1177729837.

Lawrence, E., D. Bingham, C. Liu, and V. N. Nair. 2008. “Bayesian Inference for Multivariate

Ordinal Data Using Parameter Expansion.” Technometrics 50 (2): 182–191. doi:10.

1198/004017008000000064.

Liu, C., D. B. Rubin, and Y. N. Wu. 1998. “Parameter Expansion to Accelerate EM : The

PX-EM Algorithm.” Biometrika 85 (4): 755–770. doi:10.1093/biomet/85.4.755.

Liu, J. S., and Y. N. Wu. 1999. “Parameter Expansion for Data Augmentation.” Journal

of the American Statistical Association 94 (448): 1264–1274. doi:10.1080/01621459.

1999.10473879.

49

http://dx.doi.org/10.1198/jcgs.2009.07145

http://dx.doi.org/10.1111/1467-9469.00242

http://dx.doi.org/10.1111/1467-9469.00242



http://dx.doi.org/10.1086/504455



http://dx.doi.org/10.1198/016214501750332758

http://dx.doi.org/10.1198/016214501750332758

http://dx.doi.org/10.1198/106186002411

http://arxiv.org/abs/1504.07823

http://dx.doi.org/10.1214/aoms/1177729837

http://dx.doi.org/10.1198/004017008000000064

http://dx.doi.org/10.1198/004017008000000064

http://dx.doi.org/10.1093/biomet/85.4.755

http://dx.doi.org/10.1080/01621459.1999.10473879

http://dx.doi.org/10.1080/01621459.1999.10473879

Liu, X. 2008. “Parameter Expansion for Sampling a Correlation Matrix: An Efficient GPX-

RPMH Algorithm.” Journal of Statistical Computation and Simulation 78 (11): 1065–

1076. doi:10.1080/00949650701519635.

Liu, X., and M. J. Daniels. 2006. “A New Algorithm for Simulating a Correlation Matrix

Based on Parameter Expansion and Reparameterization.” Journal of Computational

and Graphical Statistics 15 (4): 897–914. doi:10.1198/106186006X160681.

Lopes, H. F., and M. West. 2004. “Bayesian Model Assessment in Factor Analysis.” Statistica

Sinica 14:41–67.

Lucas, J. E., C. M. Carvalho, Q. Wang, A. Bild, J. Nevins, and M. West. 2006. “Sparse

Statistical Modelling in Gene Expression Genomics.” In Bayesian Inference for Gene

Expression and Proteomics, edited by K. A. Do, P. Muller, and M. Vannucci, 155–176.

Cambridge University Press.

McLachlan, G. J., and D. Peel. 2000. “Mixtures of Factor Analyzers.” Chap. 8 in Finite

Mixture Models, 238–256. John Wiley & Sons, Inc. doi:10.1002/0471721182.ch8.

McLachlan, G. J., D. Peel, and R. W. Bean. 2003. “Modelling High-Dimensional Data by

Mixtures of Factor Analyzers.” Computational Statistics and Data Analysis 41 (3-4):

379–388. doi:10.1016/S0167-9473(02)00183-4.

Meng, X.-L., and D. A. van Dyk. 1997. “The EM Algorithm — an Old Folk song Sung to

a Fast New Tune (with Discussion).” Journal of the Royal Statistical Society. Series B

59 (3): 511–567. doi:10.1111/1467-9868.00082.

. 1999. “Seeking Efficient Data Augmentation Schemes via Conditional and Marginal

Augmentation.” Biometrika 86 (2): 301–320. doi:10.1093/biomet/86.2.301.

Neal, R. M. 2000. “Markov Chain Sampling Methods for Dirichlet Process Mixture Mod-

els.” Journal of Computational and Graphical Statistics 9 (2): 249–265. doi:10.1080/

10618600.2000.10474879.

Paisley, J., and L. Carin. 2009. “Nonparametric Factor Analysis with Beta Process Priors.” In

Proceedings of the 26th International Conference on Machine Learning, 1–8. Montreal,

Canada: ACM Press. doi:10.1145/1553374.1553474.

Papaspiliopoulos, O., and G. O. Roberts. 2008. “Retrospective Markov Chain Monte Carlo

Methods for Dirichlet Process Hierarchical Models.” Biometrika 95 (1): 169–186. doi:10.

1093/biomet/asm086.

50

http://dx.doi.org/10.1080/00949650701519635

http://dx.doi.org/10.1198/106186006X160681

http://dx.doi.org/10.1002/0471721182.ch8

http://dx.doi.org/10.1016/S0167-9473(02)00183-4

http://dx.doi.org/10.1111/1467-9868.00082

http://dx.doi.org/10.1093/biomet/86.2.301

http://dx.doi.org/10.1080/10618600.2000.10474879

http://dx.doi.org/10.1080/10618600.2000.10474879

http://dx.doi.org/10.1145/1553374.1553474

http://dx.doi.org/10.1093/biomet/asm086

http://dx.doi.org/10.1093/biomet/asm086

Piatek, R., and P. Pinger. 2016. “Maintaining (Locus of) Control? Data Combination for

the Identification and Inference of Factor Structure Models.” Journal of Applied Econo-

metrics 31 (4): 734–755. doi:10.1002/jae.2456.

Quintana, F. A., and P. Muller. 2004. “Nonparametric Bayesian Data Analysis.” Statistical

Science 19 (1): 95–110. doi:10.1214/088342304000000017.

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna,

Austria.

Reiersøl, O. 1950. “On the Identifiability of Parameters in Thurstone’s Multiple Factor Anal-

ysis.” Psychometrika 15 (2): 121–149. doi:10.1007/BF02289197.

Scott, S. L. 2011. “Data Augmentation, Frequentist Estimation, and the Bayesian Analysis

of Multinomial Logit Models.” Statistical Papers 52 (1): 87–109. doi:10.1007/s00362-

009-0205-0.

Sethuraman, J. 1994. “A Constructive Definition of Dirichlet Priors.” Statistica Sinica 4 (2):

639–650.

Sokal, A. D. 1997. “Monte Carlo Methods in Statistical Mechanics: Foundations and New

Algorithms.” In Functional Integration (Cargese, 1996), edited by C. Dewitt-Morette

and A. Folacci, 361:131–192. Nato Science Series B. Springer US. doi:10.1007/978-1-

4899-0319-8.

Thurstone, L. L. 1934. “The Vectors of Mind.” The Psychological Review (Chicago) 41 (1):

1–32. doi:10.1037/h0075959.

Tukey, J. W. 1977. Exploratory Data Analysis. Pearson.

Uysal, D. S. 2015. “Doubly Robust Estimation of Causal Effects with Multivalued Treat-

ments: An Application to the Returns to Schooling.” Journal of Applied Econometrics

30:763–786. doi:10.1002/jae.2386.

Van Dyk, D. A. 2010. “Marginal Markov Chain Monte Carlo Methods.” Statistica Sinica 20

(4): 1423–1454.

Van Dyk, D. A., and X.-L. Meng. 2001. “The Art of Data Augmentation.” Journal of Com-

putational and Graphical Statistics 10 (1): 1–50. doi:10.1198/10618600152418584.

Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

Williams, B. D. 2017. “Identification of the Linear Factor Model.” Working Paper.

51

http://dx.doi.org/10.1002/jae.2456

http://dx.doi.org/10.1214/088342304000000017

http://dx.doi.org/10.1007/BF02289197

http://dx.doi.org/10.1007/s00362-009-0205-0

http://dx.doi.org/10.1007/s00362-009-0205-0

http://dx.doi.org/10.1007/978-1-4899-0319-8

http://dx.doi.org/10.1007/978-1-4899-0319-8

http://dx.doi.org/10.1037/h0075959

http://dx.doi.org/10.1002/jae.2386

http://dx.doi.org/10.1198/10618600152418584

Yang, M., D. B. Dunson, and D. Baird. 2010. “Semiparametric Bayes Hierarchical Models

with Mean and Variance Constraints.” Computational Statistics & Data Analysis 54

(9): 2172–2186. doi:10.1016/j.csda.2010.03.025.

Yau, C., O. Papaspiliopoulos, G. O. Roberts, and C. C. Holmes. 2011. “Bayesian Non-

Parametric Hidden Markov Models with Applications in Genomics.” Journal of the

Royal Statistical Society: Series B (Statistical Methodology) 73 (1): 37–57. doi:10.1111/

j.1467-9868.2010.00756.x.

Zhang, X., W. J. Boscardin, and T. R. Belin. 2006. “Sampling Correlation Matrices in

Bayesian Models With Correlated Latent Variables.” Journal of Computational and

Graphical Statistics 15 (4): 880–896. doi:10.1198/106186006X160050.

52

http://dx.doi.org/10.1016/j.csda.2010.03.025

http://dx.doi.org/10.1111/j.1467-9868.2010.00756.x

http://dx.doi.org/10.1111/j.1467-9868.2010.00756.x

http://dx.doi.org/10.1198/106186006X160050

a bayesian nonparametric approach to factor analysisweb.econ.ku.dk/piatek/pdf/bnpfactor.pdf · a...

Documents