distributed, partially collapsed mcmc for bayesian nonparametrics · 2020-03-05 · distributed,...

Distributed, partially collapsed MCMC for Bayesian nonparametrics

Avinava Dubey∗ Michael M. Zhang∗ Eric P. Xing Sinead A. WilliamsonCarnegie Mellon Univ. Princeton University Carnegie Mellon Univ. Univ. of Texas Austin

Abstract

Bayesian nonparametric (BNP) models pro-vide elegant methods for discovering under-lying latent features within a data set, butinference in such models can be slow. We ex-ploit the fact that completely random mea-sures, which commonly-used models like theDirichlet process and the beta-Bernoulli pro-cess can be expressed using, are decompos-able into independent sub-measures. Weuse this decomposition to partition the la-tent measure into a finite measure contain-ing only instantiated components, and aninfinite measure containing all other com-ponents. We then select different inferencealgorithms for the two components: uncol-lapsed samplers mix well on the finite mea-sure, while collapsed samplers mix well on theinfinite, sparsely occupied tail. The resultinghybrid algorithm can be applied to a wideclass of models, and can be easily distributedto allow scalable inference without sacrificingasymptotic convergence guarantees.

1 INTRODUCTION

Bayesian nonparametric (BNP) models are a flexi-ble class of models whose complexity adapts to thedata under consideration. BNP models place priorson infinite-dimensional objects, such as partitions withinfinitely many blocks; matrices with infinitely manycolumns; or discrete measures with infinitely manyatoms. A finite set of observations is assumed to begenerated from a finite—but random—subset of thesecomponents, allowing flexibility in the underlying di-mensionality and providing the ability to incorporate

∗ denotes equal contribution

Proceedings of the 23rdInternational Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2020, Palermo,Italy. PMLR: Volume 108. Copyright 2020 by the au-thor(s).

previously unseen properties as our dataset grows.

While the flexibility of these models is a good fit forlarge, complex data sets, distributing existing infer-ence algorithms across multiple machines is challeng-ing. If we explicitly represent subsets of the underlyinginfinite-dimensional object—for example, using a slicesampler—we can face high memory requirements andslow convergence. Conversely, if we integrate out theinfinite-dimensional object, we run into problems dueto induced global dependencies.

Moreover, a key goal of distributed algorithms is tominimize communication between agents. This can beachieved by breaking the algorithm into independentsub-algorithms, which can be run independently ondifferent agents. In practice, we usually cannot splitan MCMC sampler on a Bayesian hierarchical modelinto entirely independent sub-algorithms since thereare typically some global dependencies implied by thehierarchy. Instead, we make use of conditional inde-pendencies to temporarily partition our algorithm.

Contributions: In this paper, we propose a dis-tributable sampler for models derived from completelyrandom measures, which unifies exact parallel infer-ence for a wide class of Bayesian nonparametric priors,including the popularly used Dirichlet process (Fergu-son, 1973) and the beta-Bernoulli process (Griffithsand Ghahramani, 2011). After reviewing the appro-priate background material, we first introduce generalrecipes for (non-distributed) partially collapsed sam-plers appropriate for a wide range of BNP models, fo-cusing on the beta-Bernoulli process and the Dirichletprocess as exemplars. We then demonstrate that thesemethods can be easily extended to a distributed set-ting. Finally we provide experimental results for ourhybrid and distributed sampler on DP and BB infer-ence.

2 RELATED WORK

Completely randommeasures (CRMs, Kingman, 1967)are random measures that assign independent massesto disjoint subsets of a space. For example, the gammaprocess assigns a gamma-distributed mass to each sub-

arX

iv:2

001.

0559

1v3

[st

at.M

L]

4 M

ar 2

020


Table 1: Comparison of various parallel and distributed inference algorithm proposed for BNPs. Other CRMsinclude gamma-Poisson process, beta-negative binomial process etc.

MethodsData

Exact Parallel DistributedBN Processes

Size Beta-Bernoulli Other DP HDP Pitman-Yor

Millions Process CRMs Process

Smyth et al. 2009 1M 5 X X 5 5 X X 5

Doshi-Velez and Ghahramani 2009 .01M X 5 5 X 5 5 5 5

Doshi-Velez et al. 2009b .1M 5 X X X 5 5 5 5

Williamson et al. 2013 1M X X 5 5 5 X X 5

Chang and Fisher 2013 0.3M X X 5 5 5 X X 5

Dubey et al. 2014 10M X X 5 5 5 5 5 XGe et al. 2015 0.1M X X X 5 5 X X 5

Yerebakan and Dundar 2017 0.05M X X X 5 5 X 5 5

This paper 1M X X X X X X X X

set. Other examples include the beta process (Hjort,1990) and the Poisson process. The distribution of aCRM is completely determined by its Lévy measure,which controls the size and location of atoms.

Many nonparametric distributions can be expressedin terms of CRMs. For example, if we sample B =∑∞i=1 µiδθi

from a (homogeneous) beta process, andgenerate a sequence of subsets Zi where θk ∈ Zi w.p.µk, then we obtain an exchangeable distribution oversequences of subsets known as the beta-Bernoulli pro-cess (Thibaux and Jordan, 2007), which is related tothe Indian buffet process (IBP, Ghahramani and Grif-fiths, 2005). If we sample G from a gamma process onΩ with base measure αH, then D(·) = G(·)/G(Ω) isdistributed according to a Dirichlet process with con-centration parameter α and base measure H.

Inference in such models tend to fall into three cat-egories: uncollapsed samplers that alternate betweensampling the latent measure and the assignments (Ish-waran and Zarepour, 2002; Paisley and Carin, 2009;Zhou et al., 2009; Walker, 2007; Teh et al., 2007);collapsed samplers where the latent measure is inte-grated out (Ishwaran and James, 2011; Neal, 2000;Ghahramani and Griffiths, 2005); and optimization-based methods that work with approximating distri-butions where the parameters are assumed to have amean-field distribution (Blei and Jordan, 2006; Doshi-Velez et al., 2009a).

Collapsed methods often mix slowly due to the de-pendency between assignments, while blocked updatesmean uncollapsed methods typically have good mix-ing properties at convergence (Ishwaran and James,2011). Uncollapsed methods are often slow to incor-porate new components, since they typically rely onsampling unoccupied components from the prior. Inhigh dimensions, such components are unlikely to beclose to the data. Conversely, collapsed methods can

make use of the data when introducing new points,which tends to lead to faster convergence (Neal, 2000).

Other methods incorporate both uncollapsed and col-lapsed sampling, resulting in a “hybrid”, partially col-lapsed sampler, although such approaches have beenrestricted to specific models. Doshi-Velez and Ghahra-mani (2009) introduced a linear time accelerated Gibbssampler for conjugate IBPs that effectively marginal-ized over the latent factors, while more recently Yere-bakan and Dundar (2017) developed a sampler by par-tially marginalizing latent random measure for DPs.These methods can be seen as special cases of our hy-brid framework (Section 3), but do not generalize tothe distributed setting.

Several inference algorithms allow computation to bedistributed across multiple machines—although again,such algorithms are specific to a single model. The ap-proximate collapsed algorithm of Smyth et al. (2009)is only developed for Dirichlet process-based models,and lacks asymptotic convergence guarantees. Dis-tributed split-merge methods have been developedfor Dirichlet process-based models, but not extendedto more general nonparametric models (Chang andFisher, 2013, 2014). Partition-based algorithms basedon properties of CRMs have been developed for Dirich-let process- and Pitman-Yor process-based models(Williamson et al., 2013; Dubey et al., 2014), but itis unclear how to extend to other model families. Alow-communication, distributed-memory slice samplerhas been developed for the Dirichlet process, but sinceit is based on an uncollapsed method it will tend toperform poorly in high dimensions (Ge et al., 2015).Doshi-Velez et al. (2009b) developed an approximatedistributed inference algorithm for the Indian buffetprocess which is superficially similar to our distributedbeta-Bernoulli sampler. However, their approach al-lows all processors to add new features, which will leadto overestimating the number of features. We contrast

Avinava Dubey, Michael Zhang, Eric P. Xing, Sinead A. Williamson

these methods in Table 1.

3 HYBRID INFERENCE FORCRM-BASED MODELS

By definition, completely random measures can be de-composed into independent random measures. If theCRM has been transformed in some manner we canoften still decompose the resulting random measureinto independent random measures – for example, anormalized random measure can be decomposed as amixture of normalized random measures. Such repre-sentations allow us to decompose our inference algo-rithms, and use different inference techniques on theconstituent measures.

As discussed in Section 2, collapsed and uncollapsedmethods both have advantages and disadvantages.Loosely, collapsed methods are good at adding newcomponents and exploring the tail of the distribution,while uncollapsed methods offer better mixing in es-tablished clusters and easy parallelization. We makeuse of the decomposition properties of CRMs to parti-tion our model into two components: One containing(finitely many) components currently associated withmultiple observations, and one containing the infinitetail of components.

3.1 Models Constructed Directly FromCRMs

Consider a generic hierarchical model,

M :=∞∑k=1

µkδθk∼ CRM(ν(dµ)H(dθ))

Zi,k ∼ f(µk), Xi =∞∑k=1

g(Zi,k, θk) + εi

(1)

where ν is a measure on R+; H is a measure on thespace of parameters Θ; g(·, ·) is some deterministictransformation such that g(0, θ) = 0 for all θ ∈ Θ;εi is a noise term; and f(·) is a likelihood that forms aconjugate pair with M , i.e. the posterior distributionP (M∗|Z) is a CRM in the same family. The indicesi = 1, . . . , n refer to the observations and the indicesk = 1, 2, . . . refer to the features. This framework in-cludes exchangeable feature allocation models such asthe beta-Bernoulli process (Ghahramani and Griffiths,2005; Thibaux and Jordan, 2007), the infinite gamma-Poisson feature model (Titsias, 2008), and the betanegative binomial process (Zhou et al., 2012; Broder-ick et al., 2014). We assume, as is the case in these ex-amples, that both collapsed and uncollapsed posteriorinference algorithms can be described. We also assumefor simplicity that the prior contains no fixed-locationatoms, although this assumption could be relaxed (see

Algorithm 1 Hybrid Beta-Bernoulli Sampler1: while not converged do2: Select J3: Sample µk ∼ Beta(mk, n−mk + c), ∀ k ≤ J4: Sample θk ∼ p(θk|H,Z,X), ∀ k ≤ J5: for i = 1, . . . , N do6: Sample Zi,kni=1,k≤J according to Eq. 47: Sample Zi,kni=1,k>J according to Eq. 58: Metropolis-Hastings sample Zi,k′ for k′ ∈k : Zi,k = 1,

∑j 6=i Zj,k = 0.

9: end for10: end while

Broderick et al., 2018).

Lemma 1 (Broderick et al. 2018). If M ∼CRM(ν(dµ)H(dθ)) and Zi,k ∼ f(µk) for i = 1, . . . , n,and if M and f form a conjugate pair, then theposterior P (M∗|Zi) can be decomposed into twoCRMs, each with known distribution. If K is thenumber of features for which

∑i Zi,k > 0, the first,

M∗≤K =∑Kk=1 µ

∗kδθk

, is a finite measure with fixed-location atoms at locations θk :

∑i Zi,k > 0. The

distribution over the corresponding weights is propor-tional to ν(dµ)

∏ni=1 f(Zi,k|µ). The second, with in-

finitely many random-location atoms, has Lévy mea-sure ν(dµ)H(dθ) (f(0|µ))n.

Based on Lemma 1, we partitionM∗ into a finite CRMM∗≤J =

∑Jk=1 µ

∗kδθk

for some J ≤ K, that contains all,or a subset of, the fixed-location atoms; and an infiniteCRM M∗>J that contains the remaining atoms. Weuse an uncollapsed sampler to sampleM∗≤J |Zi,kk≤K ,and then sample Zi,kk≤K |X,M∗fixed. Then, we use acollapsed sampler to sample the allocations Zi,k : k >J . The size J should be changed periodically to makesure J ≤ K to avoid explicitly instantiating atoms thatare not associated with data. In our experiments, weset J = K at the beginning of each iteration.

Example 1: Beta-Bernoulli Process. As a spe-cific example, consider the beta-Bernoulli process. LetB :=

∑∞k=1 µkδθk

∼ BetaP(α, c,H) be a homoge-neous beta process (Thibaux and Jordan, 2007), andlet Zi,k ∼ Bernoulli(µk). The posterior is given by

B∗|Z ∼ BetaP(cα+

∑kmk

c+ n, c+ n,

cαH +∑kmkδθk

cα+∑kmk

)where mk =

∑Ni=1 Zi,k. In this case the following

lemma helps us in decomposing the posterior distri-bution:

Lemma 2 (Thibaux and Jordan 2007). If K is thenumber of features where

∑i Zi,k > 0 and J ≤ K,

we can decompose the posterior distribution of beta-


Bernoulli process as as B∗ = B∗≤J +B∗>J where

B∗≤J ∼BetaP

(∑J

k=1 mk

c+ n, c+ n,

∑kmkδθk∑J

k=1 mk

)

B∗>J ∼BetaP

(cα

c+ n, c+ n,

cαH +∑

k>Jmkδθk

cα+∑

k>Jmk

) (2)

We note that the atom sizes of B∗≤J are Beta(mk, n−mk + c) random variables. This allows us to splitthe beta-Bernoulli process into two independent fea-ture selection mechanisms: one with a finite numberof currently instantiated features, and one with an un-bounded number of features.

The likelihood of a given data point, given the latentvariables and other data points (written as P (Xi|−)for space reasons), can be written as

p(Xi|−) =∫p(Xi|Zi, θ1, . . . , θJ , . . . , θK)

p(θJ+1, . . . , θK |Z−i, X−i)dθJ+1 · · · dθK .(3)

When working with conjugate likelihoods, we can typ-ically evaluate the integral term in Equation 3 ana-lytically (see Ghahramani and Griffiths (2005) for thecase of Gaussian prior and linear Gaussian likelihoodfor J = 0). If this is not possible, we can sample θkfor J ≥ k ≥ K as auxiliary variables (Doshi-Velez andGhahramani, 2009).

This formulation of the likelihood, combined withthe partitioning of the Bernoulli process described inEquation 2, gives us the hybrid sampler, which wesummarize in Algorithm 1. For each data point Xi,we sample Zi,k in a three step manner. For k ≤ J ,

P (Zi,k = z|−) ∝µkp(Xi|Zi,k = 1,−) z = 1(1− µk)p(Xi|Zi,k = 0,−) z = 0.

(4)

where p(Xi|Zi,k = 1,−) is given by Equation 3 withZi,k set to 1. For k > J and mk > 0, we have

P (Zi,k = z|Z−(i,k), Xi) ∝mkp(Xi|Zi,k = 1,−) z = 1(n−mk)p(Xi|Zi,k = 0,−) z = 0.

(5)

where p(Xi|Zi,k = 1,−) is given by Equation 3 withZi,k set to 0. Finally, we propose adding Poisson(α/n)new features, accepting using a Metropolis-Hastingsstep. Once we have sampled the Zi,k, for every instan-tiated feature k ≤ J , we sample µk ∼ Beta(mk, n −mk + c) and its corresponding parameters θk ∼p(θk|H,Z,X).

We note that similar algorithms can be easily derivedfor other nonparmetric latent feature models such asthose based on the infinite gamma-Poisson process(Thibaux and Jordan, 2007) and the beta-negative Bi-

Algorithm 2 Hybrid DPMM Sampler1: while not converged do2: Select J3: Sample B∗ ∼ Beta(n, n− n+ α)4: Sample (π1, . . . , πJ) ∼ Dir(m1, . . . ,mJ)5: For k ≤ J , sample θk ∼ p(θk|H, Xi : Zi = k)6: Sample Zi, ∀i = 1, . . . , n, using Equation 77: end while

nomial process (Zhou et al., 2012; Broderick et al.,2018).

3.2 Models Based on Transformations ofRandom Measures

While applying transformations to CRMs means theposterior is no longer directly decomposable, in somecases we can still apply the above general ideas.

Example 2: Dirichlet Process. As noted in Sec-tion 2, the Dirichlet process with concentration pa-rameter α and base measure H can be constructed bynormalizing a gamma process with base measure αH.If the Dirichlet process is used as the prior in a mix-ture model (DPMM), the posterior distribution condi-tioned on the cluster allocations Z1, . . . , Zn, having Kunique clusters is again a Dirichlet process:

D∗|Z1, . . . Zn ∼ DP(α+ n,αH +

∑k<K mkδθk

n+ α) (6)

where mk =∑i δ(Zi, k) and K is the number of clus-

ters withmk > 0. In this case also the following lemmahelps us in decomposing the posterior.Lemma 3. Assuming J ≤ K, and n =

∑k≤J mk, we

can decompose the posterior of DP asD∗|Z1, . . . , Zn = B∗D∗≤J + (1−B∗)D∗>J

where

D∗≤J |Z1, . . . , Zn ∼DP(n,

∑k≤J mkδθk

n)

D∗>J |Z1, . . . , Zn ∼DP(α+ n− n,

αH +∑

k>JmkδθK

α+ n− n )

B∗ ∼Beta(n, n− n+ α)

Proof. This is a direct extension of the fact that theDirichlet process has Dirichlet-distributed marginals(Ferguson, 1973). See Chapter 3 of Ghosh and Ra-mamoorthi (2003) for a detailed analysis.

We note that the posterior atom weights (π1, . . . , πJ)for the finite component are distributed accordingto Dirichlet(m1, . . . ,mJ), and can easily be sampledas part of an uncollapsed sampler. Conditioned onπk, θk : k ≤ J and B∗ we can sample the cluster


allocation, Zi of point Xi as

P (Zi = k|−) ∝

B∗πkf(xn; θk) k ≤ J

(1−B∗)mk∑j>K

mj+αfk(xn) J < k ≤ K

(1−B∗)α∑j>K

mj+αfH(xn) k = K + 1

(7)where f(Xi; θk) is the likelihood for each mixingcomponent; fk(Xi) =

∫Θ f(Xi; θ)p(θ|Xj : Zj =

k, j 6= i)dθ is the conditional probability of xi givenother members of the kth cluster; and fH(xi) =∫

Θ f(xi; θ)H(dθ). This procedure is summarized in Al-gorithm 2.

Example 3: Pitman-Yor processes The Pitman-Yor process (Perman et al., 1992; Pitman and Yor,1997) is a distribution over probability measures, pa-rameterized by 0 ≤ σ < 1 and α > −σ, that is ob-tained from a σ-stable CRM via a change of mea-sure and normalization. Provided α > 0, it can alsobe represented as a Dirichlet process mixture of nor-malized σ-stable CRMs (Lemma 22, Pitman and Yor,1997). This representation allows us to decomposethe posterior distribution into a beta mixture of afinite-dimensional Pitman-Yor process and an infinite-dimensional Pitman-Yor process. We provide more de-tails in the supplementary section A.

Example 4: Hierarchical Dirichlet processesWe can decompose the hierarchical Dirichlet process(HDP, Teh et al., 2006) in a manner comparable tothe Dirichlet process, allowing our hybrid sampler tobe used on the HDP. For space reasons, we defer dis-cussion to the supplementary section B.

4 DISTRIBUTED INFERENCE FORCRM-BASED MODELS

The sampling algorithms in Section 3 can easily beadapted to a distributed setting, where data are parti-tioned across several P machines and communicationis limited. In this setting, we set J = K (i.e. thenumber of currently instantiated features) after everycommunication step. We instantiate the finite mea-sure (M∗≤J in the case of CRMs, D∗>J for DPMMs),with globally shared atom sizes and locations, on allprocessors.

We then randomly select one out of P processors bysampling P ∗ ∼ Uniform(1, . . . , P ). On the remain-ing P − 1 processors, we sample the allocations Zi us-ing restricted Gibbs sampling (Neal, 2000), enforcingZi,k = 0 for k > J . When working with DPMMs, thismeans that we only need to calculate cluster probabil-ities that depend on the instantiated measure D∗≤J . Inthe CRM case, it means that we avoid the integral inEquation 3, and hence avoid any dependence on Z−ior X−i when conditioning onM∗≤J . In both cases, this

Algorithm 3 Distributed Beta Bernoulli Sampler1: procedure Local(Xi, Zi, P )

. Global variables J , P ∗, θk, µkJk=1, n2: for k ≤ J do3: Sample Zi,k according to (4)4: end for5: if P = P ∗ then6: For k > J , sample Zi,k according to (5)7: Sample Poisson(α/n) new features8: end if9: end procedure10: procedure Global(Xi, Zi)11: Gather feature counts mk and parameter suffi-

cient statistics Ψk from all processors.12: Let J be the number of instantiated features.13: For k : mk > 0, sample

µk ∼Beta(mk, n−mk + c)θk ∼p(θk|Ψk, H)

14: Sample P ∗ ∼ Uniform(1, . . . , P )15: end procedure

means that we do not need knowledge of the featureallocations on other processors, and can sample the Ziindependent of each other.

On the remaining processor P ∗, we sample the Zi us-ing unrestricted Gibbs sampling. Let P∗ be the set ofindices of data points on processor P ∗. Since we knowthat Zj,k = 0 for all j 6∈ P∗, then we can calculatethe full sufficient statistics for features k > J withoutknowledge of data or latent features on other proces-sors. While P (Zi|−) does depend on Zj , Xj : j ∈ P,it is independent of Zj , Xj : j 6∈ P conditioned onD∗≤J (or M∗≤J) plus the fact that Zj,k = 0 for allj 6∈ P∗, so can be calculated without further infor-mation from the other processors.

At each global step, we gather the sufficient statisticsfrom all instantiated clusters – from both the finitecomponent M∗≤J / D∗≤J and the infinite componentM∗>J / D∗>J – and sample parameters for those clus-ters. We then create a new partition, redefining J asthe current number of instantiated component param-eters. In the case of the DPMM, we also resampleB ∼ Beta(N,α). We summarize the distributed algo-rithm for the special cases of the beta-Bernoulli pro-cess and the DPMM in Algorithms 3 and 4 and forPYMM in algorithm 6 in supplementary.

4.1 Warm-start Heuristics

In our distributed approach, only 1/P of the datapoints eligible to start a new cluster or feature, mean-


Algorithm 4 Distributed DPMM Sampler1: procedure Local(Xi, Zi)

. Global variables J, P ∗, θk, πkJk=1, B∗

2: if P = P ∗ then3: Sample Zi according to (7)4: else5: P (Zi = k) ∝ πkf(Xi; θk)6: end if7: end procedure8: procedure Global(Xi, Zi)9: Gather cluster counts mk and parameter suffi-

cient statistics Ψk from all processors.10: Sample B∗ ∼ Beta(n, α)11: Let J be the number of instantiated clusters.12: Sample (π1, . . . , πJ) ∼ Dir(m1, . . . ,mJ)13: For k : mk > 0, sample θk ∼ p(θk|Ψk, H).14: Sample P ∗ ∼ Uniform(1, . . . , P )15: end procedure

ing that the rate at which new clusters/features areadded will decrease with the number of processors.This can lead to slow convergence if we start with toofew clusters. To avoid this problem, we initialize ouralgorithm by allowing all processors to instantiate newclusters. At each global step, we decrease the numberof randomly selected processors eligible to instantiatenew clusters, until we end up with a single proces-sor. This approach tend to over-estimate the numberof clusters. However, the procedure acts in a mannersimilar to simulated annealing, by encouraging largemoves early in the algorithm but gradually decreas-ing the excess stochasticity until we are sampling fromthe correct algorithm. We note that a sampler withmultiple processors instantiating new clusters is not acorrect sampler until we revert to a single processorproposing new features, because correct MCMC sam-plers are invariant to its starting position.

5 EXPERIMENTAL EVALUATION

While our primary contribution is in the developmentof distributed algorithms, we first consider, in Sec-tion 5.1, the performance of the hybrid algorithms de-veloped in Section 3 in a non-distributed setting. Weshow that this performance extends to the distributedsetting, and offers impressive scaling, in Section 5.2.

5.1 Evaluating the Hybrid Sampler

We begin by considering the performance of the hybridsamplers introduced in Section 3 in a non-distributedsetting. For this, we focus on the Dirichlet process,since there exist a number of collapsed and uncollapsedinference algorithms; we expect similar results under

Figure 1: Synthetic data experiments. Comparison ofF1 scores over iteration for the collapsed, uncollapsedand hybrid samplers for growing dimensions.

other models.

We compare the hybrid sampler of Algorithm 2 witha standard collapsed Gibbs sampler and an uncol-lapsed sampler based on Algorithm 8 of Neal (2000).Algorithm 8 collapses occupied clusters and instanti-ates a subset of unoccupied clusters; we modify thisto instantiate the atoms associated with unoccupiedclusters. Concretely, at each iteration, we sampleweights for the K instantiated clusters plus U unin-stantiated clusters as (π1, . . . , πK , πK+1, . . . , πK+U ) ∼Dir

(m1, . . . ,mK ,

αJ , . . . ,

αJ

), and sample locations for

the uninstantiated clusters from the base measure H.Note that this method can be distributed easily.

Figure 1 shows convergence plots for the three algo-rithms. The data set is aD dimensional synthetic dataset consisting of 100 observations of Gaussian mixtureswith 2 true mixture components centered at 5× 1D

and −5× 1D with an identity covariance matrix.

While the three algorithms perform comparably onlow-dimensional data, as the dimension increases theperformance of the uncollapsed sampler degradesmuch more than the collapsed sampler. This is be-cause in high dimensions, it is unlikely that a proposedparameter will be near our data, so the associated like-lihood of any given data point will be low. This is incontrast to the collapsed setting, where we integrateover all possible locations. While the hybrid methodperforms worse in high dimensions than the collapsedmethod, it outperforms the uncollapsed method.

The synthetic data in Figure 1 has fairly low-dimensional structure, so we do not see negative effectsdue to the poor mixing of the collapsed sampler. Next,we evaluate the algorithms on the CIFAR-100 dataset(Krizhevsky, 2009). We used PCA to reduce the di-mension of the data to between 8 and 64, and plotthe test set log likelihood over time in Figure 2. Eachmarker represents a single iteration. We see that theuncollapsed sampler requires more iterations to con-


101 102 103

Time (s)21.0

20.8

20.6

20.4

Test

log-

likel

ihoo

ddimensions = 8

hybridcollapseduncollapsed

101 102 103

Time (s)

35.5

35.0

34.5

34.0

Test

log-

likel

ihoo

d

dimensions = 16

101 102 103

Time (s)

57

56

55

54

Test

log-

likel

ihoo

d

dimensions = 32

101 102 103

Time (s)

88

86

84

82

80

Test

log-

likel

ihoo

d

dimensions = 64

Figure 2: Test log-likelihood with time on CIFAR-100 dataset as the dimensionality increases from 8 to 64.

verge than the collapsed sampler; however since eachiteration takes less time, in some cases the wall timeto convergence is slower. The hybrid method has com-parable iteration time to the collapsed, but, in gen-eral, converges faster. We see that, even without tak-ing advantage of parallelization, the hybrid method isa compelling competitor to pure-collapsed and pure-uncollapsed algorithms.

5.2 Evaluating the Distributed Samplers

Here, we show that the distributed inference algo-rithms introduced in Section 4 allow inference in BNPmodels to be scaled to large datasets, without sac-rificing accuracy. We focus on two cases: the beta-Bernoulli process (Algorithm 3) and the Dirichlet pro-cess (Algorithm 4)1.

5.2.1 Beta-Bernoulli Process

We evaluate the beta-Bernoulli sampler on syntheticdata based on the “Cambridge” dataset, used in theoriginal IBP paper (Griffiths and Ghahramani, 2011),where each datapoint is the superposition of a ran-domly selected subset of four binary features of di-mension 36, plus Gaussian noise with standard devia-tion 0.5.2 We model this data using a linear Gaus-sian likelihood, with Z ∼ Beta-Bern(α, 1), Ak ∼Normal(0, σ2

AI), Xn ∼ Normal(∑

k znkAk, σ2XI)

We initialized to a single feature, and ran the hybridsampler for 1,000 total observations with a synchro-nization step every 5 iterations, distributing over 1, 4,8, 16, 32, 64 and 128 processors.

We first evaluate the hybrid algorithm under a “coldstart”, where only one processor is allowed to introducenew features for the entire duration of the sampler.In the top left plot of Figure 3, we see that the coldstart results in slow convergence of in the test set loglikelihood for large numbers of processors. We can seein the bottom left plot of Figure 3 that the number of

1Our code is available online athttps://github.com/michaelzhang01/hybridBNP

2See Figure 6 in the supplement for more details

features grows very slowly, as only 1/P processors areallowed to propose new features in the exact setting.

Next, we explore warm-start initialization, as de-scribed in Section 4.1. For the first one-eighth of thetotal number of MCMC iterations, all processors canpropose new features; after this we revert to the stan-dard algorithm. The top central plot of Figure 3 showspredictive log likelihood over time, and the bottomcentral plot shows number of features. We see thatconvergence is significantly improved relative to thecold-start experiments. Since we revert to the asymp-totically correct sampler, the final number of featuresis generally close to the true number of features, 4.3Additionally, we see that convergence rate increasesmonotonically in the number of processors.

Next, we allowed all processors to propose new featuresfor the entire duration (“always-hot”). This settingapproximately replicates the behavior of the parallelIBP sampler of Doshi-Velez et al. (2009b). In the topright plot of Figure 3, we can see that all experimentsroughly converge to the same test log likelihood. How-ever, the number of features introduced (bottom rightplot) is much greater than the warm start experiment,grows with the number of processors. Moreover, thedifference in convergence rates between processors isnot as dramatic as in the warm-start trials.

Next, we demonstrate the scalability of our distributedalgorithm on a massive synthetic example, showingit can be used for large-scale latent feature models.We generate one million “Cambridge” synthetic datapoints, as described for the previous experiments, anddistribute the data over 256 processors. This experi-ment represents the largest experiment ran for a beta-Bernoulli process algorithm (the next largest being100,000 data points, in Doshi-Velez et al. 2009b). Welimited the sampler to run for one day and completed860 MCMC iterations. In Figure 4, we see in the testset log likelihood traceplot that we can converge to alocal mode fairly quickly under a massive distributed

3Note that BNP models are not guaranteed to find thecorrect number of features in the posterior; see Miller andHarrison (2013)

https://github.com/michaelzhang01/hybridBNP


Figure 3: Top row left to right: Test set log likelihood (y-axis) on synthetic data without warm-start ini-tialization, with warm start, and with all processors on. The x-axis represents CPU wall time in seconds, ona log scale. Bottom row left to right number of features over iteration with cold-start, warm-start, and allprocessors introducing features. Y-axis represents number of instantiated features.

Figure 4: Test set log likelihood trace plot for a millionobservation “Cambridge” data set.

setting.

5.2.2 Dirichlet Process

Our distributed inference framework can also speed upinference in a DP mixture of Gaussians, using the ver-sion described in Algorithm 4. We used a dataset con-taining the top 64 principle components of the CIFAR-100 dataset, as described in Section 5.1. We comparedagainst two existing distributed inference algorithmsfor the Dirichlet process mixture model, chosen to rep-resent models based on both uncollapsed and collapsedsamplers: 1) A DP variant of the asynchronous sam-pler of Smyth et al. (2009), an approximate collapsedmethod; and 2) the distributed slice sampler of Geet al. (2015), an uncollapsed method.

Figure 5 shows, when distributed over eight processors,our algorithm converges faster than the two compar-ison methods, showing that the high quality perfor-mance seen in Section 5.1 extends to the distributedsetting. Further, in Figure 5 we see roughly linearspeed-up in convergence as we increase the number ofprocessors from 1 to 8.

100 101 102 103

Time (s)

90

88

86

84

82

80

Test

log-

likel

ihoo

d

Hybrid P=8Async P=8Slice P=8

100 101 102 103

Time (s)

88

86

84

82

80

Test

log-

likel

ihoo

d

P=1P=2P=4P=8

Figure 5: Comparison of CIFAR-100 test log-likelihood with time, with baseline methods (left) andwith varying number of processor (right)

6 Conclusion

We have proposed a general inference framework fora wide variety of BNP models. We use the inher-ent decomposability of the underlying completely ran-dom measures to partition the latent random measuresinto a finite-dimensional component that representsthe majority of the data, and an infinite-dimensionalcomponent that represents mostly uninstantiated tail.This allows us to take advantage of the inherent par-allelizability of the uncollapsed sampler on the finitepartition and the better performance of the collapsedsampler for proposing new components. Thus theproposed hybrid inference method can be easily dis-tributed over multiple machines, providing provablycorrect inference for many BNP models. Experimentsshow that, for both the DP and the beta-Bernoulliprocess, our proposed distributed hybrid sampler con-verges faster than the comparison methods.


References

Blei, D. M. and Jordan, M. I. (2006). Variational infer-ence for Dirichlet process mixtures. Bayesian Anal-ysis, 1(1):121–143.

Broderick, T., Mackey, L., Paisley, J., and Jordan,M. I. (2014). Combinatorial clustering and thebeta negative binomial process. IEEE Transac-tions on Pattern Analysis and Machine Intelligence,37(2):290–306.

Broderick, T., Wilson, A. C., and Jordan, M. I.(2018). Posteriors, conjugacy, and exponential fam-ilies for completely random measures. Bernoulli,24(4B):3181–3221.

Chang, J. and Fisher, J. W. (2013). Parallel sam-pling of DP mixture models using sub-cluster splits.In Advances in Neural Information Processing Sys-tems, pages 620–628.

Chang, J. and Fisher, J. W. (2014). Parallel samplingof HDPs using sub-cluster splits. In Advances inNeural Information Processing Systems, pages 235–243.

Doshi-Velez, F. and Ghahramani, Z. (2009). Accel-erated sampling for the Indian buffet process. InProceedings of the 26th Annual International Con-ference on Machine Learning, pages 273–280. ACM.

Doshi-Velez, F., Miller, K., Van Gael, J., and Teh,Y. W. (2009a). Variational inference for the Indianbuffet process. In Artificial Intelligence and Statis-tics, pages 137–144.

Doshi-Velez, F., Mohamed, S., Ghahramani, Z., andKnowles, D. A. (2009b). Large scale nonparamet-ric Bayesian inference: Data parallelisation in theIndian buffet process. In Advances in Neural Infor-mation Processing Systems, pages 1294–1302.

Dubey, A., Williamson, S., and Xing, E. P. (2014).Parallel Markov chain Monte Carlo for Pitman-Yormixture models. In Uncertainty in Artificial Intelli-gence, pages 142–151.

Ferguson, T. S. (1973). A Bayesian analysis ofsome nonparametric problems. Annals of Statistics,1(2):209–230.

Ge, H., Chen, Y., Wan, M., and Ghahramani, Z.(2015). Distributed inference for Dirichlet processmixture models. In Proceedings of the 32nd In-ternational Conference on Machine Learning, pages2276–2284.

Ghahramani, Z. and Griffiths, T. L. (2005). Infinitelatent feature models and the Indian buffet process.In Advances in Neural Information Processing Sys-tems, pages 475–482.

Ghosh, J. K. and Ramamoorthi, R. V. (2003).Bayesian non-parametrics. In Springer.

Griffiths, T. L. and Ghahramani, Z. (2011). The Indianbuffet process: An introduction and review. Journalof Machine Learning Research, 12:1185–1224.

Hjort, N. L. (1990). Nonparametric Bayes estimatorsbased on beta processes in models for life historydata. Annals of Statistics, pages 1259–1294.

Ishwaran, H. and James, L. F. (2011). Gibbs samplingmethods for stick-breaking priors. Journal of theAmerican Statistical Association.

Ishwaran, H. and Zarepour, M. (2002). Exact andapproximate sum representations for the Dirichletprocess. Canadian Journal of Statistics, 30(2):269–283.

Kingman, J. F. C. (1967). Completely random mea-sures. Pacific Journal of Mathematics, 21(1):59–78.

Krizhevsky, A. (2009). Learning multiple layers of fea-tures from tiny images.

Miller, J. W. and Harrison, M. T. (2013). A simple ex-ample of Dirichlet process mixture inconsistency forthe number of components. In Advances in NeuralInformation Processing Systems, pages 199–206.

Neal, R. M. (2000). Markov chain sampling methodsfor Dirichlet process mixture models. Journal ofComputational and Graphical Statistics, 9(2):249–265.

Paisley, J. and Carin, L. (2009). Nonparametric factoranalysis with beta process priors. In Proceedings ofthe 26th Annual International Conference on Ma-chine Learning, pages 777–784. ACM.

Perman, M., Pitman, J., and Yor, M. (1992). Size-biased sampling of Poisson point processes and ex-cursions. Probability Theory and Related Fields,92(1):21–39.

Pitman, J. (1996). Some developments of theBlackwell-MacQueen urn scheme. In Statistics,Probability and Game Theory, pages 245–267. In-stitute of Mathematical Statistics.

Pitman, J. and Yor, M. (1997). The two-parameterPoisson-Dirichlet distribution derived from a stablesubordinator. Annals of Probability, 25(2):855–900.

Smyth, P., Welling, M., and Asuncion, A. U. (2009).Asynchronous distributed learning of topic models.In Advances in Neural Information Processing Sys-tems, pages 81–88.

Teh, Y. W., Görür, D., and Ghahramani, Z. (2007).Stick-breaking construction for the Indian buffetprocess. In International Conference on ArtificialIntelligence and Statistics, pages 556–563.

Teh, Y.-W., Jordan, M. I., Beal, M. J., and Blei,D. M. (2006). Hierarchical Dirichlet processes.Journal of the American Statistical Association,101(476):1566–1581.


Thibaux, R. and Jordan, M. I. (2007). Hierarchicalbeta processes and the Indian buffet process. In In-ternational Conference on Artificial Intelligence andStatistics, pages 564–571.

Titsias, M. K. (2008). The infinite gamma-Poissonfeature model. In Advances in Neural InformationProcessing Systems, pages 1513–1520.

Walker, S. G. (2007). Sampling the Dirichlet mix-ture model with slices. Communications in Statis-tics—Simulation and Computation, 36(1):45–54.

Weyrauch, B., Heisele, B., Huang, J., and Blanz, V.(2004). Component-based face recognition with 3Dmorphable models. In 2004 Conference on Com-puter Vision and Pattern Recognition Workshop,pages 85–85. IEEE.

Williamson, S., Dubey, A., and Xing, E. (2013). Par-allel Markov chain Monte Carlo for nonparametricmixture models. In Proceedings of the 30th Interna-tional Conference on Machine Learning, pages 98–106.

Yerebakan, H. Z. and Dundar, M. (2017). Partiallycollapsed parallel Gibbs sampler for Dirichlet pro-cess mixture models. Pattern Recognition Letters,90:22–27.

Zhou, M., Chen, H., Ren, L., Sapiro, G., Carin, L.,and Paisley, J. W. (2009). Non-parametric Bayesiandictionary learning for sparse image representations.In Advances in Neural Information Processing Sys-tems, pages 2295–2303.

Zhou, M., Hannah, L., Dunson, D., and Carin, L.(2012). Beta-negative binomial process and poissonfactor analysis. In Artificial Intelligence and Statis-tics, pages 1462–1471.


Distributed, partially collapsed MCMC for BayesianNonparametrics: Supplement

A Hybrid Sampler for the Pitman Yor Mixture Model (PYMM)

In this section, we expand upon Example 3 in Section 3.

Example 3: Pitman-Yor processes The Pitman-Yor process (Perman et al. (1992); Pitman and Yor (1997))is a distribution over probability measures, parametrized by a discount parameter 0 ≤ σ < 1, a concentrationparameter α > −σ, and a base measure H. While the Pitman-Yor process is not a normalized CRM, it canbe derived from a σ-stable CRM via a change of measure and normalization. When the discount parameter iszero, we recover the Dirichlet process. As the discount parameter increases, we get increasingly heavy-taileddistributions over the atom sizes in the resulting probability measure.Lemma 4. If D ∼ PY(α, σ,H) with α > 0, and Zi ∼ D, then the posterior distribution P (Y ∗|Z1, . . . , Zn) isdescribed by

D∗≤J ∼PY(n− Jσ, 0,

∑k≤j(mk − σ)δθk

n− Jσ

)D∗>J ∼PY

(α+ n− n+ Jσ, σ,

(α+Kd)H +∑k>J(mk − σ)δθk

α+ n− n+ Jσ

)B ∼Beta(n− Jσ, α+ n− n+ Jσ)D∗ =BD∗≤J + (1−B)D∗>J

(8)

where K is the number of occupied clusters, J ≤ K, and n =∑Jk=1mk.

Proof. The proof is a direct consequence of Lemma 22 in Pitman and Yor (1997) and Theorem 1 in Dubey et al.(2014). The special case for J = K is presented in Corollary 20 of Pitman (1996).

We note that the posterior atom weights (π1, . . . , πJ) for the finite component D∗≤J are distributed according toDirichlet(m1 − σ, . . . ,mJ − σ), and can easily be sampled as part of an uncollapsed sampler. Conditioned onπk, θk : k ≤ J and B∗ we can sample the cluster allocation, Zi of point Xi as

P (Zi = k|−) ∝

B∗πkf(xn; θk) k ≤ J(1−B∗)(mk−σ)n−n+α+Jσ fk(xn) J < k ≤ K

(1−B∗)(α+Kσ)n−n+α+Jσ fH(xn) k = K + 1

(9)

where f(Xi; θk) is the likelihood for each mixing component; fk(Xi) =∫

Θ f(Xi; θ)p(θ|Xj : Zj = k, j 6= i)dθ isthe conditional probability of xi given other members of the kth cluster; and fH(xi) =

∫Θ f(xi; θ)H(dθ). This

procedure is summarized in Algorithm 5

We can similarly derive the distributed sampler for PYMM shown in algorithm 6


Algorithm 5 Hybrid PYMM Sampler1: while not converged do2: Select J3: Sample B∗ ∼ Beta(n− Jσ, n− n+ α+ Jσ)4: Sample (π1, . . . , πJ) ∼ Dir(m1 − σ, . . . ,mJ − σ)5: For k ≤ J , sample

θk ∼ p(θk|H, Xi : Zi = k)

6: For each data point Xn sample Zi according to Equation 97: end while

Algorithm 6 Distributed PYMM Sampler1: procedure Local(Xi, Zi)

. Global variables J, P ∗, θk, πkJk=1, B∗

2: if Processor = P∗ then3: Sample Zi according to (9)4: else5: P (Zi = k) ∝ πkf(Xi; θk)6: end if7: end procedure8: procedure Global(Xi, Zi)9: Gather cluster counts mk and parameter sufficient statistics Ψk from all processors.10: Let J be the number of instantiated clusters.11: Sample B∗ ∼ Beta(n− Jσ, α+ Jσ)12: Sample (π1, . . . , πJ) ∼ Dir(m1 − σ, . . . ,mJ − σ)13: For k : mk > 0, sample

θk ∼ p(θk|Ψk, H)


B Hybrid Sampler for Hierarchical Dirichlet Processes

In this section, we expand upon Example 4 in Section 3.

Example 4: Hierarchical Dirichlet Process. Hierarchical Dirichlet processes (HDPs, Teh et al., 2006)extend the DP to model grouped data. The Hierarchical Dirichlet Process is a distribution over probabilitydistributions Ds, s = 1, . . . , S, each of which is conditionally distributed according to a DP. These distributionsare coupled using a discrete common base-measure H, itself distributed according to a DP. Each distributionDs can be used to model a collection of observations xs := xsiNs

i=1, whereH ∼ DP(α,D0), Ds ∼ DP(γ,H),

θsi ∼ Ds, xsi ∼ f(θsi),(10)

for s = 1, . . . , S and i = 1, . . . , Ns.

We consider a Chinese Restaurant Franchise (CRF, Teh et al., 2006) representation of the HDP, where eachdata point is represented by a customer, each atom in Ds is represented by a table, and each atom location inthe support of H is represented by a dish. Let xsi represent the ith customer in the sth restaurant; let tsi bethe table assignment of customer xsi; let kst be the dish assigned to table t in restaurant s. Let msk denotethe number of tables in restaurant s serving disk k, and nstk denote the number of customers in restaurant s attable t having dish k.

Lemma 5. Conditioned on the table/dish assignment counts m·k =∑smsk, the posterior distribution

P (H∗|tsi, kst can be written as

H∗ = B∗H∗≤J + (1−B∗)H∗>J


whereH∗≤J |m·1, . . . ,m·K ∼DP

(m,

∑k≤J m·kδφk

m

)H∗>J |m·1, . . . ,m·K ∼DP

(α+m− m,

αH +∑k>J m·kδφk

α+m− m

)B∗ ∼Beta(m,m− m+ α),

where K is the total number of instantiated dishes; J ≤ K; m =∑Kk=1m·k; and m =

∑Jk=1m·k

Proof. This is a direct extension of Lemma 3, applied to the top-level Dirichlet process H.

We can therefore construct a hybrid sampler, where H∗≤J :=∑Jk=1 βkδφk

is represented via (β1, . . . , βJ) ∼Dir(m·1, . . . ,m·J) and corresponding φk ∼ h(φk)

∏s,i:kstsi

=k f(xsi|φk), and H∗>J is represented using a Chineserestaurant process. We can then sample the table allocations according to

tsi = t|t−si, kst,− ∝

n−sist.

n−sis.. +γ f(xsi;φkst

) if t previously used and kst ≤ JB∗ γ

n−sis.. +γβkst

f(xsi;φkst) t = new table table and kst ≤ J

(1−B∗) γ

n−sis.. +γ

m.kst∑k:k>J

m.k+αf(xsi|t−si, ,−;D0) if t new and kst > J

(1−B∗) γ

n−sis.. +γ

α∑k:k>J

m.k+αf(xsi;D0) t new and k = K + 1

(11)

and sample each table according to

p(kst = k|tsi, k−st) ∝

B∗βkp(Xsi : tsi = t|φk) k ≤ J(1−B∗) m.k∑

k>Jm.k+α

p(Xsi : tsi = t|D0,X) J < k ≤ K

(1−B∗) α∑k>J

m.k+αp(Xsi : tsi = t|D0) k = K + 1

(12)

We summarize the hybrid sampler shown in Algorithm 7

Algorithm 7 Hybrid HDP Sampler1: while not converged do2: Select J3: Sample B∗ ∼ Beta(m,m− m+ α)4: Sample (β1, . . . , βJ) ∼ Dir(m.1, . . . ,m.J)5: For k ≤ J , sample

φk ∼ p(φk|D0, Xsi : kstsi= k)

6: For each data point Xsi sample tsi according to Equation 117: Sample the dish kst according to Equation 128: end while

If we ensure that data associated with each “restaurant” or group lies on the same table, we can extend thishybrid algorithm to a distributed setting, as described in Algorithm 8.

C Further IBP Empirical Results

For the “Cambridge” data sets described in the main paper, we generated images based on a superposition ofthe four features in the top row of Figure 6, and then flattened the image to create a 36-dimensional vector. Thebottom row of Figure 6 shows some sample data points.

In addition to synthetic data, we also evaluated the distributed beta-Bernoulli process sampler to a real-worlddata set, the CBCL MIT face dataset (Weyrauch et al., 2004). This data set consists of 2,429 images of 19× 19


Algorithm 8 Distributed HDP Sampler1: procedure Local(Xsi, tsi, kst)

. Global variables J, P ∗, φk, βkJk=1, B∗

2: if Processor = P∗ then3: Sample tsi according to (11)4: Sample kst according to (12)5: else6: Sample tables according to

P (tsi = t|n−sist. ) ∝n−sist. f(xsi;φkst

) if t occupiedγβkst

f(xsi;φkst) if t is new

(13)

7: Sample dishes according to

P (kst = k|βk, tsi, X) ∝ βkp(Xsi : tsi = t|φk) (14)

8: end if9: end procedure

10: procedure Global(Xsi, tsi, kst)11: Gather cluster counts mk and parameter sufficient statistics Ψk from all processors.12: Sample B∗ ∼ Beta(m, α+m− m)13: Let J be the number of instantiated clusters.14: Sample (β1, . . . , βJ) ∼ Dir(m.1, . . . ,m.J)15: For k ≤ J , sample

φk ∼ p(φk|D0, Xsi : kstsi = k)


dimensional faces. We distributed the data across 32 processors and ran the sampler in parallel for 500 iterations.Figure 8 shows the test set log likelihood of the sampler over time and Figure 7 shows the features learned bythe hybrid IBP method and we can clearly see that our method can discover the underlying facial features inthis data set.

Figure 6: Top: The true features present in the synthetic data set. Bottom: Examples of observations in thesynthetic data set.


Figure 7: Learned features from the hybrid IBP for the CBCL face data set.

Figure 8: Test set log-likelihood trace plot for CBCL face data set.

distributed, partially collapsed mcmc for bayesian nonparametrics · 2020-03-05 · distributed,...

Documents