improved bayesian logistic supervised topic models with ... · mulation of logistic supervised...

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 187–195,Sofia, Bulgaria, August 4-9 2013. c©2013 Association for Computational Linguistics

Improved Bayesian Logistic Supervised Topic Modelswith Data Augmentation

Jun Zhu, Xun Zheng, Bo ZhangDepartment of Computer Science and Technology

TNLIST Lab and State Key Lab of Intelligent Technology and SystemsTsinghua University, Beijing, China

{dcszj,dcszb}@tsinghua.edu.cn; [email protected]

Abstract

Supervised topic models with a logisticlikelihood have two issues that potential-ly limit their practical use: 1) responsevariables are usually over-weighted bydocument word counts; and 2) existingvariational inference methods make strictmean-field assumptions. We address theseissues by: 1) introducing a regularizationconstant to better balance the two partsbased on an optimization formulation ofBayesian inference; and 2) developing asimple Gibbs sampling algorithm by intro-ducing auxiliary Polya-Gamma variablesand collapsing out Dirichlet variables. Ouraugment-and-collapse sampling algorithmhas analytical forms of each conditionaldistribution without making any restrict-ing assumptions and can be easily paral-lelized. Empirical results demonstrate sig-nificant improvements on prediction per-formance and time efficiency.

1 Introduction

As widely adopted in supervised latent Dirichletallocation (sLDA) models (Blei and McAuliffe,2010; Wang et al., 2009), one way to improvethe predictive power of LDA is to define a like-lihood model for the widely available document-level response variables, in addition to the likeli-hood model for document words. For example, thelogistic likelihood model is commonly used for bi-nary or multinomial responses. By imposing somepriors, posterior inference is done with the Bayes’rule. Though powerful, one issue that could limitthe use of existing logistic supervised LDA modelsis that they treat the document-level response vari-able as one additional word via a normalized like-lihood model. Although some special treatment iscarried out on defining the likelihood of the single

response variable, it is normally of a much small-er scale than the likelihood of the usually tens orhundreds of words in each document. As notedby (Halpern et al., 2012) and observed in our ex-periments, this model imbalance could result in aweak influence of response variables on the topicrepresentations and thus non-satisfactory predic-tion performance. Another difficulty arises whendealing with categorical response variables is thatthe commonly used normal priors are no longerconjugate to the logistic likelihood and thus lead tohard inference problems. Existing approaches re-ly on variational approximation techniques whichnormally make strict mean-field assumptions.

To address the above issues, we present two im-provements. First, we present a general frame-work of Bayesian logistic supervised topic modelswith a regularization parameter to better balanceresponse variables and words. Technically, insteadof doing standard Bayesian inference via Bayes’rule, which requires a normalized likelihood mod-el, we propose to do regularized Bayesian infer-ence (Zhu et al., 2011; Zhu et al., 2013b) via solv-ing an optimization problem, where the posteriorregularization is defined as an expectation of a l-ogistic loss, a surrogate loss of the expected mis-classification error; and a regularization parame-ter is introduced to balance the surrogate classifi-cation loss (i.e., the response log-likelihood) andthe word likelihood. The general formulation sub-sumes standard sLDA as a special case.

Second, to solve the intractable posterior infer-ence problem of the generalized Bayesian logis-tic supervised topic models, we present a simpleGibbs sampling algorithm by exploring the ideasof data augmentation (Tanner and Wong, 1987;van Dyk and Meng, 2001; Holmes and Held,2006). More specifically, we extend Polson’smethod for Bayesian logistic regression (Polsonet al., 2012) to the generalized logistic supervisedtopic models, which are much more challeng-

187

ing due to the presence of non-trivial latent vari-ables. Technically, we introduce a set of Polya-Gamma variables, one per document, to refor-mulate the generalized logistic pseudo-likelihoodmodel (with the regularization parameter) as a s-cale mixture, where the mixture component is con-ditionally normal for classifier parameters. Then,we develop a simple and efficient Gibbs samplingalgorithms with analytic conditional distribution-s without Metropolis-Hastings accept/reject steps.For Bayesian LDA models, we can also explorethe conjugacy of the Dirichlet-Multinomial prior-likelihood pairs to collapse out the Dirichlet vari-ables (i.e., topics and mixing proportions) to docollapsed Gibbs sampling, which can have bettermixing rates (Griffiths and Steyvers, 2004). Final-ly, our empirical results on real data sets demon-strate significant improvements on time efficiency.The classification performance is also significantlyimproved by using appropriate regularization pa-rameters. We also provide a parallel implementa-tion with GraphLab (Gonzalez et al., 2012), whichshows great promise in our preliminary studies.

The paper is structured as follows. Sec. 2 intro-duces logistic supervised topic models as a generaloptimization problem. Sec. 3 presents Gibbs sam-pling algorithms with data augmentation. Sec. 4presents experiments. Sec. 5 concludes.

2 Logistic Supervised Topic Models

We now present the generalized Bayesian logisticsupervised topic models.

2.1 The Generalized ModelsWe consider binary classification with a trainingset D = {(wd, yd)}D

d=1, where the response vari-able Y takes values from the output space Y ={0, 1}. A logistic supervised topic model consistsof two parts — an LDA model (Blei et al., 2003)for describing the words W = {wd}D

d=1, wherewd = {wdn}Nd

n=1 denote the words within docu-ment d, and a logistic classifier for considering thesupervising signal y = {yd}D

d=1. Below, we intro-duce each of them in turn.

LDA: LDA is a hierarchical Bayesian modelthat posits each document as an admixture of Ktopics, where each topic Φk is a multinomial dis-tribution over a V -word vocabulary. For documentd, the generating process is

1. draw a topic proportion θd ∼ Dir(α)

2. for each word n = 1, 2, . . . , Nd:

(a) draw a topic1 zdn ∼ Mult(θd)

(b) draw the word wdn ∼ Mult(Φzdn)

where Dir(·) is a Dirichlet distribution; Mult(·) isa multinomial distribution; and Φzdn

denotes thetopic selected by the non-zero entry of zdn. Forfully-Bayesian LDA, the topics are random sam-ples from a Dirichlet prior, Φk ∼ Dir(β).

Let zd = {zdn}Ndn=1 denote the set of topic as-

signments for document d. Let Z = {zd}Dd=1 and

Θ = {θd}Dd=1 denote all the topic assignments

and mixing proportions for the entire corpus. LDAinfers the posterior distribution p(Θ,Z,Φ|W) ∝p0(Θ,Z,Φ)p(W|Z,Φ), where p0(Θ,Z,Φ) =(∏

d p(θd|α)∏

n p(zdn|θd)) ∏

k p(Φk|β) is thejoint distribution defined by the model. As noticedin (Jiang et al., 2012), the posterior distributionby Bayes’ rule is equivalent to the solution of aninformation theoretical optimization problemmin

q(Θ,Z,Φ)KL(q(Θ,Z,Φ)∥p0(Θ,Z,Φ))−Eq[log p(W|Z,Φ)]

s.t. : q(Θ,Z,Φ) ∈ P, (1)

where KL(q||p) is the Kullback-Leibler diver-gence from q to p and P is the space of probabilitydistributions.

Logistic classifier: To consider binary super-vising information, a logistic supervised topicmodel (e.g., sLDA) builds a logistic classifierusing the topic representations as input features

p(y = 1|η, z) =exp(η⊤z)

1 + exp(η⊤z), (2)

where z is a K-vector with zk = 1N

∑Nn=1 I(zk

n =1), and I(·) is an indicator function that equals to1 if predicate holds otherwise 0. If the classifierweights η and topic assignments z are given, theprediction rule isy|η,z = I(p(y = 1|η, z) > 0.5) = I(η⊤z > 0). (3)

Since both η and Z are hidden variables, wepropose to infer a posterior distribution q(η,Z)that has the minimal expected log-logistic loss

R(q(η,Z)) = −∑

d

Eq[log p(yd|η, zd)], (4)

which is a good surrogate loss for the expectedmisclassification loss,

∑d Eq[I(y|η,zd

= yd)], ofa Gibbs classifier that randomly draws a modelη from the posterior distribution and makes pre-dictions (McAllester, 2003; Germain et al., 2009).In fact, this choice is motivated from the obser-vation that logistic loss has been widely used asa convex surrogate loss for the misclassification

1A K-binary vector with only one entry equaling to 1.

188

loss (Rosasco et al., 2004) in the task of fully ob-served binary classification. Also, note that the l-ogistic classifier and the LDA likelihood are cou-pled by sharing the latent topic assignments z. Thestrong coupling makes it possible to learn a pos-terior distribution that can describe the observedwords well and make accurate predictions.

Regularized Bayesian Inference: To integratethe above two components for hybrid learning, alogistic supervised topic model solves the jointBayesian inference problem

minq(η,Θ,Z,Φ)

L(q(η,Θ,Z,Φ)) + cR(q(η,Z)) (5)

s.t.: q(η,Θ,Z,Φ) ∈ P,

where L(q) = KL(q||p0(η,Θ,Z,Φ)) −Eq[log p(W|Z,Φ)] is the objective for doingstandard Bayesian inference with the classifierweights η; p0(η,Θ,Z,Φ) = p0(η)p0(Θ,Z,Φ);and c is a regularization parameter balancing theinfluence from response variables and words.

In general, we define the pseudo-likelihood forthe supervision information

ψ(yd|zd,η) = pc(yd|η, zd) ={exp(η⊤zd)}cyd

(1 + exp(η⊤zd))c, (6)

which is un-normalized if c = 1. But, as weshall see this un-normalization does not affectour subsequent inference. Then, the generalizedinference problem (5) of logistic supervised topicmodels can be written in the “standard” Bayesianinference form (1)

minq(η,Θ,Z,Φ)

L(q(η,Θ,Z,Φ)) − Eq[logψ(y|Z,η)] (7)

s.t.: q(η,Θ,Z,Φ) ∈ P,

where ψ(y|Z,η) =∏

d ψ(yd|zd,η). It is easyto show that the optimum solution of problem(5) or the equivalent problem (7) is the posteriordistribution with supervising information, i.e.,

q(η,Θ,Z,Φ) =p0(η,Θ,Z,Φ)p(W|Z,Φ)ψ(y|η,Z)

ϕ(y,W).

where ϕ(y,W) is the normalization constant tomake q a distribution. We can see that when c = 1,the model reduces to the standard sLDA, which inpractice has the imbalance issue that the responsevariable (can be viewed as one additional word) isusually dominated by the words. This imbalancewas noticed in (Halpern et al., 2012). We will seethat c can make a big difference later.

Comparison with MedLDA: The above for-mulation of logistic supervised topic models asan instance of regularized Bayesian inference pro-vides a direct comparison with the max-margin

supervised topic model (MedLDA) (Jiang et al.,2012), which has the same form of the optimiza-tion problems. The difference lies in the posteriorregularization, for which MedLDA uses a hingeloss of an expected classifier while the logistic su-pervised topic model uses an expected log-logisticloss. Gibbs MedLDA (Zhu et al., 2013a) is an-other max-margin model that adopts the expect-ed hinge loss as posterior regularization. As weshall see in the experiments, by using appropriateregularization constants, logistic supervised topicmodels achieve comparable performance as max-margin methods. We note that the relationship be-tween a logistic loss and a hinge loss has beendiscussed extensively in various settings (Rosas-co et al., 2004; Globerson et al., 2007). But thepresence of latent variables poses additional chal-lenges in carrying out a formal theoretical analysisof these surrogate losses (Lin, 2001) in the topicmodel setting.

2.2 Variational Approximation AlgorithmsThe commonly used normal prior for η is non-conjugate to the logistic likelihood, which makesthe posterior inference hard. Moreover, the laten-t variables Z make the inference problem harderthan that of Bayesian logistic regression model-s (Chen et al., 1999; Meyer and Laud, 2002; Pol-son et al., 2012). Previous algorithms to solveproblem (5) rely on variational approximationtechniques. It is easy to show that the variation-al method (Wang et al., 2009) is a coordinate de-scent algorithm to solve problem (5) with the addi-tional fully-factorized constraint q(η,Θ,Z,Φ) =q(η)(

∏d q(θd)

∏n q(zdn))

∏k q(Φk) and a vari-

ational approximation to the expectation of thelog-logistic likelihood, which is intractable tocompute directly. Note that the non-Bayesiantreatment of η as unknown parameters in (Wanget al., 2009) results in an EM algorithm, whichstill needs to make strict mean-field assumptionstogether with a variational bound of the expecta-tion of the log-logistic likelihood. In this paper, weconsider the full Bayesian treatment, which canprincipally consider prior distributions and inferthe posterior covariance.

3 A Gibbs Sampling Algorithm

Now, we present a simple and efficient Gibbs sam-pling algorithm for the generalized Bayesian lo-gistic supervised topic models.

189

3.1 Formulation with Data AugmentationSince the logistic pseudo-likelihood ψ(y|Z,η) isnot conjugate with normal priors, it is not easyto derive the sampling algorithms directly. In-stead, we develop our algorithms by introducingauxiliary variables, which lead to a scale mix-ture of Gaussian components and analytic condi-tional distributions for automatical Bayesian in-ference without an accept/reject ratio. Our algo-rithm represents a first attempt to extend Polson’sapproach (Polson et al., 2012) to deal with highlynon-trivial Bayesian latent variable models. Let usfirst introduce the Polya-Gamma variables.

Definition 1 (Polson et al., 2012) A randomvariable X has a Polya-Gamma distribution,denoted by X∼PG(a, b), if

X =1

2π2

∞∑

i=1

gk

(i− 1)2/2 + b2/(4π2),

where a, b > 0 and each gi ∼ G(a, 1) is an inde-pendent Gamma random variable.

Let ωd = η⊤zd. Then, using the ideas of dataaugmentation (Tanner and Wong, 1987; Polsonet al., 2012), we can show that the generalizedpseudo-likelihood can be expressed as

ψ(yd|zd,η) =1

2ceκdωd

∫ ∞

0

exp(

− λdω2d

2

)p(λd|c, 0)dλd,

where κd = c(yd−1/2) and λd is a Polya-Gammavariable with parameters a = c and b = 0. Thisresult indicates that the posterior distribution ofthe generalized Bayesian logistic supervised topicmodels, i.e., q(η,Θ,Z,Φ), can be expressed asthe marginal of a higher dimensional distributionthat includes the augmented variables λ. Thecomplete posterior distribution is

q(η,λ,Θ,Z,Φ) =p0(η,Θ,Z,Φ)p(W|Z,Φ)ϕ(y,λ|Z,η)

ψ(y,W),

where the pseudo-joint distribution of y and λ is

ϕ(y,λ|Z,η) =∏

d

exp(κdωd − λdω

2d

2

)p(λd|c, 0).

3.2 Inference with Collapsed Gibbs SamplingAlthough we can do Gibbs sampling to infer thecomplete posterior distribution q(η,λ,Θ,Z,Φ)and thus q(η,Θ,Z,Φ) by ignoring λ, the mixingrate would be slow due to the large sample space.One way to effectively improve mixing ratesis to integrate out the intermediate variables(Θ,Φ) and build a Markov chain whose equi-librium distribution is the marginal distributionq(η,λ,Z). We propose to use collapsed Gibbs

sampling, which has been successfully used inLDA (Griffiths and Steyvers, 2004). For ourmodel, the collapsed posterior distribution isq(η,λ,Z) ∝ p0(η)p(W,Z|α,β)ϕ(y,λ|Z,η)

= p0(η)

K∏

k=1

δ(Ck + β)

δ(β)

D∏

d=1

[δ(Cd + α)

δ(α)

× exp(κdωd − λdω

2d

2

)p(λd|c, 0)

],

where δ(x) =∏dim(x)

i=1 Γ(xi)

Γ(∑dim(x)

i=1 xi), Ct

k is the number of

times the term t being assigned to topic k over thewhole corpus and Ck = {Ct

k}Vt=1; Ck

d is the num-ber of times that terms being associated with topick within the d-th document and Cd = {Ck

d}Kk=1.

Then, the conditional distributions used in col-lapsed Gibbs sampling are as follows.

For η: for the commonly used isotropic Gaus-sian prior p0(η) =

∏k N (ηk; 0, ν

2), we have

q(η|Z,λ) ∝ p0(η)∏

d

exp(κdωd − λdω

2d

2

)

= N (η; µ,Σ), (8)

where the posterior mean is µ = Σ(∑

d κdzd) andthe covariance is Σ = ( 1

ν2 I+∑

d λdzdz⊤d )−1. We

can easily draw a sample from a K-dimensionalmultivariate Gaussian distribution. The inversecan be robustly done using Cholesky decomposi-tion, an O(K3) procedure. Since K is normallynot large, the inversion can be done efficiently.

For Z: The conditional distribution of Z is

q(Z|η,λ) ∝K∏

k=1

δ(Ck + β)

δ(β)

D∏

d=1

[δ(Cd + α)

δ(α)

× exp(κdωd − λdω

2d

2

)].

By canceling common factors, we can derive thelocal conditional of one variable zdn as:q(zk

dn = 1 | Z¬,η,λ, wdn = t)

∝ (Ctk,¬n + βt)(C

kd,¬n + αk)

∑t C

tk,¬n +

∑Vt=1 βt

exp(γκdηk

− λdγ2η2

k + 2γ(1 − γ)ηkΛkdn

2

), (9)

whereC ··,¬n indicates that term n is excluded from

the corresponding document or topic; γ = 1Nd

;and Λk

dn = 1Nd−1

∑k′ ηk′Ck′

d,¬n is the discrimi-nant function value without word n. We can seethat the first term is from the LDA model for ob-served word counts and the second term is fromthe supervising signal y.

For λ: Finally, the conditional distribution ofthe augmented variables λ is

q(λd|Z,η) ∝ exp(

− λdω2d

2

)p(λd|c, 0)

= PG(λd; c, ωd

), (10)

190

Algorithm 1 for collapsed Gibbs sampling1: Initialization: set λ = 1 and randomly drawzdn from a uniform distribution.

2: for m = 1 to M do3: draw a classifier from the distribution (8)4: for d = 1 to D do5: for each word n in document d do6: draw the topic using distribution (9)7: end for8: draw λd from distribution (10).9: end for

10: end for

which is a Polya-Gamma distribution. The equal-ity has been achieved by using the constructiondefinition of the general PG(a, b) class through anexponential tilting of the PG(a, 0) density (Pol-son et al., 2012). To draw samples from thePolya-Gamma distribution, we adopt the efficientmethod2 proposed in (Polson et al., 2012), whichdraws the samples through drawing samples fromthe closely related exponentially tilted Jacobi dis-tribution.

With the above conditional distributions, we canconstruct a Markov chain which iteratively drawssamples of η using Eq. (8), Z using Eq. (9) andλ using Eq. (10), with an initial condition. In ourexperiments, we initially set λ = 1 and randomlydraw Z from a uniform distribution. In training,we run the Markov chain forM iterations (i.e., theburn-in stage), as outlined in Algorithm 1. Then,we draw a sample η as the final classifier to makepredictions on testing data. As we shall see, theMarkov chain converges to stable prediction per-formance with a few burn-in iterations.

3.3 Prediction

To apply the classifier η on testing data, we needto infer their topic assignments. We take the ap-proach in (Zhu et al., 2012; Jiang et al., 2012),which uses a point estimate of topics Φ fromtraining data and makes prediction based on them.Specifically, we use the MAP estimate Φ to re-place the probability distribution p(Φ). For theGibbs sampler, an estimate of Φ using the sam-ples is ϕkt ∝ Ct

k + βt. Then, given a testing doc-ument w, we infer its latent components z usingΦ as p(zn = k|z¬n) ∝ ϕkwn(Ck

¬n + αk), where

2The basic sampler was implemented in the R packageBayesLogit. We implemented the sampling algorithm in C++together with our topic model sampler.

Ck¬n is the times that the terms in this document w

assigned to topic k with the n-th term excluded.

4 Experiments

We present empirical results and sensitivity anal-ysis to demonstrate the efficiency and predictionperformance3 of the generalized logistic super-vised topic models on the 20Newsgroups (20NG)data set, which contains about 20,000 postingswithin 20 news groups. We follow the same set-ting as in (Zhu et al., 2012) and remove a stan-dard list of stop words for both binary and multi-class classification. For all the experiments, weuse the standard normal prior p0(η) (i.e., ν2 = 1)and the symmetric Dirichlet priors α = α

K 1, β =0.01×1, where 1 is a vector with all entries being1. For each setting, we report the average perfor-mance and the standard deviation with five ran-domly initialized runs.

4.1 Binary classification

Following the same setting in (Lacoste-Jullien etal., 2009; Zhu et al., 2012), the task is to distin-guish postings of the newsgroup alt.atheism andthose of the group talk.religion.misc. The trainingset contains 856 documents and the test set con-tains 569 documents. We compare the generalizedlogistic supervised LDA using Gibbs sampling(denoted by gSLDA) with various competitors,including the standard sLDA using variationalmean-field methods (denoted by vSLDA) (Wanget al., 2009), the MedLDA model using varia-tional mean-field methods (denoted by vMedL-DA) (Zhu et al., 2012), and the MedLDA mod-el using collapsed Gibbs sampling algorithms (de-noted by gMedLDA) (Jiang et al., 2012). We al-so include the unsupervised LDA using collapsedGibbs sampling as a baseline, denoted by gLDA.For gLDA, we learn a binary linear SVM on itstopic representations using SVMLight (Joachims,1999). The results of DiscLDA (Lacoste-Jullienet al., 2009) and linear SVM on raw bag-of-wordsfeatures were reported in (Zhu et al., 2012). ForgSLDA, we compare two versions – the standardsLDA with c = 1 and the sLDA with a well-tunedc value. To distinguish, we denote the latter bygSLDA+. We set c = 25 for gSLDA+, and setα = 1 and M = 100 for both gSLDA and gSL-DA+. As we shall see, gSLDA is insensitive to α,

3Due to space limit, the topic visualization (similar to thatof MedLDA) is deferred to a longer version.

191

5 10 15 20 25 300.55

0.6

0.65

0.7

0.75

0.8

0.85

# Topics

Acc

urac

y

gSLDAgSLDA+vSLDAvMedLDAgMedLDAgLDA+SVM

(a) accuracy

5 10 15 20 25 3010−2

10−1

100

101

102

103

# Topics

Tra

in−

time

(sec

onds

)


(b) training time

5 10 15 20 25 300

0.5

1

1.5

2

2.5

3

3.5

4

# Topics

Tes

t−tim

e (s

econ

ds)

gSLDAgSLDA+vSLDAgMedLDAvMedLDAgLDA+SVM

(c) testing time

Figure 1: Accuracy, training time (in log-scale) and testing time on the 20NG binary data set.

c and M in a wide range.Fig. 1 shows the performance of different meth-

ods with various numbers of topics. For accuracy,we can draw two conclusions: 1) without makingrestricting assumptions on the posterior distribu-tions, gSLDA achieves higher accuracy than vSL-DA that uses strict variational mean-field approxi-mation; and 2) by using the regularization constantc to improve the influence of supervision informa-tion, gSLDA+ achieves much better classificationresults, in fact comparable with those of MedLDAmodels since they have the similar mechanism toimprove the influence of supervision by tuning aregularization constant. The fact that gLDA+SVMperforms better than the standard gSLDA is dueto the same reason, since the SVM part of gL-DA+SVM can well capture the supervision infor-mation to learn a classifier for good prediction,while standard sLDA can’t well-balance the influ-ence of supervision. In contrast, the well-balancedgSLDA+ model successfully outperforms the two-stage approach, gLDA+SVM, by performing topicdiscovery and prediction jointly4.

For training time, both gSLDA and gSLDA+ arevery efficient, e.g., about 2 orders of magnitudesfaster than vSLDA and about 1 order of magnitudefaster than vMedLDA. For testing time, gSLDAand gSLDA+ are comparable with gMedLDA andthe unsupervised gLDA, but faster than the varia-tional vMedLDA and vSLDA, especially when Kis large.

4.2 Multi-class classification

We perform multi-class classification on the 20NGdata set with all the 20 categories. For multi-class classification, one possible extension is touse a multinomial logistic regression model forcategorical variables Y by using topic represen-tations z as input features. However, it is non-

4The variational sLDA with a well-tuned c is significantlybetter than the standard sLDA, but a bit inferior to gSLDA+.

trivial to develop a Gibbs sampling algorithm us-ing the similar data augmentation idea, due to thepresence of latent variables and the nonlinearityof the soft-max function. In fact, this is harderthan the multinomial Bayesian logistic regression,which can be done via a coordinate strategy (Pol-son et al., 2012). Here, we apply the binary gSL-DA to do the multi-class classification, followingthe “one-vs-all” strategy, which has been showneffective (Rifkin and Klautau, 2004), to providesome preliminary analysis. Namely, we learn 20binary gSLDA models and aggregate their predic-tions by taking the most likely ones as the finalpredictions. We again evaluate two versions ofgSLDA – the standard gSLDA with c = 1 andthe improved gSLDA+ with a well-tuned c value.Since gSLDA is also insensitive to α and c for themulti-class task, we set α = 5.6 for both gSLDAand gSLDA+, and set c = 256 for gSLDA+. Thenumber of burn-in is set as M = 40, which is suf-ficiently large to get stable results, as we shall see.

Fig. 2 shows the accuracy and training time. Wecan see that: 1) by using Gibbs sampling withoutrestricting assumptions, gSLDA performs betterthan the variational vSLDA that uses strict mean-field approximation; 2) due to the imbalance be-tween the single supervision and a large set ofword counts, gSLDA doesn’t outperform the de-coupled approach, gLDA+SVM; and 3) if we in-crease the value of the regularization constant c,supervision information can be better captured toinfer predictive topic representations, and gSL-DA+ performs much better than gSLDA. In fac-t, gSLDA+ is even better than the MedLDA thatuses mean-field approximation, while is compara-ble with the MedLDA using collapsed Gibbs sam-pling. Finally, we should note that the improve-ment on the accuracy might be due to the differen-t strategies on building the multi-class classifier-s. But given the performance gain in the binarytask, we believe that the Gibbs sampling algorith-

192

20 30 40 50 60 70 80 90 100 1100.55

0.6

0.65

0.7

0.75

0.8

# Topics

Acc

urac

y


(a) accuracy

20 30 40 50 60 70 80 90 100 11010−1

100

101

102

103

104

105

# Topics

Tra

in−

time

(sec

onds

)


parallel−gSLDA

parallel−gSLDA+

(b) training time

Figure 2: Multi-class classification.

Table 1: Split of training time over various steps.SAMPLE λ SAMPLE η SAMPLE Z

K=20 2841.67 (65.80%) 7.70 (0.18%) 1455.25 (34.02%)K=30 2417.95 (56.10%) 10.34 (0.24%) 1888.78 (43.66%)K=40 2393.77 (49.00%) 14.66 (0.30%) 2476.82 (50.70%)K=50 2161.09 (43.67%) 16.33 (0.33%) 2771.26 (56.00%)

m without factorization assumptions is the mainfactor for the improved performance.

For training time, gSLDA models are about10 times faster than variational vSLDA. Table 1shows in detail the percentages of the training time(see the numbers in brackets) spent at each sam-pling step for gSLDA+. We can see that: 1) sam-pling the global variables η is very efficient, whilesampling local variables (λ,Z) are much more ex-pensive; and 2) sampling λ is relatively stable asK increases, while sampling Z takes more timeas K becomes larger. But, the good news is thatour Gibbs sampling algorithm can be easily paral-lelized to speedup the sampling of local variables,following the similar architectures as in LDA.

A Parallel Implementation: GraphLab is agraph-based programming framework for parallelcomputing (Gonzalez et al., 2012). It provides ahigh-level abstraction of parallel tasks by express-ing data dependencies with a distributed graph.GraphLab implements a GAS (gather, apply, scat-ter) model, where the data required to compute avertex (edge) are gathered along its neighboringcomponents, and modification of a vertex (edge)will trigger its adjacent components to recomputetheir values. Since GAS has been successfully ap-plied to several machine learning algorithms5 in-cluding Gibbs sampling of LDA, we choose it as apreliminary attempt to parallelize our Gibbs sam-pling algorithm. A systematical investigation ofthe parallel computation with various architecturesin interesting, but beyond the scope of this paper.

For our task, since there is no coupling amongthe 20 binary gSLDA classifiers, we can learnthem in parallel. This suggests an efficient hybridmulti-core/multi-machine implementation, which

5http://docs.graphlab.org/toolkits.html

can avoid the time consumption of IPC (i.e., inter-process communication). Namely, we run our ex-periments on a cluster with 20 nodes where each n-ode is equipped with two 6-core CPUs (2.93GHz).Each node is responsible for learning one binarygSLDA classifier with a parallel implementationon its 12-cores. For each binary gSLDA mod-el, we construct a bipartite graph connecting traindocuments with corresponding terms. The graphworks as follows: 1) the edges contain the to-ken counts and topic assignments; 2) the verticescontain individual topic counts and the augment-ed variables λ; 3) the global topic counts and ηare aggregated from the vertices periodically, andthe topic assignments and λ are sampled asyn-chronously during the GAS phases. Once start-ed, sampling and signaling will propagate over thegraph. One thing to note is that since we can-not directly measure the number of iterations ofan asynchronous model, here we estimate it withthe total number of topic samplings, which is againaggregated periodically, divided by the number oftokens. We denote the parallel models by parallel-gSLDA (c = 1) and parallel-gSLDA+ (c = 256).From Fig. 2 (b), we can see that the parallel gSL-DA models are about 2 orders of magnitudes fasterthan their sequential counterpart models, which isvery promising. Also, the prediction performanceis not sacrificed as we shall see in Fig. 4.

4.3 Sensitivity analysis

Burn-In: Fig. 3 shows the performance of gSL-DA+ with different burn-in steps for binary classi-fication. When M = 0 (see the most left points),the models are built on random topic assignments.We can see that the classification performance in-creases fast and converges to the stable optimumwith about 20 burn-in steps. The training time in-creases about linearly in general when using moreburn-in steps. Moreover, the training time increas-es linearly as K increases. In the previous experi-ments, we set M = 100.

Fig. 4 shows the performance of gSLDA+and its parallel implementation (i.e., parallel-gSLDA+) for the multi-class classification with d-ifferent burn-in steps. We can see when the num-ber of burn-in steps is larger than 20, the per-formance of gSLDA+ is quite stable. Again, inthe log-log scale, since the slopes of the lines inFig. 4 (b) are close to the constant 1, the train-ing time grows about linearly as the number of

193

100 101 102 1030.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

burn−in iterations

Acc

urac

y

K = 5K = 10K=20

train accuracy

test accuracy

(a) accuracy

0 100 200 300 400 5000

5

10

15

20

25

30

35


Tra

in−

time

(sec

onds

)

K = 5K = 10K=20

(b) training time

Figure 3: Performance of gSLDA+ with differentburn-in steps for binary classification. The mostleft points are for the settings with no burn in.

10−1 100 101 102 1030.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85


Acc

urac

y

K = 20K = 30K = 40K = 50

gSLDA+

parallel−gSLDA+

(a) accuracy

10−1 100 101 102 103100

101

102

103

104

105


Tra

in−

time

(sec

)

K = 20K = 30K = 40K = 50parallel−gSLDA+

gSLDA+

(b) training time

Figure 4: Performance of gSLDA+ and parallel-gSLDA+ with different burn-in steps for multi-class classification. The most left points are forthe settings with no burn in.

burn-in steps increases. Even when we use 40 or60 burn-in steps, the training time is still compet-itive, compared with the variational vSLDA. Forparallel-gSLDA+ using GraphLab, the training isconsistently about 2 orders of magnitudes faster.Meanwhile, the classification performance is alsocomparable with that of gSLDA+, when the num-ber of burn-in steps is larger than 40. In the pre-vious experiments, we have set M = 40 for bothgSLDA+ and parallel-gSLDA+.

Regularization constant c: Fig. 5 shows theperformance of gSLDA in the binary classificationtask with different c values. We can see that in awide range, e.g., from 9 to 100, the performanceis quite stable for all the three K values. But forthe standard sLDA model, i.e., c = 1, both thetraining accuracy and test accuracy are low, whichindicates that sLDA doesn’t fit the supervision da-ta well. When c becomes larger, the training ac-curacy gets higher, but it doesn’t seem to over-fitand the generalization performance is stable. Inthe above experiments, we set c = 25. For multi-class classification, we have similar observationsand set c = 256 in the previous experiments.

Dirichlet prior α: Fig. 6 shows the perfor-mance of gSLDA on the binary task with differ-ent α values. We report two cases with c = 1 andc = 9. We can see that the performance is quitestable in a wide range of α values, e.g., from 0.1

1 2 3 4 6 7 8 9 100.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

√

c

Acc

urac

y

K = 5K = 10K = 20

train accuracy

test accuracy

(a) accuracy

1 2 3 4 6 7 8 9 101

2

3

4

5

6

7

8

9

10

11

√

c

Tra

in−

time

(sec

onds

)

K = 5K = 10K = 20

(b) training time

Figure 5: Performance of gSLDA for binary clas-sification with different c values.

10−4 10−2 100 102 1040.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

α

Acc

urac

y

K = 5K = 10K = 15K=20

(a) c = 1

10−6 10−4 10−2 100 102 1040.55

0.6

0.65

0.7

0.75

0.8

0.85

α

Acc

urac

y

K = 5K = 10K = 15K=20

(b) c = 9

Figure 6: Accuracy of gSLDA for binary classifi-cation with different α values in two settings withc = 1 and c = 9.

to 10. We also noted that the change of α does notaffect the training time much.

5 Conclusions and Discussions

We present two improvements to Bayesian logis-tic supervised topic models, namely, a general for-mulation by introducing a regularization parame-ter to avoid model imbalance and a highly efficientGibbs sampling algorithm without restricting as-sumptions on the posterior distributions by explor-ing the idea of data augmentation. The algorithmcan also be parallelized. Empirical results for bothbinary and multi-class classification demonstratesignificant improvements over the existing logisticsupervised topic models. Our preliminary resultswith GraphLab have shown promise on paralleliz-ing the Gibbs sampling algorithm.

For future work, we plan to carry out morecareful investigations, e.g., using various distribut-ed architectures (Ahmed et al., 2012; Newmanet al., 2009; Smola and Narayanamurthy, 2010),to make the sampling algorithm highly scalableto deal with massive data corpora. Moreover,the data augmentation technique can be appliedto deal with other types of response variables,such as count data with a negative-binomial likeli-hood (Polson et al., 2012).

Acknowledgments

This work is supported by National Key Foun-dation R&D Projects (No.s 2013CB329403,

194

2012CB316301), Tsinghua Initiative ScientificResearch Program No.20121088071, TsinghuaNational Laboratory for Information Science andTechnology, and the 221 Basic Research Plan forYoung Faculties at Tsinghua University.

ReferencesA. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy,

and A. Smola. 2012. Scalable inference in laten-t variable models. In International Conference onWeb Search and Data Mining (WSDM).

D.M. Blei and J.D. McAuliffe. 2010. Supervised topicmodels. arXiv:1003.0783v1.

D.M. Blei, A.Y. Ng, and M.I. Jordan. 2003. LatentDirichlet allocation. JMLR, 3:993–1022.

M. Chen, J. Ibrahim, and C. Yiannoutsos. 1999. Pri-or elicitation, variable selection and Bayesian com-putation for logistic regression models. Journal ofRoyal Statistical Society, Ser. B, (61):223–242.

P. Germain, A. Lacasse, F. Laviolette, and M. Marc-hand. 2009. PAC-Bayesian learning of linear clas-sifiers. In International Conference on MachineLearning (ICML), pages 353–360.

A. Globerson, T. Koo, X. Carreras, and M. Collins.2007. Exponentiated gradient algorithms for log-linear structured prediction. In ICML, pages 305–312.

J.E. Gonzalez, Y. Low, H. Gu, D. Bickson, andC. Guestrin. 2012. Powergraph: Distributed graph-parallel computation on natural graphs. In the 10thUSENIX Symposium on Operating Systems Designand Implementation (OSDI).

T.L. Griffiths and M. Steyvers. 2004. Finding scientif-ic topics. Proceedings of National Academy of Sci-ence (PNAS), pages 5228–5235.

Y. Halpern, S. Horng, L. Nathanson, N. Shapiro, andD. Sontag. 2012. A comparison of dimensionalityreduction techniques for unstructured clinical text.In ICML 2012 Workshop on Clinical Data Analysis.

C. Holmes and L. Held. 2006. Bayesian auxiliary vari-able models for binary and multinomial regression.Bayesian Analysis, 1(1):145–168.

Q. Jiang, J. Zhu, M. Sun, and E.P. Xing. 2012. MonteCarlo methods for maximum margin supervised top-ic models. In Advances in Neural Information Pro-cessing Systems (NIPS).

T. Joachims. 1999. Making large-scale SVM learningpractical. MIT press.

S. Lacoste-Jullien, F. Sha, and M.I. Jordan. 2009. Dis-cLDA: Discriminative learning for dimensionalityreduction and classification. Advances in Neural In-formation Processing Systems (NIPS), pages 897–904.

Y. Lin. 2001. A note on margin-based loss functions inclassification. Technical Report No. 1044. Universi-ty of Wisconsin.

D. McAllester. 2003. PAC-Bayesian stochastic modelselection. Machine Learning, 51:5–21.

M. Meyer and P. Laud. 2002. Predictive variable selec-tion in generalized linear models. Journal of Ameri-can Statistical Association, 97(459):859–871.

D. Newman, A. Asuncion, P. Smyth, and M. Welling.2009. Distributed algorithms for topic models.Journal of Machine Learning Research (JMLR),(10):1801–1828.

N.G. Polson, J.G. Scott, and J. Windle. 2012. Bayesianinference for logistic models using Polya-Gammalatent variables. arXiv:1205.0310v1.

R. Rifkin and A. Klautau. 2004. In defense of one-vs-all classification. Journal of Machine LearningResearch (JMLR), (5):101–141.

L. Rosasco, E. De Vito, A. Caponnetto, M. Piana, andA. Verri. 2004. Are loss functions all the same?Neural Computation, (16):1063–1076.

A. Smola and S. Narayanamurthy. 2010. An architec-ture for parallel topic models. Very Large Data Base(VLDB), 3(1-2):703–710.

M.A. Tanner and W.-H. Wong. 1987. The calcu-lation of posterior distributions by data augmenta-tion. Journal of the Americal Statistical Association(JASA), 82(398):528–540.

D. van Dyk and X. Meng. 2001. The art of data aug-mentation. Journal of Computational and Graphi-cal Statistics (JCGS), 10(1):1–50.

C. Wang, D.M. Blei, and Li F.F. 2009. Simultaneousimage classification and annotation. IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR).

J. Zhu, N. Chen, and E.P. Xing. 2011. Infinite latentSVM for classification and multi-task learning. InAdvances in Neural Information Processing Systems(NIPS), pages 1620–1628.

J. Zhu, A. Ahmed, and E.P. Xing. 2012. MedLDA:maximum margin supervised topic models. Journalof Machine Learning Research (JMLR), (13):2237–2278.

J. Zhu, N. Chen, H. Perkins, and B. Zhang. 2013a.Gibbs max-margin topic models with fast samplingalgorithms. In International Conference on Ma-chine Learning (ICML).

J. Zhu, N. Chen, and E.P. Xing. 2013b. Bayesian infer-ence with posterior regularization and applicationsto infinite latent svms. arXiv:1210.1766v2.

195

improved bayesian logistic supervised topic models with ... · mulation of logistic supervised...

Documents