deep topic models for multi-label learningproceedings.mlr.press/v89/panda19a/panda19a.pdfdeep topic...

Deep Topic Models for Multi-label Learning

Rajat Panda∗† Ankit Pensia∗† Nikhil Mehta Mingyuan Zhou Piyush RaiGoldman Sachs UW-Madison Duke University UT-Austin IIT Kanpur

Abstract

We present a probabilistic framework formulti-label learning based on a deep genera-tive model for the binary label vector associ-ated with each observation. Our generativemodel learns deep multi-layer latent embed-dings of the binary label vector, which areconditioned on the input features of the ob-servation. The model also has an interest-ing interpretation in terms of a deep topicmodel, with each label vector representing abag-of-words document, with the input fea-tures being its meta-data. In addition to cap-turing the structural properties of the labelspace (e.g., a near-low-rank label matrix), themodel also offers a clean, geometric interpreta-tion. In particular, the nonlinear classificationboundaries learned by the model can be seenas the union of multiple convex polytopes.Our model admits a simple and scalable in-ference via efficient Gibbs sampling or EMalgorithm. We compare our model with state-of-the-art baselines for multi-label learning onbenchmark data sets, and also report someinteresting qualitative results.

1 INTRODUCTION

Multi-label learning (Gibaja and Ventura, 2014, 2015)refers to the problem of annotating an object with asubset of relevant labels from a potentially massive setof labels. Multi-label learning problems in modern-dayapplications such as recommender systems involve veryhigh-dimensional feature and label spaces, and also∗Equal contribution.†Majority of this work was done when Rajat and Ankit

were at IIT Kanpur.

Proceedings of the 22nd International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2019, Naha,Okinawa, Japan. PMLR: Volume 89. Copyright 2019 bythe author(s).

dubbed “extreme” multi-label learning problems (Bab-bar and Schölkopf, 2017; Jain et al., 2016; Prabhu andVarma, 2014). Since, in such problem settings, thelabels tend to be related with each other, the naïveapproach of learning a classifier for each label indepen-dently is usually sub-optimal and suitable structuralassumptions need to be imposed, e.g., learning an em-bedding of the label matrix with a low-rank (Rai et al.,2015; Yu et al., 2014) or near-low-rank (Bhatia et al.,2015; Xu et al., 2016) assumption. Other notable ap-proaches for multi-label learning include tree-basedmethods (Jain et al., 2016; Prabhu and Varma, 2014)and carefully regularized one-vs-all approaches (Bab-bar and Schölkopf, 2017). We provide a more detaileddiscussion of prior work on multi-label learning in theRelated Work section.

While some of the recently proposed approaches tomulti-label learning have led to impressive results onbenchmark data sets, these methods still have some keylimitations. In particular, most of the existing methodsare based on linear models (e.g., linear embeddings oflabel vectors, or linear classifiers for label predictionmodels), which may not be sufficiently expressive tolearn complex classification boundaries if the underly-ing problem requires nonlinear classifiers. Moreover,most of the existing methods are non-probabilistic innature, and lack a proper generative model for the data,which may be useful in exploiting the rich structure inthe high-dimensional features and label vectors. A prob-abilistic, generative framework is also appealing since itopens door to extensions such as semi-supervised learn-ing (Kingma et al., 2014), or active learning (Kapooret al., 2012; Vasisht et al., 2014).

With these desiderata as our motivation, we presenta probabilistic, Bayesian framework based on a deepgenerative model for the binary label vectors. Ourgenerative model is built using a Bernoulli-Poisson like-lihood model (Rai et al., 2015; Zhou, 2015) for eachentry of the label vector and constructs a deep hierar-chy of gamma-distributed non-negative, interpretable,low-dimensional embeddings for the label vector. Thisconstruction is appealing due to several reasons: (1)The deep generative model enables learning nonlinear


embeddings of the label vector; (2) The model can beseen as learning a set of topics over the label space,which implicitly amounts to learning a (overlapping)clustering of the labels, thereby imposing an additionalstructural assumption into our multi-label learningmodel; (3) As we show, the resulting classification rulefor our model can be seen as learning nonlinear classifierfor each label, with the decision boundaries having in-tuitive geometric interpretation; (4) The model admitsefficient inference via Gibbs sampling or expectationmaximization, with a computational cost that scalesin the number of nonzeros in the label matrix.

In our experiments, our model compares favorably withstate-of-the-art methods and outperforms these meth-ods especially on difficult multi-label learning tasksfor which linear models do not suffice. Moreover, ourmodel can also be used as a scalable deep topic modelfor document collections, where each document alsohas some associated meta-data given in the form of ar-bitrary feature vectors (Mimno and McCallum, 2008).

2 DEEP GENERATIVE MODELFOR NONLINEAR MULTI-LABELLEARNING

In multi-label learning, we assume that we are given Ntraining examples {(x1,y1), . . . , (xN ,yN )} with xn ∈RD and yn ∈ {0, 1}L = [y1,n, . . . , yL,n], n = 1, . . . , N .The goal is to learn a model that can predict the labelvector y∗ ∈ {0, 1}L for a new test input x∗ ∈ RD.Often, instead of the actual binary labels, we are justinterested in predicting the relative scores for the Llabels (or the probabilities of each binary label y`,∗,` = 1, . . . , L being equal to 1).

We assume that each observation (xn,yn) is associatedwith a deep hierarchy of latent factors u(t)

n ∈ RKt+ ,

t = 1, . . . , T , which generates yn as follows: Layer 1latent factor u(1)

n generates an L-dimensional latentcount vector mn ∈ ZL+ from a Poisson distribution,and we then generate the L× 1 binary label vector ynvia a thresholding operation. These two steps can bewritten as

mn|ξn ∼ Poisson(ξn) and yn|mn = I[mn > 0] (1)

where I[m > 0] is equal to 1 if m > 0 and 0 other-wise. The Poisson rate parameter ξn is defined asξn =

∑K1

k1=1 λ(1)k1u(1)n,k1

v(1)k1

= V(1)Λ(1)u(1)n , which is a

weighted combination of K1 “basis vectors” V(1) =

[v(1)1 , . . . ,v

(1)K1

], where V(1) ∈ RL×K1+ . Note that the

Poisson(.) and I[.] in Eq. 1 denote point-wise oper-ations defined on vectors, i.e., m`,n ∼ Poisson(ξ`,n)and y`,n = I[m`,n > 0]. The likelihood model givenby Eq. (1) is known as the Bernoulli-Poisson (BP)

link (Rai et al., 2015; Zhou, 2015) and is preferred overthe commonly used logistic/probit likelihood since itonly requires evaluating the likelihood for the nonzeroentries in yn (Rai et al., 2015; Zhou, 2015). Thisis especially appealing for multi-label learning prob-lems since the label vector tends to be highly sparse.Also note that, for the BP likelihood, we can writey`,n|ξ`,n ∼ Bernoulli[1− exp(−ξ`,n)].

We can view V(1) = [v(1)1 , . . . ,v

(1)K1

] ∈ RL×K1+ as a set

of K1 topics, with each of its columns representinga (unnormalized) distribution over the L labels. Aswe will show later, our model actually learns a deephierarchy of topics with V(1) denoting the set of layer-1topics.

Fig. 1 shows the graphical model in plate notation.For brevity, we do not show the global parametersV(1),Λ(1) that the latent count vector mn depends on,as well as the other global parameters that the latentfactors u(t)

n ∈ RKt+ , t = 1, . . . , T in the deep hierarchy

depend on (we collectively denote those by {β(t)}Tt=1).We also condition the latent factors {u(t)

n }Tt=1 on theinput features xn ∈ RD.

We now describe the rest of the generative model whichwe showed concisely in Fig. 1. First note that thelatent factors {u(t)

n }Tt=1 form a deep hierarchy with thefollowing generative story:

u(T )n,kT

∼ Gamma(rkT , exp(x′nw

(T+1)kT

))

(2)

kT = 1, . . . ,KT

. . .

u(t)n,kt

∼ Gamma(V

(t+1)kt,:

Λ(t+1)u(t+1)n , (3)

exp(x′nw(t+1)kt

)

), k1 = 1, . . . ,Kt

. . .

u(1)n,k1

∼ Gamma(V

(2)kt,:

Λ(2)u(2)n , exp(x′nw

(2)k1

))(4)

k1 = 1, . . . ,K1

Note that, in the above generative model, the shapeparameter of the gamma prior on each component of thelatent factor u(t)

n , for layers t = 1, . . . , T − 1, dependson u(t+1)

n , the latent factors of the layer above. Thegamma scale parameter is a function of the inputs xnvia a regression model with weight vectors w(t+1)

kt∈

RD, kt = 1, . . . ,Kt. For the top layer T , the shapeparameters are given by rkT ∈ R+, kT = 1, . . . ,KT .Also note that V

(t+1)kt,:

is a row vector of size Kt+1,Λ(t+1) is a diagonal matrix of size Kt+1 ×Kt+1. Weassume the following priors on these parameters:

λ(t)kt∼ Gamma(a

(t)λ , 1/b

(t)λ )

Rajat Panda∗†, Ankit Pensia∗†, Nikhil Mehta, Mingyuan Zhou, Piyush Rai

r1

un , 1(3 )

un , 1(2) un , 2

(2)

un , K 1

(1)un , 1

(1)

yn ,1 yn ,2 yn , L−1 yn , L

t=1

t=2

t=3

(a) (b)

Figure 1: (a) Our generative model in the plate notation. Note: Some of the global variables are not shown for brevity.The {β(t)}Tt=1 collectively denote the global parameters that {u(t)

n }Tt=1 depend on. The latent factors are also conditionedon the input features xn (explained in more detail when we describe the full generative model). (b) Visualization as aBayesian network.

kt = 1, . . . ,Kt (5)

v(t)kt−1,kt

∼ Gamma(a(t)v , 1/b(t)v )

kt−1 = 1, . . . ,Kt−1, kt = 1, . . . ,Kt (6)

Each of the regression weight vectors w(t+1)kt

∈ RD isgiven a Gaussian prior with zero mean and a diagonalprecision matrix Γ(t+1). We assume a gamma prioron the diagonal entries of the precision matrix, whichallows us to learn the relevance of different features inthe input xn.

We also place gamma priors on each of the gammashape parameters rkT needed at the topmost layer.Note that the deep stacked construction of the latentfactors is similar in spirit to other deep Bayesian gen-erative models, such as deep exponential family (Ran-ganath et al., 2015) and gamma belief networks (Zhouet al., 2015), that define a deep hierarchy of latentvariables using a top-down generative model.

The proposed deep hierarchy (Eq.(2)-(4)) of latent fac-tors that eventually generates the binary label vectoryn via Eq. (1) is appealing as it enables learning anonlinear embedding of yn. This is in contrast withmost of the existing methods that can only learn alinear embedding (Rai et al., 2015; Yu et al., 2014).

In addition to learning a nonlinear embedding via adeep hierarchy of latent factors, our model has severalother distinguishing properties: (1) It allows condition-ing the latent factor in each layer on arbitrary features(which in our case is xn, where our goal is to learn amulti-label learning model); and (2) The overall modelhas a nice geometric interpretation (Sec. 3) in terms of

defining nonlinear classification boundaries which canbe seen as the union of multiple convex polytopes .

For the architectural choice of the deep network, wechoose Kt � Kt−1 which leads to a pyramid-like struc-ture of the deep network. The pyramid-like structureof the network ensures low intrinsic dimensionality onthe label vectors and, at the same time, leads to parsi-mony in terms of the number of model parameters tobe learned, while still being expressive enough.

Another appealing aspect of our framework is that itis not limited to multi-label learning. Note that thedeep generative construction for the label vectors canalso be seen as learning a (deep) topic model over adocument collection, where the documents are alsoassociated with arbitrary meta-data (Mimno and Mc-Callum, 2008). To see this connection, note that thelabel vector yn ∈ {0, 1}L can be thought of as a bag-of-words representation of a document (labels being thewords) and xn ∈ RD can be thought of as additionalfeatures or “meta-data” associated with this document.Our generative model can leverage both the documentas well as the meta-data to infer a deep hierarchy oftopics and a deep topic based representation {u(t)

n }Tt=1

of each document yn. The hierarchy of topics is givenby the matrices {V(t)}Tt=1, with V(t) ∈ RKt−1×Kt

+ rep-resenting a set of Kt topics, with each topic being adistribution over Kt−1 “tokens”. At layer 1 (K0 = L),these tokens are the L labels themselves, and at higherlayers, these token correspond to topics of a “super-topic”. Deep topic models have been explored recentlyusing the idea of gamma belief networks (Zhou et al.,2015). However, a key difference is that our model isable to additionally condition on the document meta-


data xn via a regression model.

3 MODEL PROPERTIES ANDGEOMETRIC INTERPRETATION

Our deep generative framework for multi-label learninghas several appealing properties. Firstly, predicting thelabels for a test example x does not require inferringits latent factors {u(t)}Tt=1. Instead, the expression forthe label probabilities can be obtained in closed-formand it only depends on the test input x and the globalparameters of our model. This leads to faster inferenceat test time.

To elaborate more on this, note that, given a new inputx and global parameters, we can easily marginalize thelocal variable u(t)n,kt of every node in the deep hierarchy,in a recursive fashion, by using the moment-generatingfunction of Gamma distribution (Rai et al., 2015; Zhou,2016). The marginalization scheme associates a confi-dence score q(t)kt,l to every node u(t)n,kt that depends onall its children nodes and corresponding outgoing edges.The confidence score for a node in the first layer (t=1)is given by

q(1)k1,l

= log(

1 + λ(1)k1v(1)l,k1

exp(x′w(2)k1

))

(7)

Confidence score of the nodes in the upper layers (t > 1)can then be recursively defined as follows:

q(t)kt,l

= log

[1 + λ

(t)kt

Kt−1∑kt−1=1

v(t)kt−1,kt

q(t−1)kt−1,l

× exp

(x′w

(t+1)kt

)](8)

The predictive distribution of the ith label in the labelvector y is then calculated by aggregating the confi-dence scores of all the KT nodes in the top layer as

P (yl = 1|x) = 1− exp

(−

KT∑kT=1

rkT q(T )kT ,l

)(9)

For the single layer case (T = 1), the above expres-sion is equivalent to the predictive distribution of theBernoulli-Poisson multi-label learning model from (Raiet al., 2015). However, in the multi-layer constructionof our model, Eq. (9) also has an interesting geometricinterpretations. It can be observed from the aboveexpression that if there exists atleast one confident pathfrom the top layer upto the label l, then the probabilityof this label being 1 will be large. A confident path isessentially a sequence of nodes [l, k′1, k

′2, . . . , k

′T , rk′T ] in

the deep hierarchy such that the model is confident at

each step t , that is, the edge weight (λk′t+1vk′t,k′t+1

) as

well as the regression model based score (x′w(t+1)k′t

) issufficiently large. Consequently, the region of the inputspace satisfying x′w(t+1)

k′t> ct ∀t, for the layer-specific

thresholds ct, leads to a convex-polytope-like region inthe feature space that encloses positive labels.

Another interesting geometric interpretation and expla-nation of our model’s ability to learn highly nonlinearclassification boundaries comes from the fact that themodel can be seen as a union of (

∏Tt=0Kt) committees

of T experts where experts are shared across commit-tees. Moreover, each expert uses a linear hyperplaneto generate a score. For predicting a specific label,the model uses a union of (

∏Tt=1Kt) committees of

experts. All the experts of at least one committeemust agree simultaneously to generate high probabilityof success and therefore each committee resembles aconvex polytope-like region of positive labels. Notethat all the labels share (

∏Tt=2Kt) experts in their

committees. Moreover, K1 hyperplanes are shifted ina parallel fashion to generate K0 ∗K1 label-specific ex-perts. Sharing experts leads to a highly flexible modelwithout increasing the number of parameters.

4 INFERENCE

Although our deep generative model is not nativelyconjugate, we are able to leverage auxiliary-variable-augmentation techniques (Rai et al., 2015; Zhou, 2016;Zhou et al., 2015) to obtain a model that is locallyconjugate and admits simple inference via closed-formGibbs sampling or Expectation-Maximization . Foreach training example (xn,yn), we augment latentcounts m(t+1)

n,kt,kt+1corresponding to each edge in the

deep hierarchy (see Fig. 1b). Using these latentcounts, we associate two latent counts with each nodeu(t)n,kt

: (1) outgoing latent counts m(t)n,_,kt ; (2) in-

coming latent counts m(t+1)n,kt,_. The outgoing latent

count is defined as the sum of latent counts asso-ciated with all the outgoing edges from that node,m

(t)n,_,kt =

∑Kt−1

kt−1=1m(t)n,kt−1,kt

. Similarly, the incom-

ing latent count associated with node u(t)n,kt , that is

m(t+1)n,kt,_, is defined as

∑Kt+1

kt+1=1m(t+1)n,kt,kt+1

. The latent

count, m(1)n,l , described in Eq. 1 can be thought of as

an incoming latent counts m(1)n,k0,_ for a node k0 of the

0th layer (i.e., the observed label vector layer).

The latent counts m(1)n,k0,_ are drawn from a truncated

Poisson distribution as

(m(1)n,k0,_|−) ∼ yk0,n�Pois+(

∑k1

λ(1)k1v(1)k0,k1

u(1)n,k1

) (10)


The latent counts of the successive layer can be com-puted recursively: In particular, given m

(t+1)n,kt,_ for a

layer t, we can obtain m(t+1)n,kt,kt+1

using Poisson andMultinomial-Poisson equivalence (Zhou et al., 2015):(

. . . ,m(t+1)n,kt,kt+1

, . . .)∼ Mult

(m

(t+1)kt,n,−

∣∣∣∣ . . . ,λ(t+1)kt+1

v(t+1)kt,kt+1

u(t+1)n,kt+1∑

kt+1λkt+1v

(t+1)kt,kt+1

u(t+1)n,kt+1

, . . .

)(11)

The Eq. (11) and Eq. (12) hold trivially for t = 0 andit also holds for t > 1 by using the Chinese RestaurantTable-negative Binomial (CRT-NB) equivalence withPoisson-SumLog(Zhou and Carin, 2015).

The subsequent m(t+1)n,_,kt+1

can be obtained by aggre-

gating the m(t+1)n,kt,kt+1

over kt. By using the additiveproperty of the Poisson distribution, we have

m(t+1)n,_,kt+1

∼ Pois(u(t+1)n,kt+1

q(t+1)n,kt+1

) (12)

where q(t+1)n,kt+1

can be thought of as a proportionality fac-

tor similar to q(t)kt,l in Eq. (8) and is defined as: q(t+1)n,kt+1

=

λ(t+1)kt+1

∑ktv(t+1)kt,kt+1

log(

1 + q(t)n,kt

exp(x′nw(t+1)kt

)),

q(0)n,k0

= exp(1)− 1 and w(1)k0

= 0.

Once the outgoing latent count m(t)n,_,kt of a node is

available, its incoming latent count m(t+1)n,kt,_ can be

calculated as m(t+1)n,kt,_ ∼ CRT

(m

(t)_,n,kt , r

(t)n,kt

), where

r(t)n,kt

is∑Kt+1

kt+1=1 vt+1kt,kt+1

λ(t+1)kt+1

u(t+1)n,kt+1

when t < T andrkT otherwise. This procedure is repeated to find latentcounts at each layer.

Given these latent counts, we can now leverage Poisson-gamma conjugacy to find the posterior distributionsof the local variables u(t)n,kt and the global variables

v(t)kt−1,kt

and λ(t)kt , as follows:

u(t)n,kt∼ Gamma

(r(t)n,kt

+m(t)n,_,kt ,

exp(x′nw

(t+1)kt

)exp

(x′nw

(t+1)kt

)q(t)n,kt

+ 1

)(13)

v(t)kt−1,kt

|− ∼ Gamma(a(t)v +

N∑n=1

m(t)n,kt−1,kt

,

(b(t)v + λkt

∑n

u(t)n,kt

ψ(t−1)n,kt−1

)−1)(14)

λ(t)kt|− ∼ Gamma

(a(t)λ +

N∑n=1

m(t)n,_,kt ,b(t)λ +

∑kt−1

∑n

(v(t)kt−1,kt

u(t)n,kt

ψ(t−1)n,kt−1

)

−1)(15)

where ψ(t−1)n,kt−1

is equal to log(1 + q(t−1)n,kt−1

exp(x′nw(t)kt−1

)).

Although, the posterior of the regression weight w(t)kt

does not have any closed-form solution, Poisson-gammamarginalization leads to a negative binomial distribu-tion which can be transformed into a Gaussian likeli-hood using Pólya-Gamma augmentation (Polson et al.,2013). Given the Pólya-Gamma variables

(ω(t)n,kt|−) ∼ PG

(m

(t)n,_,kt

+ r(t)n,kt

,x′nw(t+1)kt

+ ln(q(t)n,kt

))

(16)the posterior of w

(t+1)kt

can be written asN (µ

(t+1)wkt

,Σ(t+1)wkt

) where

Σ(t+1)wkt

=

((Γ(t+1))−1 +

N∑n=1

ω(t)n,kt

xnx′n

)−1(17)

and

µ(t+1)wkt

= Σ(t+1)wkt

[ N∑n=1

(− ω(t)

n,ktln q

(t)n,kt

+

0.5 ∗ (m(t)n,_,kt − r

(t)n,kt

)

)xn

](18)

We omit the equations for hyperparameter inferencefor the sake of brevity.

5 RELATED WORK

A prominent class of methods for multi-label learninghas been based on learning low-dimensional embed-dings of label vectors (Bhatia et al., 2015; Kapooret al., 2012; Rai et al., 2015; Yu et al., 2014). Thesemethods perform multi-label learning by assuming thatthe label vectors have a low-dimensional representation(label embeddings), which amounts to the label ma-trix being a low-rank matrix. However, these methodsare usually limited to learning linear (or locally linear)embeddings. Another way to incorporate nonlinearityin the embeddings is by using kernel-based label em-beddings (Li and Guo, 2015). However, kernel-basedmethods tend to be slow at training as well as test time,due to the requirement of storing all the training data.

Probabilistic/Bayesian models for multi-label learn-ing are relatively fewer. These are usually based onlatent factor models for the binary label vectors. How-ever, the inference cost for these models is usually


prohibitive (Kapoor et al., 2012) and these models donot scale to large data sets. Since the label vectorsare high-dimensional but extremely sparse in nature(i.e., have very few nonzeros), it is desirable to havemodels whose computational cost scales in the num-ber of nonzeros in the label matrix. In (Rai et al.,2015), a Bayesian latent factor model (BMLPL) basedon a Bernoulli-Poisson-gamma generative model wasproposed, which exploited the label matrix sparsityto design efficient inference algorithms for multi-labellearning. Our model is similar in spirit to the workin (Rai et al., 2015). However, it is significantly moregeneral and considerably more flexible due to the deepgenerative model on the label vectors. The BMLPLmodel proposed in (Rai et al., 2015) is a special caseof our model when the number of layers is equal to 1.We use this model as one of the baselines in our exper-iments. In contrast to the model in (Rai et al., 2015),our deep generative model allows learning nonlinearlabel embeddings, provides interesting geometric inter-pretation/explanation for the nonlinearity of our model(in terms of what kind of nonlinear decision boundariesit learns), while still enjoying simplicity of the inferenceprocedure. The inference cost in our model still scalesin the number of nonzeros in the label matrix, which isappealing for multi-label learning problems involvinglarge label matrices.

Among other related work, recently (Cissé et al., 2016)proposed a deep architecture for multi-label learning.This model clusters the labels into a Markov BlanketChain and then leverages these clusters to train a deepneural network. While this work is also similar in spiritto our model, there are a few key differences: (1) It hasa feedforward architecture and lacks a proper generativemodel for the labels; (2) It cannot leverage the labelmatrix sparsity; and (3) It lacks the nice geometricinterpretability offered by our approach. Moreover,unlike the model in (Cissé et al., 2016), our model canbe applied to not only multi-label learning problems,but also to topic modeling tasks where the documentsalso have meta-data associated with them. Recentwork has also explored multi-label learning for textdocuments using text CNN (Liu et al., 2017). However,to the best of our knowledge, none of the existing deeplearning models, including (Cissé et al., 2016; Liu et al.,2017) are based on generative models.

Deep generative models based on the Poisson-gammalatent factor model construction (Zhou et al., 2015)have also been proposed for modeling high-dimensionalcount data (e.g., documents represented as vector ofword-counts). These models have also been extendedto multimodal settings (Henao et al., 2016). Althoughthe model proposed in (Henao et al., 2016) can be usedin multi-label setting, it is (1) limited to count-valued

features and moreover, (2) difficult to do inference onat test time, and (3) does not provide any geometricinterpretation unlike our model. Our framework canbe seen as a generarization of these models where wecondition the latent factors on arbitrary feature vectors,and the deep architecture naturally leads to a nonlinearclassification model from the feature space to the labelspace. To the best of our knowledge, ours is the firstfully Bayesian method, based on a deep generativemodel for the problem of multi-label learning, that canhandle arbitrary type of features, while also leveragingthe sparsity of label matrix for computational speed-ups.

6 ExperimentsIn this section, we present the quantitative and qualita-tive results of our model. First, inspired by the geomet-ric interpretation of decision boundaries, we generatea synthetic dataset to show that shallow models can’tcompete with deep models for a non-linear dataset.Then, we benchmark our model with other state-of-the-art models on the benchmark datasets. Moreover, weillustrate the ability of our model to learn a hierarchicaltopic model over the documents.

We have implemented an EM algorithm (Scott andSun, 2013) to generate the results in the following ex-periments. We use the fact that the simple closed-formexpressions of expectations of the local variables areavailable including Polya-Gamma and Chinese Restau-rant Table random variables. Note that a straight-forward implementation to calculate the updates ofw

(t+1)kt

would require inversion of a D × D matrix.We avoid this computationally heavy step by calcu-lating the approximate solution to the following linearsystem for w(t+1)

ktby using only a few iterations of

Conjuate Gradient Method (Bertsekas, 1999) (see Eq.(17),(18) ) (Σ

(t+1)wkt

)−1w(t+1)kt

= d(t+1)kt

where d(t+1)kt

=[∑Nn=1

(−ω(t)

n,ktln q

(t)n,kt

+ 0.5 ∗ (m(t)n,_,kt − r

(t)n,kt

)

)xn

].

We run our EM-CGS algorithm for 500 iterations andthe algorithm converged in all these cases.

6.1 Synthetic Dataset

We generate a synthetic dataset to validate the claimmade in Section 3 that the model learns a union ofmultiple convex polytopes. With a model of depth T ,each of the inferred polytope will be the intersection ofT half-spaces. Therefore, if the region (in the covariatespace RD) corresponding to l = 1, for a label l, is inter-section of multiple half-spaces, then the performanceof a shallow model would saturate after a particularwidth. With this idea, we generate the synthetic datathat has L = 10, d = 50 and contains 5000 instances.The label yn,l is set to be 1 if xn lies in the convex


Table 1: Performance on synthetic data with varying depth and width of the deep architecture

Layer Sizes = [K1,K2,K3] [4] [10] [4,2] [4,2,1] [4,2,2]

AUC 0.898 0.909 0.930 0.930 0.935

Table 2: Statistics of the data sets. D and L denote the average number of non-zero elements per example infeatures and labels respectively.

Data set D L N L D

bibtex 1836 159 4880 2.40 68.74delicious 500 983 12920 19.03 18.17eurlex 5000 3993 17413 5.30 236.69

Table 3: AUC-ROC scores on real datasets

Dataset CPLST BCS WSABIE LEML BMLPL ADIOS DBMLPL

T= 1 T = 2 T = 3

bibtex 0.888 0.861 0.918 0.904 0.921 0.876 0.927 0.930 0.932delicious 0.883 0.800 0.856 0.889 0.895 0.904 0.899 0.903 0.906eurlex - - 0.865 0.946 0.952 0.877 0.956 0.961 0.962

polytopes defined by 3 hyperplanes w(3)kl, w

(2)k′l

and w(1)k′′l

.The hyperplanes are shared across labels such thatnumber of distinct w(1)

k , w(2)k and w

(3)k are 4, 2 and 2

respectively. The results are reported in Table 1 whichclearly shows that in absence of a sufficient deep modelthat captures the intrinsic dimensionality of data, theperformance suffers adversely.

6.2 Real Datasets

We first report our results on standard benchmarkdatasets for multi-label learning. We use three bench-mark datasets(Yu et al., 2014) — Bibtex, Eurlex-4k,and Delicious — to compare our model with otherstate-of-the-art models. Note that all these datasetshave highly sparse features and labels (see Table 2).

The choice of the number of layers and sizes of theselayers is the most important decision that has to bemade while using this model, which is further shownin Section 6.1. It is worth mentioning that betweentwo architectures with the same number of nodes, themodel with a higher depth will have a lower number ofparameters in our case. We restrict the size of tth layerfor t ∈ {2, 3} to half of the size of the (t−1)th layer. Theresults are only reported for DBMLPL with the totalnumber of nodes between 0.1L and 0.5L. On the otherhand, for the rest of the models we have picked theirbest performance with the embedding size between0.2L and L. We compare our model with the depth asT=1, 2 and 3 with other state-of-the-art baselines inTable 3.

Among the baselines used in Table 3, BMLPL(Raiet al., 2015) is a Bayesian model based on learning

low-dimensional embeddings of the labels, that areconditioned on the features (similar to our model). Itcan be thought of as a special case of our model with asingle layer. LEML(Yu et al., 2014) minimizes variousloss functions like logistic, squared and hinge loss forlow dimensional embedding in an ERM framework.WSABIE(Weston et al., 2011) tries to learn a jointembedding of both the features as well as the labels.BCS(Kapoor et al., 2012) is also a Bayesian modelthat uses random projection of labels and regression ofthe features against that random projection in a singleprobabilistic framework. CPLST(Chen and Lin, 2012)also learns label embeddings, which are conditionedon the features, but it uses hamming loss as its lossfunction. We also compare our model with a state-of-the-art deep learning architecture ADIOS (Cissé et al.,2016) which leverages the relationship among labels bycreating a deep output space to perform multi-labellearning. As shown in Table 3, our model outperformsthe various baselines on all the datasets. Also, theclassification accuracies improve as we increase thenumber of layers.

We would like to highlight that our baselines consist ofvery strong, state-of-the-art Bayesian models and deeplearning models for multi-label learning, and our modelyields moderate but consistent performance gains acrossall the datasets. Moreover, our experiment on thecarefully simulated difficult nonlinear dataset (Table 1)demonstrates that the model is able to learn difficultnonlinear boundaries, which single-layer models likeBMLPL cannot.


Figure 2: Topics and Super topics inferred from eurlex data

6.3 Qualitative Analysis: Deep TopicModeling

The model explicitly imposes a low-dimensional struc-ture on labels through a factor-loading matrix V (1)Λ(1).Moreover, each layer (t) learns an unnormalized dis-tribution on the layer below (t − 1) through V (t)Λ(t)

with 0th layer being the observed labels. We use theeurlex-4k dataset that contains about 5000 documentsrelated to European Union Law where each documentis annotated with possibly multiple tags. There are atotal of 3993 different labels, i.e. tags, spanning diverseareas. We use this dataset to test efficacy of deep topicmodeling aspect of our model. We set the K1 = 20and K2 = 5 for our model and hence expect to learn 20topics and 5 super-topics over the topics. The obtainedtopics and super-topics are shown in Fig. 2 where weshow the top few labels for each topic as well as showthe top 4 topics for each super-topic. The model groupssemantically similar labels in each topic and is able tolearn super-topics such as “Health and Food”.

7 Conclusion and Discussion

We have presented a deep generative model for nonlin-ear multi-label learning. Our Bayesian model is basedon learning a deep hierarchy of gamma-distributed la-tent factors that represent the embeddings of binarylabel vectors. The model is built using a clear, deepgenerative model, and admits a simple and efficientinference procedure due to full local conjugacy. More-over, the sparsity of label vectors further reduces thecomputational cost of inference. The model also of-fers a nice geometric interpretation, which explains

its effectiveness in learning complex nonlinear decisionboundaries.

While multi-layer neural networks can also be tried formulti-label learning (in fact, the ADIOS baseline wecompared against is precisely that!), taking a generativeapproach enables us to construct a likelihood modelthat is appropriate for high-dimensional binary labelvectors, and also leverage the sparsity of the labelvectors for computational efficiency via the Bernoulli-Poisson link. Extensions to semi-supervised and activelearning would be an interesting future directions.

Although, in this paper, we used a Gibbs samplingbased inference procedure, other inference methodssuch as stochastic variational inference can allow ap-plying our model on very large-scale data sets. De-veloping such inference algorithms for our model willbe an interesting direction of future work. Finally,our fully Bayesian framework also opens the door toother interesting extensions such as performing activelearning (Vasisht et al., 2014) to acquire the most in-formative labels.

Acknowledgements: PR acknowledges support fromVisvesvaraya Young Faculty Fellowship by DST India,and P.K. Kelkar Young Faculty Fellowship by IIT Kan-pur, and grants from IBM Research, Microsoft ResearchIndia, and Google.

ReferencesR. Babbar and B. Schölkopf. DiSMEC- distributed

sparse machines for extreme multi-label classification.In WSDM, 2017.


D. P. Bertsekas. Nonlinear programming. Athenascientific Belmont, 1999.

K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain.Sparse local embeddings for extreme multi-label clas-sification. In NIPS, 2015.

Y.-N. Chen and H.-T. Lin. Feature-aware label spacedimension reduction for multi-label classification. InNIPS, 2012.

M. Cissé, C. M. Al-Shedivat, and S. Bengio. Adios:Architectures deep in output space. In ICML, 2016.

E. Gibaja and S. Ventura. Multilabel learning: Areview of the state of the art and ongoing research.Wiley Interdisciplinary Reviews: Data Mining andKnowledge Discovery, 2014.

E. Gibaja and S. Ventura. A tutorial on multilabellearning. ACM Comput. Surv., 2015.

R. Henao, J. T. Lu, J. E. Lucas, J. Ferranti, andL. Carin. Electronic health record analysis via deeppoisson factor models. The Journal of MachineLearning Research, 17(1):6422–6453, 2016.

H. Jain, Y. Prabhu, and M. Varma. Extreme multi-label loss functions for recommendation, tagging,ranking & other missing label applications. In KDD,2016.

A. Kapoor, R. Viswanathan, and P. Jain. Multilabelclassification using bayesian compressed sensing. InNIPS, 2012.

D. P. Kingma, S. Mohamed, D. J. Rezende, andM. Welling. Semi-supervised learning with deepgenerative models. In NIPS, 2014.

X. Li and Y. Guo. Multi-label classification withfeature-aware non-linear label space transformation.In IJCAI, 2015.

J. Liu, W.-C. Chang, Y. Wu, and Y. Yang. Deeplearning for extreme multi-label text classification.In SIGIR, 2017.

D. Mimno and A. McCallum. Topic models conditionedon arbitrary features with dirichlet-multinomial re-gression. In UAI, 2008.

N. G. Polson, J. G. Scott, and J. Windle. Bayesianinference for logistic models using pólya–gammalatent variables. Journal of the American Statis-tical Association, 108(504):1339–1349, 2013. doi:10.1080/01621459.2013.829001.

Y. Prabhu and M. Varma. FastXML: a fast, accu-rate and stable tree-classifier for extreme multi-labellearning. In KDD, 2014.

P. Rai, C. Hu, R. Henao, and L. Carin. Large-scalebayesian multi-label learning via topic-based labelembeddings. In Proceedings of the 28th International

Conference on Neural Information Processing Sys-tems, NIPS’15, pages 3222–3230, Cambridge, MA,USA, 2015. MIT Press.

R. Ranganath, L. Tang, L. Charlin, and D. Blei. Deepexponential families. In AISTATS, 2015.

J. G. Scott and L. Sun. Expectation-maximization forlogistic regression, 2013.

D. Vasisht, A. Damianou, M. Varma, and A. Kapoor.Active learning for sparse bayesian multilabel classi-fication. In KDD, 2014.

J. Weston, S. Bengio, and N. Usunier. WSABIE: Scal-ing up to large vocabulary image annotation. InIJCAI, 2011.

C. Xu, D. Tao, and C. Xu. Robust extreme multi-labellearning. In KDD, 2016.

H.-F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scalemulti-label learning with missing labels. In ICML,2014.

M. Zhou. Infinite edge partition models for overlap-ping community detection and link prediction. InAISTATS, 2015.

M. Zhou. Softplus Regressions and Convex Polytopes.ArXiv e-prints, Aug. 2016.

M. Zhou and L. Carin. Negative binomial process countand mixture modeling. IEEE Trans. Pattern Anal.Mach. Intell., 37:307–320, 2015.

M. Zhou, Y. Cong, and B. Chen. The poisson gammabelief network. In C. Cortes, N. D. Lawrence,D. D. Lee, M. Sugiyama, and R. Garnett, edi-tors, Advances in Neural Information ProcessingSystems 28, pages 3043–3051. Curran Associates,Inc., 2015. URL http://papers.nips.cc/paper/5645-the-poisson-gamma-belief-network.pdf.

http://papers.nips.cc/paper/5645-the-poisson-gamma-belief-network.pdf

http://papers.nips.cc/paper/5645-the-poisson-gamma-belief-network.pdf

deep topic models for multi-label learningproceedings.mlr.press/v89/panda19a/panda19a.pdfdeep topic...

Documents