collaborative neighborhoods - machine learningcs229.stanford.edu/proj2014/diego represas, david...

Collaborative Neighborhoods

Diego Represas, David Dindidiegorep,[email protected]

December 13, 2014

Abstract

Finding relevant research publications is a growing problem for researchers across allfields. Online platforms such as Mendeley and CiteULike have attempted to addressthis need by providing researchers the ability to share relevant articles with one another.These platforms have further sought to extend their capabilities by recommending articlesto users based on the user’s past interactions and preferences. The models that underliethese methods, however, are unable to provide recommendations on an item for whichno prior interactions have been observed. To solve this issue, we propose CollaborativeNeighborhoods (CN). CN combines elements of Collaborative Topic Regression [0] andNearest Neighbor Models [1] to provide meaningful recommendations in both the pres-ence and absence of past user interactions with items. We assess the performance of CNon dataset of 129,531 articles sourced from PubMed, and demonstrate that our modelsprovides more accurate recommendations that extant recommendation frameworks.

1 Introduction

Researchers today are inundated with in-formation; 2 million peer-reviewed articles arepublished every year [2]. The difficulty in find-ing relevant articles amidst this abundance ofinformation has prompted citation manage-ment platforms like Mendeley and CiteULiketo implement recommendation systems to aidresearchers in the search [13]. These systemsrecommend articles based on implicit user pref-erences learned from a user’s past article as-sociations ( likes, shares or additions to one’spersonal archive.)

Several collaborative filtering recommenda-tion models have been proposed to aide in theimplicit learning of user preferences. Collabo-rative Topic Regression (CTR) [0] has been the

most promising model thus far. CTR appliestopic modeling to augment the item featurevector used in traditional collaborative filter-ing. The limitation of CTR, however, is its rep-resentation of users by latent features alone.In the absence of information on past user-article associations, these user latent featurescannot be accurately predicted. Consequently,CTR performs poorly on users that have fewerarticle associations.

We address CTRs shortcomings with CN.CN not only applies topic modeling to augmentthe latent feature representation of items, butdoes so as well for the latent feature represen-tation of users. We begin by representing usersand items by their associated topic distribu-tions. We then proceed by learning latent vari-ables that offset these distributions using past

1

observations of user-article interactions. Theseoffsets capture hidden interests that an authormay have in fields that are outside their mainarea of research. Given that CN attributesboth explicit and implicit content features toevery user, it is capable both of understandinghidden preferences , and providing recommen-dations to users for whom there is insufficientinformation to learn implicit preferences.

2 Background

A basic approach to recommending textitems has been to do so based on contentsimilarity. Such methods employ probabilisticmodels and similarity scoring to define an ar-ticle’s content [3,4,5,6], or matrix factorizationmethods such as content-based CollaborativeFiltering (CBCF) to provide recommendationsto users [7,8,9]. For our particular problem, anarticle j’s content can be thought off as a topicdistribution vector, θj ∈ RK , across Ktopics.When a new item is introduced, a similarityfunction (e.g. cosine similarity) is used to de-termine the k items that are most similar tothe new item. The new item is then recom-mended to users who in the past have ratedany of the k items favourably

Despite their intuitive appeal, similarityand neighborhood models (such as k-NearestNeighbors) are inadequate for providing rec-ommendations in research literature. This in-adequacy stems from the fact that contentsimilarity alone is not sufficient to determinewhether or not an author would cite an ar-ticle. A neighborhood model, for instance,would recommend an article that is of littleintrinsic value to a user, solely because thearticle contains topics that the user has ref-erenced in the past. This dependence on ex-plicit features prevents neighbourhood modelsfrom capturing hidden preferences of a user.

Some of the most popular market recom-

mendation systems, including the one used byMendeley, are based on Collaborative Filter-ing (CF) methods using latent factor models[10,11,12,13,14]. In these models, the ratingthat a user j attributes to item i can be pre-dicted through the rating function r̂i,j = u

Ti vj,

where ui ∈ RK is the latent factor vector foruser i and vj ∈ RK . An effective approach forimplicit feedback datasets is to translate thecontinuous-value rating into the implicit spaceby setting the preference variable as follows [1,12]:

p̂i,j =

{1 r̂i,j > 00 r̂i,j = 0

(1)

Because not all values of r̂i,j > 0 are equallylikely to predict a user-item interaction, a con-fidence variable ci,j = 1+αf(ri,j) is introducedto measure the confidence of observing pi,j = 1.The constant α is a learning rate constant andf(ri,j) is an empirical function that dependson the dataset. The latent factor vectors uiand vj are then computed by minimizing theobjective function:

minu,v

∑i,j ci,j(pi,j − uTi vj)2 + λ

(∑i ‖ui‖2 +

∑j ‖vj‖2

)(2)

The following update rules for both latent vec-tors are then derived from the objective func-tion:

ui ← (V CiV T + λI)−1V CiPivj ← (UCjUT + λI)−1UCjPj

(3)

Where U, V are the user and item latentfactor matrices, C is the confidence variablematrix and P is the preference variable ma-trix. Updates are thus performed through analternating least-squares model.

Recommender systems based on latent fac-tor models have been shown to provide betterrecommendations than neighborhood methods[15,12]. However, because CF relies on observ-ing prior user-item interactions to provide rec-ommendations, it is unable to recommend ar-ticles that have not been previously cited or

2

”liked” by researchers.To address the shortcomings of CF, David

M. Blei and Chong Wang recently devel-oped Collaborative Topic Regression (CTR)[0]. Much like CF, CTR assigns latent featuresto every user and item based on prior user-item interactions. However, CTR differs fromCF in its usage of Latent Dirichlet Allocation(LDA) to construct topic distributions (drawnfrom β = β1:K unique topics) for every item.Rather than describing vj as an exclusively la-tent factor-based vector, Blei et al. chooseto describe the item vector as vj = θj + �j,where θj ∈ RK represents the LDA derivedtopic distribution of item j, and �j is a la-tent variable that captures fluctuations awayfrom the topic distribution. These latent off-sets augment the content similarity-based ap-proach through the flexibility that they affordthe model to understand hidden features aboutitems and users. Since regularization for thetopic-enhanced items must inherently be dif-ferent than that for pure-latent users, separateparameters λu and λv are applied to user anditem feature vectors respectively. The resultinglog-likelihood function is intractable; a type ofExpectation Maxmization algorithm is there-fore used to arrive upon the optimal parame-ters.

L = −λu2

∑i u

Ti ui − λv2

∑j(vj − θj)T (vj − θj) (4)

+∑

j

∑n log(

∑k θjkβk, wjn)−

∑i,j

cij2

(rij − uTi vj)2

The derived update rules for u and v thenbecome:

ui ← (V CiV T + λuI)−1V CiPivj ← (UCjUT + λvI)−1(UCjPj + λvθj)

(5)

Given u and v, the topic proportions arelearned by variational inference; a family ofdistributions on the latent variables is gener-ated: q(θ, z | γ, φ) = q(θ | γ)

∏n q(zn | φn) and

Jensen’s inequality is applied to find the tightlower bound for the log likelihood function be-low.

L ≥ −λv2

(vj − θj)T (vj − θj) (6)

+∑

n

∑k φjnk(logθjkβk, wjn − logφjkn)

The free variational parameters upon whichthe distributions are generated are optimizedby minimizing the Kullback Leibler (KL) di-vergence between the variational distributionand the true posterior. The resulting updateequations, for determining θ and β are givenbelow

φni ∝ βiwnexp{Eq[logθi | γ]} (7)

γi = αi +∑n

φi (8)

The predicted rating variable r̂i,j for in-matrix predictions is then computed by:

r̂i,j = uTi vj = u

Ti (θj + �j) (9)

When an item is new and in-matrix pre-diction is not possible, the rating variable iscomputed ignoring the latent offset:

r̂i,j = uTi θj (10)

Consequently, users are still represented bylatent factors but items are represented by acombination of their topical distribution and alatent factor offset; the latter aiming to capturevariables conducive to a user-item interactionthat are not related to the item’s topic. CTRwas shown to perform better as a recommendersystem than Collaborative Filtering using bothlatent factor and content-based methods.

3 Collaborative Neighbors

Given the performance improvement CTR sawby introducing topic modeling for the items, wedeveloped an algorithm that also introducedtopic modeling for the user vectors to increaseprediction accuracy. In this section we describe

3

the resulting algorithm, Collaborative Neigh-bors.

As in CTR, our users are I researchers andour items are J scientific articles. The prefer-ence variable pi,j ∈ 0, 1 indicates whether ornot user i cited article j. The preference vari-able is computed from the predicted ratings asin eq. (1) and the predicted rating variableremains r̂i,j = u

Ti vj.

We now introduce a way to represent topic pro-portions for both users and items. We denoteθj as the topic distribution for item j, whereeach θj is drawn from β = β1:K unique topics.Conversely, we introduce ψi as the topic distri-bution for user i, each φi drawn from β

′ = β1:L.In our experiment, we chose ψi to be the av-erage topic space of user i but there are sev-eral other strategies one could use to representthe user. Since Matrix Factorization requiresfor u, v ∈ RK (the user and item factor ma-trices have to align), both θ and ψ have tobe drawn from the same topic space or, alter-natively, topic models of the same magnitude(| β′ |=| β |). Since finding the optimal valueof ψ, θ, u and v given β,β′ becomes intractable,our algorithm will only be tested in the casewhere β = β′ so we can assume ψ ≈ θanduse the same EM-style algorithm as in CTR to

optimize for the topic distributions. The mostimportant difference is that our update rulesfor ui, vj now become:

ui ← (V CiV T + λuI)−1(V CiPi + λuψi)vj ← (UCjUT + λvI)−1(UCjPj + λvθj)

(11)Our algorithm then additionally has to incor-porate topic modeling for the users in the sameway as CTR does it for the items.

4 Experimental Study

Dataset: A total of 1.2 millions articles were re-trieved from the PubMed open access dataset.We parsed the title, abstract, authors, key-words and citations associated with each arti-cle. We removed articles missing any one of thefields of interest to obtain a dataset of 129,531scientific articles.

Articles: For each article, we concatenatedits title, abstract and keywords into a docu-ment. We then removed all english stop-wordsand built a vocabulary consisting of the Xwords that appeared in our corpus more thanonce. Next, we used a Hierarchical DirichletProcess (HDP) model on our dataset to deter-mine the number of K topics contained within,and subsequently ran LDA to determine theK-vector topic distribution for each article.

Users: 1.5 million unique authors wereidentified in the initial open access dataset.User-article interactions were generated bymapping users to all the citations that theyhad across all their publications. We removedself citations (where an author cites their pastwork) and filtered out users that had citedfewer than 10 papers in our dataset. Our finaluser item matrix consisted of 26,152 users by129,531 thousand articles, with an average of16.9 interactions (citations) per user and 12.5publications per user. Lastly, we assigned atopic distribution to each user by taking theaverage topic distribution of all papers theyhad co/authored. Because of computational

4

constraints *(our tests routinely took up to 36hours on Stanford’s Barley machines due to al-sqr)* , all tests were run on a random fractionof the original dataset consisting of 35,341 ar-ticles.

4.1 Evaluation

Cross-Validation: The training and testingsets for all tests were split as follows. Fromthe original M ×N matrix containing all userinteractions, a total of m = M/20 user rowindices and n = N/20 item column indiceswere randomly selected. Interactions belong-ing to these m randomly selected users wereseparated into a different m × N − n matrixfor user out-of matrix cross validation. Inter-actions belonging to the n randomly selecteditems were also separated, this time into aM −m×n matrix for user out-of matrix crossvalidation. Interactions in both the m selectedrows and n selected columns were discarded.The remaining bulk of the interactions wereAssigned to a M−m×N−n matrix. Of theseinteractions, 10% were randomly selected andplaced in a separate M − m × N − n for in-matrix prediction cross-validating. The other90% were used for training the models. Theuser and item topic matrices were split anddistributed according to where their represen-tative rows/columns were assigned during theshuffle. We saw no significant changes in ourtraining and testing recall values when imple-menting multiple folds of this matrix shufflesplit.

In-matrix predictions: Following Blei andWang’s prior work, we established our mea-sures of accuracy to be standard precision andrecall within the first 1-200 highest-scored pre-dictions. For in-matrix recommendations, re-call consisted of the number of articles recom-mended to a user that were also in the testingset over the total number of articles in thatuser’s testing set.

Out-of-matrix predictions: For out-of-matrix recommendations, recall consisted ofthe number of articles recommended to a userthat were also in the out-of-matrix set over thetotal number of articles in that user’s out-of-matrix set.

4.2 Settings

Blei and Wang established that the indepen-dence of the topic distributions θj relative touser vectors allowed for the optimal topic dis-tributions to be found before beginning to op-timize for ui and vj = �j + θj. Consequently,we used HDP to find an optimal topic numberof K = 200 and also set the dimensions of ourlatent vectors equal to K. We set α = 1 and it-erated for values of λv, λu ∈ {.01, .1, 1, 10, 100}to find that λv = 10,λu = .01 gave optimalresults for CTR using 25 training epochs. Val-ues of λv = 10, λu = .1 gave optimal re-sults for CN. As prior literature had alreadyshown CTR outperforms CF, we simply usedK = 200, λu = λv = 0.01 for our CTR iter-ations to provide a point of reference withoutoptimizing the parameters.

4.3 Results:

Comparisons: Our algorithm performed bet-ter than Collaborative Filtering and compa-rable to Collaborative Topic Regression whenperforming in-matrix recommendations. Thisis consistent with prior research on the perfor-mance of CTR. In our particular dataset, wefound that recall using CN was almost iden-tical to that measured using CTR during thefirst predictions and consistently higher thanCTR in the ranges after 20 predictions. Thiscould mean that our algorithm generally per-forms better in-matrix recommendations thanCTR but it could also be attributed to themore direct topical connections of our dataset.

5

For item out-of-matrix predictions, our al-gorithm demonstrated having considerablyhigher recall figures compared to CTR, indi-cating the value of our algorithm is strongestwhen performing cold-start recommendations.We attempted to perform user out-of-matrixpredictions multiple times with no more suc-cess than random recommendations. We at-tribute this failure to the strategy we used toestablish the user topic space as an averagingof a user’s publications is likely to be similarto too many publications without much speci-ficity.

Examining user Profiles: We can explicitlyanalyze user profiles from the LDA representa-tion of topics that were generated by averagingthe topic distribution across all their publica-tions. With CN, we are able to extend theanalysis further by analyzing the magnitude ofeach latent offset from the authors topic dis-tribution. These offsets represent large devi-ations of an author’s preferences, from theircore topics of research. For instance, a finan-cial engineer might apply electrical engineering

signal processing techniques to filter time se-ries data in his/her research. More generally,these offsets capture passive interests that canallow recommender systems to make more in-formed judgments on what to recommend. Inour dataset, we observe the topic associatedwith words such as south, variation, india, asia,geographic and madagascar exhibit the largestoffsets.

Examining Article Characteristics: Sim-ilarly CN allows us to understand which top-ics attract a broad range of interest from re-searchers from a diversity of fields. We ac-complish this analysis by calculating the aver-age magnitude of each item latent feature off-set. Topics with a large offsetting magnitudeson the item side, tend to be topics that havebeen cited by users in a variety of fields. Inour dataset we observe topics associated withwords such as world, skin, biodiversity andcommunities to have the largest offsets.

5 Conclusions

We demonstrate in this study the ability ofCollaborative Neighbourhoods of predictingthe citations made by researchers in medicalliterature. We obtain results that are supe-rior to traditional Collaborative Filtering andCollaborative Topic Regression in in-matrixand out-of-matrix predictions. Furthermore,we demonstrate the ability our model to de-velop a semantic understanding of an author’spreferences, as well as to identify the types ofarticles that enjoy the most reception from avariety of fields. Our method can be gener-ally applied to any user-item recommendationproblem where there is sufficient informationabout each user. Future work will focus oneliciting further meaning from the latent fea-tures derived, and augmenting the ability ofCN to recommend items to users whose priorinteractions have not been observed.

6

References

[0] Wang, Chong and David M. Blei. ”Collaborative Topic Modeling for Recommending Scien-tific Articles.” Proceedings of the 17th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. New York, NY, USA: ACM, 2011, 448-456.

[1] Hu, Y., Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets.”Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. 2009

[2] ”STM International Association of Scientific, Technical and Medical Publishers.” STM.N.p., n.d. Web. 13 Dec. 2014.

[3] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine LearningResearch, 3:9931022, January 2003.

[4] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading tea leaves: Howhumans interpret topic models. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams,and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 288296,2009.

[5] S. M. Gerrish and D. M. Blei. Predicting legislative roll calls from text. In Proceedingsof the 28th Annual International Conference on Machine Learning, ICML 11, 2011.

[6] D. Agarwal and B.-C. Chen. flda: matrix factorization through latent Dirichlet allo-cation. In Proceedings of the third ACM international conference on Web search and datamining, WSDM 10, pages 91100, New York, NY, USA, 2010. ACM.

[7] Content-Based Recommendation Systems Michael J. Pazzani, Daniel Billsus[8] Burke, R.: Hybrid Web Recommender Systems. In: Brusilovsky, P., Kobsa, A., Nejdl,

W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol.4321, pp. 377408. Springer, Heidelberg (2007)

[9] Mooney, R. J., and Roy, L. 2000. Content-based book recommending using learning fortext categorization. In Proceedings of the Fifth ACM Conference on Digital Libraries, 195204.

[10] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization usingMarkov chain Monte Carlo. In Proceedings of the 25th International Conference on Machinelearning, pages 880887. ACM, 2008.

[11] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. Advances in NeuralInformation Processing Systems, 20:12571264, 2008.

[12] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommendersystems. IEEE Computer, 42(8):3037, 2009.

[13] D. Agarwal and B.-C. Chen. Regression-based latent factor models. In Proceedingsof the 15th ACM SIGKDD international conference on Knowledge discovery and data mining,pages 1928, New York, NY, USA, 2009. ACM.

[14] K. Yu, J. Lafferty, S. Zhu, and Y. Gong. Large-scale collaborative prediction using anonparametric random effects model. In Proceedings of the 26th Annual International Confer-ence on Machine Learning, pages 11851192, New York, NY, USA, 2009. ACM.

[15] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An algorithmic frameworkfor performing collaborative filtering. In Proceedings of the 22nd annual international ACMSIGIR conference on Research and development in information

7

collaborative neighborhoods - machine learningcs229.stanford.edu/proj2014/diego represas, david...

Documents