compressed knowledge transfer via factorization machine...

11
Compressed knowledge transfer via factorization machine for heterogeneous collaborative recommendation Weike Pan a , Zhuode Liu a , Zhong Ming a,, Hao Zhong b , Xin Wang b , Congfu Xu b a College of Computer Science and Software Engineering, Shenzhen University, China b Institute of Artificial Intelligence, College of Computer Science, Zhejiang University, China article info Article history: Received 11 August 2014 Received in revised form 10 February 2015 Accepted 8 May 2015 Available online 15 May 2015 Keywords: Collaborative recommendation Heterogeneous feedbacks Factorization machine Compressed knowledge Transfer learning abstract Collaborative recommendation has attracted various research works in recent years. However, an impor- tant problem setting, i.e., ‘‘a user examined several items but only rated a few’’, has not received much attention yet. We coin this problem heterogeneous collaborative recommendation (HCR) from the perspec- tive of users’ heterogeneous feedbacks of implicit examinations and explicit ratings. In order to fully exploit such different types of feedbacks, we propose a novel and generic solution called compressed knowledge transfer via factorization machine (CKT-FM). Specifically, we assume that the compressed knowledge of user homophily and item correlation, i.e., user groups and item sets behind two types of feedbacks, are similar and then design a two-step transfer learning solution including compressed knowl- edge mining and integration. Our solution is able to transfer high quality knowledge via noise reduction, to model rich pairwise interactions among individual-level and cluster-level entities, and to adapt the potential inconsistent knowledge from implicit feedbacks to explicit feedbacks. Furthermore, the analysis on time complexity and space complexity shows that our solution is much more efficient than the state-of-the-art method for heterogeneous feedbacks. Extensive empirical studies on two large data sets show that our solution is significantly better than the state-of-the-art non-transfer learning method w.r.t. recommendation accuracy, and is much more efficient than that of leveraging the raw implicit examina- tions directly instead of compressed knowledge w.r.t. CPU time and memory usage. Hence, our CKT-FM strikes a good balance between effectiveness and efficiency of knowledge transfer in HCR. Ó 2015 Elsevier B.V. All rights reserved. 1. Introduction Recommendation functionality has been widely implemented as a default module in various Internet services such as YouTube’s video recommendation and Amazon’s book recommen- dation. Factorization based collaborative recommendation algo- rithms with low-rank assumptions have dominated in various recommendation scenarios due to their applicability and high accuracy. Most factorization based methods focus on homoge- neous user feedbacks, e.g., explicit ratings in matrix factorization [28,31] and implicit feedbacks in Bayesian personalized ranking (BPR) [27]. However, few works have studied a very common prob- lem setting, in which ‘‘a user examined several items but only rated a few’’. This setting is called heterogeneous collaborative recommen- dation (HCR) and considers different types of users’ feedbacks, including implicit examinations (e.g., browsing and clicks) and explicit ratings. In a typical recommendation system, implicit feed- backs are usually more abundant and thus have a potential to help alleviate the sparsity problem of users’ explicit ratings. For the HCR problem, the most well-known method is probably the SVD++ model [11], which could be mimicked by factorization machine (FM) [24]. SVD++ and FM combine two types of feedbacks in a principled way via changing the prediction rule defined on one (user, item, rating) triple in explicit feedbacks to that defined on both the triple and all examined items by the user. Because implicit feedbacks are usually much more than explicit feedbacks, leverag- ing raw implicit feedbacks will increase the time cost and space cost significantly, which may make it not applicable in some real-world recommendation scenarios. The increase of time and space cost is also observed in our empirical studies in Section 4. In order to leverage the implicit feedbacks in a more efficient and effective way, we address the HCR problem from a novel trans- fer learning perspective [19], in which we take explicit feedbacks as target data and implicit feedbacks as auxiliary data. Technically, we propose a novel two-step transfer learning http://dx.doi.org/10.1016/j.knosys.2015.05.009 0950-7051/Ó 2015 Elsevier B.V. All rights reserved. Corresponding author. E-mail addresses: [email protected] (W. Pan), [email protected] (Z. Liu), [email protected] (Z. Ming), [email protected] (H. Zhong), cswangxinm@zju. edu.cn (X. Wang), [email protected] (C. Xu). Knowledge-Based Systems 85 (2015) 234–244 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Upload: others

Post on 03-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compressed knowledge transfer via factorization machine ...csse.szu.edu.cn/staff/panwk/publications/Journal-KBS-15-CKT-FM.pdf · Compressed knowledge transfer via factorization machine

Knowledge-Based Systems 85 (2015) 234–244

Contents lists available at ScienceDirect

Knowledge-Based Systems

journal homepage: www.elsevier .com/locate /knosys

Compressed knowledge transfer via factorization machinefor heterogeneous collaborative recommendation

http://dx.doi.org/10.1016/j.knosys.2015.05.0090950-7051/� 2015 Elsevier B.V. All rights reserved.

⇑ Corresponding author.E-mail addresses: [email protected] (W. Pan), [email protected] (Z. Liu),

[email protected] (Z. Ming), [email protected] (H. Zhong), [email protected] (X. Wang), [email protected] (C. Xu).

Weike Pan a, Zhuode Liu a, Zhong Ming a,⇑, Hao Zhong b, Xin Wang b, Congfu Xu b

a College of Computer Science and Software Engineering, Shenzhen University, Chinab Institute of Artificial Intelligence, College of Computer Science, Zhejiang University, China

a r t i c l e i n f o

Article history:Received 11 August 2014Received in revised form 10 February 2015Accepted 8 May 2015Available online 15 May 2015

Keywords:Collaborative recommendationHeterogeneous feedbacksFactorization machineCompressed knowledgeTransfer learning

a b s t r a c t

Collaborative recommendation has attracted various research works in recent years. However, an impor-tant problem setting, i.e., ‘‘a user examined several items but only rated a few’’, has not received muchattention yet. We coin this problem heterogeneous collaborative recommendation (HCR) from the perspec-tive of users’ heterogeneous feedbacks of implicit examinations and explicit ratings. In order to fullyexploit such different types of feedbacks, we propose a novel and generic solution called compressedknowledge transfer via factorization machine (CKT-FM). Specifically, we assume that the compressedknowledge of user homophily and item correlation, i.e., user groups and item sets behind two types offeedbacks, are similar and then design a two-step transfer learning solution including compressed knowl-edge mining and integration. Our solution is able to transfer high quality knowledge via noise reduction,to model rich pairwise interactions among individual-level and cluster-level entities, and to adapt thepotential inconsistent knowledge from implicit feedbacks to explicit feedbacks. Furthermore, the analysison time complexity and space complexity shows that our solution is much more efficient than thestate-of-the-art method for heterogeneous feedbacks. Extensive empirical studies on two large data setsshow that our solution is significantly better than the state-of-the-art non-transfer learning method w.r.t.recommendation accuracy, and is much more efficient than that of leveraging the raw implicit examina-tions directly instead of compressed knowledge w.r.t. CPU time and memory usage. Hence, our CKT-FMstrikes a good balance between effectiveness and efficiency of knowledge transfer in HCR.

� 2015 Elsevier B.V. All rights reserved.

1. Introduction

Recommendation functionality has been widely implementedas a default module in various Internet services such asYouTube’s video recommendation and Amazon’s book recommen-dation. Factorization based collaborative recommendation algo-rithms with low-rank assumptions have dominated in variousrecommendation scenarios due to their applicability and highaccuracy. Most factorization based methods focus on homoge-neous user feedbacks, e.g., explicit ratings in matrix factorization[28,31] and implicit feedbacks in Bayesian personalized ranking(BPR) [27]. However, few works have studied a very common prob-lem setting, in which ‘‘a user examined several items but only rated afew’’. This setting is called heterogeneous collaborative recommen-dation (HCR) and considers different types of users’ feedbacks,

including implicit examinations (e.g., browsing and clicks) andexplicit ratings. In a typical recommendation system, implicit feed-backs are usually more abundant and thus have a potential to helpalleviate the sparsity problem of users’ explicit ratings.

For the HCR problem, the most well-known method is probablythe SVD++ model [11], which could be mimicked by factorizationmachine (FM) [24]. SVD++ and FM combine two types of feedbacksin a principled way via changing the prediction rule defined on one(user, item, rating) triple in explicit feedbacks to that defined onboth the triple and all examined items by the user. Because implicitfeedbacks are usually much more than explicit feedbacks, leverag-ing raw implicit feedbacks will increase the time cost and spacecost significantly, which may make it not applicable in somereal-world recommendation scenarios. The increase of time andspace cost is also observed in our empirical studies in Section 4.

In order to leverage the implicit feedbacks in a more efficientand effective way, we address the HCR problem from a novel trans-fer learning perspective [19], in which we take explicit feedbacksas target data and implicit feedbacks as auxiliary data.Technically, we propose a novel two-step transfer learning

Page 2: Compressed knowledge transfer via factorization machine ...csse.szu.edu.cn/staff/panwk/publications/Journal-KBS-15-CKT-FM.pdf · Compressed knowledge transfer via factorization machine

Table 1Some notations used in the paper.

Notation Description

(i) n;m Number of users and itemsR Rating range, e.g., f0:5;1; . . . ;5gR 2 fR [ ?gn�m Numerical rating matrix

E 2 f1; ?gn�m Unary examination matrix of implicitfeedbacks

Eb 2 f1;0gn�m Binary examination matrix converted fromE

u User IDi; i0 Item IDrui 2 R [ ?; eui 2 f1; ?g Rating and examination of user u on item ip; pe Number of ratings and examinations

(ii) d Number of latent dimensions in SVD

U 2 Rn�d Users’ latent preferences

V 2 Rm�d Items’ latent features

(iii) g; s Number of user groups and item setsG 2 f0;1gn�g ; S 2 f0;1gm�s Membership matrix of users and itemsGu ; Si User group of user u and item set of item i

(iv) ðu � iÞ Individual-level interaction between u and iðGu � SiÞ Cluster-level interaction between Gu and Si

ðGu � iÞ; ðu � SiÞ Hybrid interaction between Gu and i, and uand Si

ðGu � uÞ; ðSi � iÞ Adaptation from Gu to u, and from Si to i

(v) X 2 f0;1gp�ðnþmÞ Design or feature matrix

xui 2 f0;1g1�ðnþmÞ Design or feature vector of triple ðu; i; ruiÞr 2 Rp�1 Rating vector~X 2 f0;1gp�ðnþmþgþsÞ Expanded design or feature matrix

f Number of latent dimensions of FM

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244 235

solution, i.e., compressed knowledge transfer via factorizationmachine (CKT-FM), for knowledge sharing between auxiliary dataand target data. In our first step, we mine compressed knowledgeof user homophily (i.e., user groups) and item correlation (i.e., itemsets) from auxiliary implicit feedbacks, which is expected to bemore parsimonious than the raw implicit feedbacks. In our secondstep, we design an integrative knowledge transfer solution viaexpanding the design matrix of factorization machine, whichincorporates the compressed knowledge of user groups and itemsets into the target data in a seamless manner. We then conductextensive empirical studies on two large data sets and obtain sig-nificantly better results via our CKT-FM than the state-of-the-artmethod without knowledge transfer. Furthermore, our CKT-FM ismuch more efficient than the method leveraging raw implicitexaminations w.r.t. CPU time and memory usage.

We summarize our main contributions as follows, (i) we pro-pose a novel and generic compressed knowledge transfer solutionvia factorization machine (CKT-FM) for heterogeneous collabora-tive recommendation; and (ii) we conduct extensive empiricalstudies on two large data sets and show that our CKT-FM is signif-icantly better than the state-of-the-art non-transfer learningmethod w.r.t. recommendation accuracy, and is much more effi-cient than that of leveraging the raw implicit examinations directlyinstead of compressed knowledge w.r.t. CPU time and memoryusage.

We organize the paper as follows. First, we provide some back-ground information of a formal definition of the studied problemand a description of factorization machine in Section 2. Second,we describe our two-step knowledge transfer solution in detail inSection 3. Third, we conduct extensive empirical studies anddetailed analysis in Section 4. Fourth, we discuss some existingworks on some closely related topics in Section 5. Finally, we con-clude this paper with some future directions.

2. Background

2.1. Problem definition

In our studied heterogeneous collaborative recommendation(HCR) problem, we have n users and m items in the target data,for which we have observed some explicit feedbacks of ratings,e.g., rui for user u’s graded preference on item i. Besides the targetexplicit feedbacks, we also have some auxiliary data of implicitexamination records such as users’ actions of browsing and clicks.We use R ¼ ½rui�n�m and E ¼ ½eui�n�m to denote the explicit ratingsand implicit examinations, respectively.

Our goal is then to design an effective and efficient knowledgetransfer solution to transfer knowledge from the auxiliary implicitfeedbacks E to the target explicit feedbacks R, in order to addressthe sparsity problem of graded preferences in the target data.Note that the users and items are the same in both target data

Fig. 1. Illustration of compressed knowledge transfer (CKT) via factorization

and auxiliary data, which can thus be categorized as afrontal-side transfer learning setting [23], rather than thetwo-side [22], user-side [7] or item-side [29] knowledge transfersetting. We illustrate our studied problem in Fig. 1, in particularof the left part (implicit examinations) and the right part (explicitratings).

We put some commonly used notations in Table 1. They include(i) feedbacks, (ii) latent variables, (iii) compressed knowledge, (iv)pairwise interactions, and (v) variables in factorization machine.Please refer to this table for the descriptions of the notations usedin this paper.

2.2. Factorization machine

The main idea of factorization machine (FM) [24] is to representthe (user, item) rating matrix R in a new form, including a designmatrix X and a target vector r,

FMðRÞ ! FMðX; rÞ:

Specifically, X and r are associated with p feature vectors and p rat-

ings, respectively, i.e., X ¼ ½xui�p�1 2 f0;1gp�ðnþmÞ and

machine (FM) for heterogeneous collaborative recommendation (HCR).

Page 3: Compressed knowledge transfer via factorization machine ...csse.szu.edu.cn/staff/panwk/publications/Journal-KBS-15-CKT-FM.pdf · Compressed knowledge transfer via factorization machine

236 W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

r ¼ ½rui�p�1 2 Rp�1, where p is the number of observed explicitratings in R. For a typical (user, item, rating) triple, i.e., ðu; i; ruiÞ, it

is represented as ðxui; ruiÞ, where xui 2 f0;1g1�ðnþmÞ is a feature vectorwith the uth and (nþ i) th entries being 1 and all other entries being0. Note that such a representation of xui is usually called dummycoding. The rating rui is then put in the corresponding location ofthe target vector r.

With this new representation, FM [24] then models pairwiseinteractions for every two non-zero features of each feature vectorxui via two latent vectors, one latent vector for one feature. Themost commonly used formulation of FM is the second order pair-wise interactions with factorized variables, which inherits theadvantages of support vector machine (SVM) [4] and matrix factor-ization (MF) [28]. Specifically, the rating of user u on item i in FM isapproximated as follows [24],

r̂ui ¼ wð0Þ þXnþm

j¼1

wðjÞxuiðjÞ þXnþm

j¼1

Xnþm

j0¼jþ1

xuiðjÞxuiðj0ÞvðjÞvðj0ÞT;

where the scalars wðjÞ 2 R; j ¼ 0;1; . . . ;nþm and vectorsvðjÞ 2 R1�f ; j ¼ 1;2; . . . ;nþm are model parameters to be learned.Once the model parameters have been learned, we can estimateeach user’s preferences on each item, which can then be used forpersonalized recommendation. Note that the above formula canbe reformulated to result in an efficient linear time complexitycomputation [24].

One of the brightest aspect of FM is its high flexibility to inte-grate various auxiliary data via expanding the design matrix withadditional columns, such as temporal information, user demo-graphics, item descriptions and auxiliary feedbacks [16,24].However, one of the major limitation also arises from suchstraightforward expansions, in particular of low efficiency, becauseraw auxiliary data is usually much more than the target data. Thatis also our motivation to design a compressed knowledge transfersolution when leveraging knowledge from auxiliary implicit feed-backs via FM. With compressed knowledge transfer, we expect toobtain more accurate prediction performance than FM on explicitratings only, and to achieve more efficient knowledge transfer thanFM with raw auxiliary data regarding time and space complexity.Hence, we expect to have a well-balanced solution between effec-tiveness and efficiency in exploiting heterogeneous feedbacks.

3. Compressed knowledge transfer via factorization machine

Our proposed compressed knowledge transfer (CKT) solutioncontains two major steps of compressed knowledge mining andcompressed knowledge integration. Specifically, in the first step,we aim to reduce the noise effect of implicit feedbacks and thenmine some compressed knowledge of user groups and item sets;and in the second step, we propose to transfer the mined com-pressed knowledge via feature expansion of factorization machine.We will describe these two steps in detail in the following twosubsections.

3.1. Compressed knowledge mining

For convenience of notation, we replace the missing values in Ewith 0s, and thus have a full binary matrix Eb 2 f1;0gn�m. Note thatwe do not need to store the full binary matrix in memory, andrepresent it in a parsimonious way via recording the 1 s only inEb. For the auxiliary implicit feedbacks, we first adopt singularvalue decomposition (SVD)1 to learn users’ latent preferences anditems’ latent features,

1 http://www.mathworks.com/help/matlab/ref/svds.html.

U0;B;V0 SVDðEb; dÞ; ð1Þ

where d means that we only keep the d largest singular values andtheir corresponding singular vectors. Note that SVD has the effect ofnoise reduction [22], which is helpful for knowledge mining fromthe uncertain implicit feedbacks. With the factorized variables, weuse U ¼ U0B1=2 2 Rn�d and V ¼ V0B1=2 2 Rm�d to denote the users’latent preferences and items’ latent features, respectively.

In HCR, the semantic meanings of auxiliary implicit examina-tions and target explicit ratings are very different, i.e., an examina-tion record ðu; iÞ and a rating record ðu; i; ruiÞ represent differentlevels of uncertainties of the user’s preferences. However, the userhomophily such as user groups in two data are usually similar,because two users that have similar browsing behaviors in the auxil-iary data are likely to have similar tastes in the target data. Similarly,the item correlation behind two types of feedbacks are also likelyto be similar. With this assumption, we apply k-means2 clusteringto the users’ latent preferences U ¼ U0B1=2 and items’ latent featuresV ¼ V0B1=2 in order to mine the user groups and item sets, respec-tively. We represent the mining process as follows,

G k-meansðU; gÞ; S k-meansðV; sÞ; ð2Þ

where g denotes the number of user groups and s denotes thenumber of item sets, and G 2 f0;1gn�g with G1g�1 ¼ 1n�1 andS 2 f0;1gm�s with S1s�1 ¼ 1m�1 are the membership matrices forusers and items, respectively. The homophily and correlationamong users and items are thus encoded in G and S, because twousers’ (or two items’) membership vectors will be the same if theybelong to the same group (or set). As compared with the raw data ofimplicit feedbacks, the knowledge of user groups and item sets in Gand S are much compressed, because the number of groups and setsare usually much smaller than the number of users and items,respectively, i.e., g � n and s� m. For this reason, we call the firststep of our solution as compressed knowledge mining, which is illus-trated in the middle part of Fig. 1 (denoted as ‘‘Compressedknowledge’’).

3.2. Compressed knowledge integration

In a typical factorization model [28], we usually focus on mod-eling pairwise interactions between a user and an item if the cor-responding (user, item, rating) triple ðu; i; ruiÞ is observed.However, when the explicit ratings are sparse, suchindividual-level interaction between a user u and an item i,denoted as ðu � iÞ, may not be reliable for characterizing users’preferences. As a response, we propose to integrate and transferthe mined compressed knowledge, i.e., user group Gu of user uand item set Si of item i, to the target task of preference learningof user u on item i via factorization machine [4]. Specifically, wedesign three new types of interactions, including (i) cluster-levelinteraction between user groups and item sets, (ii) hybrid interac-tion between user groups and items (or between users and itemsets), and (iii) preference adaptation from user groups in auxiliarydata to users in target data (or feature adaptation from item sets toitems), which are inspired by the feature engineering process offactorization machine [4] in the context of HCR.

First, we will describe the cluster-level interaction and hybridinteraction.

� Cluster-level interaction ðGu � SiÞ: Cluster-level interaction asdefined on user groups and item sets is a smoothed version ofindividual-level interaction, which may help alleviate the spar-sity problem to some extent. Specifically, ðGu � SiÞ aims to

2 http://www.mathworks.com/help/stats/kmeans.html.

Page 4: Compressed knowledge transfer via factorization machine ...csse.szu.edu.cn/staff/panwk/publications/Journal-KBS-15-CKT-FM.pdf · Compressed knowledge transfer via factorization machine

Fig. 2. Illustration of the six pairwise interactions with mined compressedknowledge of a user group and an item set. The solid line is for individual-levelinteraction, the dashed lines are for cluster-level interaction and hybrid interaction,and the arrows are for preference or feature adaptation.

3 http://www.libfm.org/.

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244 237

approximate the rating rui via modeling the interaction betweena user group and an item set which are mined from the auxiliaryimplicit feedbacks. Note that the cluster-level rating pattern in[6,12,13,18] is a g by s non-negative matrix, which is thus differ-ent from our membership matrices, i.e., G 2 f0;1gn�g andS 2 f0;1gm�s. As compared with the codebook in [6,12,13,18],the user groups G and item sets S in our solution are expectedto transfer more knowledge and to model richer cluster-levelinteraction.� Hybrid interaction ðGu � iÞ and ðu � SiÞ: Hybrid interaction is a

mix of individual-level interaction and cluster-level interaction.Specifically, ðGu � iÞ is for the preference of the group that useru belongs to on the item i, and ðu � SiÞ is for user u’s overallpreference on the set that item i belongs to. Interactionsbetween a user group and an item have been explored forrecommendation with implicit feedbacks only, e.g., grouppreference based Bayesian personalized ranking (GBPR) [20].However, the user groups in GBPR [20] are not mined from aux-iliary data and fixed, but are randomly constructed during thelearning procedure, which is thus not able to model real hybridinteraction since there is no membership matrix. Furthermore,the studied problem setting in GBPR [20] is different from ourHCR in Fig. 1. Similar to the aforementioned cluster-level inter-action, the hybrid interaction can also be considered as asmoothing approach and thus may help alleviate the sparsityproblem.

So far, we have assumed that the compressed knowledge asmined from the auxiliary implicit examinations is the same withthat from the target explicit ratings. However, there may still besome inconsistency between the hidden preferences of users intwo data. In order to mitigate such inconsistency, an adaptationis usually adopted in knowledge transfer methods [19,22]. Hence,we further propose two novel pairwise interactions, i.e., ðGu � uÞand ðSi � iÞ, in order to adapt the user preferences and item fea-tures from the auxiliary data to the target data in a principledway. Specifically, ðGu � uÞ is for the consistency modeling betweenthe group Gu’s preference as reflected in the auxiliary examinationrecords and user u’s preference in the target explicit ratings.Similarly, ðSi � iÞ is for the consistency modeling between thelatent features of item set Si in the auxiliary data and that of itemi in the target data. For example, a strong interaction betweengroup Gu and user u means that the group’s preference and theuser’s preference is consistent, or the user homophily is similarin two data.

Finally, we have two families of interactions, one for preferencemodeling and one for preference or feature adaptation,

Preference : ðu � iÞ; ðGu � SiÞ; ðGu � iÞ; ðu � SiÞ;Adaptation : ðGu � uÞ; ðSi � iÞ:

It is interesting to see that the above six interactions are actu-ally all possible pairwise interactions among four entities, i.e.,u; i; Gu, and Si, which are shown in Fig. 2. Note that when Gu

and Si are hard membership vectors, i.e., one user (or item) belongsto one and only one group (or set), all these interactions can bemodeled exactly via FM [24] by expanding the design matrix,

X 2 f0;1gp�ðnþmÞ ! ~X 2 f0;1gp�ðnþmþgþsÞ; ð3Þ

where g and s are the number of user groups and item sets, respec-tively, and each original design or feature vector xui in X will then be

extended to ~xui 2 f0;1g1�ðnþmþgþsÞ. Note that we do not introducenormalization on the appended compressed knowledge as usuallyadopted by SVD++ [11], because each user (or item) belongs to onlyone group (or set). We may only transfer compressed knowledge ofuser groups or item sets rather than both, which will then result in a

shorter feature vector, i.e., ~xui 2 f0;1g1�ðnþmþgÞ for user groups and~xui 2 f0;1g1�ðnþmþsÞ for item sets. We will study the empirical perfor-mance of transferring compressed knowledge of user groups, itemsets and both in Section 4.

With the expanded design matrix ~X that integrates compressedknowledge G and S via introducing several interactions for prefer-ence modeling and adaptation, we deploy the available implemen-tation of factorization machine [24] (i.e., the libFM software3) forfurther learning and prediction,

FMðR;G; SÞ ! FMð~X; rÞ: ð4Þ

3.3. Algorithm

We depict the above two major steps of compressed knowledgemining and compressed knowledge integration in Fig. 3, whichcontains four specific components of denoising, clustering, incor-poration and factorization. Specifically, we first apply singularvalue decomposition to the converted binary examination matrixso as to denoise the raw examination records and extract latentvariables, which are then used by k-means clustering to mine somecompressed knowledge of user groups and item sets. After that, wetransfer the mined compressed knowledge via integrating theminto the target design matrix. Finally, we factorize the expandeddesign matrix via factorization machine. From Fig. 3, we can alsosee that our CKT-FM is quite general and flexible, because wemay derive a new solution with an alternative algorithm for a typ-ical component. For example, we may use a different clusteringalgorithm for the clustering component.

As for time complexity, our CKT-FM is much more efficient thanFM with raw implicit feedbacks, because the design matrix in

CKT-FM, i.e., f0;1gp�ðnþmþgþsÞ, is much smaller than that in FM with

raw implicit feedbacks, i.e., f0;1gp�ðnþmþmÞ, which mimics SVD++[11]. Furthermore, the number of pairwise interactions and modelparameters of CKT-FM is also much fewer than that of FM andSVD++. Our empirical results on CPU time and memory usage alsoconfirm this analysis. Note that the step of denoising via SVD israther efficient for our implicit feedback matrix with most entriesbeing 0. The k-means clustering algorithm is also very efficient,where we find that it converges well when the iteration numberis smaller than 300.

Page 5: Compressed knowledge transfer via factorization machine ...csse.szu.edu.cn/staff/panwk/publications/Journal-KBS-15-CKT-FM.pdf · Compressed knowledge transfer via factorization machine

Fig. 3. The algorithm of CKT-FM (compressed knowledge transfer via factorizationmachine).

Table 2Description of MovieLens10M (n ¼ 71;567;m ¼ 10;681) and Flixter(n ¼ 147;612;m ¼ 48;794) used in the experiments.

Data set Record number Ratio (pe=p)

MovieLens10MExplicit (training) p ¼ f5;10;15g � 71;567Implicit (training) pe ¼ 4;000;022 11:2; 5:6; 3:7Explicit (test) 2,000,010

FlixterExplicit (training) p ¼ f5;10;15g � 147;612Implicit (training) pe ¼ 3;278;431 4:4; 2:2; 1:5Explicit (test) 1,639,215

238 W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

4. Experimental results

4.1. Data sets and evaluation metric

MovieLens10M MovieLens10M4 is a public recommendationdata set with n ¼ 71;567 users, m ¼ 10;681 items, and10;000;000 ratings in f0:5;1; . . . ;5g. As far as we know, there areno publicly available data including both explicit feedbacks andimplicit feedbacks. In order to simulate the problem setting asshown in Fig. 1, we follow previous works [15,30], and preprocessthe data as follows. First, we randomly split the (user, item, rating)triples into five sets with equal size. Second, we take one set as targetexplicit feedbacks for test, two sets as target explicit feedbacks fortraining, and the remaining two sets as auxiliary data for training.Third, we adopt a common approach [15,30] to convert all (user,item, rating) triples in auxiliary data to (user, item) pairs as implicitfeedbacks via removing the rating values. Fourth, we randomly take5n; 10n and 15n ratings from target explicit feedbacks, so that everyuser has 5, 10 and 15 ratings on average. We use these data to studythe effectiveness of sparsity reduction of the proposed compressedknowledge transfer solution. We then repeat the second, third andfourth steps for five times and get five copies of data in order to con-duct 5-fold empirical studies.

Flixter Flixter5 [9] contains n ¼ 147;612 users, m ¼ 48;794items and 8,196,077 ratings in f0:5;1; . . . ;5g. We preprocess thisdata in the same way as that of the above MovieLens10M data.

In order to have some deep understanding of the effectivenessof the proposed knowledge transfer solution, we also calculatethe ratio of the number of auxiliary examination records to thenumber of target explicit ratings, i.e., pe=p, as shown in the last col-umn of Table 2. We can see that the ratios of the above two datasets are quite different, which is also reflected on the predictionperformance in Tables 4 and 5.

4 http://grouplens.org/datasets/movielens/.5 http://www.cs.sfu.ca/�sja25/personal/datasets/.

A formal description of the data is shown in Table 2.Evaluation metric For quantitative evaluation of the effective-

ness of the compressed knowledge transfer solution for rating pre-diction, we adopt a commonly used evaluation metric called RootMean Square Error (RMSE),

RMSE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiX

ðu;i;ruiÞ2TEðrui � r̂uiÞ2=jTEj

r

where rui and r̂ui are the true and predicted ratings, respectively, andjTEj is the number of test ratings.

4.2. Baselines and parameter settings

Our proposed knowledge transfer solution is a generic frame-work with four specific components of denoising, clustering, incor-poration and factorization via factorization machine [24]. In orderto study the effect of the proposed knowledge transfer solutionmore directly, we choose factorization machine which mimicsSVD++ [11] as our major baseline. Note that factorization machineis a very strong baseline, which has won several international com-petition awards including KDD CUP 2012 [25] and ECML PKDDDiscovery Challenge 2013 [3]. Due to our major motivation of spar-sity reduction for explicit ratings, we also include a smoothingmethod called user-based average filling (UAF), i.e.,r̂ui ¼ �ru� ¼

Pmj¼1yujruj=

Pmj¼1yuj, where yuj ¼ 1 if the rating ruj is

observed and yuj ¼ 0 otherwise. Furthermore, we also compareCKT-FM with a basic user-based collaborative filtering (UCF)method, i.e., r̂ui ¼ �ru� þ

Pw2N i

uswuðrwi � �rw�Þ=

Pw2N i

uswu, where swu is

the Pearson correlation between user u and user w and N iu is a

neighboring set of users w.r.t. user u and item i. Note that weuse the whole neighboring set due to the sparsity of the explicitfeedbacks.

For the algorithms to solve the optimization problem in factor-ization machine [24], we have chosen MCMC among SGD, SGDA,ALS and MCMC as implemented in the libFM software.6 In our pre-liminary studies, we find that MCMC usually generates much betterresults with fewer parameter configurations. Since the focus of ourempirical study is on the effectiveness of the proposed knowledgetransfer solution instead of designing a new optimization algorithm,we thus choose the MCMC algorithm to avoid external factors suchas tedious parameter configurations.

In order to decide an empirically good value of the initializationparameter r of MCMC in FM, we construct a validation set via ran-domly taking 1 rating per user on average from the training data ofthe target explicit ratings. We then use the remaining training datato train the model with different values ofr 2 f0:01;0:05;0:10;0:15;0:20;0:25g. The value of r with best per-formance on RMSE with 200 iterations on the first copy of eachdata is selected and fixed for further empirical studies. We include

6 http://www.libfm.org/.

Page 6: Compressed knowledge transfer via factorization machine ...csse.szu.edu.cn/staff/panwk/publications/Journal-KBS-15-CKT-FM.pdf · Compressed knowledge transfer via factorization machine

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244 239

the prediction performance and the corresponding values of r inTables 4 and 5, and report the results using different iterations off50;100;150;200g in Fig. 4. For the number of latent dimensionsin the factorization machine, we fix it as 20 [32]. Note that in thefirst step of our solution, we use d ¼ 100 in SVD as shown in Eq.(1) in order to find users’ latent preferences and items’ latent fea-tures, and use g ¼ s ¼ 200 in k-means as shown in Eq. (2) with 300iterations and Euclidean distance so as to mine the user groups anditem sets.

For convenience of comparative studies and discussions, wedenote user-based average filling (UAF) with explicit ratings asUAF(R), user-based collaborative filtering (UCF) with explicit rat-ings as UCF(R), factorization machine (FM) with explicit ratingsas FM(R), FM with explicit ratings and raw implicit examinationsas FM(R;E), compressed knowledge transfer (CKT) with explicitratings and user groups as CKT-FM(R;G), CKT with explicit ratingsand item sets as CKT-FM(R; S), CKT with explicit ratings, usergroups and item sets as CKT-FM(R;G; S).

4.3. Results

We study the empirical performance of CTK-FM mainly withthe following three questions. First, is CKT-FM more efficient thanFM with raw implicit feedbacks, i.e., FM(R;E)? Second, is CKT-FMmore accurate than FM without auxiliary implicit feedbacks, i.e.,FM(R)? Third, is the compressed knowledge useful? Specifically,

50 100 150 2000.885

0.89

0.895

0.9

0.905

Iteration number

RM

SE

FM(R)CKT−FM(R,G,S)

50 100 150 2000.86

0.865

0.87

0.875

0.88

0.885

Iteration number

RM

SE

FM(R)CKT−FM(R,G,S)

50 100 150 2000.845

0.85

0.855

0.86

0.865

Iteration number

RM

SE

FM(R)CKT−FM(R,G,S)

Fig. 4. Prediction performance of CKT-FM and FM on MovieLens10M and Flixter with di[24] in the experiments. Note that the number in each parentheses denotes the number

we answer the first question in Section 4.3.1, the second questionin Sections 4.3.1–4.3.3, and the third question in Section 4.3.4.

4.3.1. Main resultsIn our preliminary studies, we find that CKT-FM(R;G; S) is much

more efficient than FM(R;E), which verifies our complexity analy-sis in Section 3.3. In order to study the efficiency issue more pre-cisely, we control the computing environment when calculatingthe CPU time and memory usage. Specifically, we conduct experi-ments on Windows Server 2008 with Intel(R) Core(TM) i7-3770CPU @ 3.40 GHz (1-CPU/4-core)/12 GB RAM, where all othernon-system processes are terminated due to the high space com-plexity of FM(R;E). We record the CPU time and memory usageon running FM(R), FM(R;E) and CKT-FM(R;G; S) on the first copyof MovieLens10M and Flixter, and report the results in Table 3.The reported CPU time and memory usage are about the FM part,excluding the steps of SVD and k-means, because we focus onstudying the efficiency of compressed knowledge transfer as com-pared with that of leveraging the raw implicit feedbacks. Anotherreason that we do not include the cost of SVD and k-means is thatthey are implemented in MATLAB, which may be not comparablewith the C++ implementation of FM. Note that the libFM softwarehas two implementations, i.e., (i) FM without block structure [24]and (ii) FM with block structure [26], where the latter exploitsthe repeating patterns of the design matrix and is thus of low timeand space complexity. From the quantitative results of CPU time

50 100 150 2000.89

0.895

0.9

0.905

0.91

0.915

Iteration number

RM

SE

FM(R)CKT−FM(R,G,S)

50 100 150 2000.865

0.87

0.875

0.88

0.885

0.89

Iteration number

RM

SE

FM(R)CKT−FM(R,G,S)

50 100 150 2000.85

0.855

0.86

0.865

0.87

Iteration number

RM

SE

FM(R)CKT−FM(R,G,S)

fferent iteration numbers. We adopt the implementation of FM w/o block structureof ratings per user on average.

Page 7: Compressed knowledge transfer via factorization machine ...csse.szu.edu.cn/staff/panwk/publications/Journal-KBS-15-CKT-FM.pdf · Compressed knowledge transfer via factorization machine

Table 3CPU time and memory usage on running factorization machine (FM) with explicitratings, i.e., FM(R), FM with explicit ratings and raw implicit examinations, i.e.,FM(R;E), and FM with explicit ratings and compressed knowledge, i.e., CKT-FM(R;G; S), on the first copy of MovieLens10M and Flixter. The number of iterationsis fixed as 200. The number of ratings per user on average is 10.

Data Algorithm CPU time (min.) Memory usage (GB)

FM w/o block structure [24]MovieLens10M FM(R) 15 0.16

FM(R;E) 915 4.1CKT-FM(R;G; S) 24 0.20

Flixter FM(R) 20 0.18FM(R;E) 2345 9.7CKT-FM(R;G; S) 34 0.23

FM w/block structure [26]MovieLens10M FM(R) 6 0.18

FM(R;E) 17 0.22CKT-FM(R;G; S) 7 0.18

Flixter FM(R) 13 0.21FM(R;E) 24 0.25CKT-FM(R;G; S) 13 0.21

Table 4Prediction performance of CKT-FM and other methods on MovieLens10M. Thenumber in each parentheses denotes the number of ratings per user on average. Thebold numbers denote the corresponding best results.

Data Algorithm Parameter RMSE

MovieLens10M (5) UAF(R) 1.0635 ± 0.0009UCF(R) 1.0384 ± 0.0006FM(R) r ¼ 0:25 0.8971 ± 0.0008CKT-FM(R;G) r ¼ 0:25 0.8927 ± 0.0006CKT-FM(R; S) r ¼ 0:20 0.8901 ± 0.0005CKT-FM(R;G; S) r ¼ 0:20 0.8868 ± 0.0008FM(R;E) r ¼ 0:20 0.8826 ± 0.0006

MovieLens10M (10) UAF(R) 1.0280 ± 0.0007UCF(R) 0.9539 ± 0.0005FM(R) r ¼ 0:20 0.8707 ± 0.0008CKT-FM(R;G) r ¼ 0:20 0.8667 ± 0.0010CKT-FM(R; S) r ¼ 0:15 0.8659 ± 0.0007CKT-FM(R;G; S) r ¼ 0:15 0.8618 ± 0.0008FM(R;E) r ¼ 0:15 0.8564 ± 0.0006

MovieLens10M (15) UAF(R) 1.0111 ± 0.0007UCF(R) 0.9233 ± 0.0007FM(R) r ¼ 0:20 0.8550 ± 0.0005CKT-FM(R;G) r ¼ 0:15 0.8505 ± 0.0007CKT-FM(R; S) r ¼ 0:15 0.8503 ± 0.0005CKT-FM(R;G; S) r ¼ 0:15 0.8462 ± 0.0007FM(R;E) r ¼ 0:15 0.8409 ± 0.0005

Table 5Prediction performance of CKT-FM and other methods on Flixter. The number in eachparentheses denotes the number of ratings per user on average. The bold numbersdenote the corresponding best results.

Data Algorithm Parameter RMSE

Flixter (5)UAF(R) 0.9534 ± 0.0012UCF(R) 0.9498 ± 0.0011FM(R) r ¼ 0:20 0.9035 ± 0.0010CKT-FM(R;G) r ¼ 0:20 0.9027 ± 0.0010CKT-FM(R; S) r ¼ 0:15 0.8968 ± 0.0007CKT-FM(R;G; S) r ¼ 0:15 0.8937 ± 0.0008FM(R;E) r ¼ 0:20 0.8969 ± 0.0008

Flixter (10)UAF(R) 0.9379 ± 0.0010UCF(R) 0.9242 ± 0.0008FM(R) r ¼ 0:15 0.8753 ± 0.0010CKT-FM(R;G) r ¼ 0:15 0.8747 ± 0.0008CKT-FM(R; S) r ¼ 0:15 0.8711 ± 0.0009CKT-FM(R;G; S) r ¼ 0:15 0.8687 ± 0.0008FM(R;E) r ¼ 0:15 0.8705 ± 0.0008

Flixter (15)UAF(R) 0.9309 ± 0.0009UCF(R) 0.9115 ± 0.0008FM(R) r ¼ 0:15 0.8598 ± 0.0008CKT-FM(R;G) r ¼ 0:15 0.8591 ± 0.0007CKT-FM(R; S) r ¼ 0:10 0.8568 ± 0.0008CKT-FM(R;G; S) r ¼ 0:10 0.8549 ± 0.0008FM(R;E) r ¼ 0:15 0.8561 ± 0.0007

240 W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

and memory usage in Table 3, we can have the followingobservations:

� Factorization machine with explicit ratings, i.e., FM(R), is themost efficient one, which is consistent with our analysis onthe relationship between the efficiency and the size of thedesign matrix and the number of model parameters. And thetime and space cost of CKT-FM(R;G; S) is only slightly higherthan that of FM(R) and is much lower than that of FM(R;E),which clearly shows that our compressed knowledge transfersolution is very efficient.� The CPU time and memory usage of FM(R;E) on Flixter increase

as compared with that on MovieLens. The reason for this ismainly that there are more items in Flixter as shown inTable 2, and according to the formula of FM (which mimicsSVD++ [11]), each examined item by a certain user will beappended to the design matrix, resulting in a much larger num-ber of pairwise interactions and model parameters.

We then conduct extensive empirical studies of UAF(R), UCF(R)and FM(R), and three variants of CKT-FM, i.e., CKT-FM(R;G),CKT-FM(R; S) and CKT-FM(R;G; S). We also include the results ofFM(R;E) for reference although it takes much more time as shownin Table 3. The number of iterations is fixed as 200. We adopt theimplementation of FM w/o block structure [24] in the experiments.We report the results in Tables 4 and 5, from which we can havethe following observations:

� Factorization based methods are much better than the smooth-ing method (i.e., UAF) and memory-based method (i.e., UCF),which shows that factorization machine is indeed a very strongbaseline and is also consistent with various previous works.� CKT-FM(R;G), CKT-FM(R; S) and CKT-FM(R;G; S) are all better

than FM(R), UAF(R) and UCF(R), which clearly shows theusefulness of the shared compressed knowledge and theeffectiveness of our knowledge transfer approach.� CKT-FM(R;G; S) further improves the performance over

CKT-FM(R;G) and CKT-FM(R; S), which shows that thecompressed knowledge of user groups and item sets arecomplementary for the learning task on the target rating data.� FM(R;E) performs well in both data w.r.t. the prediction accuracy

(i.e., the best on MovieLens10M and second best on Flixter) asexpected. However, the time and space cost of FM(R;E) is high,while our CKT-FM(R;G; S) is a good balance between efficiencyand effectiveness. Furthermore, our CKT-FM(R;G; S) performs

better than FM(R;E) on Flixter, which shows that the step of noisereduction is helpful and the compressed knowledge is more use-ful than the raw implicit feedbacks.� The prediction performance shows that the benefit or improve-

ment from knowledge transfer is larger on MovieLens10M thanthat on Flixter, which is consistent with the relative size ofratios of auxiliary examination records to target explicit ratings,i.e., pe=p 2 f11:2;5:6;3:7g for MovieLens10M andpe=p 2 f4:4;2:2;1:5g for Flixter as shown in Table 2. Also, theimprovement in cases with fewer ratings (e.g., 5 ratings per useron average) is larger than those with more ratings (e.g., 15 rat-ings per user on average), which shows that our CKT-FM is help-ful for sparsity alleviation in the target explicit ratings.

Page 8: Compressed knowledge transfer via factorization machine ...csse.szu.edu.cn/staff/panwk/publications/Journal-KBS-15-CKT-FM.pdf · Compressed knowledge transfer via factorization machine

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244 241

Overall, the results in Tables 3–5 show that transferring com-pressed knowledge of user groups and item sets from auxiliaryimplicit feedbacks to target explicit feedbacks is helpful, and theproposed knowledge transfer solution via factorization machineis both efficient and effective.

4.3.2. Results with different iteration numbersThe results of CKT-FM and FM with different iteration numbers

are shown in Fig. 4, from which we can have the followingobservations:

� Both CKT-FM(R;G; S) and FM(R) converge smoothly with about200 iterations in most cases, and the relative prediction perfor-mance of each approach is proportional to the numbers of rat-ings per user on average in both data sets.� CKT-FM(R;G; S) is significantly better than FM(R) on all

iteration numbers in almost all cases (except when the iterationnumber is smaller than 100 on Flixter with 15 ratings peruser on average), which again shows the advantages ofCKT-FM(R;G; S) with compressed knowledge transfer.� The special case when the iteration number is smaller than 100

on Flixter with 15 ratings per user on average is because of theinconsistency of the compressed knowledge of user groups anditem sets between the auxiliary implicit examinations and tar-get explicit ratings when the learning is not sufficient.

4.3.3. Results on different user segmentationsIn order to have a deep understanding of the performance gain

of our CKT-FM over the major baseline method FM, we analyze the

Fig. 5. Prediction performance of CKT-FM and FM on the first copy of MovieLens10M forimplementation of FM w/o block structure [24] in the experiments. Note that the numb

performance of each method on different user segmentations.Specifically, we construct eight and twelve user segmentationsw.r.t. different numbers of ratings in the test data, which is shownin two tables in Figs. 5 and 6. Note that due to the process ofrandom generation of training data and test data as described inSection 4.1, the distributions of user segmentations of training dataand test data are similar. The results of CKT-FM and FM on differ-ent user segmentations are shown in Figs. 5 and 6, from which wecan have the following observations:

� The results on active users (who have rated more items) are bet-ter than that on inactive users, and the overall performance incases with more ratings per user on average (e.g., 15) is betterthan those with fewer (e.g., 5 or 10), which are consistent withobservations in other works [13,22].� CKT-FM(R;G; S) is better than FM(R) on all user segmentations

in all cases, which again clearly shows the advantages of ourproposed knowledge transfer approach.

In summary, the results in Figs. 4–6 clearly show that ourCKT-FM converges smoothly and performs significantly better thanthe method using explicit feedbacks only in all cases, including dif-ferent user segmentations on data with different levels of sparsity.

4.3.4 Improvement from compressed knowledgeIn this section, we study two questions about the superior pre-

diction performance of CKT-FM, in particular of the compressedknowledge. First, is the performance improvement of CKT-FM overFM simply from using more model parameters rather than the

different user segmentations. The number of iterations is fixed as 200. We adopt theer in each parentheses denotes the number of ratings per user on average.

Page 9: Compressed knowledge transfer via factorization machine ...csse.szu.edu.cn/staff/panwk/publications/Journal-KBS-15-CKT-FM.pdf · Compressed knowledge transfer via factorization machine

Fig. 6. Prediction performance of CKT-FM and FM on the first copy of Flixter for different user segmentations. The number of iterations is fixed as 200. We adopt theimplementation of FM w/o block structure [24] in the experiments. Note that the number in each parentheses denotes the number of ratings per user on average.

242 W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

mined knowledge? Note that the number of model parameters ofCKT-FM and FM are ðnþmþ g þ sÞ � f þ f þ 1 andðnþmÞ � f þ f þ 1, respectively, where f is the latent dimensionof FM. Second, is the denoising step via singular value decomposi-tion helpful on mining high quality compressed knowledge?.

In order to answer the first question, we conduct additionalexperiments with comparable number of model parameters.Specifically, we fix f ¼ 20 in CKT-FM as before and use f ¼ 21 inFM, where the total number of model parameters in CKT-FM isnow slightly fewer than that in FM. The prediction performance

Table 6Prediction performance of CKT-FM and FM with comparable numbers of modelparameters on MovieLens10M and Flixter. The number of iterations is fixed as 200.The number of ratings per user on average is 10. We adopt the implementation of FMw/o block structure [24] in the experiments. The bold numbers denote thecorresponding best results.

Data Algorithm Latentdimension

Parameter RMSE

MovieLens10M FM(R) f ¼ 20 r ¼ 0:20 0.8707 ± 0.0008FM(R) f ¼ 21 r ¼ 0:20 0.8706 ± 0.0004CKT-FM(R;G; S) f ¼ 20 r ¼ 0:15 0.8618 ± 0.0008

FlixterFM(R) f ¼ 20 r ¼ 0:15 0.8753 ± 0.0010FM(R) f ¼ 21 r ¼ 0:15 0.8753 ± 0.0007CKT-FM(R;G; S) f ¼ 20 r ¼ 0:15 0.8687 ± 0.0008

is shown in Table 6, from which we can clearly see that simplyusing more model parameters does not help much. Hence, theanswer to the first question is no, i.e., the performance improve-ment is not from using more model parameters, but from the com-pressed knowledge of user groups and item sets.

For the second question on the effect of denoising, we conductcomparative studies between CKT-FM with and without singularvalue decomposition for noise reduction. Specifically, we removestep 1.1 in Algorithm 3 and revise step 1.2 via clustering on Eand ET (instead of on U0B1=2 and V0B1=2), to obtain user groups

Table 7Prediction performance of CKT-FM with and without the denoising step onMovieLens10M and Flixter. The number of iterations is fixed as 200. The number ofratings per user on average is 10. We adopt the implementation of FM w/o blockstructure [24] in the experiments. The bold numbers denote the corresponding bestresults.

Data Algorithm Denoise Parameter RMSE

MovieLens10MCKT-FM(R;G; S) No r ¼ 0:20 0.8642 ± 0.0007CKT-FM(R;G; S) Yes r ¼ 0:15 0.8618 ± 0.0008

FlixterCKT-FM(R;G; S) No r ¼ 0:15 0.8697 ± 0.0008CKT-FM(R;G; S) Yes r ¼ 0:15 0.8687 ± 0.0008

Page 10: Compressed knowledge transfer via factorization machine ...csse.szu.edu.cn/staff/panwk/publications/Journal-KBS-15-CKT-FM.pdf · Compressed knowledge transfer via factorization machine

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244 243

and item sets, respectively. The prediction performance is reportedin Table 7, from which we can see that CKT-FM without the denois-ing step will hurt the performance. Hence, the step of noise reduc-tion is indeed helpful for mining high quality compressedknowledge, which gives a positive answer to the second question.

5. Related work

Considering the studied problem setting and proposed solutionin this paper, we discuss some existing works on three closelyrelated topics, including collaborative recommendation, heteroge-neous collaborative recommendation, and transfer learning incollaborative recommendation.

5.1. Collaborative recommendation

Collaborative recommendation techniques are usually catego-rized into memory-based methods, model-based methods andhybrid methods [2]. Memory-based methods include two similarvariants of user-based and item-based recommendationapproaches, where the user-based approach predicts a user’s pref-erence on an item via aggregating his or her neighbors’ preferenceson the item. Model-based methods, e.g., matrix factorization basedalgorithms [11,24,28], usually learn some latent user preferencesand item features with the assumption that the observed ratingsare generated by such latent variables. Hybrid methods includemonolithic, parallelized and pipelined variants with different waysof hybridization of some basic recommendation techniques [10].

Model-based methods usually perform better in open competi-tions, which are able to capture the hidden correlation amongusers and items [11]. Memory-based methods are associated withgood interpretability and maintainability, which are thus also quitepopular in real deployment [14]. However, most collaborativerecommendation methods are for homogeneous feedbacks, suchas numerical ratings, and very few works focus on heterogeneousfeedbacks such as the implicit and explicit feedbacks as shown inFig. 1.

5.2. Heterogeneous collaborative recommendation

In a recent work, a collective matrix factorization (CMF) [29]based approach is proposed to exploit both explicit feedbacksand implicit feedbacks [15], which introduces a scaling or normal-ization process to mitigate the heterogeneity of the two types offeedbacks. However, using the same user-specific latent preferencematrix U and item-specific latent feature matrix V in CMF [15,29]for both explicit ratings and implicit examinations may still notcapture the preference difference well. The expectation–maximization collaborative filtering (EMCF) algorithm [30] proposes to esti-mate graded preferences of implicit feedbacks iteratively, whichcan then be added to the explicit feedbacks. However, such aniterative solution may not be efficient for large data, especiallywhen there are lots of raw implicit feedbacks. The SVD++ model[11] or the equivalent constrained PMF model [28] is a principledapproach for modeling explicit and implicit feedbacks simultane-ously via extending the basic matrix factorization method on expli-cit ratings with interactions from implicit examinations, whichcould be mimicked by factorization machine [24]. Hence, for faircomparison of both accuracy and efficiency, we use factorizationmachine [24], i.e., FM(R;E), in our empirical studies.

There are also some algorithms designed for some specificapplications, such as sequential radio channel recommendationvia exploiting users’ explicit and implicit feedbacks [17]. Note thatthe explicit feedbacks in [17] is different from the numerical

ratings in our HCR, and the reinforcement learning algorithm isalso not applicable in our studied problem.

5.3. Transfer learning in collaborative recommendation

Transfer learning [19] aims to improve a learning task in sometarget data via transferring knowledge from some related learningtasks or some related auxiliary data. Transfer learning in collabora-tive recommendation is a new and active research area and haswitnessed significant improvement of recommendation perfor-mance in several different recommendation scenarios, includingrecommendation without mappings between entities in two data[12], recommendation with two-side implicit feedbacks [22], rec-ommendation with frontal-side binary explicit feedbacks [21,23],etc. In this paper, we study a new problem setting, i.e., recommen-dation with frontal-side implicit feedbacks, which is associatedwith few existing works.

From the perspective of ‘‘how to transfer’’ in transfer learning[19], existing transfer learning approaches includes adaptive, col-lective and integrative algorithm styles. Typically, an integrativeapproach introduces richer interactions between the target dataand auxiliary data than that of adaptive and collective ones, andcan usually have better recommendation performance. OurCKT-FM is such an integrative approach since the mined com-pressed knowledge is incorporated into the target learning taskas a whole.

From the perspective of ‘‘what to transfer’’ in transfer learning[19], previous works share different types of knowledge, includingcovariance [1], codebook [6,12,13,18], latent features [5,8,29], etc.Note that the covariance, codebook and latent features can alsobe considered as some type of compressed knowledge since theraw auxiliary data are not preserved. In this paper, the mined usergroups and item sets are a new type of compressed knowledge,which has not been explored before by works on transfer learningin collaborative recommendation.

In summary, we have designed a novel transfer learning solu-tion for a new recommendation problem, i.e., an integrative trans-fer learning approach via factorization machine with compressedknowledge for HCR as shown in Fig. 1.

6. Conclusions and future work

In this paper, we have proposed a novel and generic solutioncalled compressed knowledge transfer via factorization machine(CKT-FM) for heterogeneous collaborative recommendation(HCR). Our solution contains two major steps of mining com-pressed knowledge of user groups and item sets and integratingcompressed knowledge via design matrix expansion in factoriza-tion machine. Extensive experimental studies on two large datasets with different levels of sparsity show that our CKT-FM issignificantly better than the state-of-the-art non-transfer learningmethod, and is much more efficient than the method leveragingraw implicit feedbacks.

For future works, we are mainly interested in designing a singleunified optimization function for compressed knowledge miningand integration in order to further improve the knowledge sharingprocess. We are also interested in generalizing CKT-FM from point-wise regression to pairwise or listwise ranking, aiming to optimizethe top-k recommended items in a more direct manner [27,31] andstudy its performance on some real industry data.

Acknowledgements

We thank the support of Natural Science Foundation ofGuangdong Province No. 2014A030310268, Natural Science

Page 11: Compressed knowledge transfer via factorization machine ...csse.szu.edu.cn/staff/panwk/publications/Journal-KBS-15-CKT-FM.pdf · Compressed knowledge transfer via factorization machine

244 W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

Foundation of SZU No. 201436, National Natural ScienceFoundation of China (NSFC) Nos. 61170077 and 61272303, NSFGD No. 10351806001000000, GD S&T No. 2012B091100198, S&Tprojects of SZ Nos. JCYJ20130326110956468 andJCYJ20120613102030248, and National Basic Research Programof China (973 Plan) No. 2010CB327903. We are also thankful tothe handling Editor and Reviewers for their constructive andexpert comments.

References

[1] Ryan P. Adams, George E. Dahl, Iain Murray, Incorporating side informationinto probabilistic matrix factorization using Gaussian processes, in:Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence,UAI’10, 2010, pp. 1–9.

[2] Gediminas Adomavicius, Alexander Tuzhilin, Toward the next generation ofrecommender systems: a survey of the state-of-the-art and possibleextensions, IEEE Trans. Knowl. Data Eng. 17 (2005) 734–749.

[3] Imannuel Bayer, Steffen Rendle, Factor models for recommending givennames, in: ECML PKDD Discovery Challenge Workshop, 2013.

[4] Chih-Chung Chang, Chih-Jen Lin, LIBSVM: a library for support vectormachines, ACM Trans. Intell. Syst. Technol. (ACM TIST) 2 (3) (2011) 27:1–27:27.

[5] Sotirios Chatzis, Nonparametric bayesian multitask collaborative filtering, in:Proceedings of the 22nd ACM International Conference on Information andKnowledge Management, CIKM’13, 2013, pp. 2149–2158.

[6] Sheng Gao, Hao Luo, Da Chen, Shantao Li, Patrick Gallinari, Jun Guo, Cross-domain recommendation via cluster-level latent factor model, in: Proceedingsof the 2013 European Conference on Machine Learning and KnowledgeDiscovery in Databases – Part II, 2013, pp. 161–176.

[7] Liang Hu, Jian Cao, Guandong Xu, Longbing Cao, Zhiping Gu, Can Zhu,Personalized recommendation via cross-domain triadic factorization, in:Proceedings of the 22nd International Conference on World Wide Web,WWW’13, 2013, pp. 595–606.

[8] Liang Hu, Jian Cao, Guandong Xu, Jie Wang, Zhiping Gu, Longbing Cao, Cross-domain collaborative filtering via bilinear multilevel analysis, in: Proceedingsof the 23rd International Joint Conference on Artificial Intelligence, IJCAI’13,2013, pp. 2626–2632.

[9] Mohsen Jamali, Martin Ester, A matrix factorization technique with trustpropagation for recommendation in social networks, in: Proceedings of the 4thACM Conference on Recommender Systems, RecSys’10, 2010, pp. 135–142.

[10] Dietmar Jannach, Markus Zanker, Alexander Felfernig, Gerhard Friedrich,Recommender Systems: An Introduction, first ed., Cambridge University Press,New York, NY, USA, 2010.

[11] Yehuda Koren, Factorization meets the neighborhood: a multifacetedcollaborative filtering model, in: Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, KDD’08,2008, pp. 426–434.

[12] Bin Li, Qiang Yang, Xiangyang Xue, Can movies and books collaborate? Cross-domain collaborative filtering for sparsity reduction, in: Proceedings of the21st International Joint Conference on Artificial Intelligence, IJCAI’09, 2009, pp.2052–2057.

[13] Bin Li, Qiang Yang, Xiangyang Xue, Transfer learning for collaborative filteringvia a rating-matrix generative model, in: Proceedings of the 26th AnnualInternational Conference on Machine Learning, ICML’09, 2009, pp. 617–624.

[14] Greg Linden, Brent Smith, Jeremy York, Amazon.com recommendations: item-to-item collaborative filtering, IEEE Internet Comput. 7 (1) (2003) 76–80.

[15] Nathan N. Liu, Evan W. Xiang, Min Zhao, Qiang Yang, Unifying explicit andimplicit feedback for collaborative filtering, in: Proceedings of the 19th ACMInternational Conference on Information and Knowledge Management,CIKM’10, 2010, pp. 1445–1448.

[16] Babak Loni, Yue Shi, Martha A Larson, Alan Hanjalic, Cross-domaincollaborative filtering with factorization machines, in: Proceedings of the36th European Conference on Information Retrieval, ECIR’14, April 2014.

[17] Omar Moling, Linas Baltrunas, Francesco Ricci, Optimal radio channelrecommendations with explicit and implicit feedback, in: Proceedings of the6th ACM Conference on Recommender Systems, RecSys’12, 2012, pp. 75–82.

[18] Orly Moreno, Bracha Shapira, Lior Rokach, Guy Shani, Talmud: transferlearning for multiple domains, in: Proceedings of the 21st ACM InternationalConference on Information and Knowledge Management, CIKM’12, 2012, pp.425–434.

[19] Sinno Jialin Pan, Qiang Yang, A survey on transfer learning, IEEE Trans. Knowl.Data Eng. 22 (10) (2010) 1345–1359.

[20] Weike Pan, Li Chen, GBPR: group preference based bayesian personalizedranking for one-class collaborative filtering, in: Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence, IJCAI’13, 2013, pp.2691–2697.

[21] Weike Pan, Nathan N. Liu, Evan W. Xiang, Qiang Yang, Transfer learning topredict missing ratings via heterogeneous user feedbacks, in: Proceedings ofthe 22nd International Joint Conference on Artificial Intelligence, July 2011, pp.2318–2323.

[22] Weike Pan, Evan W. Xiang, Nathan N. Liu, Qiang Yang, Transfer learning incollaborative filtering for sparsity reduction, in: Proceedings of the 24th AAAIConference on Artificial Intelligence, AAAI’10, 2010, pp. 230–235.

[23] Weike Pan, Qiang Yang, Transfer learning in heterogeneous collaborativefiltering domains, Artif. Intell. 197 (2013) 39–55.

[24] Steffen Rendle, Factorization machines with LIBFM, ACM Trans. Intell. Syst.Technol. (ACM TIST) 3 (3) (2012) 57:1–57:22.

[25] Steffen Rendle, Social network and click-through prediction with factorizationmachines, in: KDD-Cup Workshop, 2012.

[26] Steffen Rendle, Scaling factorization machines to relational data, in:Proceedings of the 39th International Conference on Very Large Data Bases,PVLDB’13, VLDB Endowment, 2013, pp. 337–348.

[27] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, Schmidt-Thie Lars,BPR: Bayesian personalized ranking from implicit feedback, in: Proceedings ofthe 25th Conference on Uncertainty in Artificial Intelligence, UAI’09, 2009, pp.452–461.

[28] Ruslan Salakhutdinov, Andriy Mnih, Probabilistic matrix factorization, AnnualConference on Neural Information Processing Systems, vol. 20, MIT Press,2008, pp. 1257–1264.

[29] Ajit P. Singh, Geoffrey J. Gordon, Relational learning via collective matrixfactorization, in: Proceeding of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD’08, 2008, pp.650–658.

[30] Bin Wang, Mohammadreza Rahimi, Dequan Zhou, Xin Wang, Expectation-maximization collaborative filtering with explicit and implicit feedback, in:Proceedings of the 16th Pacific-Asia Conference on Advances in KnowledgeDiscovery and Data Mining – Volume Part I, PAKDD’12, 2012, pp. 604–616.

[31] Markus Weimer, Alexandros Karatzoglou, Alex Smola, Improving maximummargin matrix factorization, in: Proceedings of the 2008 European Conferenceon Machine Learning and Knowledge Discovery in Databases – Part I, ECMLPKDD’08, 2008, pp. 14–14.

[32] Tom Chao Zhou, Hao Ma, Irwin King, Michael R. Lyu, TagRec: leveragingtagging wisdom for recommendation, in: Proceedings of the 2009International Conference on Computational Science and Engineering, vol. 04,2009, pp. 194–199.