[ieee 2013 brics congress on computational intelligence & 11th brazilian congress on...

Generating Synthetic Data for Context-Aware Recommender Systems

Marden Pasinato COPPE/UFRJ

Rio de Janeiro, Brazil [email protected]

Carlos Eduardo Mello UFRRJ


Marie-Aude Aufaure MAS/ECP

Paris, France [email protected]

Geraldo Zimbrão COPPE/UFRJ


Abstract—Context-Aware Recommender Systems (CARS) have emerged as a different way of providing more precise and interesting recommendations through the use of data about the context in which consumers buy goods and/or services. CARS consider not only the ratings given to items by consumers (users), but also the context attributes related to these ratings. Several algorithms and methods have been proposed in the literature in order to deal with context-aware ratings. Although there are lots of proposals and approaches working for this kind of recommendation, adequate and public datasets containing user’s context-aware ratings about items are limited, and usually, even these are not large enough to evaluate the proposed CARS very well. One solution for this issue is to crawl this kind of data from e-commerce websites. However, it could be very time-expensive and also complicated due to problems regarding legal rights andprivacy. In addition, crawled data from e-commerce websites may not be enough for a complete evaluation, being unable to simulate all possible users’ behaviors and characteristics. In this article, we propose a methodology to generate a synthetic dataset for context-aware recommender systems, enabling researchers and developers to create their own dataset according to the characteristics in which they want to evaluate their algorithms and methods. Our methodology enables researchers to define the user’s behavior of giving ratings based on the Probability Distribution Function (PDF) associated to their profiles.

Keywords—Synthetic Data Generator; Context-Aware Recommender Systems; Datamining

I. INTRODUCTION

Over the last decade, Recommender Systems have been largely studied in both the academia and the industry. Due to the huge amount of e-commerce applications available in the Internet and of goods offered in them, providing an ideal product recommendation for their consumers became an important feature for these applications. These systems are able to leverage e-commerce sales by converting browsers into buyers, increasing the cross-sell and gaining the consumer loyalty [1]. Therefore, many efforts in proposing algorithms and methods to improve products’ recommendation have been undertaken by academia and industry.

Different types of recommender systems have emerged. The methods used by recommender systems can be organized according to three main approaches: context-based, collaborative and hybrid. An orthogonal classification is alsopossible dividing them into heuristic-based and model-based.

For more details about the state-of-the-art of recommender systems, Adomavicius and Tuzhilin present a good survey in [2]. The authors also present new trends and future directions.

New approaches and methods have been proposed in the literature trying to bring more information about the context in which the products were bought. Methods considering contextual information have been getting more and more attention and context-aware recommender systems (CARS) emerged as a new approach in order to improve the recommendations [3]. In this perspective, context attributes should be considered when a user gives rating about a product.

Although there are many proposals and approaches working on this contextual information, adequate and public datasets containing user’s context attributes and ratings about products are very limited, and usually, even these are not large enough to evaluate proposed CARS very well [3]. One solution for this issue is to collect this kind of data from e-commerce websites by crawling them. However, it could be very time-expensive and also complicated due to problems related to legal rights and privacy. In addition, the performance of recommender systems is highly dependent on various characteristics of the dataset. Evaluating algorithms based on only one or two datasets is often not sufficient [4]. Moreover, detailed analysis can be performed by applying systematic changes to data, which cannot be done with real data.

Several studies have been carried out in order to generate synthetic datasets in the literature of database systems [5]. The aim of synthetic data generators is to test the correctness and the performance of algorithms, such as TPC benchmarks [6]. These tools are often specialized and reusable. The synthetic data must be realistic and correct in terms of size and distributions to be useful. However, not much work has been done on generating datasets for recommender system evaluation, and even these don’t focus on context-aware recommendations.

In this work we propose a methodology to generate synthetic data in order to evaluate CARS. Our methodology aims at generating data that include the items’ ratings and the context attributes in which users gave these ratings. Moreover, we expect to provide researchers with a powerful tool that is able to generate datasets according to the users' behavior of giving ratings for items. The idea is to allow researchers to set up the user’s profile according to their needs and then the

1st BRICS Countries Congress on Computational Intelligence

978-1-4799-3194-1/13 $31.00 © 2013 IEEE

DOI 10.1109/BRICS-CCI.&.CBIC.2013.93

563

1st BRICS Countries Congress on Computational Intelligence

978-1-4799-3194-1/13 $31.00 © 2013 IEEE

DOI 10.1109/BRICS-CCI.&.CBIC.2013.93

563

2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence

978-1-4799-3194-1/13 $31.00 © 2013 IEEE

DOI 10.1109/BRICS-CCI-CBIC.2013.99

563

dataset can be generated following the user behavior defined by the researcher.

Therefore, we expect that this work can contribute to the development of context-aware recommender systems by allowing researchers to generate synthetic data for evaluating their proposed methods and algorithms. Moreover, our methodology enables several kinds of evaluations to be performed by varying the characteristics in the users’ profile which could be interesting in testing different kinds of users,such as “heavy raters”, “cold start users”, “pessimists” and etc..

This paper is organized in 4 sections, of which this is the first one. In the second section, we briefly describe some related work about synthetic data generation. In the third section, we present our proposal describing basic definitions and the algorithms used. The fourth and last section describes our conclusions so far and presents the proposed experiment that can be carried out to validate our methodology.

II. RELATED WORK Synthetic data has been largely used by database researchers

— and researches from others fields as well — mainly for simulation purpose. Therefore, the size and the distribution of the synthetic data must be realistic and correct, so that the simulation might be as close to the real scenario as possible [6].Several tools used to generate synthetic data have emerged, however they are often specialized and not reusable. In order to deal with this issue, much work has been conducted to design Synthetic Data Generators as general as possible.

Lots of Synthetic Data Generators (SDG) have been proposed in the literature such as [5], [6], [7], [8] and [9]. The majority of synthetic data is used to evaluate multidimensional models and OLAP tools. Besides, in general, these SDGs are supported by description languages, which allow us to define constraints and the exact characteristics of the data to be generated. Such SDG tools are developed to generate huge amounts of data in little time [5][9] and many of these tools are available as commercial products [5].

There are plenty of works in which authors propose a SDG in order to evaluate their proposed recommender systems. In [4], the authors present several of those works. They claim that, although various SDGs for evaluating the behavior of data with several attributes have been developed, most of those found in the literature are used to evaluate one specific algorithm. Moreover, as SDG never was the main focus of those authors, the SDG is neither described in detail nor is it generic enough to be reusable for evaluating other algorithms.

Therefore, to the best of our knowledge only the work proposed in [4] describes in detail a SDG for evaluating recommender system and tries to be generic enough to be reused in other evaluations. However, this work focuses on generating data for attribute-aware recommender systems,which are quite different from context-aware recommender systems. Furthermore, even if this work could be adapted tocontext-aware recommender systems, it wouldn’t enableresearchers to model the user and the item by the Probability Distribution Function (PDF) of their ratings and context attributes, as we do in our proposed methodology.

III. PROPOSED METHODOLOGY In this section we present our proposed methodology to

generate synthetic data for evaluating CARS. In order to do so,we describe the synthetic data generator (SDG) which implements our proposal. The SDG consists of 2 profile generators, 2 profile arrays, a penalization function and a sampling algorithm organized as depicted in Figure 1.

Figure 1: Synthetic Data Generator

The Profile Generators, the Sampling Algorithm and the Penalization Function are used to define the behavior of the synthetic data. With them, we can define behaviors such as:

� how many users and items will be generated;

� how each user gives ratings to items;

� how each context attribute influences the rating;

� how many items each user will give rating to;

� how users choose items to give rating;

� which products are more suitable to receive ratings;

These and other possible attributes are supported by the use of PDFs assigned to the random variables that represents the context attributes, the ratings and the number of itemsevaluated. Therefore, we have a useful set of random variables that models the users’ behavior and enables us to systematically vary the model’s parameters according to the evaluation purpose.

In the following section we describe how each component of the SDG works. We begin by describing the user’s and item’s profile and their role played in the SDG. Afterwards, we present the Profile Generators and how to use it to model users and items. Finally, we describe the Penalization Function and the Sampling Algorithm.

564564564

A. User’s and Item’s ProfileBoth the user’s and item’s profile are defined in the same

way. So, let’s focus on describing a generic profile first. The aim of the profile is to store the model, which describes the characteristics of an entity, such as a user or an item. Therefore, profiles can be defined in several ways, e.g., by simple attributes or by complex functions.

We intend to have users’ and items’ profiles in order to model the user’s behavior in:

� choosing items to give ratings;

� choosing an appropriate value for the rating given the context;

� simulating their preferred context;

To do that we use a set of random variables representing the context attributes and the way in which a particular user evaluates his items. Therefore, in modeling how a user gives ratings to items by a random variable we are assigning a PDF of ratings for each user. Although it seems that items’ profiles are not necessary, we need them to model the most evaluated items and to establish the PDF of context variables in each item.

Many different kinds of PDFs can be assigned to the user’s ratings and to the context attributes depending on what we want to model. For instance, “pessimists” or “demanding” users canbe defined by a Gaussian distribution with its mean skewed for low ratings values. The same can be done for “optimists” or “uncritical” users, with the distribution skewed for high ratings values. There are also the “controversial” users that have high variance in their ratings, so we can use a Gaussian distribution with high variance or even a bimodal distribution. There are lots of ways to model user’s behaviors through PDFs, we have just to choose those ones that represent the behavior we want in the data.

For items, we need to define the PDF of each context attribute. Though it is not directly related to the users’ behaviors, with these profiles we can determine how the rating is influenced by the context. We are aware that context is a very broad term, so, for the sake of clarification, we shall consider a trip recommender systems in which users are travelers and items are destinations.

In this scenario, we can see how the context related to items is such an important information in the account for the recommendation. It is commonly fair to suppose that a traveler will be much more reluctant to travel to Rio de Janeiro in winter time than in summer time, since the town is worldwide known for its summer attractions. Similar argument can be made about Bordeaux. Travelers would be much more willing to go there during the wine harvest than in others seasons, since the town is well known for its great wine.

Thus, some cities are more visited during some months or periods of the year depending on their attractions. In order to design an accurate trip recommender system, we need to take this pattern into account. In our model, we have a context random variable describing the probability of an average traveler choses that city on each month of the year. For instance, we could model the probability of Rio de Janeiro

being chosen as travel destination by a Gaussian distribution with is peak in January and variance spanning throughDecember, February and March – summer months.

B. Profile Generators In the profile generators we assign a PDF, with its due

parameters, to each variable in the user’s and item’s profile. For instance, in the case of the trip recommender systems, we begin by choosing the number of users – � –, the number of destinations – � – and the number of evaluations – E – our dataset will have. For each user i, we have the following random variables describing the user’s behavior:

� Number of destinations evaluated by the user �i.This variable may assume integers values from 0 to �. The sum of all the �i must be equal to E.

� The user’s “taste”, i.e., how he evaluates destinations by giving them ratings �i. This variable may assume integers values from 1 to 5.

� The user’s preference regarding the period of the year to travel, i.e., the user’s context preference �i.This variable may assume integers values from 1 to 12 pointing to the month of the year.

Likewise, for each destination j, we have the following random variables describing the destination’s features:

� Number of users that evaluated the destination �j.This variable may assume integers values from 0 to �. The sum of all the �j must be equal to E.

� The period of the year in which the destination is most sought for traveling j. This variable may assume integers values from 1 to 12 pointing to the month of the year.

Each of these random variables has an associated PDF that governs its behavior. The PDF can be of any sort, e.g., a Gaussian distribution, a bimodal distribution, an exponential distribution, a chi-square distribution or even a nonparametric distribution. The PDF’s parameters are also selected depending on the characteristics we want our entities to have.

For example, suppose one wants to model a “heavy rater” user. By “heavy rater” we say that this user gives ratings to many destinations. A feasible model that conveys this behavior can be achieved by assigning a Gaussian distribution to the variable �i with a large mean value and a small variance value. The same analogy can be extended for modeling the others random variables, both in the user’s and in the destination’s profile.

As the desired dataset may have thousands, even millions, of users, choosing the appropriate PDF’s parameters, for each random variable and for each profile can be a toil for the researcher. Hence, in order to set the SDG in an easier way we propose a Bayesian approach for selecting these parameters.

Suppose the random variable �i follows a Gaussian distribution for every user i in the dataset. Since the Gaussian distribution requires only two parameters (mean and variance) to be functional, if the dataset required 1000 users, the

565565565

InputNumber of users – NNumber of destinations – MNumber of evaluations – EProfile GeneratorsFor each user i:

Create user profile upi = {�i, �i, �i}For each destination j:

Create destination profile dpj = {�j, j}Sampling AlgorithmFor each user i: Chose �i destinations for evaluation

For each destination j: Check if the destination has achieved the maximum

number of evaluation �jIf yes: Discard the destination and search for other that

still needs to be evaluatedElse: Assign a rating rij that user i “gave” to

destination jPenalization Function Penalize the rating rij according to the user’s

preferred context �i and the destination real context j

researcher would have to set 2000 parameters manually; two for every user i (�i , �i �

Instead, we can assume that the parameter itself is a random variable with its own PDF. Thus, the only manual requirement is to determine this PDF, since the values for �i follow straightforward from it. Regarding the example above, suppose the values for each parameter �i are given by the random variable � and that this variable follows a Gaussian distribution determined by its mean (��) and variance (��). Therefore, we need to select manually only the values for the two parameters �� and �� to get all the others parameters set.

Having the control over �� and �� enables us to create the dataset under hypothetical situations that we may be willing to confirm or to deny. For instance, imagine that the researcher has designed an algorithm that has a good performance even when most users are “light raters”, i.e., when most users evaluate very few items. Finding a real context-aware dataset is already difficult; with these specific characteristics it becomes even harder. With our SDG the researcher could easily and simply create such dataset by assigning small values for �� and ��.

C. Penalization Function In the Context-Aware Recommender Systems, the context

influences, in some way, the rating given by users. Therefore, the Penalization Function aims at penalizing the rating given by user, if the destination context differs too much from the user’s preferred context. The implementation of the penalization function depends on how and how much we expect that the rating will be influence by the context.

The Penalization Function has as input parameters the users’ preferred context (�i), the destination’s context (j) and user’s rating (�i). As the values chosen for �i take only into account the mathematical model of the user’s taste, not considering the context involved, we need to introduce some variation in the rating to account for the context influence.

The function that calculates the penalty for the ratings, according to the context, can be done in many ways. We suggest the use of Fuzzy Logic engine to do that. The way how the rating will be penalized can be described by fuzzy rules and fuzzy sets. It seems to be easier for us than using complicated mathematical functions. Besides, it is more useful in order to explain how the context affects the ratings.

D. Sampling Algorithm The Sampling Algorithm is the core of the SDG and

therefore of our methodology too. This algorithm determined which users will evaluate which destinations and applies the Penalization Function according to the profiles’ variables in order to synthesize the “real” rating.

After all the profiles were created, we need to associate each user with his evaluated destinations. In order to do so, we pick a user in particular and select among the entire set of destinations those which he will evaluate. This must be done under the established constrains for the number of destinations the user

evaluated and the number of evaluations each destination received.

The complete algorithm is shown in Figure 2. We begin by defining the total number of users, destinations and evaluations the dataset will have. Then, the Profile Generator is responsible for setting the users’ and destination’s profiles. In other words,it is responsible for assigning values to the profiles’ random variables, depending on its PDF. We move to the Sample Algorithm that associates users to destination and apply the

Penalization Function in order to get the real rating. Figure 2: Complete SDG algorithm.

IV. CONCLUSIONS AND FUTURE WORK In this work we presented a methodology to generate

synthetic data for evaluating Context-Aware Recommender Systems, in particular, one aimed for trip recommendations. It was also described a SDG which implements the proposed methodology, generating data for CARS based on users’ and destinations’ profiles. These profiles are modeled by random variables and their PDFs. We hope that with this modeling we are able to simulate the user’s behavior of giving ratings and their contexts.

Although the use of simple random variables seems an interesting tool for modeling the user’s behavior, it is necessary more research in this direction to understand the complex relationship between rating and context. Besides, there are many other statistical tools fit for the task that can, perhaps, outperform our approach; such as Markov’s Chains, Complex Network and Joint Distributions.

We are working on a complete case study in order to show how we can achieve good and reliable evaluations of CARS through our proposed methodology. In order to do so, we expect to apply this approach to enlarge a real dataset, keeping or even improving the recommender system performance.

566566566

Nevertheless, we need to discover the PDFs of the real dataset in order to model users and items.

A context-aware movie dataset is described in [10] with only 90 users, 950 items and 1600 ratings. This seems a good candidate for our tests. We intend to extract, from this real dataset, the PDFs that better describe the users’ and items’ behavior and their context attributes. We can do this either by and trying known distributions or by a nonparametric approach, using Kernel Density Estimators [11]. The latter seems more appropriated due to the wide range of possibilities for modeling human behavior.

Therefore, our methodology allows not only generating synthetic data and evaluating CARS, but also understanding the users’ behavior and how the context may interfere in this behavior. We hope as soon as possible turn available the implementation code of our SDG as a Matlab Toolbox.

ACKNOWLEDGMENTS Our thanks to CNPQ (Brazilian Founding Agency) and

EUBRANEX for founding this research.

REFERENCES [1] J.B. Schafer, J. Konstan, and J. Riedi, “Recommender systems in e-

commerce,” Proceedings of the 1st ACM conference on Electronic commerce, Denver, Colorado, United States: ACM, 1999, pp. 158-166.

[2] G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions,” Knowledge and Data Engineering, IEEE Transactions on, vol. 17, 2005, pp. 734-749.

[3] G. Adomavicius, R. Sankaranarayanan, S. Sen, and A. Tuzhilin, “Incorporating contextual information in recommender systems using a multidimensional approach,” ACM Trans. Inf. Syst., vol. 23, 2005, pp. 103-145.

[4] K. Tso and L. Schmidt-Thieme, “Empirical Analysis of Attribute-Aware Recommendation Algorithms with Variable Synthetic Data,” Data Science and Classification, 2006, pp. 271-278.

[5] J.E. Hoag and C.W. Thompson, “A parallel general-purpose synthetic data generator,” SIGMOD Rec., vol. 36, 2007, pp. 19-24.

[6] K. Houkjær, K. Torp, and R. Wind, “Simple and realistic data generation,” Proceedings of the 32nd international conference on Very large data bases, Seoul, Korea: VLDB Endowment, 2006, pp. 1243-1246.

[7] P.J. Lin, B. Samadi, A. Cipolone, D.R. Jeske, S. Cox, C. Rendon, D. Holt, and R. Xiao, “Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems,” Proceedings of the Third International Conference on Information Technology: New Generations, IEEE Computer Society, 2006, pp. 707-712.

[8] N. Bruno and S. Chaudhuri, “Flexible database generators,” Proceedings of the 31st international conference on Very large data bases, Trondheim, Norway: VLDB Endowment, 2005, pp. 1097-1107.

[9] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P.J. Weinberger, “Quickly generating billion-record synthetic databases,” Proceedings of the 1994 ACM SIGMOD international conference on Management of data, Minneapolis, Minnesota, United States: ACM, 1994, pp. 243-252.

[10] Košir, Andrej; Odic, Ante; Kunaver, Matevž; Tkalcic, Marko, Tasic, Jurij F. “Database for contextual personalization”. Elektrotehnišk vestnik [English print ed.], 2011, vol. 78, no. 5, str. 270-274, ilustr. [COBISS.SI-ID 8871764].

[11] Härdle, Wolfgang; “Applied Nonparametric Regression”. Cambridge University Press, 1992.

567567567

[ieee 2013 brics congress on computational intelligence & 11th brazilian congress on...

Documents