2012 draft - to appear - · pdf file(3.22) ... (4.4) ..... 113. draft - final ... this...
TRANSCRIPT
Providing Actionable Recommendations:
Design and Evaluation of a Method for Provision of
Recommendations and Effective Explanations thereof.
Dipl.-Oek. Paul Marx
A dissertation thesis submitted in fulfillment
of the requirements for the degree of
Doctor rerum politicarum
of the
Bauhaus-University of Weimar
Faculty of Media
July 2011
DRAFT -
final
revisi
on to
appe
ar in
2012
DRAFT -
final
revisi
on to
appe
ar in
2012
Acknowledgements iii
Acknowledgements
A paradox of a dissertation project is that it is always accomplished by a single person
but actually represents the result of a joint effort of plenty of individuals. I want to thank all
these people for their invaluable contribution to the success of my work – without you, I
would be have been facing hard times.
First of all, a great, big thank you goes to my supervisor Thorsten Hennig-Thurau for
waking my interest for the project and academic work along with his exceptional, never end-
ing and encouraging personal example of how hard work with the attention to details yields
fruits and satisfaction as well as professional advancements. Thank you to Tobias Bauckhage,
the CEO of MoviePilot, for providing invaluable data for my experiments.
Thanks are also due to all my teachers, who taught me the value of constant learning
and inspired my curiosity and respect for the unknown. Especially, I would like to thank the
teachers of Novosibirsk Aerospace Lyceum and professors of the Aircraft faculty of the No-
vosibirsk State Technical University as well as the professors of Khristianovich Institute of
Theoretical and Applied Mechanics. I am proud to have studied there. Special mention also
goes to Anne Priller and Denis Rechkin for teaching me languages: Anne for English and
Denis for C#.
And of course, my eternal gratitude belongs to my family. Thank you to my parents for
making me out of me, for your absolute love and continuing support throughout my life and
especially during writing this thesis. Thank you to my wife Elena for your patience and un-
derstanding. Thank you to my kids Vera and Michael for reminding me that there are other
important things going on out there in the world. Thank you for all your support. I have no
doubt that this thesis would not have been possible without you.
Langenhagen, July 2011 Paul Marx
DRAFT -
final
revisi
on to
appe
ar in
2012
DRAFT -
final
revisi
on to
appe
ar in
2012
Table of Contents v
Table of Contents
Glossary .................................................................................................................................. viii
List of Tables ............................................................................................................................. x
List of Figures .......................................................................................................................... xi
List of Equations ..................................................................................................................... xii
1 Introduction and Motivation ........................................................................................... 1
1.1 Motivation ................................................................................................................... 1
1.2 Objectives .................................................................................................................... 6
1.3 Outline of the Thesis .................................................................................................... 7
2 Background and Related Work ....................................................................................... 8
2.1 Explanations in Recommender Systems ...................................................................... 8
2.1.1 Relevance and Advantages of Explanation Facilities .......................................... 9
2.1.2 Explanation Styles .............................................................................................. 14
2.1.3 Explanations within Recommendation Process ................................................. 18
2.1.4 Summary ............................................................................................................ 21
2.2 Movie Related Preferences and Relevant Movie Characteristics .............................. 23
2.2.1 Operationalizing Preferences: Multiattribute Utility Model
and Weighted Additive Decision Rule ............................................................... 23
2.2.2 Preference Relevant Attributes of Motion Pictures ............................................ 26
2.2.3 Summary ............................................................................................................ 34
2.3 Key Recommendation Techniques ............................................................................ 35
DRAFT -
final
revisi
on to
appe
ar in
2012
Table of Contents vi
2.3.1 Collaborative Filtering ....................................................................................... 36
2.3.1.1 User-based Approach .................................................................................. 36
2.3.1.2 Item-based Approach .................................................................................. 42
2.3.1.3 Matrix Factorization and Latent Factor Models ......................................... 44
2.3.2 Content-based Filtering ...................................................................................... 49
2.3.2.1 The Principles of Content-based approaches .............................................. 50
2.3.2.2 Exploiting Content Characteristics in Non-textual Item Domains ............. 54
2.3.3 Trade-offs and Problems of Collaborative and Content-based Approaches ...... 60
2.3.3.1 Data Sparcity ............................................................................................... 61
2.3.3.2 “Ramp-up”: New User and New Item Problems ........................................ 62
2.3.3.3 Overspecialization ....................................................................................... 63
2.3.3.4 “Gray Sheep”, “Starvation” and Shilling Attacks. ...................................... 64
2.3.3.5 Stability vs. Plasticity .................................................................................. 65
2.3.4 Hybrid Recommender Systems .......................................................................... 66
2.3.4.1 Principles of Hybrid Methods ..................................................................... 66
2.3.4.2 Explanations in Hybrid Approaches ........................................................... 68
2.4 Summary .................................................................................................................... 70
3 Conceptual Framework of a Hybrid Recommender System
that allows for Effective Explanations of Recommendations ...................................... 72
3.1 Modeling User Preferences ....................................................................................... 73
3.1.1 Motivation of the Approach ............................................................................... 73
3.1.2 Basic Model of User Preferences ....................................................................... 74
3.1.3 Accounting for Static Effects beyond the User-Item Interaction ....................... 76
3.1.4 Accounting for Time .......................................................................................... 78
3.2 Estimating Model Parameters .................................................................................... 82
3.2.1 Step 1: Estimation of Initial Parameter Values .................................................. 84
DRAFT -
final
revisi
on to
appe
ar in
2012
Table of Contents vii
3.2.1.1 Omitted Variable Bias in OLS Models and a Method to Counteract the
Biasness 85
3.2.1.2 Estimating User and Item Related Effects .................................................. 89
3.2.1.3 Estimating Attribute Part-worths ................................................................ 91
3.2.2 Step 2: Optimization of the Parameters .............................................................. 95
3.3 Hybridization with Collaborative Approaches ........................................................ 100
3.3.1 Motivation for Hybridization ........................................................................... 100
3.3.2 Methods to Hybridize and the Method of Hybridization ................................. 102
4 Empirical Study ............................................................................................................ 105
4.1 Datasets and their Properties ................................................................................... 107
4.2 Measures of Prediction Accuracy ............................................................................ 112
4.3 Employed Algorithms and Benchmarks .................................................................. 114
4.4 Results ..................................................................................................................... 116
4.4.1 Comparison of Prediction Accuracy ................................................................ 116
4.4.2 Provided Explanation Style .............................................................................. 124
4.5 Summary .................................................................................................................. 126
5 Conclusions and Future Work ..................................................................................... 128
5.1 Research Summary, Findings and Contributions .................................................... 128
5.2 Discussion and Implications .................................................................................... 135
5.3 Future Research ....................................................................................................... 136
Bibliography ......................................................................................................................... 139
Appendix A: Sources of Error in Recommender Systems ............................................... 160
Appendix B: List of Preference Relevant Attributes ........................................................ 165
Appendix C: Technical Details of Prediction Accuracy Tests ......................................... 168
DRAFT -
final
revisi
on to
appe
ar in
2012
Glossary viii
Glossary
ACM Association for Computing Machinery
CB Content-Based filtering
CF Collaborative Filtering
CSCW Computer Supported Cooperative Work
DFG Deutsche ForschungsGemeinschaft (German Research Foundation)
DVD Digital Versatile Disk
EBA Elimination By Aspecs
esp. Especially
GB GigaByte
GHz GigaHerz
GPS Global Positioning System
IDF Inverse Document Frequency
IMDb Internet Movie Database
kNN k Nearest Neighbor
MAE Mean Absolute Error
MAU Multiattribute Utility
MDS MultiDimensional Scaling
MF Matrix Factorization
NMAE Normalized Mean Absolute Error
DRAFT -
final
revisi
on to
appe
ar in
2012
Glossary ix
NRMSE Normalized Root Mean Squared Error
OLS Ordinary Least Squares
RAM Random-Access Memory
RecSys Recommender Systems
RMSE Root Mean Squared Error
RS Recommender System; Recommender Systems
SD Standard Deviation
SE Standard Error
SVD Singular Value Decomposition
TF Term Frequency
TF-IDF Term Frequency - Inverse Document Frequency
WAAD Weighted ADDitive linear model
w.r.t. with respect to
DRAFT -
final
revisi
on to
appe
ar in
2012
List of Tables x
List of Tables
Table 2.1: Reasons and benefits for provision of explanations ................................................ 13
Table 2.2: Summary of motion picture success factors ........................................................... 28
Table 2.3: Summary of preference relevant movie attributes .................................................. 33
Table 2.4: Ratings database for collaborative filtering ............................................................ 37
Table 2.5: Principle of content-based filtering ......................................................................... 51
Table 2.6: Summary of strengths and weaknesses
of different recommendation approaches .............................................................. 60
Table 4.1: Descriptive statistics of the raw rating datasets .................................................... 109
Table 4.2: Descriptive statistics of the datasets employed in the study ................................. 110
Table 4.3: Comparison of the prediction accuracy of different algorithms
for MoviePilot dataset ......................................................................................... 117
Table 4.4: Comparison of the prediction accuracy of different algorithms
for Netflix dataset ................................................................................................ 118
Table 4.5: Distribution parameters of the absolute prediction error
of the optimization step ....................................................................................... 121
Table 4.6: Accuracy improvement of the hybrid method ...................................................... 123
Table 4.7: Provided explanation style .................................................................................... 124
Table C.1: Overview of the employed source code snippets from Press et al. 2007 ............. 169
DRAFT -
final
revisi
on to
appe
ar in
2012
List of Figures xi
List of Figures
Figure 2.1: Comparing three user rating profiles ..................................................................... 39
Figure 2.2: Comparing three movie rating profiles .................................................................. 43
Figure 2.3: A simplified illustration of the latent factor approach ........................................... 45
Figure 2.4: Illustration of the extraction of a features vector from a document ...................... 50
Figure 3.1: Decomposition of a time changing measure in three components:
baseline, long-term trend, and short-term fluctuations .......................................... 79
Figure 3.2: Successive minimization with gradient methods ................................................... 96
Figure 3.3: Flowchart of the optimization step ........................................................................ 98
Figure 4.1: Rating scales in user interfaces of recommender systems ................................... 108
DRAFT -
final
revisi
on to
appe
ar in
2012
List of Equations xii
List of Equations
(2.1) .......................................................................................................................................... 23
(2.2) .......................................................................................................................................... 24
(2.3) .......................................................................................................................................... 35
(2.4) .......................................................................................................................................... 38
(2.5) .......................................................................................................................................... 38
(2.6) .......................................................................................................................................... 40
(2.7) .......................................................................................................................................... 40
(2.8) .......................................................................................................................................... 40
(2.9) .......................................................................................................................................... 40
(2.10) ........................................................................................................................................ 43
(2.11) ........................................................................................................................................ 44
(2.12) ........................................................................................................................................ 46
(2.13) ........................................................................................................................................ 47
(2.14) ........................................................................................................................................ 47
(2.15) ........................................................................................................................................ 48
(2.16) ........................................................................................................................................ 53
(2.17) ........................................................................................................................................ 53
(2.18) ........................................................................................................................................ 53
(2.19) ........................................................................................................................................ 56
(2.20) ........................................................................................................................................ 57
(3.1) .......................................................................................................................................... 75
(3.2) .......................................................................................................................................... 75
(3.3) .......................................................................................................................................... 76
(3.4) .......................................................................................................................................... 77
(3.5) .......................................................................................................................................... 79
(3.6) .......................................................................................................................................... 81
(3.7) .......................................................................................................................................... 81
(3.8) .......................................................................................................................................... 83
DRAFT -
final
revisi
on to
appe
ar in
2012
List of Equations xiii
(3.9) .......................................................................................................................................... 85
(3.10) ........................................................................................................................................ 85
(3.11) ........................................................................................................................................ 87
(3.12) ........................................................................................................................................ 87
(3.13) ........................................................................................................................................ 87
(3.14) ........................................................................................................................................ 87
(3.15) ........................................................................................................................................ 88
(3.16) ........................................................................................................................................ 88
(3.17) ........................................................................................................................................ 88
(3.18) ........................................................................................................................................ 88
(3.19) ........................................................................................................................................ 90
(3.20) ........................................................................................................................................ 93
(3.21) ........................................................................................................................................ 94
(3.22) ........................................................................................................................................ 94
(3.23) ........................................................................................................................................ 99
(4.1) ........................................................................................................................................ 112
(4.2) ........................................................................................................................................ 112
(4.3) ........................................................................................................................................ 113
(4.4) ........................................................................................................................................ 113
DRAFT -
final
revisi
on to
appe
ar in
2012
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 1: Introduction and Motivation 1
Chapter 1
Introduction and Motivation
1 Introduction and Motivation
This chapter describes the motivation leading to the presentation of the thesis. The ob-
jectives of this thesis and the subjects included in this document are briefly explained. The
chapter ends describing the structure and contents of the thesis.
1.1 Motivation
Recommendations are a part of everyday life. It is natural for people to seek recommen-
dations whenever they are going to make a decision about a particular item or action. We rely
on recommendations coming from different sources such as other people, bestseller lists, trav-
el guides, test reports, technical reviews, restaurant and movie critics and so forth. Personal-
ized recommender systems (RS) are intended to support and augment this natural social pro-
cess by helping their users find the most interesting and valuable items for them in a quick
and efficient way.
On the internet, where the service providers are not bound to the shelves‟ space and thus
can carry far more inventory than traditional retailers1, the choice task becomes overwhelming
1 For instance, the internet music shop Raphsody offers 19 times as many songs as Wal-Mart’s stocks of 39,000
tunes. Amazon’s offering includes 2.3 million books while specialized book retailers can carry up to maximum
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 1: Introduction and Motivation 2
to the customers, making it nearly impossible to land at optimal selection decisions – some-
thing being referred to as the information overload problem (Jacoby, Speller, and Berning
1974; Anderson 2004). In such situations people strive to minimize their search effort, i.e.
they are eager not to be overloaded by a vast amount of irrelevant offerings they are not inter-
ested in and require only such items to be presented, which are at least potentially valuable for
them (Herlocker et al. 2004). Recommending relevant items (e.g. product offerings such as
books, CDs, movies, etc.) to their users RS not only largely mitigate the information overload
problem on the users‟ side but also support sales at online stores: RS allow e-commerce pro-
viders to increase their up-selling and cross-selling potentials (Schafer, Konstan, and Riedl
2001; Bodapati 2008) as well as help them to better manage customer relationships that lead
to higher loyalty and greater competitive barriers (Wei, Shaw, and Easley 2002; Ricci,
Rokach, and Shapira 2011). In other words, RS allow both counterparts of a business transac-
tion to considerably benefit from it by solving their tasks more efficiently.
Accordingly, recommender systems have already found their way into many commer-
cial applications and established themselves as an important component of online stores
(Shafer et al. 1999; Ansari et al. 2000). Indeed, most internet users have come across a rec-
ommender system in one way or another. A prominent example of a commercial RS is Ama-
zon‟s2 service of offering personalized book recommendations, which is also widely known as
“Customers Who Bought This Item Also Bought”. An online DVD rental and video streaming
service, Netflix3, recommends its subscribers a movie to watch next in the form “If You Liked
This Movie You Will Also Like”. Last.fm4 and Pandora
5 offer their users to create their own
“personalized radio stations” online, which then play songs in accordance with the user‟s
taste. Mendeley6, a researcher community web site, recommends scientific articles to read. A
pure movie recommendation service, Moviepilot7, offers a series of recommendation systems:
one of them produces forecasts of how good a user will find a particular movie, while the se-
cond one suggest a cinema nearby showing the movie, which s/he will like most of all movies
currently running in the cinemas. The third one ranks in the real-time the TV program con-
130,000 book items. The online DVD rental Netflix offers 25,000 DVDs whereas an average inventory of a con-ventional Store consists only of 3,000 DVDs (Anderson 2004). 2 http://www.amazon.com
3 http://www.netflix.com
4 http://www.last.fm
5 http://www.pandora.com
6 http://www.mendeley.com
7 http://www.moviepilot.com
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 1: Introduction and Motivation 3
sistent with the user‟s preferences and then recommends a channel to watch. Besides conven-
tional goods and food other widespread examples of the domains where RS are employed also
include recommending of restaurants, jokes, news, physicians, lawyers, sightseeing places,
vacation resorts, libraries, web sites, acquaintances, sport centers, and even lifestyles. Finally,
the fact that Netflix has recently awarded one million dollar prize to the team that first suc-
ceeded to substantially improve the performance of its own recommender algorithm (Koren,
Bell, and Volinsky 2009) convincingly indicates the importance of the RS for the industry.
At the same time the research interest in recommender systems has dramatically in-
creased. In accordance with EBSCO Business Source Premier Database over 300 scientific
papers were published explicitly on this topic in the last fifteen years. Conferences and work-
shops on RS became premier annual events8. Sessions dedicated to RS are frequently included
in the more traditional conferences in the area of information systems9. Furthermore, several
noted academic journals presented special issues covering the research and developments in
the area of RS10
. The topic of recommender systems is also frequently tackled in the academic
publications in the field of psychology, e-commerce, and marketing11
.
Providing personalized recommendations, however, requires that the RS knows some-
thing about its users. Every RS must obtain and maintain a user profile, i.e. data that allows to
draw conclusions about what is relevant for the users. Such data may come, for example, from
the users‟ purchase history. In this case each purchase act or purchased item can be seen as an
expression of user‟s preference in the item‟s domain, thus, providing RS information about
what the user likes or in which part of the item‟s domain do his/her tastes or interests mani-
fest. Another source of information descriptive of users‟ preferences is the users‟ explicit rat-
ings to the items. Ratings are potentially more informative to RS as they also allow users to
8 We refer specifically to ACM Recommender Systems (RecSys), founded in 2007 and now taking place annual-
ly. 9 Among the conferences that included sessions dedicated to RS the most prominent ones are: ACM Special
Interest Group on Information Retrieval (SIGIR), User Modeling, Adaption and Personalization (UMAP), and ACM Special Interest Group on Management of Data (SIGMOD) (Ricci et al. 2011, p. 3). 10
Among the journals that presented special issues on RS are: AI Communications (2008), IEEE Intelligent Sys-tems (2007), International Journal of Electronic Commerce (2006), International Journal of Computer Science and Applications (2006), ACM Transactions on Computer-Human Interaction (2005), and ACM Transactions on Information Systems (2004) (Ricci et al. 2011, p. 3). 11
For example: Hennig-Thurau, Marchand, and Marx (2011), Hennig-Thurau et al. (2010), Bodapati (2008), Aksoy et al. (2006), Yuanping, Feinberg, and Wedel (2006), Fritzmons and Lehmann (2004), Rutkovsky, Senecal, and Nantel (2004), Fairchild and Rijsman (2004), Gershoff, Mukherjee, and Mukhopadhyay(2003), Cooke et al. (2002), Mild and Natter (2002), Ansari, Essagier, and Kohli (2000).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 1: Introduction and Motivation 4
indicate the amount and the direction of the preference they associate with an item, i.e. the
degree to which the item is liked – or even disliked.
Once the user profiles are acquired RS can begin to produce recommendations. This is
usually done by the means of numeric algorithms that exploit the data from the user profiles
and the items catalogue. In accordance with the modern literature on recommender systems,
three state-of-the-art recommendation approaches can be distinguished: content-based, col-
laborative, and hybrid approaches (Balabanovic and Shoham 1997; Adomavicius and Tuzhil-
in 2005). In each given case the choice of a recommendation approach depends heavily on the
type of the user profiles data and the characteristics of the item domain it is applied to. These
approaches will be discussed in-depth below in Chapter 2.1. At this point, it is worth mention-
ing that personalized recommendations arise from a process, which relies greatly on the quali-
ty of the input data, i.e. the user and item profiles, and the characteristics of the underlying
algorithms.
The numeric algorithms, in their turn, are subject to errors that may result from a num-
ber of factors, such as incompleteness of data, data input and profile extraction errors, algo-
rithmic processing errors, and misspecification of the user decision strategy model (Herlocker,
Konstan, and Riedl 2000; Aksoy et al. 2006). By presenting users erroneously predicted rec-
ommendations RS risk to compromise their credibility for the users as well as the users‟ trust,
which may result in detracting and losing customers (Sinha and Swearingen 2002; Gershoff,
Mukherjee, and Mukhopadhyay 2003; O‟Donovan and Smith 2005; Cramer et al. 2008). This
issue raises two questions:
(i) How can the recommendation algorithms be improved in order to reduce the er-
ror rate and magnitude?
(ii) How can the negative effect of inaccurate recommendations on acceptance and
trust be mitigated?
While the first question is directly related to the numeric algorithms, the second one is
typically addressed in modern RS literature through the issue of explanations. That is, provid-
ing personalized explanations can reduce the negative effects on inaccurate recommendations
thus improving credibility and trust of the RS (Herlocker, Konstan, and Riedl 2000). Alt-
hough both questions have been studied in the extant literature, there is still room for im-
provement in these research directions. An important shortcoming of the current research
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 1: Introduction and Motivation 5
could be seen in the fact that the both research streams have occurred mostly separated from
each other. We argue that the integrative approach to these independent research streams may
be beneficial for the reasons explained below.
Stimulated by Netflix‟ One Million Dollar Prize Competition, the research was primari-
ly concentrated on the accuracy of recommender algorithms. Having provided a movie rating
dataset of more than 100 million date-stamped ratings performed by about half million anon-
ymous Netflix customers on 17,770 movies (Bennet and Lanning 2007) Netflix indirectly
influenced the research by focusing it on this available data. The concentration solely on the
rating data was additionally aggravated by limited ability of the contemporary information
processing algorithms to automatically extract meaningful attributes, descriptive to multime-
dia content, i.e. movies (Wei, Shaw, Easely 2002; Pazzani, Billsus 1997; Lops, de Gemmis,
and Semeraro 2011). Consequently, the movie characteristics such as stars, budgets, country
of origin, etc. were not handled adequately by recommender research. The fact that movie
research provides evidence that these characteristics significantly influence the movie success
as a result of consumers‟ preferences (Hennig-Thurau, Houston, and Walsh 2006) was largely
ignored in the RS literature12
. We argue that incorporating such characteristics in the recom-
mendation process can be fruitful at least for the following reasons:
Firstly, capturing the attribute-related movie preferences offers potentially more infor-
mation than the rating data does alone. This allows addressing the user preferences in a more
flexible way and at a finer level of resolution while generating recommendations, thus leading
to potentially more precise predictions of user ratings, i.e. overall preferences, towards partic-
ular items. Secondly, having the attribute-related preference information at hand, it is possible
to align the recommendation process with the users‟ preference structures and so to reflect the
users‟ intrinsic attribute weights and their decision strategies within the recommendation gen-
eration procedure. According to Aksoy et al. (2006) this leads to higher choice efficiency at
the user‟s side. Thirdly, knowing the attribute-related weights that lead to a particular recom-
mendation allows RS to provide users with reasons underlying recommendation, i.e. personal-
ized explanations. This increases recommender transparency and credibility (Sinha and
Swearingen 2002; Cramer et al. 2008; Herlocker, Konstan, and Riedl 2000), as well as it of-
12 When preferences towards movie attributes were used in extant work (e.g. Ying, Feinberg, and Wedel 2006),
the choice of the attributes was either based on information availability, not a thorough study of relevant at-tributes, or the attributes were used for post processing of recommendation generation (e.g. Symenoidis, Na-nopoulos, and Manolopoulos 2009).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 1: Introduction and Motivation 6
fers other benefits for the users, thus, reducing negative effects of inaccurate recommenda-
tions13
. On the other side, addressing preferences at the attribute level increases the degree of
detail at which the explanations can be provided. Moreover, because preference relevant at-
tributes can be taken into account, such explanations are enabled to emphasize those aspects
of the items, which users themselves consider important while evaluating the items. Conse-
quently, such explanations can be better understood by the users, thus, being potentially more
valuable and, what is also important, actionable for them. Intuitively, the reliability of attrib-
ute-based explanations depends on the ability of the underlying recommendation algorithm to
handle the preferences on the attribute level. So we conclude that the questions of improving
the accuracy of the recommendation algorithms and the handling of inaccurate recommenda-
tions are not mutually independent, but rather complementing. Hence, these questions should
be addressed simultaneously within an integrative approach.
The considerations exposed above motivate the presentation of the current thesis and
build the basis for the objectives that will be formulated in the next Section.
1.2 Objectives
Following the considerations and reasons provided in the previous section, the current
thesis targets for developing a recommendation method which is capable of providing both,
accurately predicted recommendations and actionable explanations of the reasoning behind
them, as well as aligning the recommendation process with the user preferences.
Contrary to the typical RS research approach of building an explanation facility around
pre-calculated recommendations, we aim to incorporate the ability of providing explanations
already within the basis framework of the recommendation algorithm.
The objectives stated should be accomplished by the means of incorporating the attrib-
ute-based preferences into the recommendation process. Through an integrated consideration
of algorithmic and explanatory issues of RS we aim to combine the advantages of pure algo-
13 These positive effects of explanations will be discussed in Chapter 2.1
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 1: Introduction and Motivation 7
rithmic accuracy and the benefits offered by explanation facilities while mitigating the disad-
vantages of the respective issues.
Finally, despite the limited algorithmic abilities to automatically process the multimedia
content the development of the recommendation method should occur having the domain of
the motion pictures in focus. This represents an additional challenge for our research, while
enhancing its contribution to the RS literature.
1.3 Outline of the Thesis
This document is structured in five chapters with a bibliography section and appendixes
at the end.
Chapter 2 presents the description of the research related to the objectives of this thesis.
In particular, it encompasses the study of the research on multiattribute utility, movie prefer-
ences, explanations of the recommendations, and an overview of the contemporary recom-
mendation algorithms. Thus, this chapter provides us with indispensable information for de-
signing our proposals.
Chapter 3 describes our proposed conceptual framework to a recommendation algo-
rithm, which incorporates attribute-based preferences of the users and allows the aligning of
the recommendation process with the users‟ preference structures as well as it provides the
information needed for the generation of detailed and actionable explanations. This represents
the core of the current thesis.
In Chapter 4, the proposed algorithm is empirically tested using the real-world data of
commercial recommendation systems Moviepilot and Netflix. The accuracy of the proposed
method is compared against the accuracy of the state-of-the-art recommendation algorithms.
Additionally the comparison of the explanation details level over the different algorithms
takes place.
Finally, Chapter 5 concludes the thesis restating its main contributions and listing ave-
nues for further work.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 8
Chapter 2
Background and Related Work
2 Background and Related Work
This chapter sums up the theoretical background that underlies the proposals of the cur-
rent thesis and provides an overview of the work related to our objectives.
Specifically, the first section addresses the questions of why the explanations of the rea-
soning behind recommendations should be provided and how particularly this should be done.
The second section projects these findings into the domain of the motion pictures and elabo-
rates on the operationalization of the movie characteristics for their subsequent use in the pro-
cess of recommendation generation. The third section provides an overview of the key rec-
ommendation approaches and presents detailed descriptions of the correspondent recommen-
dation algorithms – the knowledge essential for development of a new recommendation
method. The fourth section recapitulates the main points of the theoretical discussion and
concludes the chapter.
2.1 Explanations in Recommender Systems
A cover issue of the Wall Street Journal from the year 2002 titled “If TiVo Thinks You
Are Gay, Here’s The Way How to Set it Right” describes users‟ frustration with irrelevant
choices made by their digital video recorder “TiVo” that records programs it assumes its
owner will like, based on shows s/he has chosen to record in the past. For instance, Mr.
Iwanyk suspected that his TiVo thought he was gay, since it inexplicably kept recording pro-
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 9
grams with gay themes. Another case described in the article concerns the founder of Ama-
zon.com Jeff Bezos. “For a live demonstration before an audience of 500 people, Mr. Bezos
once logged onto [amazon.com] to show how it caters to his interests. The top recommenda-
tion it gave him? The DVD for “Slave Girls From Beyond Infinity”. That popped up because
he had previously ordered “Barbarella”, starring Jane Fonda, a spokesman explains” (Zaslow
2006). While Mr.‟s Bzos could save the situation by providing a reasonable justification to a
risqué recommendation, Mr. Iwanyk, in absence of explanations, had to figure out how to put
things straight by his own. These examples already convincingly foreshadow the need of inte-
gration of explanation facilities into recommender systems.
More detailed evidence behind providing explanations as well as the foundation of the
criteria of how the explanations should be formed will be elaborated, with respect to our aims,
in the subsequent sections of this chapter.
2.1.1 Relevance and Advantages of Explanation Facilities
The idea of providing explanations to the users of intelligent systems is not new. Expla-
nations have repeatedly become an issue of the research dedicated to expert systems (e.g. Bu-
chanan and Shorliffe 1984; Hovitz, Breeze, and Henrion 1988; Andersen, Olsen, and Jensen
1990; Johnson and Johnson 1993; Miller and Larson 1992; Sørmo, Cassens, and Aamodt
2005). So, for example, the most frequently referred expert system MICIN14
designed by
Shorliffe and Buchanan (1975) to assist physicians while prescribing antibiotics incorporated
an explanation facility as an important component. Having a knowledge base of about 600
rules it would ask the physician a series of simple yes/no questions in order to identify the
bacteria causing a patient‟s infection. At the end of the query process, the expert system pro-
vided a list of possible bacteria ranked from high to low based on the probability of each di-
agnosis and recommended a course of drug treatment. MYCIN has also provided the reason-
14 The name MYCIN is not an acronym but was rather derived from the typical for antibiotics suffix “-mycin” the
expert system is intended to prescribe.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 10
ing behind its recommendations, i.e. a list of questions and rules which led to particular diag-
nosis and its rank order.
Despite MYCIN's success as an expert system, the developers claimed that its power
was lesser related to the details of the underlying numeric model, but rather to its knowledge
representation and reasoning scheme, i.e. explanations that allowed physicians to control why
a conclusion was arrived at and how much was known about a certain concept. They conclude
that expert systems that act as decision guides need to provide explanations for their advice
(Buchanan and Shorliffe 1984).
Since MYCIN the need to provide explanations of the reasoning behind the recommen-
dations produced by expert systems has been widely recognized. It has been pointed out that
the explanation facilities are required for expert systems to be considered useful and accepta-
ble because they remove a black-box from around the recommendation process thus raising
the confidence in recommendations through providing users with transparency, i.e. the under-
standing of the model used and the ability to reassess recommended actions (Moore and
Swartout 1988; Hovitz, Breeze, and Henrion 1988; Majchrzak and Gasser 1991; Miller and
Larson 1992; Johnson and Johnson 1993; Brézillon and Pomerol 1996; Doyle, Tsymbal, and
Cunningham 2003; Lacave and Diéz 2004).
Because RS and expert systems have common roots and strive for similar goals – provi-
sion of recommendations that help users make their choices more efficiently – RS can be con-
sidered successors of expert systems. Hence, the arguments supporting the provision of ex-
planations to recommendations in the domain of the expert systems remain valid also in the
domain of the RS (Herlocker, Konstan, and Riedl 2000; see also Tintarev and Masthoff 2008;
Cramer et al. 2008).
Similarly to the expert systems, explanations play a crucial role in RS. They bring
transparency into the recommendation process of RS and provide users with an instrument to
handle errors that come along with recommendations (Herlocker, Konstan, and Riedl 2000).
The importance of such an instrument cannot be overestimated:
Firstly, it is natural for humans to ask for reasoning while handling recommendations.
“Consider how we […] handle suggestions as they are given to us by other humans. We rec-
ognize that other humans are imperfect recommenders. In the process of deciding to accept a
recommendation from a friend, we might consider the quality of previous recommendations
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 11
by the friend or we may compare how that friend‟s general interests compare to ours in the
domain of the suggestion. However, if there is any doubt, we will ask “why?” and let the
friend explain their reasoning behind a suggestion. Then we can analyze the logic of the sug-
gestion and determine for ourselves if the evidence is strong enough.” (Herlocker, Konstan,
and Riedl 2000, p. 242).
Secondly, the recommendations generated by RS are inherently prone to errors. Auto-
mated recommender systems are in essence stochastic processes that infer their recommenda-
tions based on heuristic approximations of human processes by the means of numeric algo-
rithms. Their computations are done on extremely sparse and incomplete data. These two
conditions result in recommendations that are often correct and reliable, but also occasionally
very wrong, i.e. the suggestions generated by RS are subject to errors. The errors can be
caused, for example, by misspecification of the employed user model or by inadequate data
(see Appendix A for further details). A chance to receive an erroneous recommendation im-
pairs the users‟ acceptance and trust in RS. Explanations of the reasoning behind the recom-
mendations provide users with indications when to trust a recommendation and when to doubt
one. Helping the users to detect or estimate the likelihood of errors in recommendations, ex-
planations mitigate and may even recover the loss of acceptance and trust caused by errone-
ous recommendations (Herlocker, Konstan, and Riedl 2000).
In contrast to expert systems, the topic of the effects of transparence on acceptance and
trust was not yet extensively explored in the area of RS. To our knowledge, there exists only
one study that examines these effects (Cramer et al. 2008). Unfortunately this study is limited
to the domain of artworks and operates with rather a small sample of 60 persons divided into
three between-subject experiment settings, so that the findings can hardly be considered gen-
eralizable. Nevertheless, the study by Cramer et al. provides initial support for above stated
suitability of the transferring of the arguments that justify the provision of explanations to
recommendations from the area of expert systems into the area of the RS. So their findings
confirmed that explaining to the user why a recommendation was made (i.e. transparency)
significantly increases the acceptance of recommendations. In this study trust in RS itself was
not directly influenced by transparency. However, the results showed that the RS that provid-
ed explanations of the reasoning behind recommendations were perceived as more under-
standable by the users. Perceived understanding in turn correlated with perceived competence,
trust and acceptance of the system. This indicates that the effects of transparency on trust in
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 12
and perceived competence of the RS might either have not been surfaced in this study due to
the small sample-size, or this could also mean that these effects are mediated or moderated by
the perceived understanding of the explanations (which unfortunately was not tested by the
authors). Both of these possible outcomes point out to the importance of the transparency im-
plied by explanations for RS and specifically for users‟ trust in RS.
Many other arguments that prove the reasons for providing explanations and substanti-
ate their benefits for users and RS providers can be found in the literature on RS. However, in
their publications the authors cover mostly just a few arguments a time and without a claim of
providing a systematical overview of the reasons and benefits. To our knowledge, only two
groups of authors attempted to develop a systematic classification of the reasons behind
providing explanations in the recent work. However, they approach the derivation of their
classifications from different perspectives, so that the developed taxonomies are neither ex-
clusive, nor are they complete: While Herlocker, Konstan, and Riedl (2000), in their classifi-
cation, consider the benefits of explanations from the user‟s point of view and constrain their
considerations to the case of automated collaborative filtering systems, Tintarev and Masthoff
develop their taxonomy from the provider‟s perspective with an emphasis on the aims for
provision of explanations by different kinds of RS (Tintarev 2007; Tintarev and Masthoff
2007, 2011). Furthermore, in Herlocker‟s et al. classification all user benefits follow from the
transparency, whereas in Tintarev and Masthoff‟s variant transparency is just one of coequal
aims.
Table 2.1 summarizes the reasons and benefits for provision of explanations according
to the classifications of Herlocker, Konstan, and Riedl and Tintarev and Masthoff supple-
mented with arguments of Chen (2009) that do not fall into either of the upper mentioned
classifications15
. With this classification we still do not raise the claim of completeness, but
rather aim to expand our understanding of the topic and emphasize the need of explanations in
RS and thus the need of recommendation algorithms that allow the generation of comprehen-
sive explanations.
15 Other authors also have elaborated on the reasons for provision of explanations. However, as mentioned
above, their arguments are rather fragmented and are either complementary with the points suggested in Table 2.1 (e.g. Sinha and Swearigen 2002; O’Donovan and Smith 2005; Cramer et al. 2008; Symeonidis, Napo-poulos, and Manopoulos 2008; Jannach et al. 2011) or have served as a basis for the aforementioned publica-tions. For the sake of brevity we do not refer to the latter works here and kindly ask the interested readers to consult Herlocker, Konstan and Riedl (2000) and Tintarev and Masthoff (2011) for correspondent references.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 13
Table 2.1: Reasons and benefits for provision of explanations
Reason/Benefit Definition Author(s)
Justification / Validation User understanding of the reasoning, so
that he may decide how much confi-
dence to place in recommendation
Herlocker, Konstan
& Riedl (2000)
User Involvement Allow user to add his knowledge and
inference skills to complete decision
process
Education Help users to understand strengths and
limitations of the system, to help them
better understand the product domain
Acceptance Greater acceptance of RS, because its
strengths and limits are fully visible and
its suggestions are justified
Transparency Explain how the system works, why one
item was preferred over another
Tintarev 2007,
Tintarev & Masthoff
(2007, 2011)
Scrutability Allow users to tell the system is wrong,
justify why additional information is
needed
Trust and Credibility Increase users’ confidence in the sys-
tem, hence reduce complexity of deci-
sion making in uncertain situations
Effectiveness Help users make better decisions
Efficiency Help users make decisions faster, re-
duce decision-making effort, i.e. time
needed or cognitive effort
Persuasiveness Change users’ buying behavior, con-
vince users to try or buy
Satisfaction Increase the ease of usability, enjoy-
ment and customer return rate
Address contextual needs Help user to determine if the recom-
mendation is suitable in the user’s given
context or situation
Chen (2009)
Uncover hidden criteria Help users uncover important choice
criteria they did not perceive relevant
before
Solve preference conflicts Make preferable option more evident
due to additional preference relevant
information
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 14
At this point it is, however, important to mention that the reasons and benefits of
providing explanations, although identified in Table 2.1 as distinct, are not mutually inde-
pendent and thus may interact. So, for example, providing explanations for justification may
also help uncover hidden preferences, increase the decision’s efficiency and effectiveness as
well as increase satisfaction and trust (Herlocker, Konstan, and Riedl 2000; Tintarev and
Masthoff 2007).
Because of the advantages and benefits discussed above, as well as because of their pos-
itive interactions it seems sensible to provide explanation facilities for RS. The next section
elaborates on the question about how the explanations should be formed, i.e. what explanation
style our recommendation algorithm should allow for.
2.1.2 Explanation Styles
The capability of providing personalized explanations varies across different recom-
mendation approaches, being very limited in collaborative filtering approaches and the most
informative in the case of content-based ones (Tintarev and Mashoff 2007, 2011; Jannach et
al. 2011, p. 165).
Collaborative filtering (CF) approaches predict their recommendations based solely on
the holistic preference data, i.e. ratings of items or buying acts. Due to this fact the explana-
tion ability of these approaches is limited, allowing only for two kinds of rather generalized
statements, such as (i) “customers who bought item X also bought items Y, Z, …” and (ii)
“item Y is recommended to you because you rated item X” (Symeonidis, Napopoulos, and
Manopoulos 2008)16
. The first kind of explanation statements mimics the human word-of-
mouth recommendation process (Jannach et al. 2011). It connects the user to whom the rec-
ommendations are presented, i.e. an active user, to other users who have rated the recom-
mended item. Because in this case, the underlying process produces recommendation on the
basis of user profile similarities, i.e. considers only the users who revealed preferences that
16 In the context of movie recommendations these statements can be correspondingly paraphrased into “peo-
ple who liked movie X also like Y” and “you will like movie Y because you liked movie X”.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 15
are similar to those of the active user, this explanation style is referred to as “nearest neigh-
bor” style. In contrast, the second kind of statements connects the recommended item to the
items the same user has bought or rated in the past. In doing that, the system isolates the item
X that influenced the recommendation of Y the most. Thus, this explanation style is denoted
as the “influence” style in the literature (Tintarev and Masthoff 2007, 2011; Symeonidis, Na-
popoulos, and Manopoulos 2008).
In contrast to CF, content-based (CB) filtering systems utilize attribute level preferences
for the generation of recommendations17
. Thus, they are able to explain their recommenda-
tions on a finer resolution level where the item attributes that are relevant for the building of
the users‟ preferences and their choice making can be individually addressed. Because the
attributes are typically extracted from the content of recommended items, the explanations are
said to be presented in the “content-based” (Symeonidis, Napopoulos, and Manopoulos 2008;
Jannach et al. 2011; Tintarev and Masthoff 2011) or “keyword” (Bilgic and Mooney 2005;
Tintarev 2007; Tintarev and Masthoff 2011) style18
. An example of such explanation could be
“This story received a high relevance score, because it contains the words f1, f2 and f3”19
(Bil-
lus and Pazzani 1999).
To this moment, only three studies that involve real users provide an evaluation of ex-
planation styles for RS. In the light of the goals of this thesis, the results of these studies can
be summarized as follows:
The study by Herlocker, Konstan and Riedl (2000) examined various variants of im-
plementations of explanation interfaces in the domain of the “MovieLens”20
– a CF movie
recommender system. Twenty-one variants of explanation presentations were compared to the
base case in which no explanations were provided. The results showed that the integration of
explanation facility, in many cases can significantly increase the acceptance of the recom-
mendations by the users, which generally supports the thesis of Section 2.1.1. However, the
acceptance can also decrease when the information provided exceeds cognitive skills of the
17 For detailled description of CB approaches see Section Fehler! Verweisquelle konnte nicht gefunden wer-
den.. 18
As the terms “content-based” and “keyword” explanation style are used largely synonymously, in the further narration we adopt the term “keyword explanation style” to avoid ambiguities. 19
For the domain of movie recommendation the example of a contend-based explanation can be altered to “we recommend you to watch this movie, because Bruce Willis acts in it and it was awarded an Oscar”. 20
http://www.movielens.com
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 16
users, i.e. cannot be easily understood. Particularly in cases when additional information such
as complex graph, percentage of agreement of closest neighbors, number of neighbors with
standard deviation or average correlation between the neighbors, was presented, the ac-
ceptance of recommendations decreased below the base line. That is, although such technical
details undoubtedly increase the transparence of the functioning of an RS, the users might not
consider them relevant for forming their decisions. Hence, transparence is only beneficial for
the users if they are able to cognitively handle it, i.e. if they can deduce and understand the
details provided about the ways the system produces recommendations. We interpret this fact
as being consistent with the conclusions of the Aksoy and colleagues that RS should “think
like the people they are attempting to help” (Aksoy et al. 2006, p. 310) and argue that this
conclusion maintains its validity with regard to explanations. That is, not only should RS
think like the users they support in decision making, but also should they explain their rec-
ommendations in terms the users evaluate their choices.
Bilgic and Mooney (2005) criticize Herlocker and colleagues for their overly narrow
concentration on the acceptance and inability to demonstrate that any of the explanation vari-
ants actually increased the users‟ satisfaction with items they eventually chose. They argue
that “the goal of a good explanation should not be to “sell” the user on a recommendation, but
rather, to enable the user to make a more accurate judgment of the true quality of an item”.
Therefore, the authors conducted a user study in which they evaluated different explanation
approaches according to how well they allow users to accurately predict their true opinion of
an item. The results showed that the users who were presented explanations in the nearest
neighbor style tend to overestimate the quality of the recommended items. The authors claim
that such an overestimation leads to mistrust and could cause users to stop using the system.
Keyword-style and influence-style explanations were found to be significantly more effective
at enabling accurate assessments, whereby the keyword explanation style dominated the influ-
ence style, though not significantly.
Symeonidis, Napopoulos, and Manopoulos (2008) conducted a survey to measure user
satisfaction against three styles of explanation. Based on the results of Bilgic and Mooney,
they omitted the nearest neighbor explanation style from their study and introduced a new
one that combined the keyword and influence styles and had the following form: “Item X is
recommended, because it contains features a, b, …, which are included in items Z, W, … that
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 17
you have already rated”21
. In the between-subject experiment design they recommended each
user a movie, justified by one of the three explanation styles. The users then were asked to
rate each explanation style separately to explicitly express their actual preference among the
three styles. The survey showed that the combined explanation style dominated both keyword
and influence style at a high significance level ( ). In this study, however, the influ-
ence explanation style performed better than the keyword style. Unfortunately, the authors do
not report the significance of the latter outcome, which might indicate consistency with Bilgic
and Mooney‟s results that show that the difference between the both styles is not significant.
The authors, however, argue that the keyword explanation style provides the advantages of
convenience and effectiveness over the influence style as it lesser tasks users‟ inference skills.
To further understand the advantages of the keyword explanation style, consider two
examples of explanations provided by a movie recommender: A keyword style explanation
could be that “Million Dollar Baby (2004) is recommended because it is a Drama directed by
Clint Eastwood and starring Morgan Freeman, which are features contained in the movies
you rated high.” In contrast, an influence style justification will be “Million Dollar Baby
(2004) is recommended because you gave high ratings to Unforgiven (1992), Se7en (1995)
and Gran Torino (2008)”. The latter explanation style burdens the user to make a connection
between the movies mentioned and understand their commonalities, e.g. that they all are dra-
mas, two of the movies were directed by Clint Eastwood and two others starring Morgan
Freeman. Although for a heavy movie watcher such commonalities can be easy to deduce, for
not so experienced movie consumers such an effort can be rather discouraging. Nevertheless,
it can be argued that the specification of the common features simplifies the inference process
for both of the user types.
From the above discussed observations it follows that the explanations are able to in-
crease the acceptance of RS and user satisfaction as well as they can help users to make better
choices. The keyword and influence explanation styles lead to both, higher user satisfaction
and better ability of the users to accurately judge the true quality of recommended items.
While the combination of both explanation styles leads to the best overall satisfaction with
21 A concrete wording employed in their study was “Recommended movie title: Indiana Jones and the last cru-
sade (1989). The reason for recommendation is the participant Harrison Ford, who appears in 5 movies you have rated.”
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 18
recommendation, its keyword part seems to be the most important for the users‟ ability to
efficiently judge the quality of recommendations.
2.1.3 Explanations within Recommendation Process
Another study worth mentioning in the context of explanation effectiveness and that we
have already referred to above is the study by Aksoy et al. (2006). This study does not directly
concern explanations but can add to our understanding of how the effective explanations
should be formed and what aspects a recommender algorithm should account for while pro-
ducing recommendations.
The authors examine the role of similarity between a RS and a consumer on the quality
of consumer choices. Two dimensions of similarity are considered: One dimension is the de-
gree to which consumer preferences for different product attributes are incorporated in the
process of the generation of a recommendation22
. Another dimension of similarity is the de-
gree to which the RS employ the decision-making strategies that are similar to those used by
consumers23
. Aksoy et al. hypothesize that the attributes weight similarity and perceived deci-
sion strategy similarity influence the decision quality independently of each other. Surprising-
ly, the results of preliminary study showed that using a RS that was similar in either attribute
weights or decision strategy led to consumer decisions of equal quality as the usage of an
agent that was similar on both of these aspects. Notably is that the authors verified this finding
in their main study and successfully replicated the results. This means that it is enough for a
RS to be similar to a user on one of both similarity dimensions in order to produce recom-
22 Recommendation algorithms differ with regard to the extent of incorporation of the user preferences in the
recommendation process. Some recommendation agents, like mySimon.com, provide randomly ordered alter-native lists that do not incorporate any information about consumer preferences. Other agents, like Ama-zon.com, indirectly elicit attribute importance information based on customer previous choices, which may or may not be concordant with the consumer’s own utility function. At last, there exist recommendation agents, such as activeBuyersGuide.com, which directly elicit consumer’s attribute importance weights and explicitly use them to rank alternatives (Aksoy et al. 2006; Diehl, Kornish, and Lynch 2003). 23
According to decision-making research consumer may employ a variety of cognitive strategies when choosing among products. These strategies range from compensatory decision strategies, such as the weighted additive model (WADD), to simplifying heuristics, such as lexicographic decision rules or elimination by aspects (EBA). For a comprehensive review see Bettmann, Johnson, and Payne (1991).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 19
mendations that significantly increase the decision quality and reduce search effort. In addi-
tion, Aksoy et al. showed that web site loyalty and satisfaction also increase regardless of the
dimension on which a RS and the users are similar. On the contrary, dissimilarity in both at-
tribute weight and decision strategy hurts consumer welfare by increasing perceived costs,
reducing choice quality, and lowering web site loyalty. The latter makes consumers “believe
they make better decisions using no [recommendation] agent at all than using a doubly dis-
similar agent” (Aksoy et al. 2006, p. 311). Based on these findings the authors conclude that
the similarity between RS and consumers matters and that recommendation agents “should
think like the people they are attempting to help if the goal is to assist consumers in making
better choices” (Aksoy et al. 2006, p. 310).
In a further study Aksoy, Cooil, and Luire (2011) extend the outcomes of Aksoy et al.
(2006) by showing that the relative utility and the sum of attribute values of the chosen alter-
native capture the majority of variance in objective decision quality. Combined, these findings
support the suggestion by Ansari, Essegaier, and Kohli (2000) that RS that provide recom-
mendations based on preference models used in marketing (i.e. incorporate individual attrib-
ute importance weights) might lead to higher consumer choice effectiveness than RS that rec-
ommend products according to the preferences of other dissimilar consumers (i.e. through
collaborative filtering).
The results of Aksoy and colleagues (2006, 2011) emphasize the importance of the in-
dividual attribute preferences for RS as a whole and specifically for the process of recom-
mendation generation. In the light of this, the finding that it is enough to maintain either at-
tribute weight or decision strategy similarity allows a recommender algorithm to concentrate
on the first type of similarity, while maintaining reasonable decision quality at the user side.
The concentration on preference attributes within the recommendation process allows
generating explanations that address single attributes, i.e. producing explanations in the key-
word explanation style. As it was shown above, this allows users to efficiently judge the
quality of recommendations and increases the quality of the choice outcome. Additionally, the
inferred attribute preference weights allow to rank order the keywords within explanation
statements in accordance with their relative importance to individual users, and so, to poten-
tially simplify the choice task by emphasizing the most relevant keywords.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 20
Taking into account the reduced need to maintain the similarity of decision strategy, the
algorithm may employ a weighted additive compensatory decision rule (WADD) in order to
ensure the highest quality of recommendations with respect to decision effectiveness. WADD
offers a RS at least four advantages:
Firstly, WADD is capable of processing user preferences at the attribute level. There-
fore, this decision strategy can easily be implemented within a numeric algorithm that strives
for an increase of users‟ choice efficiency addressing attribute-related user preferences in the
process of recommendation generation and calculation.
Secondly, WADD is found to lead to normatively best consumer decisions when com-
pared to heuristic decision procedures, i.e. simplifications of the choice process (Payne,
Bettmann, and Johnson 1988). Hence, the WADD model should produce the most effective
choices given that the consumer‟s attribute-related preference weights are known or can be
accurately estimated by a RS. The task of producing efficient recommendations is, therefore,
reduced to the task of eliciting users attribute-related preference weights. Both, employing
attribute preferences that are similar to the user‟s own ones and the use of a decision rule that
leads to the best consumer decisions potentially increases the robustness of recommendations,
i.e. makes a RS as a whole potentially more tolerant to violations of the premise of attribute
preference weight similarity, which may be caused, for example, by calculation errors.
Thirdly, from psychology it is known that consumers do not have a stable utility func-
tion (Jannach et al. 2011, p. 195). That is, the decision rule consumers use is subject to change
dependent on the context of a choice situation at hand, e.g. mood, cognitive effort, time pres-
sure, product involvement, consuming environment, uncertainty, etc. (Payne, Bettmann, and
Johnson 1993; Bettmann, Johnson, and Payne 1991). Under these circumstances if a RS pur-
suits to maintain decision strategy similarity, it would have to infer the decision strategy the
consumer currently employs each time s/he requests a recommendation. However, the process
of deriving the decision rule is likely to be time consuming and often leads to a cognitive
overload of the respondents. This diminishes the advantages of RS and potentially even elim-
inates them. Instead it seems reasonable for RS to employ a decision rule that works best in
overall (i.e. WADD), while maintaining the attribute preference weights similarity. Although
the consumer preferences may change over time, they are more likely to be persistent over
longer periods of time than the decision strategies are. Furthermore, the changes in prefer-
ences can be tracked automatically and without a need to interfere the user interaction with a
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 21
RS: The recalculation of the attribute preference weights can be triggered automatically after
each implicit user input, such as a buying act or item rating.
At last, because WADD is compensatory (i.e. accounts for preference valence so that
negative attribute-related preferences can compensate the positive ones), it allows the usage of
attributes of recommended items that exhibit negative preferences as negative cues in expla-
nation statements. This again offers potentials for increasing the choice efficiency, since sev-
eral researchers have found that consumers tend to place more weight on negative information
in making evaluations (Lutz 1975; Wright 1974; Kanouse and Hanson 1972; Ito, Larsen, and
Cacioppo 1998). The keyword explanation style can thus be extended to a “pros-and-cons”
style, leading to an explanation form, such as
“Titanic (1997) is recommended to you because it matches your preference highly.
Pros: High budget Hollywood movie directed by James Cameron.
Cons: You don’t like the movie’s drama genre and its star Leonardo Di Caprio.
Taking these factors into account, we expect that you will rate this movie 8 of 10.”
This explanation style maintains the advantages of the keyword explanation style with
respect to choice effectiveness and strengthens them by taking the advantages of negative
cues into consideration. It can be argued that because the item features here are derived direct-
ly from the attributes towards that the user preferences exist, a “pros-and-cons” explanation
involves the terms users actually employ in their evaluations. Hence, this style is informative,
understandable and actionable for the users.
2.1.4 Summary
Summarizing the discussion of Section 2.1, we can conclude that it seems sensible to
incorporate an explanation facility into RS because it provides a series of benefits for both
users and RS provider (see Table 2.1). Besides the increase in transparence, user acceptance
of, trust in, and loyalty to RS, explanations of the reasoning behind recommendations provide
users with an instrument to handle errors in recommendations and hence mitigating the nega-
tive effects of the latter. Furthermore, explanations allow users to form their judgments and
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 22
evaluate the recommendations more efficiently, which increases the choice quality and its
effectiveness.
However, in order for the benefits to surface it seems essential that (i) explanations pro-
vided are understandable to the users and (ii) recommendation process is concordant with the
way the users evaluate choice alternatives on at least one dimension: attribute preference
weights or decision strategy. A possible (conceptual) solution that fulfills both requirements
simultaneously is to incorporate user attribute preferences into the recommendation genera-
tion algorithm that employs a weighted additive decision rule (WADD).
This approach has several advantages: On the one hand, it reduces the recommendation
task to the task of eliciting attribute preference weights, without having to derive a decision
model for each user in each given recommendation setting, and thus unifies and simplifies the
problem of recommendation calculation. On the other hand, because the attribute preference
weights in this case are directly involved in the calculation of recommendations, the contribu-
tion of each attribute to every recommendation is known. Hence, this information can be used
straightaway for generating an explanation of the reasoning behind the recommendation in a
keyword explanation style, which is known to be understandable and actionable to the users
as well as to reasonably contribute to choice effectiveness. Finally, the keyword explanation
style can be extended to the “pros-and-cons” style hence offering a merit of negative cues that
play an important role in making evaluations and thus potentially further increase decision
effectiveness at the user side. Altogether, the proposed approach offers an integrative view to
the explanatory and algorithmic issues of RS within a common framework.
While the rationale for advantages of explanations was provided above in the Section
2.1, the algorithmic part will be elaborated on in the subsequent chapters. In the context of the
objectives of this thesis it is, however, important to clarify which attributes are relevant for the
preference building and choice making in the domain of movies and to provide background
knowledge about the state-of-the-art recommender algorithms. The following Sections 2.2
and 2.3 are dedicated to these questions.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 23
2.2 Movie Related Preferences and Relevant Movie Characteristics
In the previous Section we proposed an integrative approach to the effective generation
of recommendations and explanations of the reasoning behind recommendations. This ap-
proach incorporates attribute preferences into the recommendation process and employs the
weighted additive model (WADD) for the generation of recommendations. At this point, in
order to develop a numeric algorithm that implements this approach in the movie domain
(which is the objective of the current thesis) and to enable the reader to comprehend its devel-
opment, further understanding of the involved topics is needed. The next two Subsections,
therefore, provide a brief overview of the topics of (i) operationalization of the preferences
and (ii) what attributes of the motion pictures are relevant for the preference formation in this
domain.
2.2.1 Operationalizing Preferences: Multiattribute Utility Model and
Weighted Additive Decision Rule
The concept of multiattribute utility (MAU) has a long history in the research fields of
psychology, decision-making and marketing (eg. Edwards 1954; Tversky 1967; Fishburn
1967; Green, Wind, and Jain 1972; Luce 1992; Caroll and Green 1995). This concept relies
on two fundamental notions: the principle of utility maximization and the decomposition hy-
pothesis. The former asserts that people make choices according to some criteria of worth.
Hence, each alternative is associated with a certain amount of utility ( ), so that the alterna-
tive that is considered best or preferred by a consumer over other alternatives has the highest
utility (Tversky 1967). In other words, if alternative A is preferred over alternative B, then it
applies:
(2.1)
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 24
The decomposition hypothesis states that the utility of an alternative can be decomposed
into basic independent components. That is, people are assumed to evaluate alternatives on a
set of their components, i.e. attributes (Tversky 1967). In doing so, they assign partial utilities,
i.e. part-worths, to each of the attributes of an alternative, which are thought to reflect the
amount of the preference that a consumer associates with the levels of the attributes that occur
within an evaluated alternative (Bettman, Johnson, Payne 1991)24
. Additionally, because the
relative importance of different attributes may vary with regard to the preference formation of
the consumer, the part-worths are weighted by the relative importance of the respective attrib-
utes ( ). Hence, the utility of a multiattribute alternative ( ) equals the sum of the part-
worths ( ) of its attributes weighted by their relative importance. Formally, this yields to:
∑
(2.2)
where
= utility of a multiattribute alternative
= relative importance of the j-th attribute
= part-worth of the k-th level of the j-th attribute
: attributes of an alternative
: levels of an attribute, embodied in the alternative
Equation (2.2) specifies an additive composition model of the multiattribute utility and,
thereby, represents an operationalization of preferences – because utility reflects preferences.
That is, the MAU model allows to rank order a set of alternatives (e.g. products, such as mov-
ies) with regard to consumer‟s preference, assuming that all the part-worths and all the corre-
spondent importance weights are known or can be elicited, for example, by the means of a
numeric algorithm.
Such a procedure of rank ordering corresponds to the weighted additive (WADD) deci-
sion rule (Bettman, Johnson, and Payne 1991; Corner and Kirkwood 1991; Weiss, Weiss, and
24 To further understand the relations between alternatives, attributes, and attribute levels, consider an exam-
ple of choosing between different models of cellular phones. Each model represents a (choice) alternative, which may be evaluated on its attributes such as brand, display size, battery durability, price, etc. The levels of the attribute brand may be, e.g., Motorolla, Samsung, Siemens, HTC, etc.; the levels of the attribute price may be, e.g., €20, €60, €120, etc. While a consumer may consider price more important than brand (i.e. the attrib-ute importance weight of price is higher than the one of the brand), s/he might prefer cheaper phones over expensive ones (i.e. the part-worth of €20 is higher than those of €60 and €120).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 25
Edwards 2009). Concordant to the above described principle of decomposition, WADD sug-
gest a normative procedure of decision making that involves the consideration of all the rele-
vant information about the problem. That is, the WADD rule considers the values of each
alternative on all the relevant attributes as well as all the relative importance weights of the
attributes to the individual (Bettman, Johnson, and Payne 1991).
Although MAU and WADD have been criticized for their restricted ability to describe
how individuals actually make choices (Simon 1982; Edwards 1961; Luce 1992) and a series
of simplifying heuristics was suggested to describe the actual choice behavior better under
certain circumstances, e.g. time pressure, routine choosing, low involvement products, etc.
(Kahneman and Tversky 1984; Bettman, Johnson, and Payne 1991; Gigerenzer et al. 1999), in
the normative view of decision analysis WADD proved to lead to the most effective choices
(Tversky 1967; von Winterfeld and Edwards 1986; Payne, Bettman, and Johnson 1988;
Aksoy, Cooil, and Luire 2011). Because our aim is not to describe the actual consumer behav-
ior, but rather to provide them with a decision aid that helps to achieve better choices, this
property of WADD is advantageous in the context of RS. Further advantages of WADD for
the generation of explanations and producing recommendations were discussed in Section
2.1.3.
In the view of the objectives of the current thesis and taking into account the discussion
of Section 2.1.3, the MAU model and WADD rule prescribe the way of how the recommend-
er algorithm should be constructed: For each user the algorithm should elicit their individual
attribute level preferences, i.e. part-worths, and the importance weights of the attributes which
are relevant for the user‟s preference formation. The obtained information can then be aggre-
gated by the means of WADD in order to calculate the utilities of the alternatives. Rank order-
ing of the latter leads then directly to a recommendation of the most preferred alternative (or a
set of alternatives ranking high along user preferences).
Within the framework of movie RS the utility ( ) from the Equation (2.2) can be
thought of as a rating (e.g. number of starts) that a user gives to a particular movie. The higher
rating a movie receives from the user, the more the user likes the movie. Accordingly, a mov-
ie with the highest rating possesses the highest utility for the user and thus is the most pre-
ferred one. Hence, the operationalization of the utility as movie rating allows comparisons of
different movies in terms of user‟s preferences. Moreover, given an attribute composition
model of user utility, this operationalization allows numeric inference of the attribute part-
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 26
worths as well as their later composition to holistic utilities for arbitrary movies, their rank-
ordering, and so, the recommendations.
To complete our conception of the movie recommendation algorithm that involves at-
tribute related preferences we now need a notion of what attributes of the motion pictures
should be considered within the algorithm, i.e. what movie attributes should be elicited from
the users or their preference data. The next Section is dedicated to this topic.
2.2.2 Preference Relevant Attributes of Motion Pictures
Understanding which attributes of motion pictures drive consumer preferences and de-
termine their choices is not a trivial task as it may seem. The extant research on movie con-
sumption leads to the recognition that addressing comprehensible preference relevant movie
attributes is challenged by the nature of the movies:
Movies are experiential experience goods. That means that, on the one hand, the main
motive for people to consume a movie consists in receiving a hedonic value (e.g. pleasure,
thrill) from experiencing it, rather than in fulfilling a utilitarian need (Cooper-Martin 1991,
1992; Holbrook and Hirschman 1982). The nature and the outcomes of hedonic motives are,
however, much more difficult to understand than utilitarian motives (Hennig-Thurau, Hou-
ston, and Walsh 2007), and thus are hard to formalize. On the other hand, the domain of mo-
tion pictures is dominated by experience qualities. This means that the quality of a movie can
be assessed by the consumers only when watching it (De Vany and Walls 1999). The latter
forces consumers to rely on proxies called “quasi-search qualities”, i.e. movie traits that a
consumer can comprehend before watching a movie, and on movie related communication for
forming their quality judgments (Hennig-Thurau, Walsh, and Wruck 2001; Hennig-Thurau,
Houston, and Walsh 2007).
Although the research on movie consumption has repeatedly put consumers into the
center of interest in recent years (e.g. Hirschmann and Morris 1982; Austin 1981, 1982, 1989;
Cooper-Martin 1991, 1992; Moon, Bergey, and Iacobucci 2010), it was mainly driven by the
hedonic nature of the movies; hence, it was mostly concentrated on the unique aspects of the
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 27
consumer behavior for this type of goods, rather than on the search for formalizable movie
attributes that allow to assess preferences of individual movie watchers. Accordingly, little is
known about which attributes of the movies actually form individual movie watcher prefer-
ences.
To our knowledge, Austin (1989) is the only author who provides a thorough overview
of the reasons for selecting a specific movie by an individual. Although the author himself
critically questions the general validity of his asserts, we see it sensible to provide a brief ex-
cerpt thereof. Austin suggests movie genre to be the most influential attribute that determines
the choice of a particular movie by moviegoers, though one movie can be simultaneously
classified in several genres. The genre categorization informs the consumer about the type of
the story and the elements of the film‟s plot, and so narrows down the set of hedonic qualities
which the consumer can anticipate from the film.
Further attributes influencing the movie choice are the onscreen and offscreen produc-
tion personnel, whose name recognition can affect attendance decision, i.e. acting stars, direc-
tors, producers, screenwriters, and production companies that are responsible for visual ef-
fects. While acting stars “no doubt contribute much to the audiences‟ awareness and
knowledge about the film” (Austin 1989, p. 77), only a few persons from the offscreen per-
sonnel gain public recognition that is strong enough to affect movie attendance decisions
(Austin 1989). Hence, a recommender algorithm does not need to consider every name from
the movie industry as a preference relevant movie attribute: We can narrow down the list of
persons to be considered to those who possess star power, i.e. whose names are popular
enough to influence consumer‟s movie preference assessment. Such list can be obtained, e.g.
from analytic web sites that maintain an up-to-date list of movie stars and offscreen personnel
with star power, such as IMDb25
or InsideKino26
.
Other factors that, according to Austin, influence movie choice are advertising, trailers,
critic reviews, and word-of-mouth (Austin 1989). These entities, however, can hardly be clas-
sified as movie attributes, but rather represent additional sources of information. That is, they
influence the process of preference assessment by the means of providing customers with ad-
ditional clues about the qualities of a movie. Although additional information may increase
25 http://pro.imdb.com/people
26 http://insidekino.de/Starpower.htm
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 28
choice effectiveness, the information sources themselves are unlikely to possess distinct char-
acteristics towards that an individual may exhibit more or less stable movie relevant prefer-
ences: The utility of an information source depends on the utility of the information it trans-
fers. In other words, we argue that it is unlikely that a consumer will like all the movies equal-
ly more or equally less just because s/he saw a trailer, a TV add, or heard of a movie from a
particular friend. Hence, we discard the above mentioned entities from the list of preference
relevant movie attributes and from the further discussion thereof.
Further preference relevant movie attributes can be obtained from the stream of movie
research that concerns the economic success of motion pictures. This research stream also
considers consumer preferences, however, approaches them from the perspective of the movie
producing industry, rather than from the consumers‟ side. Specifically, the focus of interest
lies here on the economic values such as movie profitability and box-office gross (Hennig-
Thurau, Walsh, and Wruck 2001), which are generated through the fees that consumers pay,
e.g. for attending a movie in a cinema, acquiring it on DVD, etc. Thus, the “success factors”
are determined by the consumers‟ reactions to the studio actions, non-studio factors, as well
as to the characteristics of movies themselves (Hennig-Thurau, Houston, and Walsh 2007). A
summary of motion picture success factors is provided in Table 2.2.
Table 2.2: Summary of motion picture success factors based on Hennig-Thurau, Walsh, and Wruck (2001)
and Hennig-Thurau, Houston, and Walsh (2007)
Movie characteristics Post-filming studio actions Non-studio actions
Genre
Advertising expenditures
Critical reviews
Stars Timing of movie release Awards
Directors Number of screens Customer-perceived movie quality
Budget Early box-office information
Symbolicity Word-of-mouth
Certification
Sequel
Language
Country of Origin
Movie Length
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 29
However, consumers are involved in the analysis of movie success only indirectly –
through the monetary value they generate on the aggregate level. That is, the consideration of
individual customers does not take place. This means that the empirical evidence behind the
influence of success factors on the decision to consume a movie in general cannot be simply
interpreted as a proof of the relevance of the success factors, and specifically that of movie
characteristics, for movie preferences of individuals. Nevertheless, the fact that the success
factors significantly influence consumption decisions on aggregate can be interpreted as an
indication for the possibility that those factors are valid on the level of individual consumers.
Hence, we assume that the motion picture success factors listed in Table 2.2 potentially pos-
sess an explanation power for individual consumer movie preferences. For our objectives, this
assumption has two consequences: Firstly, it strengthens the support for the relevance of gen-
re and production personnel for individual preferences. Secondly, it extends the list of movie
attributes that are potentially relevant for an individual‟s preference assessment.
New to our list of preference relevant movie characteristic, i.e. attributes, are budget,
symbolicity, certification, sequel, language, country of origin, and movie length. In the fol-
lowing we will briefly describe the meaning of these attributes and the motivation for their
inclusion into the list of preference relevant movie attributes to consider by a recommendation
algorithm:
Movie budgets serve for consumers as an indicator of quality, “since the budget indi-
cates whether the producer has the resources to turn an idea into convincing reality trough
acting, artistry, and technology” (Hennig-Thurau, Walsh, and Wruck 2001, p. 11). Thus,
budget allows consumers to assess their expectation of movie quality prior to watching it. In
fact, if we consider the popularity of high budgets movies of recent years (e.g. Avatar, Master
of the Rings, Titanic, Godzilla), we notice the tendency of higher budgets to attract more
movie watchers. Hence, although many consumers may not explicitly consider movie budgets
while making their movie consumption decisions (e.g. because such information is not always
available), we should avow by all means its indirect influence to exist. A recommender algo-
rithm can elicit such hidden preferences from the data on past movie consumption and use
them for making predictions.
Certifications are intended to classify movies with regard to their potential offensive-
ness for audiences and concern subjects such as suitability for children, violence, sex, abusive
language, etc. Although their impact on the movies‟ box-office remains disputable, certifica-
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 30
tions are considered to influence consumers‟ interest in movies (Hennig-Thurau, Huston, and
Walsh 2007). Thus, we include certifications in the list of preference relevant movie attrib-
utes.
Some movie producing countries are often associated with having a specific style of
narration that may be more or less attractive for consumers. For instance, French movies are
expected to be more arty and the Hollywood ones rather „merely‟ entertaining (Hennig-
Thurau, Walsh, Wruck 2001). Hence, the country of origin describes individual‟s movie pref-
erences.
Language spoken in a movie is closely related to the movie‟s country of origin and may
also influence the consumer‟s decision to watch the movie. Conventional wisdom tells us that
consumers who are not able to understand foreign languages are unlikely to watch undubbed
movies, whereas other people like to watch movies in original language. However, in several
non-English speaking countries the original language of movies is less important, as a majori-
ty of foreign movies is either dubbed (e.g. Germany, Russia, France) or subtitled (e.g. Nether-
lands, Sweden, Bulgaria; Hennig-Thurau, Walsh, Wruck 2001). Hence, the informativeness of
movie language with respect to consumer‟s preference may depend on the country a recom-
mender algorithm operates in.
Movie length can also be considered to impact consumer movie choice, since a signifi-
cant number of consumers are not willing to spend more time watching a movie than what can
be regarded the „critical length‟ (Hennig-Thurau, Walsh, Wruck 2001).
Awards given by prestigious institutions such as the Academy of Motion Picture Arts
and Sciences (AMPAS) can be seen as an independent indicator of the aesthetic quality of a
movie (Hennig-Thurau, Walsh, and Wruck 2001; Hennig-Thurau, Houston, and Walsh 2007).
The relevance of awards for consumer behavior was illustrated in the service sector (Dick and
Basu 1994; Hennig-Thurau and Klee 1997) and is suggested to persist in the domain of mo-
tion pictures (Hennig-Thurau, Houston, and Walsh 2007). Although awards are not inherent
attributes of movies, they are closely attributed to the latter. Due to concerns above we can
consider awards to be „exogenous‟ movie characteristics, and thus preference relevant attrib-
utes of motion pictures.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 31
Still, not all motion picture success factors listed in Table 2.2 can be considered as rele-
vant movie attributes from the viewpoint of our goals, because they are not suitable for algo-
rithmic prediction of consumer preferences.
So, the “customer-perceived movie quality”, which encompasses the movie‟s experi-
ence traits as well as structure qualities, such as the movie‟s budget and personnel (Hennig-
Thurau, Walsh, and Wruck 2001), has three serious drawbacks: Firstly, it is a composite fac-
tor which contains several entities, whereas their exact composition rule is not specified by
previous research. This makes the addressing of this factor within a numeric process not oper-
ationalizable and thus not sensible. Further, it comprises the movie‟s budget and personnel.
While movie budget represents a new piece of information, movie personnel is already in-
cluded in our list of attributes. The repeated consideration of the latter is not necessary and
can even harm an algorithm through the perfect multicollinearity between multiple instances
of the same entity. Lastly, besides the previously mentioned reasons, the concept of customer-
perceived movie quality implies that a consumer has already seen a movie and, due to this
fact, can assess his or her preferences towards the experiential traits of the movie. This means
that a part of the information cannot be made available to an algorithm prior to the consum-
er‟s watching of a movie. Thus, movies unknown to consumers would be impossible to rec-
ommend. Accordingly, a recommendation algorithm with such a feature would make no
sense.
Similar arguments apply to the symbolicity, which refers to movie‟s potential to be easi-
ly categorized by consumers into existing categories that the consumer is familiar with (Hen-
nig-Thurau, Walsh, and Wruck 2001). As a basis for this categorization serve the movie‟s
relationship to prior works (e.g. novels, myths, fairy tales, comics, TV programs, computer
games, etc.) or its affiliation to a series of movies (Hennig-Thurau, Walsh, and Wruck 2001;
Hennig-Thurau, Houston, and Walsh 2007). Accordingly, a property of being a sequel can
also be seen as a dimension of the concept “symbolicity” (Hennig-Thurau, Houston, and
Walsh 2007), since sequels are both part of a series of movies and related to their respective
predecessors. Whereas reporting the elements of symbolicity can help customers to assess
their liking of a movie prior to watching it and so potentially increases the decision effective-
ness, we doubt its potential as a single attribute (i.e. whether or not a movie is based on prior
work) to increase the quality of predictions by a recommender algorithm: Although a con-
sumer may tend to like some set of groundworks movies can be based on (e.g. Greek myths),
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 32
at the same time s/he may dislike a subset of them (e.g. myths about Heracles). Similarly,
from the fact that a movie watcher liked some sequel (e.g. Mission Impossible, Matrix), we
cannot conclude that s/he generally likes sequels, since at the same time s/he might dislike
other sequels (e.g. Batman, Spiderman). Hence, we see the movie characteristics “symbolici-
ty” and “sequel” as not appropriate to model within our preference eliciting recommendation
algorithm.
Further, although the features number of screens, timing of movie release, advertising
expenditures and early box-office information influence the movie attendance decisions, it
can be argued that their impact is rather concentrated at the period proximate to the movie
start and diminishes during the course of time. Moreover, the influence of these factors hap-
pens mostly by the means of increasing the awareness of a movie, rather than by directly im-
pacting the consumer‟s preference of the movie itself. Because the value of a recommendation
algorithm consists in recommendations of those movies that above all match the user‟s prefer-
ences irrespective of the movies‟ release times and especially such movies that the user is not
aware of, we can omit the above mentioned movie success factors from further consideration.
Analogously, because word-of-mouth and critical reviews are hard to operationalize and
because they do not necessarily mimic the consumer‟s own preferences, we consider them
irrelevant for describing the individual‟s preferences within a recommendation algorithm.
However, complete discarding of the factors that are proven to reflect the aggregate
consumers‟ movie attendance decisions may be dangerous as it involves the loss of some
preference relevant information that might not necessarily be captured by the remaining mov-
ie attributes. We suggest to fill this information loss by accounting for the movie‟s box-office
and admissions (i.e. the number of people that have attended a movie) in our recommendation
algorithm. We propose two arguments to justify this suggestion: Firstly, because within movie
success research these quantities are formed through movie watchers‟ decisions to consume a
particular movie, they also to some extent reflect the movie‟s relative popularity. We argue
that the popularity of a particular movie itself may be a separate and independent motive to
consume it. Hence, the amount of preference towards box-office and admissions represents a
“quasi-search” quality, since it indicates the movie‟s popularity as a quality judgment of other
consumers. Secondly, because of the proven influence of the success factors, we suggested to
omit, onto the movie box-office (Hennig-Thurau, Houston, Walsh 2007), the latter captures
the variance in the former and so can serve as a proxy to assess the experiential qualities of
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 33
the movie as well as other omitted factors that would be difficult to operationalize otherwise
(e.g. advertising pressure, word-of-mouse qualities, etc.).
Another attribute we propose to include in our list is the movie‟s year of production. We
suggest that among other factors the age of a movie may also determine the consumers‟ inten-
tion to watch it. Some consumers may tend to prefer only newly released movies, others may
have stronger preferences for elder „mature‟ films. So we assume the year of production to be
relevant for the consumer‟s preference formation towards watching a particular motion pic-
ture.
The discussion above provides an overview of movie attributes that are relevant for the
consumers‟ assessment of preferences to the movies prior to seeing them, and thus may be
incorporated into a recommendation algorithm that aims to generate recommendations that
reflect individual user preferences as well as providing comprehensive and actionable expla-
nations behind recommendations. The final list of the preference relevant movie attributes is
summarized in Table 2.3.
Table 2.3: Summary of preference relevant movie attributes
Genre
Acting starts
Directors
Producers
Screenwriters
Production companies
Movie Length
Language
Country of origin
Certification
Budget
Admissions
Box-office
Year of production
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 34
2.2.3 Summary
Continuing the discussion of Section 2.1, which stresses the role of movie attributes for
the provision of actionable explanations that increase decision effectiveness, in Section 2.2,
we presented some insights into how attribute related consumer preferences can be operation-
alized at the individual level, so that a numeric algorithm can produce personalized recom-
mendations. Specifically, we utilize the multi-attribute utility (MAU) theory to underlie the
decomposition of consumer‟s preferences into attribute part-worths. In doing so, we propose
operationalizing the movie utility for a consumer as a rating that the consumer assigns to a
movie in order to embody his or her preferences. Provided that the attribute part-worths can
be elicited by the algorithm, they can be used to calculate ratings, i.e. utilities, of arbitrary
movies by the means of weighted additive (WADD) decision rule, and so, to rank order alter-
natives, i.e. movies, in accordance with the consumer‟s preferences. The movies with the
highest calculated preference rating represent the actual recommendations.
The list of movie attributes that are relevant for the individual‟s preferences was elabo-
rated in Section 2.2.2. Though, the elaboration was challenged by the lack of research on this
subject: While research on movie consumption merely suggests a set of movie attributes
without proving its explanation power, research on movie success operates with attribute
preferences on the aggregate level. Arguing that empirical evidence behind movie attributes
on the aggregate level also confirms their relevance for individual preferences, we combined
the suggestions of both research streams. In addition, the suggested attributes were examined
regarding their potential descriptive power in context of RS and their suitability for operation-
alizing within a recommender algorithm. We also suggested one further attribute (year of pro-
duction), which was not subject to either of the research streams, to be descriptive of consum-
er movie preferences. The resulting list of preference relevant movie attributes is presented in
Table 2.3.
At this point, we dispose of all the important concepts that allow us to construct an algo-
rithm for the provision of personalized recommendations that is capable of effective and ac-
tionable explanations. Nevertheless, in order to ensure the novelty and conceptual effective-
ness of our approach as well as to substantiate our development process, an overview of the
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 35
key algorithms employed in contemporary recommendation systems will be given in the fol-
lowing Section.
2.3 Key Recommendation Techniques
The goal of RS is to provide users with recommendations of the items they are not
aware of and that are potentially interesting to them. In other words, RS help users to find
useful items. In doing so, RS try to predict user preferences, i.e. ratings, for the yet unseen
items. These predictions are based on preference data, usually ratings, that RS acquire from
their user base. Once the ratings for the yet unrated items are estimated, the system can rec-
ommend the item(s) with the highest estimated rating to the user (Adomavicius and Tuzhilin
2005).
More formally, the recommendation task can be described as follows: Let
be the set of all users and let be the set of all items that can be rec-
ommended, such as movies, books, CDs, websites, news articles, etc. Let denote a
matrix of ratings with indexes denoting a particular user-item combination. Finally, let
be a preference function that measures the preference of user on item , i.e., .
Then the recommendation task is: for each user choose such item that corre-
sponds to the maximum of the user‟s preference:
(2.3)
The central problem of RS is, however, that the preference function is unknown and
its mapping onto the rating space is not defined on the whole space but only on some
subset of it. This means that if a certain user has not rated an item , the corresponding ma-
trix entry remains empty. Consequently, needs to be estimated from the non-empty en-
tries of and then extrapolated to the whole space of in order to predict the unknown
ratings (Adomavicius and Tuzhilin 2005; Jannach et al. 2011). Once the predictions are made,
recommendations are produced according to (2.3).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 36
To estimate the ratings of the not-yet-rated items contemporary RS employ a number of
techniques. Although these techniques vary in details of implementation, based on the under-
lying principle of how recommendations are produced, they can be classified into three gen-
eral categories (Balabanovic and Shoham 1997):
Collaborative filtering,
Content-based filtering,
Hybrid approaches.
The approaches of the three categories differ in the strategies they employ, methods
they use, the data basis they rely on, and their inherent strengths and weaknesses. The follow-
ing subsections describe these approaches in more detail.
2.3.1 Collaborative Filtering
The key concept of collaborative filtering (CF) is that the information about the prefer-
ences of the entire user base of an RS can be exploited in order to produce recommendations.
That is, CF methods utilize all ratings from all users to all items available to the system to
predict which items a particular participant of the RS community will most probably like or
be interested in. The fact that every user potentially contributes to a recommendation entails
the title of this group of the methods, i.e. the users a thought to jointly “collaborate” on the
recommendation process. The CF methods family encompasses three approaches that differ in
the ways the rating data is used: user-based CF, item-based CF, and matrix factorization. Be-
low, we provide a brief overview of each of these approaches.
2.3.1.1 User-based Approach
The main idea of the user-based CF approaches (e.g. Shardanand and Maes 1995; Kon-
stan et al. 1997; Breese, Heckerman, and Kadie 1998; Nakamura and Abe 1998; Delgado and
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 37
Ishii 1999; Herlocker et al. 1999; Jannach et al. 2011) is that those users, who exhibited pref-
erences similar to the ones of the current user in the past, can serve as predictors of the prefer-
ences of the current user on items s/he has not seen yet. That is, the aggregated ratings of such
users (also referred to as peer users or nearest neighbors) are used as predictors of the ratings
of the current user. Accordingly, the algorithm can be broken down in the following steps:
1. From all users in the user base, find a subset of users that are similar to the
current user .
2. Aggregate the ratings of these users for the set of items the current user has
not rated yet.
3. Recommend the item from that exhibits the highest rating.
To gain an intuition for how this algorithm works, let us examine Table 2.4, which
shows an example of the rating database. The active user, Daniela, for instance, has rated “Sin
City” with “10” on a 1-to-10 scale, which means that she strongly liked this movie. Now, the
task of our RS is to predict Daniela‟s rating on “Thor”, which she has not seen or rated yet.
The system searches the database for users with tastes similar to Daniela‟s, i.e. who rated the
movies similarly, and uses their ratings to predict her liking of “Thor”. If the system can pre-
dict that Daniela will like “Thor” strongly, then it should recommend this movie to her.
Table 2.4: Ratings database for collaborative filtering
Daniela Thorsten André Michael Paul
Sin City 10 5 8 5 1
Titanic 5 8 5 8 10
Memento 8 3 8 1 10
Avatar 8 5 5 10 3
Thor ? 7 10 6 3
In our simple example, Thorsten‟s rating profile is the most similar to the Daniela‟s,
whereas Paul‟s profile is the most dissimilar one (see also Figure 2.1). Thus, Thorsten‟s rating
on “Thor” will be used to predict Daniela‟s liking of this movie.
Various approaches have been proposed for the compuation of the similarity
between users of CF systems (Herlocker et al. 1999; Herlocker, Konstan, and Riedl 2002;
Adomavicius and Tuzhilin 2005). Most of them compute the similarity based on the ratings of
items that the users have rated in common. The two most popular similarity measures are
Pearson’s correlation coefficient and cosine similarity (Adomavicius and Tuzhilin 2005;
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 38
Jannach et al. 2011). To introduce them, let | be the set of
items commonly rated by both users and . Then, Pearson‟s correlation coefficient is de-
fined as (e.g. Resnick et al. 1994; Shardanand and Maes 1995):
∑
√∑ ∑
(2.4)
The cosine-based approach (e.g. Breese, Heckerman, and Kadie 1998; Sarwar et al.
2001) treats users as vectors in m-dimensional space with | |, i.e. the number of items
the users have rated in common. The similarity between the users is then computed as the
cosine of the angle between both vectors27:
‖ ‖ ‖ ‖
∑
√∑
√∑
(2.5)
where denotes the dot-product28
between vectors and and ‖ ‖ is the second norm of
the vector, i.e. the vector‟s Euclidean length, defined as the square root of the dot-product of
the vector with itself.
Other metrics such as Spearman’s rank correlation coefficient, normalized Euclidian
distance or the mean squared difference measure have also been proposed to determine the
proximity between users (Shardanand and Maes 1995; Herlocker et al. 1999, 2002; Ado-
mavisius and Tuzhilin 2005; Jannach et al. 2011). However, empirical analysis provides evi-
dence that for user-based CF systems the Pearson‟s coefficient outperforms other measures of
comparing users (Herlocker et al. 1999). For the item-based CF systems, which will be de-
scribed in the next section, it has been reported that cosine similarity consistently outperforms
the Pearson correlation metric (Jannach et al. 2011).
27 Here and further we use bold font face to denote vectors and regular font face to denote scalars.
28 Recall that the dot-product of two vectors and in -dimensional Euclidean space is defined as a sum of
pairwise scalar products of the vectors’ coordinates, resulting in a scalar: ∑
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 39
Figure 2.1: Comparing three user rating profiles modified from Jannach et al. (2011, p. 15)
Both metrics, Pearson‟s correlation coefficient and cosine similarity, vary in the interval
between +1 and -1. While +1 corresponds to the case of perfect positive correlation, i.e. the
user profiles are identical, -1 corresponds to perfect negative correlation, i.e. the user profiles
are the exact opposite of each other. The value of zero shows that user profiles are absolutely
unrelated, i.e. dissimilar. Accordingly, the nearer the similarity measure to +1, the more simi-
lar both users are. This property of similarity measures is usually used for weighting the proxy
users‟ ratings within the aggregation process, so that the most similar users are given more
weight in the prediction of active user‟s ratings. The value of the similarity measure, as it will
be shown below, is directly adopted in the aggregation function as the weight of a user.
Before the ratings of the active user can be predicted, a set of peer users, whose rat-
ings will be considered in the prediction, needs to be defined, i.e. we have to select the most
similar users according to some rule. The set of the most similar peers is also referred to as “k
nearest neighbors”. Because these neighbors build up the basis for predictions, i.e. recom-
mendations, the collaborative approaches are often called k-Nearest Neighbor or kNN ap-
proaches.
The value of can range anywhere from 1 to the number of all users (Adomavicius and
Tuzhilin 2005). The question of the determination of the exact value of , however, remains
open until now. Hence, it is usually set heuristically, either by defining a specific minimum
similarity threshold (e.g. Shardanand and Maes 1995; Breese, Heckerman, and Kadie 1998)
or by choosing some explicit value of (Herlocker et al. 1999, 2002; Anand and Mobasher
1
2
3
4
5
6
7
8
9
10
Sin City Titanic Memento Matrix
Daniela
Thorsten
Paul
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 40
2005; Jannach et al. 2011). Both techniques are, however, problematic: if is set too high, too
many users with limited similarity bring additional “noise” into the predictions. On the con-
trary, low values of can negatively impact the quality of the predictions. On the other hand,
a too high similarity threshold can entail a radical reduction of the neighborhood sizes for the
users, so that the ratings for many items cannot be predicted. A too low threshold, in contrast,
increases the neighborhood size but also raises the amount of “noise”. Jannach et al. suggest
that “in most real-world situations, a neighborhood of 20 to 50 neighbors seems reasonable”
(Herlocker et al. 2002 cited in Jannach et al. 2011, p. 18)29
. A more detailed discussion of the
problem of the selection of neighborhood size can be found in Herlocker et al. (2002) as well
as in Anand and Mobasher (2005).
Once the neighborhood size or similarity threshold is defined, the ratings for the active
user are predicted by means of an aggregation rule. Different functions have been proposed as
an aggregation rule. Some examples of them are (Adomavicius and Tuzhilin 2005; Herlocker
et al. 2002):
∑
(2.6)
∑
(2.7)
∑
(2.8)
∑
(2.9)
where denotes a set of users that are most similar to the current user and have rated the
item . The multiplier ∑ | | serves as normalizing factor and the average
rating of user is defined as
| |∑ , with | .
In the simplest case, the aggregation can be a simple average (Adomavicius and
Tuzhilin 2005), as defined by (2.6). Intuitively, because this function does not account for the
degree of similarity of different peers, its predictions are subject to suffer from “noisy” input
from neighbors with limited similarity. Although the latter issue can be compensated by set-
29 The authors quote at this point Herlocker et al. (2002). However, despite careful reading, we could not find
this quotation in the referred publication. Hence, we refer here to Jannach et al. (2011).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 41
ting an appropriate similarity threshold, such countermeasure, as described above, tends to
reduce the coverage of RS. So if the aim is to accurately predict user ratings, function (2.6)
might not be the best choice. However, the simplicity of this function is its biggest advantage.
Such aggregation requires little resources and can be computed quickly, which might be very
useful for RS that must provide ad-hoc and real-time recommendations from considerable
catalogues of items. Furthermore, in situations when the system does not know enough about
the user in order to be able to produce a personalized prediction (a so-called “new user prob-
lem”, which will be discussed below in Section 2.3.3), recommendations according to the
average rule might be better than no recommendations at all. In this case, however, the condi-
tion under the sum sign in (2.6) must be relaxed to | , i.e. all users who have
rated an item should be involved in producing recommendations.
Equation (2.7) represents a slight modification of the previously discussed aggregation
rule by reformulating it in the “deviation form”. That is, the aggregation happens here not
over the ratings that the users have given to an item , but over the deviations of these rat-
ings from the average ratings of respective users. The produced sum is then adjusted by
the mean rating of the active user. By doing so, the modified rule accounts for the fact that
different users may use the rating scale differently. For instance, Michael‟s rating of “6” may
correspond to exactly the same amount of preference as André‟s “8”. Moreover, the mean-
adjustment corrects for the “gap” between those user profiles that expose reasonable correla-
tion but are shifted along the rating scale. An example of such profiles can be seen on Figure
2.1, where Daniela‟s and Thorsten‟s ratings expose a strong correlation but are “shifted” ver-
tically, so that Thorsten‟s ratings lie on average some 5 points below Daniela‟s. Although
Thorsten might not necessarily share Daniela‟s movie taste, his ratings seem to be reliable
predictors for Daniela‟s. However, in order to predict Daniela‟s preferences appropriately,
Thorsten‟s ratings should be incremented by approximately 5 points. The mean-adjustment
performs such correction. Herlocker et al. (2002) have found that the mean-adjusted average,
as defined in (2.7), significantly outperforms (2.6) with respect to prediction accuracy, specif-
ically in the case of non-personalized recommendations. Similarly to (2.6) the deviation-from-
mean average does not account for different degrees of user similarity, thus the merits and
shortcomings of the simple average remain for the most part valid also for this rule.
The most common aggregation approach is, however, the weighted sum as defined in
equation (2.8) (Adomavicius and Tuzhilin 2005; Jannach et al. 2011). As already noted above,
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 42
peer users are assigned weights according to their similarity to the current user. Conventional
wisdom tells us that the users whose tastes are more similar to the tastes of the active user are
more credible recommender, and thus should contribute more to the recommendation. The
weighted aggregation procedure strives to achieve exactly the same effect. The normalization
factor , as introduced above, affects that the predicted rating is scaled within the scale‟s in-
terval and does not exceed or go below the allowed scale limits.
However, equation (2.8), similarly to equation (2.6), does not take differences in the av-
erage rating between different users into account. The mean-adjusted aggregation rule (2.9)
addresses this shortcoming.
After predicting the ratings for the yet unseen items, the item with the highest rating can
be recommended to the active user. Alternatively a set of items with the highest ratings can
be shown to the user. The latter case is often referred to as “top-N recommendation” (e.g.
Sarwar et al. 2000; Seyerlehner, Flexer, and Widmer 2009; Zhang 2009).
2.3.1.2 Item-based Approach
Rather than basing recommendations on the similarity between users, item-based col-
laborative filtering relies on the similarity between items (Sarwar et al. 2001; Rashid et al.
2002; Linden, Smith, and York 2003; Zeigler et al. 2005). The item-based CF algorithm can
be broken down into the following steps:
1. From all items, find a subset of items not rated by the current user that are
similar to those, the user liked most in the past .
2. For each item from , use its similarity to the items in to weight the average
rating of other users on respective items for the prediction of the rating of the ac-
tive user.
3. Recommend the item from that exhibits the highest rating for the active user.
To gain an intuition of how this algorithm works, examine Table 2.4 again. We can see
that the ratings for “Sin City” and “Thor” are distributed similarly among the users (see also
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 43
Figure 2.2). Thus “Sin City” is given a high weight for the prediction of Daniela‟s rating on
“Thor”.
Figure 2.2: Comparing three movie rating profiles
As mentioned previously, empirical analysis shows that for item-based CF approaches
the cosine similarity measure performs best with respect to prediction accuracy (Jannach et al.
2011). Hence, this measure is most often employed in item-based predictions (e.g. Sarwar et
al. 2001; Rashid et al. 2002; Linden, Smith, and York 2003; Zeigler et al. 2005). However,
one fundamental difference between user-based CF and item-based CF in computing the simi-
larity is that in the former approaches the similarity is computed along the columns of the
rating matrix, whereas the latter approaches compute the similarity along the matrix‟ rows
(see Table 2.4), i.e. each pair of co-rated entries corresponds to different users. Thus, compu-
ting the similarity between items, using cosine measure analogues to (2.5) in the item-based
case has one important drawback - it lacks accounting for differences in the rating scales of
different users. The adjusted cosine similarity measure offsets this drawback by subtracting
the average rating of the corresponding user from each co-rated pair (Sarwar et al. 2001). Ac-
cording to this scheme the similarity between items and is given by
∑
√∑ √∑
(2.10)
where | is the set of users that have rated both items and
.
1
2
3
4
5
6
7
8
9
10
Thorsten André Michael Paul
Sin City
Titanic
Thor
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 44
After the similarities between the items are determined, the prediction of the rating of
the current user on item is computed as a weighted sum of the current user‟s ratings for the
items that are similar to the questioned item, or formally:
∑
∑ | |
(2.11)
where denotes a set of items that are most similar to the questioned item . That is, the size
of the considered neighborhood, as in the user-based case, is limited to a specific number of
most similar items.
Just like in the user-based approaches, after making predictions, the item(s) with the
highest rating(s) constitute the recommendation(s).
2.3.1.3 Matrix Factorization and Latent Factor Models
Another approach within the class of collaborative filtering techniques is matrix factori-
zation (Sarwar et al. 2000, 2002; Goldberg et al. 2001; Canny 2002; Koren, Bell and Vo-
lonsky 2009; Koren and Bell 2011; Jannach et al. 2011). The general idea of this approach is
to exploit the data received from all users of a RS to derive a set of latent factors descriptive
of hidden associations between users and items and then to apply this knowledge for the pro-
duction of recommendations. In other words, matrix factorization (MF) techniques map both
users and items to a multidimensional joint factor space, where user-item interactions are
modeled as inner products of the vectors that represent user and item rating profiles. The la-
tent space tries to explain ratings by characterizing both, items and users on factors automati-
cally inferred from the ratings gathered from the user community (Koren and Bell 2011). For
instance, in the domain of motion pictures, such automatically identified factors may corre-
spond to obvious movie aspects such as genre, less well-defined movie dimensions, such as
depth of character development or quirkiness, but they can also be completely uninterpretable
(Koren, Bell, and Volinsky 2009).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 45
Figure 2.3: A simplified illustration of the latent factor approach Source: Koren, Bell and Volinsky (2009), p. 44.
Figure 2.3 depicts a simplified example of how latent factor models work, provided in
Koren, Bell, and Volinsky (2009). The figure shows where several well-known movies and
some fictitious users may fall on two hypothetical dimensions, i.e. factors, characterized as
serious versus escapist and male- versus female-oriented. In a sense, the interpretation of the
graph is similar to the interpretation of perceptional maps within multidimensional scaling
(MDS) procedures, well-known in marketing (Myers 1996): The relative positions of the us-
ers and items in the two dimensional space characterize the degree to which the user‟s taste
matches the movie‟s characteristics in terms of the derived factors. The further from the origin
the user or the movie is located in the factor‟s direction, the more pronounced is the factor in
the user‟s taste or in the movie‟s properties. The nearer the user is to a movie, the more s/he is
supposed to like it. Accordingly, we can describe Gus‟ as having a strong preference for male-
oriented escapist movies and “The Color Purple” as a serious female-oriented movie. Hence,
in our example we would expect Gus to love “Dumb and Dumber”, to hate “The Color Pur-
ple” and to rate “Braveheart” about the average. Note, however, that some movies, e.g.
“Ocean‟s 11”, and some users, e.g. Dave, would be characterized as fairy neutral on these two
dimensions (Koren, Bell, and Volinsky 2009), meaning that the two factors fail to describe
both Dave‟s movie taste and the properties of “Ocean‟s 11” substantively enough for generat-
ing predictions.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 46
Underlying concept for deriving the factors is the method of singular value decomposi-
tion (SVD; Golub and Kahan 1965), which is an established technique for identifying of latent
semantic factors in information retrieval (Koren, Bell and Volinsky 2009; Jannach et al.
2011). SVD is based on the theorem of the linear algebra, which states that any matrix can
be decomposed into a product of three matrices as follows:
(2.12)
where the columns of and are called left and right singular vectors and the values of the
diagonal elements of are called the singular values (Jannach et al. 2011; Golub and Kahan
1965; Press et al. 2007). The main point of this decomposition is that it enables us to approx-
imate the full matrix by observing only the most important features – those with the largest
singular values (Jannach et al. 2011; Press et al. 2007).
Informally, the SVD technique can be described as follows: The singular values corre-
spond to the eigenvalues of the eigenvectors that span the range of (Press et al. 2007). Thus,
the eigenvectors with the largest singular values capture the biggest portion of the variance in
. These eigenvectors build up the basis, i.e. the set of “factors”, of the target factor space. If
is the user-item matrix of ratings (e.g., our example rating dataset from Table 2.4), then
corresponds to the users and to the catalog of items (Jannach et al. 2011); and if factors
were determined to have non-zero singular values, then the product of the first columns
from , columns from and dimensional diagonal matrix of singular values ac-
cording to (2.12) yields the best approximation of in terms of the least-squares error (Press
et al. 2007). Thus, the first columns of and describe the users‟ and the items‟ coordi-
nates along the dimensions of the factor space, i.e. user tastes and item properties in terms
of the determined factors.
However, conventional SVD is undefined when the knowledge about the matrix is
incomplete (Koren, Bell, and Volinsky 2009; Press et al. 2007), which is always the case in
RS: If each element of the user-item matrix was known, there would be no reason to predict
user ratings – they all would be already known. To overcome this problem some earlier works
suggested employing imputation techniques to fill in missing ratings and make the ratings
matrix dense (e.g. Sarwar et al. 2000; Kim and Yum 2005; Ying, Feinberg, and Wedel 2006).
However, the imputation approaches have been criticized for being very expensive with re-
spect to computational resources. Moreover, the data may be considerably distorted due to
inaccurate imputation (Koren, Bell and Volinsky 2009; Koren and Bell 2011). Consequently,
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 47
recent works suggested performing decomposition of the user-item matrix on the basis of ob-
served ratings only, while counteracting overfitting through an adequate regularization (Can-
ny 2002; Funk 2006; Paterek 2007; Bell, Koren, and Volinsky 2007; Salakhuzrdinov, Mnith,
and Hinton 2007; Koren 2008; Koren and Bell 2011).
In this case, the rating of the user for the item is modeled as an inner product of
the vector of movie qualities and the vector of user‟s preferences , each described in
terms of latent factor dimensions (Koren, Bell, and Volinsky 2009):
(2.13)
That is, the rating is thought to be a projection of the result of the interaction of the user‟s
preferences and the item‟s properties onto their common space. The problem is that neither of
the two vectors nor their dimensionality is known. The only information the system can rely
on are the results of user-item interactions, i.e. the ratings that users have given to items in the
past. The task of the system is thus to recover the knowledge about the users and the items
from past ratings, so that this knowledge can be used to predict future ratings using (2.13).
Roughly said, the system has to iterate through all ratings and infer which part of each rating
comes from the user‟s preferences and which is due to the item‟s properties, i.e. to decompose
ratings into user and item vectors. The decomposition should be additionally performed so,
that expression (2.13) “gains” its validity for the whole set of the known ratings.
To learn the vectors and , the algorithm minimizes the regularized squared error
on the set of the observed ratings (Koren, Bell, and Volinsky 2009):
∑ ‖ ‖
‖ ‖
(2.14)
where denotes the “training” set, i.e. the set of pairs for which is known. The con-
stant controls the extent of regularization, which is aimed to counteract the overfitting of the
learned parameter values to the data by penalizing the former in their magnitude. The value of
is usually determined by cross-validation (Koren, Bell, and Volinsky 2009).
The learning of the parameters, i.e. the minimization of the sum (2.14), is typically per-
formed either by alternating least squares (ALS) or by the stochastic gradient descent method.
As the name of the method suggests, ALS alternates between fixing the ‟s and fixing
‟s. Each time all ‟s are fixed, the algorithm recomputes ‟s by solving a least-squares
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 48
problem, and vice versa. In doing so, each step decreases equation (2.14). The alternation con-
tinues until the equation converges (Bell and Koren 2007).
Another method, stochastic gradient descent, can be traced back to Simon Funk (2006),
who popularized it during the Netflix One Million Dollar contest. The simple technique al-
lowed him to reach the top of the contestants list, and so gained an extensive attention from
the RS research (Paterek 2007; Salakhurdinov, Minh, and Hinton 2007; Takács et al. 2007;
Koren 2008; Koren, Bell, and Volinsky 2009; Koren and Bell 2011). Looping through all the
ratings in the training set, the algorithm computes for each given rating its predicted value
and associated prediction error . Then it modifies the parameters by a
magnitude proportional to the learning rate , i.e. step size, in the opposite direction of the
gradient (Koren, Bell, and Volinsky 2009):
( ) (2.15)
( )
The learning is finished when the sum in equation (2.14) cannot be reduced any further, or
when the magnitude of its decrease in a given iteration does not exceed some preassigned
threshold, say 0.001.
The dimensionality of the factor space can either be set based on some considerations,
e.g. system performance, or be determined directly in the process of decomposition. In the
latter case, another loop wraps around the algorithm. In each iteration of the outer loop the
algorithm learns one factor dimension, i.e. one coordinate of ‟s and ‟s. A soon as no fur-
ther iteration of the inner loop can decrease the cost function (2.14), one more factor dimen-
sion is added and the learning continues on this dimension. The loop proceeds until the addi-
tion of further factors does not decrease the cost function (Funk 2006). The intuition behind
this procedure is that in the first iteration the parameters of the factor with the highest expla-
nation power are learned, so that the first factor captures as much of the variance in the ratings
as possible. The second factor tries to capture the majority of the remaining variance, and so
on. Hence, the explanation power of each successive factor decreases. Here, the direct analo-
gy to the principle of SVD decomposition can be seen. Therefore, the matrix factorization
techniques are often collectively called “SVD methods”.
A comprehensive overview of the recent advances in matrix factorization for CF can be
found in Koren and Bell (2011). The authors tackle on topics related to computational issues,
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 49
aspects of modeling, parameter estimation and show how to utilize temporal models and im-
plicit user feedback to improve the model‟s accuracy. Additionally they report on some in-
sights from applying these techniques in the Netflix prize contest.
2.3.2 Content-based Filtering
Content-based (CB) approaches base their predictions on the similarity between items
and the information about past preferences of the active user. Unlike CF, the calculation of
item similarity is based not on the ratings of other users but solely on the content characteris-
tics of the items. The main advantage of CB approaches over CF is that the former require
neither the existence of a large user community nor a considerable rating history to produce
recommendations. In essence, CB methods do not need any knowledge about the users others
than the one, the recommendations are made for (Jannach et al. 2011). The recommendation
task consists of determining the items that are similar to those the active user has liked in the
past (Balabanovic and Shoham 1997; Mladenic 1999; Herlocker et al. 1999).
Historically, CB approaches have been developed for the recommendation of text-based
items, such as e-mail messages or news (Jannach et al. 2011). Accordingly, CB methods
mainly deal with the recommendation of textual documents. Nevertheless, the general idea of
exploiting the object‟s content can also be expanded to the domains of non-textual products or
items. In this case, however, some modifications to the original CB approach must be made.
Hence, the current Chapter is divided in two subsections: The first subsection describes the
principles and procedures of the “original” text-based CB approaches, whereas the second
subsection addresses the specifics of its application in the non-textual domains.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 50
2.3.2.1 The Principles of Content-based approaches
Having their roots in the field of information retrieval and data mining, content-based
approaches mainly deal with recommendations of textual documents (Jannach et al. 2011).
The standard approach is, therefore, to extract a list of relevant keywords from the content of
a document or from a textual description thereof (Balabanovic and Shoham 1997; Ado-
mavicius and Tuzhilin 2005; Lops, de Gemmis and Semeraro 2011; Jannach et al. 2011).
Consequently, each document is described with a vector of dimensionality equal to the num-
ber of relevant keywords (also often called features) maintained in the system. These vectors
are then used to determine the documents, i.e. items, which are similar to the ones that the
user was interested in in the past. Once such items are determined, they can be recommended
to the user.
To gain an intuitive idea of how this works, examine Figure 2.4 and Table 2.5. The fig-
ure illustrates the principle of how the keywords are extracted from documents and how they
constitute a vector representation thereof. In the given example, the vector‟s elements corre-
spond to the frequency of appearance of the respective words in the document. Other, more
comprehensive, techniques for constructing the keyword vector will be discussed below. For
this moment, to simplify our example, it is only important for us to know that the elements of
a keyword vector represent the presence of a word in the document.
Figure 2.4: Illustration of the extraction of a features vector from a document
Emmerich defends Shake-
speare film
German film director Roland Emmerich ad-
mits courting controversy with his film that
questions the authorship of Shakespeare's
plays.
2 Emmerich
2 film
0 Aid
. .
. .
. .
1 director
0 E. coli
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 51
Consider now Table 2.5, which encompasses five article headlines30
and their binary
representations as four elements row-vectors. Each element denotes a specific word with “y”,
indicating that the word is present in the article. The last column of the table shows the user
Thorsten and his preferences for the first four articles. We see that Thorsten liked articles with
the keywords “director”, “film” and “aid”; but he did not like articles with the keywords “E.
coli” and “aid”. Hence, the system will assign positive weights for “film” and “director” to
Thorsten‟s user profile. “E. coli” will receive a negative weight and the weight of “aid” will
be neutral - because it appears equally frequent in both, liked and disliked articles. Based on
these considerations, the system will predict that Thorsten will like the last article, featuring
Tom Hanks‟ attendance to the new movie of his own, because the article‟s content includes
the keyword “film”. If this article contained also the keyword “E. coli”, the system would
predict a lower rating for Thorsten‟s liking of it. The precise magnitude of the rating would
depend on the relative weighting of the keywords.
Table 2.5: Principle of content-based filtering
aid director E. coli film Thorsten
Emmerich defends Shakespeare film y y +
EU sets E. coli aid at 150m euros y y -
E. coli map: How the outbreak looks y -
Nadir to receive legal aid y +
Tom Hanks had a 'personal mission'
with Larry Crowne y ?
Now let us consider the content-based approach in more detail:
The above noted binary (as shown in Table 2.1) and frequency-based (as exposed in
Figure 2.4) encodings of keywords are not the only methods to construct vector representa-
tions of documents. The need for more comprehensive techniques emerges because of the
following shortcomings of the mentioned methods: The binary representation assumes that all
keywords have the same importance for characterizing the content of a document. Conven-
tional wisdom tells, however, that keywords that occur more often in a document are more
descriptive of it. Although the frequency based encoding compensates this issue, another seri-
30 The article headlines and annotations in Figure 2.4 and Table 2.5 are taken from http://bbc.com, retrieved on
07.06.2011.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 52
ous drawback remains: Longer documents naturally have higher keyword frequencies and
comprise richer vocabulary, so that both, the probability for the keyword vector of containing
a specific word and the keyword weights, rise with the length of a document. Consequently,
longer documents have a higher probability to be recommended because their keyword vec-
tors are more likely to overlap with user profiles and because the relevance weights of the
keywords are overestimated (Jannach et al. 2011; Lops, Gemmis, and Semeraro 2011).
A standard approach to counteract these shortcomings is the term frequency - inverse
document frequency (TF-IDF; Salton, Wong and Yang 1975), an established technique from
the field of information retrieval. The main idea of this approach is that the descriptive power
of a keyword for a document, on the one hand, depends on how frequent this word appears
within the document itself; while on the other hand, it depends on how often this word occurs
within the whole corpus of documents. Accordingly, TF-IDF is composed of two measures:
Term frequency (TF) describes the frequency of the keyword‟s occurrence in a docu-
ment, assuming that important words occur more often. To account for document lengths and
to prevent longer documents from getting higher relevance weights, the word‟s frequency is
normalized (Jannach et al. 2011), typically, by relating the word‟s frequency to the maximum
frequency of other words in the document31
(Adomavicius and Tuzhilin 2005; Lops, Gemmis,
and Semeraro 2011).
Inverse document frequency (IDF), on the contrary, assumes that words that occur sel-
dom in the whole set of documents are more descriptive of a document‟s contents. In other
words, generally frequent words are not considered to be very helpful for discriminating
among documents (Jannach et al. 2011). Hence, IDF discounts the weights of words that ap-
pear frequently.
The product of TF and IDF yields the TF-IDF measure that accounts for both of the as-
pects described above.
More formally, let be the frequency of the word in the document . Let
| denote the maximum frequency of other
words in the document. Further, let be the number of documents in the corpus and let
31Other normalization schemes, optimized for specific cases, can be found in Chakrabarti (2002), Pazzani and
Billus (2007), and Salton and Buckley (1988).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 53
denote the number of documents from in which appears. Then, in a given document cor-
pus, TF-IDF measure and its components are defined as follows:
(2.16)
(2.17)
(2.18)
Once TF-IDF vector representations are computed, the similarity between the docu-
ments can be determined by means of the similarity measure of choice. Depending on the
problem at hand, various similarity measures are possible (Maimon and Rokach 2005; Baeza-
Yates and Ribeiro-Neto 1999; Zanker et al. 2006). In the domain of recommendations of tex-
tual documents, however, the most common approach is to use the cosine similarity as de-
fined in (2.5) (Adomavicius and Tuzhilin 2005; Jannach et al. 2011; Lops, Gemmis, and
Semeraro 2011).
In essence, the further procedure of the recommendation generation in CB approaches is
analogous to the item-based technique with the difference that, here, only the document rat-
ings of the active user are employed. That is, the most similar items, for which a rating of the
active user exist, “vote” for yet unrated items (Allan et al. 1998; Jannach et al. 2011). Also,
analogously to CF approaches, in CB case the number of the “voters” can be set explicitly or
determined through setting a minimum similarity threshold (Billsus, Pazzani, and Chen 2000;
Billsus and Pazzani 1999). The “votes” are then aggregated to predicted ratings; typically by
employing a weighting rule that is based on the degree of similarity between the items, i.e.
analogously to the aggregation rule (2.8). Once the predictions are done, the item(s) with the
highest rating(s) or with the highest similarity to the previously most liked items can be rec-
ommended.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 54
2.3.2.2 Exploiting Content Characteristics in Non-textual Item Domains
As noted in the introductory part of Section 2.3, the idea to exploit the content charac-
teristics of items for producing recommendations can also be transferred into the domains of
non-textual objects, such as music or movies. The challenging task of this transfer is, howev-
er, the extraction of qualitative characteristics for the representation of user and item profiles.
This is mainly because of the very limited ability of modern content processing algorithms to
automatically extract meaningful features that are descriptive of multimedia content (Wei,
Shaw, Easely 2002; Pazzani, Billsus 1997; Lops, de Gemmis, and Semeraro 2011). Hence, the
recommender algorithms have to rely on rather “technical” characteristics of the content (such
as genre, cast, length, etc.), which are either available from the providers or the manufacturers
(Jannach et al. 2011); or that can be extracted from external information sources, e.g., cata-
logs, movie critics web sites, etc. (e.g. Alspector, Kolcz, and Karunanith 1998). Nevertheless,
these technical content characteristics do not always overlap with qualitative features that
determine the consumer‟s judgment of items: For example, in domains of quality and taste the
reasons that a consumer likes an item are often based on subjective impressions, e.g., of an
item‟s exterior design, rather than being related to certain product characteristics (Jannach et
al. 2011). A manual specification of the item‟s features by domain experts seems to be the
only option to address this limitation (Adomavicius and Tuzhilin 2005; Lops, de Gemmis, and
Semeraro 2011; Jannach et al. 2011).
The most prominent and exceptional example for the application of the CB approach on
manually coded items is a popular internet radio and music recommendation service, Pando-
ra.com. Pandora‟s services rely on the data from “Music Genome Project”32
that are manually
entered by highly-trained analysts33
. A song‟s description encompasses often up to several
hundreds of features34
– “music genes” – such as instrumentations, influences, measures, key
tonality, song structure, vocal harmonies, aesthetic, phrasing, lyrics‟ mood and emotions, etc.
32 http://www.pandora.com/mgp.shtml
33 http://blog.pandora.com/faq/contents/506.html
34 http://blog.pandora.com/faq/contents/19.html
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 55
At this point it seems reasonable to interrupt our narration shortly and to notice
that Pandora’s approach mirrors the goals of our thesis (revise esp. Sections 1.2 and
2.1.4) and practically fulfills them: The song’s attributes defined by the experts are cho-
sen to potentially influence preferences of the users and to be understandable for them.
Further, the recommendation algorithm incorporates the preference relevant attributes
directly into the process of recommendation generation. Moreover, the employed CB
method tries to match recommendations with the user’s preferences, i.e. to align rec-
ommended songs with the user’s attribute preference weights. Hence, Pandora’s rec-
ommendation engine is both concordant with the way the users evaluate choice alterna-
tives and it is also capable of providing actionable and effective explanations behind
recommendations. Nevertheless, due to the reasons explained below, we seek for an al-
ternative approach for achieving our goals.
However, in most applications, the effort to manually encode item characteristics is
considered to be impractical due to the limitation of resources (Adomavicius and Tuzhilin
2005; Jannach et al. 2011). As stated by the founder of Pandora, Tim Westergren, “unlock-
ing” of a track‟s music genes, i.e. its manual annotation, takes a trained musician from about
fifteen minutes for a pop song to about one hour and a half for more sophisticated composi-
tions (Tim Westergren cited in Tran-Le 2010).
The latter issue affects that in the most cases, RS employ only those item characteristics
(i.e. attributes) that are available in electronic form (Jannach et al. 2011). Even though, in the
domains, such as motion pictures, where considerable amounts of the “technical” attributes
are available, only a subset of the available attributes is typically exploited (e.g. Ansari, Esse-
gaier, and Kohli 2000; Kim and Kim 2001; Burke 2002; Melville, Mooney, and Nagarajan
2002; Ying, Feinberg, and Wedel 2006; Gunawardana and Meek 2009; Park and Chu 2009).
This is because of the problem of assigning importance weights to the attributes within the
vector representations of items, discussed in the preceding subsection:
In the simplest case, the attributes of movies, i.e. genres, actors, directors, etc., would be
coded binary, indicating whether the attribute, e.g., a specific actor, is preset in the movie.
However, in this case, all attributes would be equally important for describing movies. Again,
conventional wisdom tells us that some attributes may discriminate stronger than others. For
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 56
instance, the acting of a specific star in a movie may signal more than its categorization to a
specific genre or the movie‟s belonging to a specific production company. Unlike in the case
of text-based items, in the movie domain, the characterization of attribute importance weights
by means of their frequencies is not possible, because each movie can only be described once
at each attribute. In other words, we cannot state that there is more Clint Eastwood in “For a
Few Dollar More” than in “The Good, the Bad and the Ugly” and we cannot assert that one of
both movies is more of a western than the other; at least not without having to assign the at-
tribute weights manually. Consequently, the frequency based TF-IDF measure is also not
available for allocating importance weights to the attributes.
Due to the lack of an instrument to assign attribute importance weights respective to
their ability to differentiate among movies the typical approach is, therefore, to maintain the
binary movie vector representations as described above. The issue of different roles that the
attributes play for the formation of the user‟s preferences of movies is addressed solely
through the user‟s profile. The latter is, thereby, represented as a vector of the number of di-
mensions that equals the number of attributes in the movie vector plus one. Hence, each vec-
tor dimension represents both, the importance weight of a corresponding attribute for the us-
er‟s discriminating among different movies and the amount of movie preference that the user
associates with this attribute; and the last dimension represents the user‟s rating baseline. The
values of the vector‟s entries are estimated through regressing the user‟s past ratings on the set
of available movie attributes (e.g. Ansari, Essegaier, and Kohli 2000; Kim and Kim 2001;
Ying, Feinberg, and Wedel 2006). The regression model is typically formulated as follows:
∑
(2.19)
where denotes the rating of user to movie , are binary dummy varia-
bles indicating the presence of th attribute in the movie‟s characteristics and are the re-
spective regression coefficients with being the constant term and denoting the esti-
mation error of the regression model.
Notice that the regression coefficients correspond to the movie attributes and cap-
ture the part of the rating that is due to the presence of the attribute in a movie characteristics
vector, i.e. attribute part worths. The values of betas can be positive, indicating an increase in
the preference when the attribute is present, and they can be negative, indicating a dislike to-
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 57
wards the attribute. The baseline estimate shows the amount of preference for the movies
in general, i.e. when no information about the movie‟s characteristics is available, and it
equals the mean user‟s rating. The latter is due to specifics of dummy regressions (see Gujara-
ti 2004 for details).
Once the betas are estimated, the predictions of the ratings for the yet unseen movies
can be done by means of the regression equation (2.19), or when reformulated in the vector
form as inner product of the user‟s profile vector and the vector of movie‟s attributes :
(2.20)
Note, however, that in order for expression (2.20) to hold formally, the movie‟s vector
must be complemented with a unity entry at the position that corresponds to the position of
the entry in the user‟s vector, so that the baseline estimate is contained in the final sum.
Analogously to all the previously discussed methods, after the ratings are predicted, the
item(s) with the highest rating(s) can be presented to the user in order to accomplish the rec-
ommendation task.
Let us now return to the assertion we made above that only a fraction of available at-
tributes is typically used within CB approaches in the case of recommendations of non-textual
items and explain the reasons thereof. Although in many non-textual domains – and specifi-
cally in the domain of motion pictures – considerable numbers of attributes are often available
in electronic form or can easily be extracted from additional information sources, and despite
the natural pursuit to include as much of these attributes into the recommendation process as
possible to increase the “overlap” of the technical item characteristics with the qualitative
ones, this cannot be done due to the restrictions of the regression analysis.
The issue is that regression analysis in general requires at least one observation per es-
timated parameter; otherwise the problem (2.19) cannot be solved due to insufficient data
(Gujarati 2004). Additionally, the observations are required to be mutually linearly independ-
ent in terms of parameters to avoid multicollinearity, which again would entail (2.19) to be
unsolvable, unless the parameters causing multicollinearity are omitted from the model or
other countermeasures are undertaken (Gujarati 2004). Despite other requirements of the re-
gression analysis that also can be hurt, these two considerably limit the number of possible
parameters, i.e. attributes, which can be introduced to the regression model. In the best case,
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 58
when no multicollinearity is present, the upper limit for the number of attributes that can be
considered per user equals the number of the ratings of that user available to the system.
Taking into account that the majority of users in movie RS datasets typically have rated
about twenty movies each, an inclusion of a higher number of attributes in the regression
model would harm the RS insofar as it would be able to produce recommendations only for a
narrow group of its users. Hence, the number of attributes considered within content-based
movie recommender was varied between ten (Kim and Kim 2001) and twelve (Ansari, Esse-
gaier, and Kohli 2000; Ying, Feinberg, and Wedel 2006) up to this moment.
The estimation of more than 300 attribute part-worths per user, as suggested in the cur-
rent thesis (see Section 2.2.2 and Appendix B: List of Preference Relevant Attributes), would
be unfeasible within the CB approach described above. Discarding a substantial part of poten-
tially relevant attribute knowledge, however, entails that a considerable portion of preference
relevant variance in the known ratings would not be captured by the model. This, in turn,
would lead to larger errors in the predictions and thus to a lower prediction accuracy. The
latter fact explains also why the majority of work that incorporates movie attributes into the
recommendation algorithms (e.g. Baudish 1999; Burke 2002; Melville, Mooney, and Nagara-
jan 2002; Park and Chu 2009; Gunawardana and Meek 2009) exploit only a small fraction of
the available attributes. Moreover, the knowledge about the attributes is not used directly for
rating predictions in the content-based manner but is rather utilized as additional information
to improve the CF predictions within hybrid models.
A brief overview of hybrid approaches and the motivation thereof will be given in the
subsequent sections. At this point, to conclude the current section, we consider it reasonable
to draw the reader‟s attention to the following two aspects that were omitted from the main
discussion of this section because they would unnecessarily interrupt its natural flow and pos-
sibly interfere with the reader‟s understanding of the discussed topic:
Note that contrary to the case of textual items, in the case of non-textual items, the CB
recommendation procedure omits the step of computing similarity between the items. This is
due to two specific properties of the non-textual domain: Firstly, the content of the items is
described in terms of binary vectors, which was shown to be the only possible way to auto-
matically describe items because the attributes can be assigned to an item only once. In con-
trast, in a textual domain, each attribute (i.e. keyword) can be additionally characterized by
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 59
the number of its occurrences in the document, which “grants” descriptive power to the quan-
tity of the feature that is basically absent in the non-textual case. Secondly, because of the
simplified representation of the item, the preferences may be fully attributed to the user pro-
file. Thus, recommendation can be made through the direct matching of item profiles with the
user‟s profile without the need to search for the items that are similar to the most liked ones.
The information that in the case of textual items was contained in the concept of similarity
can be thought as adsorbed by the user profile in the non-textual case. Note also that the pre-
dictions of item ratings are done simply by computing the inner product of user and item vec-
tors, rather than trough “voting” of similar items for a questioned one.
Notice further the existence of conceptual similarity between the regression model
(2.19) and the multiattribute utility or the WADD rule (2.2) as well as between the vector
form of rating prediction (2.20) and the model of matrix factorization approach (2.13). If we
would think of attribute importance weights in (2.2) in terms of item characteristics presented
in the current section, we would notice that the only remaining difference between the WADD
rule (2.2) and the regression model (2.19) consists in the absence of the baseline constant term
in the former expression. In all other means, both expressions are essentially the same. Fur-
ther, recall that expressions (2.19) and (2.20) are also in essence the same rating composition
rule that is only written in two different forms, i.e. algebraic form and vector form. Taking
into account that both, the MF decomposition rule (2.13) and the vector form of rating com-
position (2.20), represent nothing more than an inner product of the vector of item properties
and the vector of user preference weights, the conceptual similarity between both expressions
becomes apparent. The only difference between the two concepts is that both involved vectors
consist of the attributes that are defined differently. That is, all the mentioned concepts - mul-
tiattribute utility, the content-based approach and matrix factorization - are in essence differ-
ent viewpoints onto the same concept that follows the idea of the representation of an object
in terms of its attributes and the representation of the users in terms of their attribute-related
preferences. The distinction between the three variants is constituted solely by the details of
implementation and the disciplines the concepts originate from, i.e. marketing, information
retrieval and recommendation systems.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 60
2.3.3 Trade-offs and Problems of Collaborative and Content-based Ap-
proaches
All of the recommendation techniques introduced in preceding sections have their mer-
its and limitations that entail trade-offs when it comes to the question which approach a par-
ticular RS should employ. Some of them, i.e. those related to the issue of the provision of ef-
fective explanations, were discussed in Section 2.1. In the current section, we provide a brief
overview of the strengths and weaknesses of CB and CF approaches that influence the func-
tionality of RS in a technical sense, i.e. impact the ability of RS to provide recommendations.
Table 2.6 summarizes the discussion of strengths and weaknesses.
Table 2.6: Summary of strengths and weaknesses of different
recommendation approaches
“+” denotes a tendency to exhibit the problem, “–” indicates nonsusceptibility to it,
“±” symbolizes the presence of the weakened problem
Approach
Type of problem User-based Item-based
Matrix
factorization
Content-
based
Sparcity + + ± –
New User + – ± +
New Item – + ± –
Overspecialization – – – +
Gray sheep + – ± –
Starvation – + ± –
Shilling Attacks + + ± –
Stability vs. Plasticity + + + +
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 61
2.3.3.1 Data Sparcity
Perhaps, the most common problem of RS that causes almost all other problematic is-
sues is the sparsity of the underlying data base. That is, RS have to produce their recommen-
dations on basis of a user-item rating matrix, which is typically very far from being dense
(Burke 2002). Consider an example of Amazon that maintains millions of items that are of-
fered to millions of users. In such a situation, it is not realistic that at least some users may
rate a considerable amount of items in Amazon‟s catalogue. Quite the contrary, it is more re-
alistic to assume that the majority of Amazon‟s customers have rated only vanishingly small
subsets of the offered items. Such scarce datasets are typical for most RS. So, in the Netflix
Prize dataset more than 99% of the possible ratings are missing (Koren and Bell 2011). The
same problem applies to the publicly available data sets of EachMovie and MovieLens
(O‟Sulivan, Smyth and Wilson 2004, p. 230) as well as for the data basis of MoviePilot‟s rec-
ommender system.
Although sparsity is problematic for all kinds of recommendation approaches, it is more
an issue for collaborative techniques, especially for item-based and user-based ones. This is
because they base their predictions on neighborhoods of like-minded users or similar items.
To form the latter, however, some level of overlap between the users or item profiles is re-
quired (Burke 2000). That is, if two users with identical tastes have rated different segments
of items, a user-based CF system will fail to detect their similarity because both user profiles
do not share a sufficient number of items. Thus, the system will not be able to recommend the
items liked by one of the users to the other one, although their tastes are identical. Analogous-
ly, in the item-based approach, if two item profiles do not overlap sufficiently, they cannot be
considered similar – even if both entries are the duplicates of the same item. Thus, the infor-
mation contained in one of the item profiles cannot be used for predictions of the user ratings
for the other item.
For MF approaches, the sparsity problem was not investigated sufficiently in literature.
However, indications are that the sparsity problem is mitigated in MF approaches because
they reduce the dimensionality of the space on which recommendations are made by extract-
ing latent factors from the original data (Burke 2000). Nevertheless, intuitively, in order for
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 62
factors to capture the variance in the user and item profiles, the latter should exhibit at least
some overlap. Though, contrary to the item-based and user-based approaches, in the MF case
both, user and item profiles, are involved in the factor extraction simultaneously. Thus, it is
enough that the overlap happens along on either of both profile dimensions, which is more
probable as compared to the situation when each of the profile dimensions is considered sepa-
rately. Still, sparsity remains a significant problem in domains where many items are availa-
ble, unless the user base is very large (Burke 2000).
As described earlier, CB approaches do not utilize ratings of other users for their predic-
tions but rather base them on the content characteristics of items. Moreover, the content de-
scriptions build up the data basis of CB approaches and thus are available for each item of the
catalogue. That is, the item space of CB recommender is dense and the density of the user
space is irrelevant. Hence, CB approaches are less likely to suffer from sparsity. Nevertheless,
the density of the active user‟s profile still remains an important issue for CB techniques.
This, however, manifests itself as a subclass of the next type of problems – the “ramp-up”
problem (Konstan et al. 1998).
2.3.3.2 “Ramp-up”: New User and New Item Problems
The “ramp-up” problem (also often called “cold-start” problem) regards the situations
where RS do not have enough information in order to make rating predictions (Konstan et al.
1998). Such a situation may happen in cases when (i) a new user or (ii) a new item is intro-
duced to the system. Accordingly, these types of situations are also often referred to as “new
user” and “new item” problems (Konstan et al. 1998; Burke 2002; Adomavicius and Tuzhilin
2005).
The new user problem is mainly an issue in the user-based and content-based approach-
es. Here, the system must acquire enough knowledge about the user, i.e. user ratings, to be
able to find like-minded users (in user-based CF systems) or detect items that match the user‟s
profile (in CB systems). In these types of systems, new users have to supply some information
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 63
about their tastes and preferences, i.e. ratings, in order to establish the basis for future recom-
mendations.
New items are added frequently to the catalogs that are maintained by RS. Whereas in
CB approaches the items are described in terms of their content, they are ready to recommend
right after their introduction to the system. In CF approaches, on the contrary, the system
needs new items to receive some ratings before they can be recommended. The new item
problem is also called the “early rater” problem, since users who rate the new items first re-
ceive little benefit from doing so, i.e. the early ratings do not increase the user‟s ability to
match against other users (Avery and Zeckhauser 1997). Hence, CF systems have to provide
other incentives in order to encourage users to provide ratings (Burke 2002).
MF approaches, as a subclass of CF approaches, suffer from both, new user and new
item problems, in equal measure; since from the viewpoint of matrix decomposition, it mat-
ters little if a new rating comes in as a new entry in a row or in a column of the user-item ma-
trix. However, MF approaches rely less on the similarity between users or items but rather
factorize the matrix entries, in a sense, “independently” of their row or column affiliations.
Hence, ceteris paribus, they are likely to need fewer ratings from a new user or for a new item
than their user-based and item-based relatives in order to be able to recommend.
As can be seen from the forgoing discussion, all recommendation approaches suffer
from the ramp-up problem in one form or another, which makes it necessary for RS to contin-
uously acquire additional data, i.e. ratings, from users in order to improve their ability to rec-
ommend and also the quality of their recommendations.
2.3.3.3 Overspecialization
CB approaches often suffer from uniform recommendations (Zhang, Callan, and Minka
2002; Jannach et al. 2011), a phenomenon that is also often called “portfolio effect” (Billus
and Pazzani 2000; Linden, Smyth, and York 2003; Burke 2002). Recommending items that
score highly against a user‟s profile, CF systems incline users to being recommended items
that are similar to those they have already seen (Adomavicius and Tuzhilin 2005). This im-
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 64
plies that the recommendations tend to linger within a particular topic of interest, which – in
radical cases – causes the user to receive recommendations of different versions of the same
item, e.g., of a book or a news article, even if s/he already owns it (Linden, Smyth, and York
2003). That is, a user must exhibit an interest for at least one item of a certain topic in order
for this topic to become relevant in the user‟s profile.
CF approaches, in contrast, allow more diverse recommendations. Because they do not
rely on item properties, but rather utilize user ratings that they assign to a wide range of items,
CF methods tend to be more capable of identifying cross-genre relationships of the items
(Adomavicius and Tuzhilin 2005; Jannach et al. 2011). Hence, CF techniques are more help-
ful in discovering items that the users might not have considered otherwise (Burke 2002).
2.3.3.4 “Gray Sheep”, “Starvation” and Shilling Attacks.
Whereas user-based CF methods are not affected by portfolio effects and can identify
cross-genre niches, they suffer from the so-called “gray sheep” problem. According to that,
users with “unusual” tastes have troubles with being categorized into a neighborhood of like-
minded, i.e. similar, users, because their rating profiles do not correlate well with the ratings
of other users (Rashid et al. 2002; Claypool, et al. 1999). Consequently, the generation of rec-
ommendations for such users is problematic.
Similarly, items can be “starved” to benefit other items. That is, popular items become
easier to find as more users rate them. The amount of ratings given to a particular item in-
creases the likelihood for it to participate in the process of matching user profiles. Because
popular items are typically given a higher rating, the probability for them to be recommended
increases too. At the other hand, in the item-based approaches, popular items are more likely
to expose a high similarity in terms of rating profiles and thus also become recommended
more often than unpopular ones. For ambiguous items, i.e. items that provoke polarizing atti-
tudes, it may also be problematic to find a neighborhood of similar items that can serve as a
“source” for rating predictions. Thus, unpopular and ambiguous items become more difficult
to discover (Rashid et al. 2002; McNee et al. 2003).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 65
The latter also makes CF systems susceptible to malicious attacks (also often called
shilling attacks), i.e. the injection of ratings that aim to sink or to soar the popularity of an
item (Lam and Riedl 2004; Sandvig, Mobasher, and Burke 2007; Resnick and Sami 2007;
Mobasher et al. 2007; Metha, Hoffman, and Nejdl 2007).
Although MF approaches do not account for the relationships between user or item pro-
files explicitly, the amount and the character of ratings in the rows and columns of the matrix
to be decomposed influences the “direction” and the information content of the extracted fac-
tors. Although this issue was not studied in prior research, we can logically presume that a
higher number of ratings for a popular item causes a factor to twist towards such an item so
that it becomes easier to recommend it (starvation problem); whereas an unusual pattern of a
user vector causes it to expose lower factor loadings, which means that the rating predictions
for such users become less reliable (grey sheep problem). However, because the reduced di-
mensionality of the factor solution does not correspond directly to user and item dimensions,
the “distortion” of a factor may be compensated through other factors, which is likely to re-
duce the extent of both problems in the case of MF approaches.
CB approaches are immune to the problems considered above, since in CB cases neither
the rating profiles of items nor the ratings of other users are relevant.
2.3.3.5 Stability vs. Plasticity
As noticed earlier, the ability of CF and CB approaches to recommend improves over
time by continuously gaining additional user input, which solves the ramp-up problem. The
converse of this problem is the “stability vs. plasticity” problem (Burke 2002). According to
the latter, RS may become rigid, i.e. insensitive to changes of users‟ preferences. In some
sense, the problem consists in the established knowledge about the prior user‟s preferences
which “dominates” over the new user input. Suppose that a devoted sci-fi fan, all of a sudden
begins to rate dramas highly. In this case, the system might not recognize the changes in the
user‟s preferences, especially if the new input conflicts with the old negative ratings of dra-
mas. Instead, the system is more likely to handle the new positive drama rating as an outlier
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 66
and to continue recommending sci-fi movies. Similarly to the ramp-up case, the user would
need to provide the system with a substantial number of positive drama ratings to stabilize the
system‟s knowledge about the changed user preferences.
To counteract this development, some approaches suggest to discount past user prefer-
ences to cause older ratings to have less influence; but they do so at the risk of losing infor-
mation about the user‟s interests that are long-term but only exercised occasionally (Billsus
and Pazzani 2000; Schwab, Kobsa, and Koychev 2001; Burke 2002; Tsymbal 2004). Thus, if
our sci-fi fan also likes westerns but watches them sporadically, a temporal discount function
might gradually “forget” the user‟s preference for westerns in the cource of time and stop to
recommend them to the user.
2.3.4 Hybrid Recommender Systems
2.3.4.1 Principles of Hybrid Methods
In order to circumvent the trade-offs and problems of individual CF and CB methods,
hybrid systems combine both types of recommendation methods for producing their recom-
mendations. Most of the hybrids combine CB techniques with item-based CF (e.g. Balabanov-
ic and Shoham 1997; Basu, Hirsh, and Cohen 1998; Claypool et al. 1999; Pazzani 1999; Sob-
oroff and Nicholas 1999; Tran and Cohen 2000; Melville, Mooney, and Nagarajan 2002;
O‟Sulivan et al. 2004; Symeonidis, Napopoulos, and Manopoulos 2007; Koren 2008). The
goal of this combination is to utilize the invulnerability of CB techniques to data sparsity and
to the new item as well as to the starvation problems while avoiding its proneness to overspe-
cialization through the use of CF. Another benefit of these approaches is that the user can be
recommended an item not only if it is rated highly by similar users but also if it scores highly
directly against the user‟s profile (Adomavicius and Tuzhilin 2005).
The hybrid approaches vary in nuances with respect to how the different methods are
combined for predictions of ratings and to how deeply they are integrated into each other:
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 67
One way to build a hybrid recommender is to implement each method, i.e. CF and CB,
separately and then to combine the individual predictions (Adomavicius and Tuzhilin 2005).
Now, the final rating predictions can either be formed as a linear combination of individual
predictions (e.g. Claypool et al. 1999) that may also employ some kind of the weighting
scheme for the individual methods (e.g. Pazzani 1999) or the rating of the individual method
can be chosen as a final prediction based on the confidence intervals of the methods employed
(e.g. Billus and Pazzani 2000) or on their consistence with past ratings of a user (e.g. Tran and
Cohen 2000).
In advance, some works (e.g. Good, Schafer, and Konstan 1999; Melville, Mooney, and
Nagarajan 2002) propose augmenting the user-item rating matrix with artificial user rating
vectors in order to increase the overlap between user profiles. These augmented vectors are
produced by content-analysis agents – the so-called “filterbots”. As a result, the users whose
rating profiles agree with that of filterbots may receive better recommendations.
Another approach consists in a more diffuse, deep integration of methods. For instance,
Balabanovic and Shoham (1997) and Pazzani (1999) suggest a technique of “collaboration via
content”. This technique maintains content-based profiles for each user and applies the CF
method to this data rather than to user-item ratings, to identify the similarity between users.
The rating predictions, however, are done by means of CF aggregation rule that is applied to
the ratings of users who were identified as similar. Soboroff and Nicholas (1999) propose a
method that uses latent semantic indexing to reduce the dimensionality of CB user profiles
that are initially represented by term vectors. Then, the collaborative technique is applied to
the “reduced” user vectors. Koren‟s approach (Koren 2008) enriches the MF model with in-
formation about item neighborhoods and factorizes the user-item rating matrix based on this
extended model.
All hybrid approaches, discussed in literature, manage to show either better prediction
accuracy or better performance compared to individual techniques or both. The approach that
has won the One Million Dollar Netflix Prize “blends” the results of more than 100 different
recommendation algorithms (Bell, Koren, and Volinsky 2007b, 2008). The contribution, i.e.
weights, of individual algorithms to the final rating is determined by means of linear regres-
sion, where the dependent variable is the vector of the holdout ratings and the vectors of rat-
ings, predicted for the same set of ratings through different methods, serve as independent
variables.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 68
2.3.4.2 Explanations in Hybrid Approaches
The ability of the hybrid approaches to provide explanations behind recommendations
depends on the specific approaches a hybrid consists of and it varies with the degree of how
tight the individual approaches are toothed with each other. Recall the discussion of Section
2.1.2, where different explanation styles and their correspondence to different recommenda-
tion approaches were presented: User-based approaches are capable of the nearest neighbor
explanation style, item-based approaches allow for the influence style, and CB approaches
can produce explanations in the keyword style.
Consequently, the hybrids are able to utilize the explanation styles that are available to
the recommendation methods they employ. However, the properties of the correspondent ex-
planation styles, e.g., transparency or effectiveness, only remain valid if the final recommen-
dation is produced solely by one of the constituent methods, i.e. when the predictions of the
hybridized methods are not combined and the rating of the best performing method is used35
.
Nevertheless, in cases when the recommendation is produced as a mixed result of mul-
tiple methods, the explanation of why a particular item was recommended can still be gener-
ated. One possible way to do this is to adopt the explanation that would apply, if the individu-
al method recommended this item. For instance, if a hybrid combined a user-based CF with a
CB technique, the explanation could be formed as a mix of the nearest neighbor style (“…
because other users also liked”) and the keyword style (“… because it contains features X, Y,
Z”).
However, this method is not applicable in the cases when the hybridized recommenda-
tion techniques are integrated more tightly within each other, so that the explanations of the
individual methods are not accessible, e.g., as in the above described case of “collaboration
through content”. A possible method to generate an explanation for this scenario is to post-
process the recommendation results with a CB technique. A concrete implementation of such
approach is described in Symeonidis, Napopoulos, and Manopoulos (2008, 2009) – the only
works that apply the keyword explanation style in the domain of movie recommendations. In
35 The consequences of the opposite case will be discussed below.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 69
these papers the recommendations are produced by means of the item-based CF that is applied
to previously formed biclustes of users and items. The explanations are, however, performed
in the content-based manner. To do this, the authors examine the correlations between the
item feature profiles and user ratings and so identify the features that are associated with the
movies the user liked most. If the recommended item included such features, they would be
highlighted in the explanation of the recommendation.
This post-processing method allows the generation of keyword style explanations for
virtually all recommendations approaches. Moreover, within such an approach, it is also pos-
sible to address the negative cues that makes the “pros-and-cons” explanation style also ac-
cessible to all recommendation approaches, albeit these issues wese not addressed by previous
research. Although the pros-and-cons style is shown to be the most effective among the ex-
planation styles considered in the previous research (see Section 2.1.2), its post-processing
version has one serious drawback – it does not reflect the way the recommendation is actually
produced. Hence, the explanations fail to achieve the goal of transparency of the recommen-
dation system, which potentially hazards the users‟ acceptance of and trust in the RS as a
whole as well as their loyalty to the system (see Section 2.1.2). The same remains true also for
the keyword style, the second best explanation style.
Even more important is that in this case the recommendation process is generally not
aligned with the user‟s preferences36
. Thus, the advantage of the pros-and-cons and the key-
word explanation styles to increase the user‟s choice effectiveness cannot unfold thoroughly:
Though the explanations might efficiently highlight the reasons why the user may like the
recommended item, they are not able to explain why the system thinks this item is the best for
the user, since the recommendation procedure cannot access the user‟s attribute preferences.
As shown previously, a deviation from the user‟s preference function potentially decreases the
user‟s choice effectiveness, satisfaction with and loyalty to the RS (Aksoy et al. 2006; see also
Section 2.1.3).
36 Recall our discussion in Section 2.1.3, where we deduced that a good, i.e. effective and actionable, explana-
tion should be informative and understandable to the user. On the other hand, such an explanation requires that the underlying recommendation process is aware of and operates in terms of the item attributes that are descriptive of the user’s preferences, i.e. item characteristics are relevant for the user. In the best case, the explanation additionally utilizes the strengths of negative cues.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 70
2.4 Summary
In this chapter, we provided an overview of the theoretic work related to the objectives
of the current thesis and its underlying proposals, which target for developing a recommenda-
tion method that is capable of providing both, accurately predicted recommendations and ac-
tionable explanations of the reasoning behind them.
In the first section of this chapter, we addressed the question of why the explanations of
recommendations should be provided and we deduced the concept of how it should be done.
Particularly, we concluded that in order to be effective and actionable, the explanations should
be aligned with the user preferences. This also increases the user‟s acceptability of and trust in
as well as loyalty to the recommender system as a whole. Based on these considerations, we
substantiated a new explanation style, i.e. “pros-and-cons” style that actionably supports the
user while choosing a movie and increases the choice effectiveness. We have also shown that
the generation of such explanations involves the recommendation algorithm to be capable of
reflecting the user‟s attribute preferences and to incorporate them directly into the process of
recommendation generation.
Following this idea, in the second section, we introduced the concept of multiattribute
utility (MAU) and the weighted additive composition rule (WADD), which serve correspond-
ingly as the basis for the operationalization of the user‟s attribute preferences and as the basis
for the derivation of the recommendations from the attribute-related preferences. To be able
to apply these concepts to the case of motion picture recommendations, we then elaborated on
the question of which movie characteristics, i.e. attributes, possess relevance for the formation
of consumer preferences for movies. These characteristics are summarized in the Table 2.3
and are also presented in detail in Appendix B.
In the third section, we provided an overview of the key recommendation approaches,
i.e. collaborative filtering, content-based filtering, and hybrid methods. We have also given
detailed descriptions of the recommendation algorithms that are representative for the corre-
spondent approaches. This knowledge allows us to comprehend the principles and the details
of recommendation generation, the merits and limitations of different approaches; as well as it
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 2: Background and Related Work 71
allows us to understand the problems we may potentially face, and thus should consider and
account for while developing our proposed method.
At this point, we have accumulated all the knowledge indispensable for the develop-
ment of our proposals. Hence, we proceed to the next Chapter that describes the concepts of
the method to achieve our goals.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 72
Chapter 3
Conceptual Framework of a Hybrid
Recommender System that allows for
Effective Explanations of Recommendations
3 Conceptual Framework of a Hybrid Recommender System
that allows for Effective Explanations of Recommendations
This chapter presents the actual proposals of the current thesis, i.e. a recommendation
method that is capable of providing both accurately predicted recommendations as well as
actionable and effective explanations of the reasoning behind them. As elaborated in the pre-
vious chapter, this method integrates the user attribute preferences directly into the process of
recommendation generation and thus aligns the recommendation process with the user prefer-
ences.
The chapter is divided in three sections: The first section elaborates on the modeling is-
sues. That is, the model of the user preferences is gradually derived and the aspects that the
model incorporates and accounts for are discussed. The second section concerns the questions
of the parameter estimation for the derived model. In essence, it presents the core of our pro-
posal – the algorithm that is capable of estimating the users‟ attribute part-worths on basis of
very scarce data sets, i.e. such data sets where the number of parameters to estimate is much
greater than the number of data points, which makes an algebraic solution to the estimation
problem impossible. The third section motivates the hybridization of our algorithm and dis-
cusses the hybridization methodology.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 73
3.1 Modeling User Preferences
3.1.1 Motivation of the Approach
As stated earlier, a recommender algorithm that aims to help users make better choices
and to increase their choice efficiency through providing actionable explanations should re-
flect the user‟s way of thinking (see Section 2.1.3). This can be done either by conforming the
algorithm‟s model to the user‟s decision strategy or by an accurate estimation of the user‟s
attribute preference weights.
As shown by Aksoy et al. (2006) the relationship between both aspects is not additive
so that it is only important for a recommender algorithm to maintain one of both: either the
similarity of recommendation process to the user‟s decision strategy or the similarity of the
estimated attribute preference weights to the user‟s actual ones. In our approach, we choose to
follow the second path, since it is more generalizable and allows us to handle all users the
same way – by applying the additive decision rule to the estimated attribute part-worth. In
comparison to our choice the other alternative, i.e. deriving the user‟s decision strategies, fac-
es the serious disadvantage that consumers do not have a stable decision function and they are
likely to rely on simplified heuristics in a number of situations (e.g. time pressure). This, from
the viewpoint of RS, spontaneous strategy change would seriously impede the recommenda-
tion task, since it would challenge the system to adapt to every little change in the user‟s be-
havior, which is also hard to track automatically. Moreover, the derivation a decision strate-
gies typically requires the knowledge of the attribute part-worths, which would thus compli-
cate the recommendation process while making it more prone to errors. Instead, we suggest to
rely on the most efficient decision rule, WADD, while concentrating on the accurate estima-
tion of attribute preference weights37
. This also conforms to our aim of providing the users
with an efficient decision aid, rather than obtaining an in-depth understanding of individuals.
On the other hand, the provision of actionable explanations requires them to be under-
standable to the user, i.e. made in terms that are meaningful to the user and relevant for the his
37 Compare Sections 2.1.3 and 2.2.1 for a detailed discussion of the provided arguments.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 74
or her preference formation. As shown in Section 2.2.2, movie attributes build a suitable basis
that fulfills these requirements: they are both understandable to the users and relevant for the
formation of user preferences. The latter, again, brings the attribute preferences in the fore-
ground and confirms our choice to concentrate our proposals on the reliable estimation of the
attribute preference weights, i.e. part-worths.
Consequently, we develop our model of user preferences in terms of the user‟s attribute
part-worths. For this sake, we utilize the concept of multi-attribute utility that connects the
consumer‟s, i.e. user‟s, preference to the utility that an alternative, i.e. a movie, possesses for
the user; a concept which states that this utility can be decomposed into its attribute related
components, i.e. part worths (see Section 2.2.1). The following subsections present the devel-
opment of the model in more detail. Each subsection builds upon a preceding one and intro-
duces additional components to the model, thus refining it.
3.1.2 Basic Model of User Preferences
The datasets that recommender systems (RS) operate with typically represent a set of
ratings the users of the system have assigned to the items contained in the system‟s catalog
(see Section 2.3). In the context of movie recommendations, the ratings describe the user‟s
enjoyment of a movie, i.e. the degree to which a user has liked a particular film. The higher
the rating, the more the user liked the movie. Hence, the ratings can be thought to express the
users‟ preferences of movies or, in other words, the usefulness of movies for the users in
terms of liking. Inn that, we can see a direct analogy to the concept of utility: Indeed, a higher
rating corresponds to a higher utility; two movies rated equally are equally „useful‟ for the
user. Hence, we can argue that ratings are proximity measures for the utility of movies for
users and for the preference of the latter to the former.
The preference of a movie can be decomposed in (partial) preferences of its attributes,
i.e. the user‟s attitudes towards movie‟s characteristics (see Section 2.2.2). Thus, a rating can
be expressed as a sum of the part-worths of the movies components, or more formally:
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 75
∑
(3.1)
where is the rating of user to movie , denotes the preference of the user for the th
attribute of the movie, i.e. the th part-worth, and denotes a binary variable with 1 indi-
cating the presence of the attribute, e.g. of an actor, in the movie and 0 otherwise.
defines the indexes of the set of attributes which are used to describe movies in the system‟s
dataset. Rewritten in vector form, (3.1) yields to
(3.2)
with denoting a transposed binary vector of the movie‟s characteristics and being the
vector of a user‟s part-worths of the corresponding attributes38
.
This first yet simple model assumes that the movie ratings are known from the user‟s
past rating records and that the movie characteristics are available, e.g., from the Internet
Movie Database (IMDb). The vector of the user‟s preferences is to be estimated. Once
estimated, the part-worths can be used for both, predictions of the user‟s ratings for new and
yet unseen movies as well as for providing the explanations of recommendations.
Note that the model implies that the elements of the part-worth vector are real numbers
and thus allows them to take positive as well as negative values. These properties entail the
ability to rank-order the attribute part-worths according to their contribution to the final rat-
ing. This, on the one hand, allows the provision of explanations in the pros-and-cons style
(see Section 2.1.3). On the other hand, it allows to highlight the most important aspects in the
explanations that influenced the recommendation in a positive as well as in a negative way,
and thus additionally increases the effectiveness of explanations.
38 Here and further we use bold font face to denote vectors and regular font face to denote scalars.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 76
3.1.3 Accounting for Static Effects beyond the User-Item Interaction
Our model suggested in expressions (3.1) and (3.2), although simple, still requires some
complements to improve its efficiency.
The first shortcoming of this model concerns the centering of the part-worths. That is,
the part-worth values in the basic model are centered on zero. Although this is advantageous
for distinguishing between “good” and “bad” attribute preferences, i.e. positive and negative
part-worths, the model does not suite the scale of most recommender systems well. Most RS
usually employ rating scales that begin at 1 point or star at the bottom level. In order to pro-
duce positive rating values in these systems, the model requires each movie to possess at least
one attribute that breeds a positive part-worth that is high enough to compensate all the nega-
tive ones. Moreover, in order to score over “0”, this requirement has to be fulfilled for all us-
ers, which seems to be rather unrealistic. A common way to compensate this shortcoming is to
integrate a constant term into the model, which is often referred to as the „baseline‟. By these
means the model parameters are shifted by the value of the constant, which affects the center-
ing of the part-worths on the baseline. A suitable choice for the baseline is the mean value of
all movie ratings contained in the system‟s dataset. The advantage of this choice is that it rep-
resents the first central moment of the rating distribution in the sample, i.e. in the dataset.
Given a high number of ratings, which is often the case in RS, and following the law of large
numbers the sample mean converges to the expected value of the rating of a movie. Accord-
ingly, the model updates to
(3.3)
with denoting the mean value of the movie ratings, i.e. the expected value for every movie,
given no additional information about the movie and the user. If a user does not have any
preference for the movie‟s attributes, i.e. when the user‟s part-worths for all movie attributes
are zero, the rating of the value is the most probable to occur. The positive and negative
attribute part-worths increase and decrease the user‟s rating value respectively. However, in
this context, the meaning of the part-worths is slightly different than in the previous model
formulation: In model (3.3) the user‟s attribute part-worths indicate the amount of the user‟s
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 77
preference change with respect to an average movie, i.e. how much the user‟s evaluation of a
movie becomes better or worse than an average one due to containing a specific attribute.
Further, expressions (3.1) and (3.2) model the rating solely as an interaction of item at-
tributes and the user‟s attribute part-worths. However, there are some effects that are inde-
pendent of this interaction but rather associated with either users or items (Koren 2009). So,
recommender literature frequently notes that different users may use the rating scale different-
ly: some users tend to systematically give higher ratings than others (e.g. Sarwar et al. 2001;
Adomavicius and Tuzhilin 2005; Jannach et al. 2011; see also Sections 2.3.1.1 and 2.3.1.2).
This causes the mean rating of individual users to deviate from the overall mean, something
we refer to as user bias. An item bias may result, for example, from the “appeal to popularity”
of the main stream movies, which causes the mean rating of such movies to be higher by trend
(Austin 1989; Koren 2009), whereas less popular movies are likely to expose lower average
ratings. Users may also differ in their reaction to average movie ratings and to movies‟ popu-
larity: While some users simply adapt to mainstream judgments, others react overly positive,
and a third group reacts skeptically, i.e. rates movies against the trend. Although these reac-
tions involve both, a user and a movie, it can be argued that they are directed to the movie as a
whole, rather than to its specific characteristics. In other words, these reactions happen on a
more general level that does not concern the attribute-level interactions, i.e. the changes of the
user‟s movie preferences that are conditioned on the presence of a certain movie characteristic
in the movie‟s profile. Incorporating these effects into the model leads to
(3.4)
where and denote the user bias and the item bias respectively
and are defined as deviations of a user‟s and a movie‟s mean rating value from the overall
mean. The user‟s reactions to the movie bias are captured by the scale factor .
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 78
3.1.4 Accounting for Time
The model described by (3.4) separates user-item interactions from the effects caused
by factors that are not related to users‟ preference formation but rather influence the magni-
tude of the rating through the inherent nature of users and movies. This allows an estimation
of the user‟s attribute preferences, i.e. part-worths, which are actually involved in the emer-
gence of the user‟s preference of a particular movie. However, this model is static. That is, it
does not account for temporal changes of user preferences, their rating behavior as well as for
changes of the movie popularity. Since RS, on the one hand, rely on historical data and, on the
other hand, depend on the amount of data (see Section 2.3.3), the model requires accounting
for time in order to not be prone to the “stability vs. plasticity” problem (see Section 2.3.3.5).
Indeed, time affects all components of the model in one way or another. So, movies can
become classics over time, e.g., “Casablanca”, or fall into oblivion, like “Night of the
Creeps”. Users may change their rating behavior or adopt new views on genres, actors, direc-
tors, etc. Hence, it is crucial to account for time changing factors (Koren 2009).
Time changing effects are usually modeled by splitting them in three parts: The first one
is a constant term, which represents the effect‟s baseline. It can be interpreted as the amount
of the modeled measure that it exposes at the „starting‟ point of time, i.e. at point . The
second part captures the long-term trend and concerns the component of the temporal changes
that develops linearly with the course of time. In other words, it represents a „drift‟ of the
measure‟s baseline that happens with a constant rate over time. The third part of the temporal
effect captures short-term fluctuations, i.e. deviations from the drifted baseline at a particular
point in time. These deviations may happen irregularly or have a periodic basis. So, for ex-
ample, Christmas movies become more popular in Christmas time, i.e. periodically; whereas
the popularity of an actor increases when a new movie starring the actor starts or when his or
her name is mentioned in a considerable amount of press reports, which in general has no pe-
riodic basis and happens irregularly. Figure 3.1 illustrates the three parts of time changing
effects.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 79
Figure 3.1: Decomposition of a time changing measure in three components: baseline,
long-term trend, and short-term fluctuations
Accordingly, term in the equation (3.4) needs to be replaced with the expression
. Here, is the slope of the user‟s rating trend, is the deviation of the
user‟s mean rating at a point in time , and is redefined as the static part of the user‟s rat-
ing. Analogously, the movie bias and the user reaction factor are to be replaced with
and correspondingly. After the described modifications to (3.4), the
model extends to
( )( )
(3.5)
As noted earlier, user preferences can also be subject to temporal changes; thus each element
of the user‟s part-worths vector is to be constructed as with the index
denoting associated attributes of the corresponding part-worth values.
With expression (3.5), we derived a model that incorporates temporal effects and cap-
tures user preferences on the finest level of resolution. Provided that all parameters of the
model are known, we could accurately estimate the user‟s ratings to movies. However, aside
from the question of the sufficient amount of data, the estimation of this model is challenged
baseline
short-term
fluctuation
at t=X
trend
Time
Tim
e c
ha
ng
ing
me
asu
re
X 0
long-term
change
at t=X
baseline
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 80
exactly by the finest resolution. That is, the estimation of the parameters which capture short-
term fluctuations of the rating is not sensible.
To understand the rationale behind this assert, let us consider the parameters of model
(3.5) individually, while turning our attention to the question to which part of the variance in
ratings the parameters are associated to. Recall that the basis for the parameter estimation is a
matrix of past user ratings that incorporates two dimensions – users and items. Hence, the
effects integrated into the model can be attributed to the specifics of the rating distribution
along either user or item dimension or both.
Parameter , defined as overall mean, is thus associated with both, user and item dimen-
sions of the rating distribution, and captures the “roughest” part of the variance in the ratings.
The residual variance is to be explained by the remaining parameters. Being a constant term,
however, introduces solely a positive affine transformation into the model and only causes
the centering of the remaining effects on its value. That is, although affects the values of the
parameters, the magnitudes of the actual effects are not affected thereby.
Parameter represents the difference between the overall mean and the mean rating of
a user. It captures the static part of the remaining variance that is attributed to a specific user.
Analogously to the overall mean, affects a positive affine shift of the centering point for
the effects caused by the user-item interaction (model term ) and adjusts the values of
the estimated parameters, leaving magnitudes of the actual effects unaffected. In doing so, it
clears out the static part of the effects that are caused solely by the user‟s specifics from the
effects that are due to the user-item interaction.
The model term describes the long-term temporal changes in the rating behavior of
a user. In other words, it accounts for the development of over time. That is, at each given
point in time, is solely a constant that adjusts the value of . Hence, this term is also
attributed to the part of the variance which is caused by the user‟s properties that are beyond
the user‟s preferences.
In the same way does it for users, captures the static part of the variance in the rat-
ings that occurs due to the specifics of an item. Similarly, adjusts the value of the item‟s
baseline over time. Both parameters and are thus attributed to the items.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 81
Similar logic applies to the construction of the user‟s reaction scaling factor and the el-
ements of the user‟s part-worths vector with the exception that both parameter groups are in-
volved in a multiplicative relation with the model terms that are associated with items. The
model terms and capture the part of the variance that is due to the user‟s reaction to a
movie‟s average rating. Notice, however, that and only address the user dimension of
the rating matrix, i.e. represent another adjustment to the user bias.
The residual variance in the ratings, i.e. the variance that remains after accounting for
previously discussed effects, is then „decomposed‟ in parts that are associated with the mov-
ie‟s attributes and caused by the user‟s evaluation of these attributes. That is, each element of
the vector captures the variance that is caused by the user‟s preference of the correspond-
ing attribute.
The short-term effects are thus thought to „compensate‟ the difference between the actu-
al rating and the rating that would be predicted after accounting for the effects de-
scribed above. However, on the one hand, this difference is produced by the cumulative effect
of all short-time parameters as defined in (3.6).
∑
(3.6)
On the other hand, problem (3.6) can only be addressed after all other parameters of the mod-
el are estimated. Thus, the model parameters that describe the short-term effects would not
help to clear out the associated variance from the initial model but rather only help to explain
its error for the past cases. This makes them useless for the purpose of rating prediction,
which contradicts the aims of providing recommendations. Therefore, we decide to omit the
short-term effects from the model, which leads to our final model:
(3.7)
with elements of the vector constructed as .
In the next Section, we describe our method to estimate the model parameters.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 82
3.2 Estimating Model Parameters
To this moment, we obtained a model of user preferences formulated in equation (3.7).
This model allows the prediction of user ratings for yet unseen movies, based on the
knowledge about the characteristics of a movie and the user‟s properties. Whereas the
knowledge about the movie characteristics can be obtained, e.g., from the Internet Movie Da-
tabase39
, other model parameters have to be learned from the past user ratings available to a
recommender system. For the description of our approach of the estimation of the model pa-
rameters we assume that both datasets are available and that the dataset of past user ratings
maintains associations between the ratings, the users and the movies to which the ratings were
given.
As deduced in previous sections, our model encompasses 643 parameters: 636 of them
describe the user‟s preferences of 318 movie attributes (see Section 2.2.2 for the derivation
and Appendix B for the list of the attributes), i.e. 318 pairs of and that build the ele-
ments of vectors . One parameter represents the overall mean rating . The remaining six
parameters describe the effects that are associated with either a user or a movie. Whereas
can easily be calculated and thus can be thought as given by the dataset of user ratings, a set
of 642 parameters is to be estimated for each user based on the user‟s past ratings.
A direct solution to this problem, however, can only be obtained when the data available
is sufficient, i.e. when the amount of ratings per user contained in the dataset equals the
amount of parameters to be estimated. Moreover, the data points are required to be linearly
independent, i.e. no movie vectors consisting of exactly the same attributes are allowed. In the
case of movie recommenders, both requirements, though, are not likely to be fulfilled: The
linear dependence between the movie vectors may happen, for example, when sequels or se-
ries, e.g., the “Matrix” trilogy or “Friends”, are included in the database. A more serious prob-
lem, however, is that the users of movie recommenders are not likely to rate a sufficient num-
ber of items. So, for instance, the median number of ratings in the MoviePilot and Netflix
39 The data is available for download at http://www.imdb.com/interfaces. Licensing information is provided at
http://www.imdb.com/licensing (for commercial use) and http://www.imdb.com/licensing/noncommercial (for non-commercial use).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 83
databases employed in our study amount respectively 25 and 96 ratings per user (see Table
4.1), which is correspondingly more than 25 and 6 times less than the number of parameters
to be estimated per user. Hence, the problem (3.7) can be addressed neither through solving it
directly nor by means of statistical techniques such as regression analysis.
In this case, optimization techniques, such as gradient descent, can be applied to learn
the model parameters through minimizing the dedicated error function
∑
∑
(3.8)
where denotes the predicted rating and designates the set of movies rated by the user.
However, the optimization methods are strongly dependent on the initial point of optimization
and will be more likely to find local minima rather than achieving a global solution to the
problem if the starting point for optimization is chosen improperly40
(Press et al. 2007; Pa-
terek 2007; Koren, Bell, and Volinsky 2009). The latter leads to unreliable estimates of the
model parameters and consequently to higher errors in the predictions produced by the model.
On the contrary, if the initial guess, i.e. a suboptimal yet good solution to (3.7), lies near the
global optimum of (3.8), optimization techniques are able to determine that optimum and to
refine the „initial‟ model parameters, so that the predictions made by the model exhibit the
lowest possible errors in terms of (3.8) as a matter of fact.
Accordingly, the task of the estimation of parameters for our model of user preferences
can be divided into two steps: (i) provision of an accurate guess for an initial solution to (3.7)
and (ii) optimization of the model parameters by means of minimizing dedicated error func-
tion (3.8). In the following subsections, we provide a description of our two-step method.
40 The optimization method and its tendency to find local minima will be described in more detail in Section
3.2.2. For now let us assume that the estimation by means of optimization is possible.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 84
3.2.1 Step 1: Estimation of Initial Parameter Values
As noted above, under the given circumstances, i.e. insufficient amount of ratings per
user, is neither a precise solution of the values of parameters in the model (3.7) available, nor
is a simultaneous estimation of all parameters possible. Nevertheless, in order to be able to
find an efficient approximation of the solution by means of an optimization method we need
an initial guess for the parameter values that defines a point in the parameter space that lies as
close to the actual solution as possible.
We propose to employ OLS regression analysis method to obtain initial parameter esti-
mates for each parameter separately. That is, instead of estimating all model parameters joint-
ly (which is impossible due to data availability restrictions), we suggest to run a set of regres-
sions for estimating the individual parameters independently of each other. Although in this
case, the obtained estimates are likely to be biased, OLS regression provides us with a set of
advantages: Alongside with the estimates, it provides (i) inference about parameter signifi-
cance and (ii) access to confidence limits of the parameters‟ values, i.e. the interval that most
probably includes the true value of a parameter. The latter allows us to interpret the OLS re-
sults as interval estimates and to additionally constrain the optimization routine, so that the
search of the parameter values is performed within the scope of possible solutions that most
probably contain the true one and the search procedure does not leave this scope to find a lo-
cal minimum that satisfies restrictions of the error function (3.8) but provides unreliable esti-
mates of user‟s preferences in terms of (3.7). The information about the significance of a pa-
rameter can be used for dropping parameters that are statistically meaningless for describing
the user‟s movie preferences and for generating and explaining rating predictions, which thus
simplifies the search procedure and reduces the probability of finding local minima.
However, compared to a simultaneous estimation of all parameters, estimating parame-
ters one by one, introduces a model specification error to OLS. That is, the regression model
becomes underspecified, which may negatively influence the „quality‟ of the estimates, par-
ticularly when the omitted regressors correlate with the independent variable in the under-
specified regression model. In this case, estimating of the parameter values and confidence
limits as well as drawing conclusions about their significance is not straightforward. There-
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 85
fore, before we present the details of how the individual parameters can be estimated, in the
next subsection, we make a note on the consequences of the underspecification of OLS and
present our method to counteract them in order to achieve more reliable initial parameter es-
timates.
3.2.1.1 Omitted Variable Bias in OLS Models and a Method to Counteract the Bias-
ness
Omitting a relevant variable from the regression model entails that in the majority of the
cases the estimates of parameters and their correspondent variances are biased (Gujarati
2004). Consequently, since the variance of the parameter in regression analysis serves as the
basis for the inference about the parameter significance, the statements about the latter may
become misleading. To further understand the rationale behind these asserts, let us consider
the following example41
:
In order to maintain consistency with the notation commonly used within regression
analysis, let us redefine, for the length of this section, the symbols and as the coeffi-
cients of regression equations, as the correlation coefficient between the th and th inde-
pendent variables of a regression model, and as the -value of the Student‟s t-test.
Suppose now that the true regression model to estimate is:
(3.9)
but instead, we omit the relevant variable and fit the model:
(3.10)
The consequences of omitting are as follows42
:
41 The example and its explanations are based on Gujarati (2004), Chapter 13, esp. pp. 510-513 and 556-557.
42 The proof of the individual statements lies out of the scope of this thesis and can be found, e.g., in Kamenta
(1971) or Johnston and DiNardo (1997).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 86
1. If the omitted variable is correlated with the included variable , i.e. the correla-
tion coefficient between and is nonzero, the estimates of and will be
biased and inconsistent. More formally, this means that and that
. Moreover, the bias does not disappear as the sample size gets larger.
2. If and are not correlated, the constant term will be biased, although is un-
biased in this case.
3. The disturbance variance ∑ , where denotes the degree of freedom, is
incorrectly estimated.
4. The variance of ∑ is a biased estimator of the variance of the true esti-
mator .
5. Consequently, the hypothesis-testing procedure, i.e. t-test, is likely to provide mislead-
ing conclusions about the statistical significance of and its confidence limits.
For our proposed method this means that:
(i) We may erroneously drop a parameter from our model based on a not applicable
conclusion of its insignificance.
(ii) The parameter values may be underestimated or overestimated, so that our solution
for the starting point for optimization would be further offset from the global opti-
mum of (3.8), which, in turn, increases the risk of finding a local minimum during
optimization.
(iii)The confidence intervals might not include the true value of the parameter, which,
again, would depart our optimized solution from the optimum.
However, we can counteract the consequences of the OLS model misspecification and
thus reduce the risks described above. That is, we can - to some extent - correct the biased
parameter values and the biases in the correspondent variance and therefore obtain more effi-
cient initial estimates as well as more reliable confidence limits.
First of all, notice that problems 1-5 as well as their consequences (i)-(iii) only apply for
the estimation of the part-worth vectors and do not apply for the user and item biases as well
as the users‟ scale reaction factor, since the latter are free from correlations with other model
variables by definition. That is, they capture effects that are associated either with a user or an
item only, which by their nature are not supposed to have sources of influence other than the
user‟s or the item‟s inherent ones. Hence, the consequences of OLS model misspecification
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 87
are of no concern for estimating these parameters. Due to these reasons, the discussion below
is only relevant for the estimation of the part-worth parameters.
Notice that we are not interested in the estimates for the regression‟s constant terms
or for the part-worth parameters, since the baseline for the part-worths is provided by the
user and item bias. That is, in our underspecified auxiliary regressions, we aim to obtain only
the values of the effects of the model‟s (3.7) variables, i.e. the slope coefficients .
Hence, we only need to correct the bias in these parameters and their correspondent variances.
The biasness of affects neither our initial solution nor the subsequent optimization.
To begin, let us consider how the estimate biases can be ruled out: It can be shown that
(3.11)
where ∑ ∑ is the slope in the regression of the excluded variable on the
included variable (Gujarati 2004, Chapter 13). As can be seen from (3.11), is biased,
unless or or both are zero, i.e. when has no effect on or when and are uncor-
related. So the first step to conclude the biasness of an estimate is to examine the correlations
between the variables. If no correlations can be determined, the estimate of the correspondent
parameter and its variance are unbiased.
In other case has to be corrected. In our example for two variables it can be done by
the means of two additional auxiliary regressions: (i) of on and (ii) vice versa:
(3.12)
Using the regression coefficients from (3.12) in expression (3.11), we obtain a system of
equations:
{
(3.13)
where as well as are known and and are the unknowns. Solving the system (3.13)
for and yields their correspondent values:
(3.14)
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 88
These are the unbiased values of the effects of our interest.
The next step is the correction of the variance estimates of and . This correction is
needed because the variance of the estimates is involved in calculation of both the signifi-
cance inference statistic, i.e. -value of the Student‟s t-test, and the confidence limits (Gujarati
2004, Chapter 8):
√
(3.15)
√ ( ) √ (3.16)
As stated above in consequences 4 and 5, the variance of the regression parameters in
OLS with the omitted variable is biased. Consequently, as can be seen from expressions
(3.15) and (3.16), the -value and the confidence limits are biased. This implies that the Stu-
dent‟s t-test, which is used within regression analysis to test a parameter‟s significance, is
likely to provide misleading conclusions. On the other hand, both the biased variance and a
biased -value entail an error in calculations of the confidence limits. The latter may cause a
shift of the confidence interval, such that the true value of lies outside of the predicted con-
fidence limits.
One way to counteract this issue is to simply recalculate the variance after its definition
(Gujarati 2004):
( )
∑
∑
∑ (
) (3.17)
where ( ) is the variance inflation factor which quantifies the extent of multi-
collinearity in OLS. Prior to this recalculation, it is, however, necessary to obtain the value of
the residual sum of squares ∑ ∑ of the „true‟ OLS model (3.9). Given the
unbiased values of and obtained after (3.14) and the definition of the constant term of
the regression as
(3.18)
(Gujarati 2004), we are able to calculate and thus the value of ∑ . Taking into account
the number of degrees of freedom that equals the number of data points minus the number of
regressors minus one, i.e. , we can now calculate ( ) as defined in
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 89
(3.17). After this procedure, the bias corrected -value and the confidence limits are obtained
using expressions (3.15) and (3.16). Accordingly, the test of significance can now be per-
formed using the corrected -values.
With the discussion above, we presented our method to counteract the problem of un-
derspecified OLS models for our proposal to estimate the parameters of the model (3.7) by
means of auxiliary regressions considering only one parameter at once. The following sec-
tions present the details of the estimation of the correspondent parameters.
3.2.1.2 Estimating User and Item Related Effects
The user and item biases as well as the user‟s popularity reaction scale factor are as-
sumed to be conceptually independent of each other and of the user-item interactions (see
Section 3.1.3). Thus, they are unaffected by the problem of the omitted variable described in
the previous section. Although there might be some „technical‟ correlations with other model
variables, these correlations and the correspondent variables are of no concern for the actual
effects of interest, because, again, they are conceptually unrelated. Hence, we can simply run
bivariate auxiliary regressions for determining the correspondent initial parameter values,
their significance and confidence intervals.
We begin by estimating the user bias parameters and . For each user we run an
OLS regression modeled in the form . Whereas the user‟s rating trend parame-
ter is derived directly from this regression, the baseline is recovered from by sub-
tracting the overall rating mean, i.e. . We choose to be the cut-off crite-
rion for concluding the significance of regression parameters. For the time resolution, we
choose do denote the number of days passed since the user‟s first rating, meaning that we do
not assume the users to change their rating behavior within one day, while allowing it to vary
on the daily basis. Further, because new users need some time to become accustomed to the
system, we assume their rating behavior to change more rapidly compared the one of experi-
enced users. Hence, to prevent the overfitting to unstable fluctuations of the average user‟s
rating and to increase the reliability of our initial estimates, we require the standard deviation
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 90
of the user‟s rating time to be at least 60. In other words, we require the user to have rated the
movies for at least 120 days in order to be able to capture his or her drifting rating behavior.
For users who do not meet this condition as well as for users whose was found to be insig-
nificant in auxiliary regressions, the parameter is discarded from the model and is cal-
culated as the mean of the correspondent user‟s rating. In such cases, the confidence limits for
are set in accordance with (3.16) to with drawn from the Student‟s t-distribution
for and degrees of freedom equal to the number of the user‟s ratings minus one, and
√∑ being the standard deviation of differences between user ratings and
the overall mean.
The item biases are estimated in the same way, using auxiliary regressions of the form
. Again, the time resolution here is set to one day. In contrast to user bias, we
expect the movie popularity to change slower and thus require the time frame between a mov-
ie‟s first and last ratings to be at least 240 days.
The estimates for the parameters that capture the user‟s reaction to the movie bias can
now be obtained in two steps: Firstly, we fix the user and item parameters in equation (3.7) at
their estimated values and ignore the model‟s part that concerns the user-item interactions, i.e.
we set . Again, we are allowed to ignore the use-item effects because they are con-
ceptually unrelated to the user and item inherent specifics. Given the fixed parameter values
for each user‟s rating we calculate the difference between the actual rating and the user‟s bias
and the value of the movie bias
. In the second step
we solve the following regression problem
(3.19)
The rationale behind this regression is that is intended to capture the part of the rating
that is not due to user bias but varies with the item‟s bias and the time. Since accounts for
both of the factors and is clarified of the user‟s bias, the estimate for from (3.19) pro-
vides precisely this knowledge. The regression‟s constant term hence captures the stable
part of the effect.
Analogously to the user and item bias cases, we discard the parameters which do not
reach the significance level of . Here, we also require the user to have rated for at least
120 days. For the users who do not fulfill this requirement and for those whose both regres-
sion parameters turned out to be insignificant we discard from the model and set the value
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 91
of . In this case, both the upper and the bottom confidence limits are also set equal ,
which allows for no variations of within the optimization process. In other words, if the
user does not exhibit any statistically significant reaction to movie average ratings or is not
supposed to have one, s/he is also not expected to have it. This is equal to dropping and
from our model.
At this point, we obtained the initial estimates for the effects that are not involved in the
user-item interaction and clarified our model of parameters that seem irrelevant for the de-
scription of the preferences of a particular user. In the next step, the residual variance that is
associated with the actual user-item interaction is to be explained by the movie attributes, and
the initial values of the correspondent part-worths are to be estimated. The next section is ded-
icated to these questions.
3.2.1.3 Estimating Attribute Part-worths
Contrary to user and item biases, user attitudes toward movie characteristics are not
necessarily mutually independent. In fact, they may be even thoroughly related to each other.
So, a moviegoer may perceive different movie attributes as a signal of the same expected
„quality‟ of a movie. For example, Clint Eastwood may be strongly associated with protracted
westerns containing less dialogs but much disquieting music; Pixar Studio with entertaining
high-quality computer animations; Andrey Tarkovsky with contemplative surrealistically
framed soviet classics; France with Alain Delon, Gerard Depardieu and arty plots; and so on.
On the other hand, correlations between movie attributes are inherent to the movie at-
tributes data. So, some actors exhibit tendencies to appear in movies of a specific genre, e.g.,
Bruce Willis is known to act mainly in action movies; directors tend to engage the same stars
in their films, e.g., Quentin Tarantino is known to have a stable „team‟ of actors. Strong corre-
lations may also take place between directors, genres, and producers; producers and writers;
studios and directors, etc.
Consequently, we inevitably come across the problem of OLS model underspecification
and have to account for biasness of the parameter estimates and their variances in our auxilia-
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 92
ry regressions (see Section 3.2.1.1). However, although we proposed a method to correct for
the omitted variable bias, there is another problem we may confront during parameter initiali-
zation – multicollinearity.
Multicollinearity is associated with the risk of poor estimation of coefficients in aux-
iliary regressions (3.12) and, in the extreme case, may preclude it entirely. This is particularly
the case when both variables are (nearly) perfectly correlated, i.e. their correlation coefficient
equals or is close to . In such cases the solution to (3.13) is either highly biased or inde-
terminate so that the effects of two highly correlated variables cannot be reliably separated
from each other (for proof see Gujarati 2004, pp. 345-346). Thus, the biasness of the parame-
ter estimates and their variances cannot be ruled out, which, again, might lead to wrong con-
clusions about the parameter significance and to an erroneous estimation of the confidence
limits.
Nevertheless, the joint effect of both highly correlated variables can be estimated and is
given by expression (3.11) (Gujarati 2004, p. 347, 511). We utilize this property to mitigate
the problem of multicollinearity for our setting. We argue that the knowledge about the joint
effect of two or more highly correlated variables is enough to describe user preferences in
terms of our model (3.7): If some attributes (nearly) always occur in movies jointly, their rela-
tive contributions to the user‟s preference become irrelevant because they always affect the
preference jointly. Hence, we examine the database of movie attributes for pairwise correla-
tions. From each pair of attributes that correlate highly, i.e. , we eliminate the one
that is less helpful for discriminating between the movies, i.e. exposes a lower variance in the
dataset. Notice that the elimination happens in the global scope of the data and is not done for
each user separately.
In the next step, we estimate the regression coefficients of the pairwise auxiliary regres-
sions (3.12). That is, each of the attribute describing variables is regressed on each of the
remaining variables that constitute the movie attribute vector . In this procedure, we zero
the values of the insignificant regression coefficients, which means that one variable does not
„influence‟ the other one in terms of (3.12), and thus must not be accounted for in the process
of the bias correction. In doing so, we obtain a set of auxiliary parameters that will be used
later for correcting the biasness of estimates and their variances as described in Section
3.2.1.1. This operation is also done in the global scope of the data, not on the individual user
level. The rationale behind this is as follows:
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 93
On the one hand, equations (3.12) and (3.13) aim to clarify the effect of the variable in-
cluded in (3.10) on the user‟s rating from the effect of the omitted variable on the included
one. Notice, however, that the effect of a movie attribute on the user‟s rating (the effect of
on ) takes place in the scope of an individual user, i.e. is relevant only for a specific user.
Whereas the effect of one attribute on another one ( on ) applies to movie attributes in
general, so that can be thought as an adjustment of the „explanation power‟ of for by
the part that explains in , which, in turn, happens in the global scope.
On the other hand, performing these auxiliary regressions in the global scope allows us
to account for the OLS model underspecification and to reduce the issue of multicollinearity,
while estimating part-worths of an individual user. Consider an example where a user has
only rated the “Lord of the Rings” trilogy. Since all episodes of the trilogy were directed by
Peter Jackson and engage a constant set of stars, all attributes describing the episodes are per-
fectly correlated. This would result in equal estimates for all of the attributes‟ parameters
and in our separated auxiliary regressions. This, in turn, would cause (3.13) to have an
indeterminate solution for all . However, because we estimated in the
global scope, their values are not likely to be the same anymore, since Peter Jackson has also
directed other films and the stars were acting in other movies as well. Hence, although re-
main equal, the unequal will clarify the effects of the omitted variable in different alphas
to a different degree and thus lead to a determined solution of (3.13) as shown in (3.14).
After the above described preparations, we are now ready to estimate the attribute part-
worths parameters. For each user, we run a set of regressions of the form
(3.20)
where and are the parameters of our interest designating respectively the static and
the time dependent components of the part-worth of the th movie attribute for the user ;
is a th component of the movie‟s characteristics; vector represents a binary dummy
variable with the value of if a correspondent attribute is present in the movie‟s characteris-
tics and otherwise; is the constant term of the regression; and is time counted in days
from the first rating in the dataset. Analogously to the user bias estimation, we require the
user to have rated movies for at least 120 days in order to be able to capture the user‟s time
changing component of the part-worths. For those users who do not fulfill this requirement we
discard and estimate a simplified regression
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 94
(3.21)
This simplified OLS model was used also in cases when the „complete‟ model (3.20) could
not be estimated due to data insufficiency. In such cases the correspondent parameters
were also discarded.
Then for the sake of correction for the omitted variable bias the estimated parameters
were pooled together with the previously derived auxiliary parameters to form, analogous-
ly to (3.13), a system of equations of the form
∑
(3.22)
where denotes the estimated value of the th parameter (i.e. or ), denotes the
unbiased value of the parameter (see Section 3.2.1.1 for details), and designates the
index of the remaining parameters.
This equation system was solved by means of the SVD technique as described in Press
et al. (2007, chapter 2.6)43
. We choose to employ SVD because it is capable of handling ill-
conditioned44
equation systems and provides in such cases an optimal solution in terms of
least squares (Press et al. 2007). Generally, since we have made dispositions to prevent the
risk of multicollinearity as described above, we do not assume the system (3.22) to be ill-
conditioned. Nevertheless, we cannot ensure this for the vast variety of cases that may happen
during the estimation process. Thus, by utilizing SVD, we secure our algorithm to obtain a
solution for (3.22) in any case.
Using the solution to (3.22), we recalculate the variances of the estimated parameters as
described in Section 3.2.1.1 after expression (3.17). Now we are able to accomplish the test
for the parameters‟ significance as provided by (3.15). The parameters that do not fall within
the significance level are discarded and the confidence limits for the remaining pa-
rameters are estimated (3.16).
43 Since the SVD method is one of the standard methods for solving linear equations, its description lies out of
the scope of the current thesis. For a detailed and comprehensive introduction of SVD, we refer to Press et al. (2007). 44
A system of linear equations is ill-conditioned when its underlying matrix is not of a full rank, i.e. when linear dependencies are present between the rows or the columns of the equation matrix, i.e. between the variables or between the equations or both (e.g. Press et al. 2007).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 95
With the above presented procedure, we finalize the estimation of the initial values for
the parameters of our model of user preferences. As described in the introductory part of this
chapter, the initial parameter estimates are then passed to an optimization method as coordi-
nates of the starting point for optimization in multiple dimensions with the aim of obtaining a
parameter solution that is closer to the optimum. The method and the process of optimization
are described in the next section.
3.2.2 Step 2: Optimization of the Parameters
In the field of numerical research the term “optimization” refers to (usually iterative)
mathematical methods which strive to find the best available values of the parameters of some
objective function (e.g. Press et al. 2007; Lange 2010). In our case, we strive to find such pa-
rameter values of model (3.7), which breed the minimum possible error for the model‟s pre-
dictions. This aim corresponds to finding the minimum of the quadratic loss function (3.8).
We choose the quadratic form in (3.8) because its U-shape ensures the loss function to have a
single extremum, i.e. a definite global minimum, and because it penalizes larger errors by
magnitude, it thus potentially reduces the error in the final solution.
Typical methods to solve such an optimization problem are the method of steepest gra-
dient descent and the conjugate gradient method. Both methods are based on the same idea of
iterative approach to the minimum of the optimized function through stepwise updates of the
solution in the direction opposite to the function‟s gradient, i.e. the direction of the function‟s
fastest descent45
. The difference between the two methods is that steepest gradient descent
optimizes only one dimension in each iteration and „steps‟ in the direction of the dimension
that exhibits the highest value of the function‟s first partial derivative in a given point; where-
as the conjugate gradient method considers all dimensions of the function‟s space to choose
the direction to move in, so that the minimization along one direction is not „spoiled‟ by sub-
sequent minimization along another. This allows us to avoid cycling through a set of direc-
45 Since both methods are well-known standard methods for optimization, we do not discuss them in detail and
refer the reader to Press et al. (2007) for an in-depth description.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 96
tions and hence reduces the number of iterations needed to achieve the optimum (Press et al.
2007, Chapter 10.7). Figure 3.2 visualizes the difference between the two methods.
Figure 3.2: Successive minimization with gradient methods (a) steepest gradient descent is less efficient than (b) conjugate gradient method, taking more
steps to get to the minimum, crossing and re-crossing the principal axis. Graphics are adapted
from Komarek (2004), p. 11.
Although both methods are suitable for the minimization of our loss function and con-
verge to the same solution (Press et al. 2007), we choose to employ the conjugate gradient
method because of its higher efficiency. However, due to the specifics of our task, some ad-
justments have to be done to the method. These adjustments include the (i) initialization of the
starting point of optimization (ii) restriction of the optimization procedure to the confidence
limits of the parameters and (iii) measures for preventing overfitting.
We initialize the optimization process with the parameter values obtained by means of
auxiliary regressions as described in Section 3.2.1. This initialization plays a crucial role for
the convergence of the optimization method. Not only does it reduce the number of iterations
needed to achieve the optimal solution, but also – together with the restriction of optimization
to the parameters‟ confidence intervals – it helps to ensure the solution we achieve to be the
true one: Recall that for most users our model (3.7) is underdetermined, i.e. the number of
parameters to estimate is greater than the number of data points. This fact „relaxes‟ the opti-
mization procedure by making it possible to have more than one solution. Note that these ad-
ditional solutions, i.e. „local‟ optima, are not caused by the form of the loss function, but ra-
(a) (b)
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 97
ther represent a set of possible spatial dispositions of the n-dimensional U-shape that satisfy
(3.8). By initializing the optimization with the values obtained through statistical techniques
(see Section 3.2.1), we ensure that the starting point of optimization already lies near the
„true‟ minimum of the loss function (3.8).
By restricting the „area‟ of optimization to the confidence limits of the parameters, we
additionally ensure that the true solution is being sought for in the scope of the space where it
is most probable to occur by virtue of our auxiliary regressions. In other words, by not allow-
ing the optimization procedure to leave the area constrained by the confidence limits, we re-
move the risk of „slipping‟ into the area of the local minima.
Another issue caused by the underdetermination of the model is the tendency to overfit-
ting (e.g., Koren 2009), i.e. finding such parameter values that fit the available data well but
exhibit large errors while making predictions for the data not included in the optimization
process. In order to counteract overfitting and thus to keep the model generalizable and suita-
ble for predictions of future ratings we utilize a holdout set of six randomly drawn ratings for
each user. The ratings contained in the holdout set are completely excluded from the whole
procedure of learning the parameter values, i.e. these ratings are neither used in the auxiliary
regressions nor for parameter optimization. Instead, they are used in the gradient method for
determining the stop point of the optimization, such that prevents overfitting. Particularly, in
each iteration the value of the loss function (3.8) is independently calculated using the holdout
data set. The optimization is stopped when the error value on the holdout data or the „original‟
error value of the method does not decrease with respect to the correspondent value from the
previous iteration. Figure 3.3 shows the flowchart of the optimization step of our algorithm.
Our adjustments to the original method are marked bold.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 98
Figure 3.3: Flowchart of the optimization step Bold font face indicates our modifications of the original method
Start with initial parameter values. Set the value of the error function for the training
set ei = ∞ and for the holdout set eh,i = ∞
Calculate gradient and determine the conjugate direction for optimization as well as the step sizes in each direction
Calculate the value of the loss function for the training set ei
and for the holdout set eh,i
ei < ei-1
oreh,i < eh,i-1
yes
For each parameter
Is the value within confidence limits?
Adjust the parameter’s value according to the method
Set the parameter‘s value equal the boundary value of
its confidence interval
yes
no
Save parameter values and stop
no
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 99
The gradient of the loss function is calculated, according to its definition, as a set of par-
tial derivatives of the correspondent parameters. In each iteration, the parameters are adjusted
by a magnitude proportional to the step size in the opposite direction of the gradient:
( )
( )
( )
( )
( )
( )
( )
( )
(3.23)
where denotes the step size in the direction of the corresponding parameter and
designates the prediction error of the user‟s rating calculated with the parameter
values of the current iteration.
In the procedure described above, we obtain the final estimates for the parameters of the
model of user preferences (3.7) and are now able to predict the users‟ future ratings and so to
provide recommendations to the users. The knowledge about the parameter significance as
well as the knowledge of the parameter values allows the generation of explanations of the
recommendations in the pros-and-cons style as established in Section 2.1.3.
Our task could be seen as completed herewith. However, there are still potentials to in-
crease the quality of recommendations for the overall set of the RS‟s users. Thus, before we
proceed to the empirical test of our proposed method, we will discuss these potentials and
motivate the hybridization of our recommendation algorithm. The next section is dedicated to
this topic.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 100
3.3 Hybridization with Collaborative Approaches
3.3.1 Motivation for Hybridization
Recall our discussion in Section 2.2.2 that we initiated with the assertion that movies are
experiential experience goods and where we highlighted the hedonic nature of movie con-
sumption in contradistinction to the consumption of utilitarian goods. Precisely these aspects,
together with the problem of the automatic extraction of meaningful and preference relevant
attributes from multimedia content, complicated the derivation of movie attributes that are
descriptive of the preferences of movie consumers. Consequently, the preference relevant
movie attributes that we derived for the operationalization of consumer preferences, although
derived carefully to capture the major part of the latter, in some cases, might not be able to
fully cover all aspects underlying the emergence of the consumer preferences.
For example, such characteristic of a movie‟s goodness as depth and dynamics of char-
acter development may be described well through the attributes „actors‟, „writers‟ and „direc-
tors‟, since they undoubtedly contribute to the character‟s development and tend to exhibit
general tendencies or affinities that correlate with the mentioned characteristic through their
work. However, in particular movies these tendencies may not necessarily surface. On the
other hand, a consumer for whose preferences character development plays an essential role
may not always consider this characteristic good and potentially disfavor protracted stories.
What is more, a consumer may have controversial tastes that, for example, may depend on the
context in which s/he watches a movie: in some situations the consumer may prefer thought-
ful motion pictures with curly storylines, whereas at other times s/he may be more interested
in light-headed entertainment movies. Furthermore, a consumer may pay more attention to
other aspects of the movies that do not correlate with our list of attributes, e.g., an overall
„message‟ that leaves its mark in the soul. Finally, the data on which we base the estimation of
the user‟s attribute preferences may simply be insufficient for our suggested procedure to un-
cover the user‟s preference structure. Although we proposed a method that is capable of esti-
mating part-worths in underdetermined conditions, it still cannot extract the preferences from
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 101
the data that are not contained therein. For instance, if a user has a strong attitude towards a
particular star but has not rated any movie starring this actor, our algorithm would have no
basis to deduce the user‟s preference score for the correspondent attribute. The latter repre-
sents the problem of overspecialization that is inherent to the content-based techniques (see
Section 2.3.3.3).
Some of the aspects mentioned above may be accounted for by introducing interaction
effects in our model of user preferences. This, however, would increase the complexity of our
already complicated model by an order of magnitude, which may make it roughly impossible
to estimate the model‟s parameters reliably. Other aspects, such as the consumption context,
cannot be addressed in our approach without the use of additional information, collection of
which entails additional interactions with the user. Not to mention that such interactions po-
tentially decrease the recommendation efficiency radically and thus may cancel out benefits
of a RS to a movie consumer. Additional information, i.e. additional ratings, is also needed
for deducing the part-worths of the attributes that do not apply for the movies rated by the
user, i.e. for counteracting the overspecialization problem.
The recommendation approaches that do not utilize item attributes in the recommenda-
tion process can help to counteract the potential problems mentioned above. Because these
approaches do not rely on item attributes, they are more likely to capture the relationships
between ratings and items that go beyond the attribute preferences. Thus, in those cases when
the concepts underlying such relationships are more valuable to a user and cannot be captured
by our proposed content-based method to a satisfactory degree, these approaches may produce
more reliable predictions of the user‟s preferences. Furthermore, because these approaches are
not subject to the overspecialization problem, they are able to predict ratings for the movies
with the attributes, the part-worths of which could not be addressed by our approach. This
potentially allows to enrich the set of movies that come into question for recommendations
with such movies that possess higher ratings than the ones selected by our method, and thus to
potentially increase the effectiveness of recommendation.
Hence, it seems sensible to extend our approach with the predictions provided by other
recommendation techniques. The two questions that we need to answer in this context are (i)
which method(s) should we combine with our algorithm and (ii) how the combination should
be accomplished.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 102
3.3.2 Methods to Hybridize and the Method of Hybridization
As discussed in Section 2.3.4, several strategies can be followed for constructing a hy-
brid recommender. Notice, however, that the majority of hybridization strategies aim to in-
crease solely the accuracy of recommendations, whereas accuracy is only one of the concur-
rent objectives that the current thesis pursuits: In Section 2.1.1 we showed that the explana-
tions of recommendations play an important role for the users‟ perception of the RS‟ transpar-
ency as well as for the users‟ acceptance of and trust to RS. Moreover, explanations increase
the effectiveness of the users‟ choice. Hence, we search for such a solution to hybridization
that counteracts the problems described in the previous section while maintaining the ad-
vantages of explanations.
Now, recall the discussion of different explanation styles provided in Section 2.1.2.
Consider that each recommendation method is associated with a particular explanation style,
which is caused by the specifics of the methods‟ process of recommendation generation. Each
explanation style, in turn, exhibits different potentials to increase the users‟ satisfaction with a
RS and the users‟ ability to accurately assess the true quality of recommended items. It was
shown that the nearest neighbor explanation style that is inherent to the user-based CF per-
forms the worst of all the explanation styles discussed. Even more, this style may lead to the
users‟ mistrust to RS. On the contrary, the keyword and influence explanation styles (that are
available to CB methods and to item-based CF correspondingly) were found to be effective at
enabling accurate assessments. Although there is no overall agreement whether the keyword
style dominates the influence style or vice versa, the combination of both was found to lead to
the best results in terms of the overall satisfaction and the quality assessment. To complete
our discussion of different explanation styles and the correspondent recommendation meth-
ods, please notice that MF methods allow for no meaningful explanations, because they base
their recommendations on uninterpretable factor solution (see Section 2.3.1.3).
Taking into account that the keyword explanation style is a shorter version of the pros-
and-cons style (see Section 2.1.3) that our proposed method implements, and accounting for
our objective to provide effective explanations along with accurate recommendations, the best
performing combination of the explanation styles prompts us the answer to the question which
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 103
methods to combine within our hybrid recommender. That is, an item-based CF method
should extend our proposed content-based approach.
The answer to the question how the predictions of both methods should be combined is
easier. Because we want to maintain the explicability of or recommendations, we are not al-
lowed to combine the prediction results of different recommendation methods by using math-
ematical operations, e.g., through averaging or weighting. Otherwise, we would lose the asso-
ciation of the recommendation result with the underlying explanation of its „original‟ method
and so would be unable to explain the reasoning behind the recommendation. Hence, we have
to use „pure‟ rating predictions of the method which performs best for a given user in terms of
accuracy.
In order to compare the accuracy of the two methods constituting our hybrid, we sug-
gest utilizing the holdout set of six randomly drawn ratings per user that we used for deter-
mining the stop point of optimization (see Section 3.2.2). We chose to utilize the same hold-
out set despite the critic that may arise that our model of user preferences estimation is al-
ready trained to the data. We argue that the latter is not of too high importance for comparing
the accuracy of both hybridized methods. Firstly, the holdout set was employed to increase
the generalizability of the estimated model parameters and to prevent overfitting to the train-
ing data. Because the ratings of the holdout set were excluded from the actual optimization
procedure and used to calculate the value of the loss function „externally‟ with respect to the
training data, the optimization is accomplished so that the predictions accuracy of the model
for the unseen data should already be very similar to the predictions that are done on the hold-
out set. Secondly, even if our model of user preferences overfits the holdout data so that the
„combining‟ algorithm would have to prefer our model‟s predictions over the CF predictions
for the final recommendations, the effect of this preference would only decrease the overall
accuracy of the hybrid method. The latter provides our hybrid method no advantage for com-
paring the prediction accuracy of different recommendation algorithms that will be given in
Section 4.4 below. Finally, although we admit that using a separate holdout set that is not em-
ployed in either calculation except for the comparison of the predictive accuracy of hybridized
methods is methodically desirable, we have no other choice: Provided with the median of 13
ratings per user in the dataset of MoviePilot (see Table 4.1) and having in mind that we need
another holdout set for validation purposes, we risk to exhaust most data for the purpose of
the holdout sets. In this case, the users that only have a few ratings would be underrepresented
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 3: Conceptual Framework 104
in our empirical study, which, in turn, would question the generalizability of its results. On the
other hand, in practical settings our proposed recommendation method would require a user to
rate at least 24 items (18 for the three holdouts and 6 for the predictions) in order to be able to
recommend. Such an amount of „warm-up‟ ratings may be impracticable for a considerable
number of users – again, as can be seen from Table 4.1, MoviePilot would not be able to rec-
ommend anything for the majority of its actual users. Thus, in this tradeoff, we decide to trade
potentially inferior accuracy of our hybrid for the benefit of generalizability of the results and
a higher attractiveness of our method for the practitioners.
To summarize the above said: In our hybrid of our proposed algorithm and item-based
CF; we suggest generating predictions of future ratings by means of the one of both methods
that performs best on the same holdout set, which is used in the optimization procedure de-
scribed in Section 3.2.2. For determining the best performing method we propose utilizing the
Student‟s t-test for paired samples. That is, the method that exhibits significantly lower pre-
diction error ( ) on the holdout set is considered best and is used for future predictions.
If, however, the difference between the errors is not significant, we will use the predictions of
model (3.7), even if its error on the holdout set is greater than the error of item-based CF.
Again, in doing this, we trade formal accuracy for more effective explicability.
The discussion of this section closes the description of our proposed conceptual frame-
work of a hybrid recommender system that allows for effective explanations of recommenda-
tions. In the next chapter we present an empirical study which evaluates our proposed method
and compares it with key recommendation algorithms.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 105
Chapter 4
Empirical Study
4 Empirical Study
In the previous chapters, we built theoretical foundations and developed a recommenda-
tion method that achieves our objectives. That is, a method that is capable of providing both,
(accurately predicted) recommendations and actionable explanations of the reasoning behind
them, as well as aligning the recommendation process with the user preferences. Whereas the
alignment of the recommendation process with user preferences is given by the design of the
method (see esp. Sections 2.2 and 3.1) and its ability to provide actionable explanations is
justified theoretically (see esp. Sections 2.1, 3.1 and 3.3.2), the statement that the predictions
are accurate still requires some proof.
Indeed, our method suggests estimating a considerable number of parameters that in
many cases exceeds the number of data points available for the estimation procedure (see Sec-
tion 3.2). The latter fact may raise a doubt that the estimates produced are capable of reliable
predictions of user preferences and good enough in comparison with established recommen-
dation methods. That is, proof is needed that our method is applicable in the recommendation
systems‟ praxis and provides advantages to the latter.
On the other hand, by means of hybridization of our proposed method, we „secured‟ that
the hybrid‟s predictions are at least as accurate as the predictions of its item-based CF compo-
nent (see Section 3.3.2). Thus, another question of interest is the quantity that characterizes
the number of times when our proposed preference estimation method applies, relative to the
number of times when the CF component is used for generating the final rating predictions.
At the same time, this quantity characterizes the relative number of times when explanations
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 106
are provided to the users in the most effective pros-and-cons style, rather than in the second
best keyword explanation style.
To answer these questions we conduct an empirical study that tests different recommen-
dation techniques on real-world rating data: the dataset of a German commercial movie rec-
ommendation system MoviePilot.com and the dataset of a US online DVD rental service Net-
flix.com. By using two datasets for our tests, we „secure‟ the generalizability of the compari-
son results to other cases of movie recommendation and prove potentials for the portability of
our method to other recommendation domains.
We compare the accuracy of our proposed method with the accuracy of the key collabo-
rative recommendation techniques that were described in detail in Section 2.3, i.e. user-based
CF, item-based CF and matrix factorization method. As matrix factorization is known to pro-
vide one of the best predictive accuracies among „pure‟, i.e. not hybridized, recommendation
algorithms (e.g. Funk 2006; Paterek 2007; Bell, Koren, and Volinsky 2007b, 2008; Koren
2009), we suggest a comparison with this algorithm to be the most informative for judging the
prediction accuracy of other algorithms.
The comparison is made on the basis of the holdout data which is not involved in the
training procedure of either algorithm. The task of the algorithms consists therefore in predict-
ing the ratings for the holdout data. The difference between the predicted ratings and the actu-
al ones then serves to calculate accuracy measures based on the comparison results evolved.
Further details on the comparison procedure and the data employed as well as the results
of the empirical study are described in the succeeding sections.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 107
4.1 Datasets and their Properties
As mentioned above, two real-world datasets are involved in our study. We chose to use
the Netflix dataset because it underlies the majority of recommender research done in the re-
cent years, which was caused by the interest from the side of the research community to the
Netflix‟s competition that promised a prize of one million dollars to an individual or a team of
individuals who would suggest a recommendation algorithm that tops the prediction accuracy
of the Netflix‟s own recommender by 10% with regard to RMSE. Hence, providing the accu-
racy measures of our algorithm for this particular dataset makes our results comparable to a
variety of other recommendations methods discussed in recent literature. MoviePilot‟s data
was used because of the following reasons: The current research is performed within a re-
search project funded by the German Research Foundation (Deutsche Forschungsgemein-
schaft; DFG) where MoviePilot acts as a cooperation partner. Through this cooperation, we
could gain a full access to information which could have influenced the rating data, e.g.,
changes in the scale labeling, interface updates, etc. Such information is not available for the
Netflix‟s data, although it is known that Netflix has altered its scale labels in the past (Koren
2009), however, no exact details about the type of alternation and the date when it happened
were ever published. From his analysis of the Netflix‟ data Koren (2009) infers that it might
have been happened in early 2004, where the mean rating makes a sudden jump that would be
hard to explain otherwise. Furthermore, Netflix has made available only a subset of its rating
data stating this data to be randomly drawn from the original rating dataset. However, as a
commercial provider founding a considerable prize, Netflix could have „integrated‟ some arti-
facts in the published dataset. So, for example, one of the users in the dataset has over 17,000
ratings. Assuming an average movie runtime of one and a half hour this person should have
been watching movies without any breaks for almost three years; if this user spent only eight
hours a day watching movies, s/he would need more than eight years to watch them all. This
seems rather unrealistic. Netflix provided no comments on this artifact or other artifacts that
might be introduced to the data artificially. Contrary to Netflix‟s data, the dataset of MoviePi-
lot we employ is a complete set of the ratings provided to the recommender system by its us-
ers.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 108
Each dataset represents a relational database with two tables. The first table contains
three fields „user_id‟, „movie_id‟, „timestamp‟ and „rating‟, so that each raw entry of the table
defines a correspondence of a rating to a concrete user and a concrete movie as well as to the
exact date and time when the rating was recorded by the system. The second table consists of
two fields: „movie_id‟ and „movie_title‟. To reduce ambiguity, movie titles are complemented
with the year of production of the movie.
The ratings in the Netflix dataset are represented on a 5 point scale with 1 point step,
where 1 denotes the worst rating (“Hated It”) of the movie and 5 indicates the best rating
(“Really Liked It”). In the interface of Netflix the scale points correspond to the number of
stars that a user gave to a movie (see Figure 4.1a). Although MoviePilot presents its users a 11
point scale varying from 0 (“Hated This Movie”) to 10 (“My Favorite Movie”) in .5 points
step, the ratings are saved in the database as values from 0 to 100 corresponding to a tenfold
of the rating that a user provides (i.e., a rating of 7.5 points is stored as 75). In the interface of
MoviePilot the rating are surveyed from the users by means of a horizontal scale bar that sup-
ports a gradient fill effect and changes the caption text according to the currently selected
number of points (see Figure 4.1b).
Figure 4.1: Rating scales in user interfaces of recommender systems (a) Netflix, captions from 1 to 5 stars: “hated it”, “didn‟t like it”, “liked it”, “really liked it”, “loved
it”; (b) MoviePilot, captions altering in 3.5 points interval: “hated movie”, “not interested”, “aver-
age”, “good”, “my favorite movie”.
Table 4.1 presents descriptive statistics for the raw datasets of MoviePilot and Netflix.
(a) (b)
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 109
Table 4.1: Descriptive statistics of the raw rating datasets
MoviePilot Netflix
General characteristics
Number of ratings 1,389,749 100,480,507
Number of users 14,528 480,189
Number of movies 12,762 17,770
Scale interval 0 – 10 (0 – 100) 1 – 5
Scale step size .5 (5) 1
Time range 19-AUG-2006
– 04-APR-2008
11-APR-1999
– 31-DEC-2005
Ratings per user
Min 1 1
Max 6,687 17,653
Mean 95 209
Median 25 96
SD 214.17 302.33
Ratings per movie
Min 1 3
Max 6,546 232,944
Mean 108 5654
Median 13 561
SD 345.62 16,909.67
Ratings per day
Min 1 5
Max 78,164 737,570
Mean 2,583 46,049
Median 1,548 15,499
SD
4,498.33 58,558.61
However, in order to be able to perform our tests both datasets were reduced as follows:
We left out six latest ratings per user as a holdout for out-of-sample predictions and the com-
putation of accuracy measures for different recommender algorithms (in the following, we
refer to this holdout as to the “validation set”). Another six ratings were drawn randomly from
each user‟s rating profile to build a holdout for operation reasons of our proposed algorithm
(in the following “operation holdout”; see Sections 3.2.2 and 3.3.2). Users for whom there
was not enough data to generate both holdouts were discarded. We further discarded such
users for whom less than six ratings remained after isolating both holdout data. The descrip-
tive statistics of the resulting datasets are summarized in
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 110
Table 4.2.
Table 4.2: Descriptive statistics of the datasets employed in the study
MoviePilot Netflix training
set
operation
holdout
validation
set
training
set
operation
holdout
validation
set
General characteristics
Number of ratings 1,140,577 47,610 47,610 93,170,314 2,570,310 2,570,310
Number of users 7,935 7,935 7,935 428,385 428,385 428,385
Number of movies 12,246 5,052 5,037 16,543 16,241 16,212
Time range 19-AUG-
2006 – 04-
APR-2008
20-AUG-
2006 – 04-
APR-2008
20-AUG-
2006 – 04-
APR-2008
11-NOV-
1999 – 31-
DEC-2005
06-JAN-
2000 – 31-
DEC-2005
05-JAN-
2000 – 31-
DEC-2005
Ratings per user
Min 1 6 6 8 6 6
Max 6,535 6 6 16,419 6 6
Mean 143 6 6 217 6 6
Median 59 6 6 101 6 6
SD 250.16 0 0 304.50 0 0
Ratings per movie
Min 1 1 1 2 1 1
Max 4,543 802 677 213,367 15,816 12,354
Mean 93 9 9 5,623 158 158
Median 12 3 2 544 20 18
SD 262.90 36.23 35.7161 16,305.89 624.28 603.76
Ratings per day
Min 1 1 1 5 1 1
Max 56,194 3,413 3,629 703,924 27,936 17,202
Mean 2,120 106 104 42,631 1,283 1,242
Median 1,293 50 52 15,167 38 61
SD
3,451.80 201.44 206.64 53,378.06 3,423.97 2,820.47
The data about movie attributes (genres, acting stars, directors, writers, production
companies, budget, admissions, box office, year of production, country of origin; see Sections
2.2.2 for the derivation and Appendix B for a detailed list of attributes) was obtained from
IMDb under a limited, non-commercial license46
. The data is provided as a set of text files
that maintain connections between a particular movie title and a list of a specific type of at-
46 Copyright message: “Information courtesy of The Internet Movie Database (http://www.imdb.com). Used
with permission”. Licensing information can be obtained at http://www.imdb.com/licensing/ (for commercial use) and http://www.imdb.com/help/show_leaf?usedatasoftware (for non-commercial and personal use).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 111
tributes (e.g. actors, countries of origin, etc.). We converted the text files to a database format
that is more convenient for our calculation purposes, so that each raw of the data table repre-
sents a movie and each column represents a specific attribute. Non-metric, i.e. nominal, at-
tributes (such as actors, directors, country of origin, etc.) were coded as binary variables with
1 denoting the presence of an attribute in a movie‟s characteristics and 0 otherwise. Metric
attributes (admissions, budget, box office, and year of production) were recoded as follows:
Movie budgets and box office values were converted to a common currency (US dollar) in
order to unify the measurement units and thus to increase the consistency of the estimation of
correspondent parameters in model (3.7). The movies‟ years of production were recoded as
the number of years from the current year. This rescales the correspondent model parameter
by three orders of magnitude (e.g. the year 2009 is recoded as 2), which simplifies compari-
sons of the production year‟s effect on the user‟s preference with the effects of nominal pa-
rameters when inspecting parameter values „manually‟. This also alters the interpretation of
the parameter‟s values, so that negative values indicate the preference for newer movies,
whereas positive values display the preference for older ones. That is, the meaning of the pa-
rameter changes to “preference of older movies”. Since this rescaling represents a positive
affine transformation to the data, i.e. adding a constant to all values of a variable, it has no
effect on either estimations or predictions made with our algorithm and so serves only for the
sake of convenience of visual inspection of the part-worth values. Since the admissions are
already scaled in common measurement units (number of tickets sold at movie theaters) their
values were not modified.
After the above described conversions, the IMDb data was merged with the datasets of
MoviePilot and Netflix by matching movie titles and the correspondent years of production
contained in all datasets considered. This step finalizes the preparation of the data for the ac-
tual study and concludes the data description.
The next section introduces the measures that we employ for the comparison of the pre-
diction accuracy of different recommender algorithms.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 112
4.2 Measures of Prediction Accuracy
Prediction accuracy measures evaluate how close the ratings predicted by a recom-
mender algorithm are to the true user ratings (Herlocker et al. 2004). Two established accura-
cy measures that the majority of works in the research area of recommendation systems em-
ploy are the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). Formally,
these measures are defined as follows:
∑ | |
(4.1)
√∑
(4.2)
Whereas MAE measures the average absolute deviation between a predicted rating
and a user‟s true rating , RMSE puts more emphasis on large deviations through squaring
single errors before summing them up. For instance, an error of one point increases the error
sum by one, while an error of two points increases the sum by four. Through emphasizing
large errors, RMSE puts on par the algorithms that constantly make moderate errors for all
ratings with those that predict ratings fairly good for most of the time but also make large er-
rors in some of the cases. As can be seen from equations (4.1) and (4.2), RMSE always tends
to be greater and can never be smaller than MAE. RMSE can also be equal to MAE, however,
only in one specific case – when all predictions contain an error of a constant magnitude, i.e.
when | | for all .
The meaning of MAE and RMSE can also be interpreted in statistical terms: Since
MAE is defined as mean of absolute errors, it represents the first central moment of the error
distribution, i.e. the expected value of the error that an algorithm produces. RMSE, according
to its formal definition, is a square root of the variance of the algorithm‟s errors around zero,
i.e. in terms of statistics, RMSE is the second moment about zero. Hence, RMSE corresponds
to the standard deviation of the errors from the no-error point and therefore informs about the
“width” of the interval of the error distribution. That is, assuming the normal distribution of
the errors, about 68% of them lie in the interval bounded with ±RMSE, about 95% of the er-
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 113
rors are in the interval of ±2RMSE, and the interval of ±3RMSE accounts for 99.7% of the
errors. In other words, MAE and RMSE are informative about the distribution of the predic-
tion errors. Hence, it seems sensible to report both measures for the evaluation of the predic-
tive accuracy of different algorithms.
Nevertheless, both MAE and RMSE depend on the scale on which the ratings are sur-
veyed from the users. That is, although these measures allow comparisons of different algo-
rithms with respect to their predictive accuracy, these comparisons remain informative only
when the algorithms are tested on the same dataset or when the datasets employ the same rat-
ing scale. The latter does not hold in our case, since MoviePilot and Netflix utilize different
rating scales (see Table 4.1). A typical approach to overcome this limitation and so to make
the prediction runs on different datasets comparable is normalizing the measures with respect
to the range of rating values (Herlocker et al. 2004; Goldberg et al. 2001). The formal defini-
tions of the Normalized Mean Absolute Error (NMAE) and the Normalized Root Mean
Squared Error (NRMSE) are as follows:
(4.3)
(4.4)
where and denote respectively the minimum and the maximum ratings of a recom-
mender system‟s rating scale.
In Section 4.4 that presents the results of our empirical study we will report all of the
four accuracy measures introduced above. While normalized measures allow us to compare
predictive accuracy of different algorithms across different datasets consistently, the raw, i.e.
non-normalized, measures would allow the reader making comparisons of our results with the
results of other published or unpublished works on recommendation systems.
Prior to presenting the results of our study the next section provides some details on the
algorithms employed therein.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 114
4.3 Employed Algorithms and Benchmarks
In order to provide an informative report about the predictive accuracy of our proposed
method, we ran a series of accuracy tests with some of the key recommender algorithms. Spe-
cifically, we used pure user-based and item-based collaborative filters – each in two variants
that differ with respect to the similarity measure employed, i.e. Pearson correlation coefficient
and cosine similarity (see Sections 2.3.1.1 and 2.3.1.2 for details). In these collaborative fil-
ters, we used the neighborhood size of that provided the best accuracy over all da-
tasets in preliminary test runs.
Another algorithm employed in our study is a Singular Value Decomposition-like ma-
trix factorization algorithm by Funk (2006), the foundation for all matrix factorization rec-
ommenders discussed in recent literature. As matrix factorization is known to provide one of
the best predictive accuracies for a single algorithm, we suggest a comparison with their basis
algorithm to be informative. However, a note should be taken regarding the results of the em-
ployed algorithm: In our preliminary prediction runs it turned out that matrix factorization is
highly sensible to its parameters, i.e. number of iterations, regularization parameter, learning
rate and number of factors47
. The optimal values of these parameters, in turn, are dependent
on the underlying data, and thus should be determined individually for each dataset in order to
achieve optimal results. These asserts are also supported by recent research work (e.g. Paterek
2007; Koren 2009; Koren, Bell, and Volinsky 2000; Koren and Bell 2011). Therefore, in our
comparisons we used differently parameterized versions of Funk‟s algorithm and report corre-
spondingly the best results of the algorithm for the respective datasets. For the comparisons
on the MoviePilot‟s dataset the factor model of the algorithm was learnt for the following
parameter values: , ,
and . For the Netflix dataset, the cor-
respondent parameters are , ,
and .
47 See Section 2.3.1.3 for explanation of the parameters’ meaning.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 115
To better judge relative the accuracy improvements of different algorithms we introduce
two „benchmarks‟. The first benchmark is a simple heuristic that „predicts‟ a global average of
a dataset to be the value of all future ratings for all users. This benchmark, obviously, repre-
sents the bottom level of accuracy that a recommender system can provide to its users. That
is, if a recommender algorithm exhibits a lower level of prediction accuracy than the „global
average‟ method, it will make no sense for a recommender system to employ such an algo-
rithm since a simple heuristic performs better. The second benchmark is the result of the algo-
rithm that has won the Netflix‟ One Million Dollar Prize. By achieving an RMSE of
(Bell, Koren, and Volinsky 2008), it improved the RMSE of the Netflix‟ own algorithm by
10% and hence can be considered as the most accurate recommendation algorithm of the rec-
ommender domain. Therefore, we suggest the comparison with this benchmark to be informa-
tive. However, the test runs of this algorithm on our data are impeded by the fact that this al-
gorithm is, in essence, the result of blending of the predictions generated by more than 100
recommendation algorithms (Bell, Koren, and Volinsky 2008). Testing this algorithm on our
data would imply the implementation of all of its composite parts and the subsequent blending
of their results, which is not trivial. Not only would it take a lot of time and resources but also
it would open a huge sail area to errors and criticism that may attribute our results to imple-
mentation mistakes. Thus, we suggest using the reported RMSE and the corresponding
NRMSE instead, which can easily be calculated by means of equation (4.4) and amounts to
. In the tables that present the results of our study, i.e. in Table 4.3 and Table 4.4,
these values are denoted as the “Netflix Prize winner”. Unfortunately, the authors do not re-
port on the MAE of their algorithm. Nevertheless, we consider comparisons of NRMSE of
different algorithms with this benchmark to be informative for judging improvements in pre-
diction accuracy.
Whereas in the foregoing sections we provided a conceptual description of the design of
our study and of the methods employed therein, the insights in the details of the technical im-
plementation and execution of our tests can be found in Appendix C. The next section pre-
sents the results.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 116
4.4 Results
This section presents the results and discusses the findings of our empirical study. Two
main questions concerned herein are (i) how good our proposed method predicts future user
ratings and (ii) what proportion of the users receive explanations behind the recommendations
in the most effective pros-and-cons explanation style. Each of these questions is addressed in
the following in a separate subsection.
4.4.1 Comparison of Prediction Accuracy
The results of the prediction runs of different algorithms are summarized in Table 4.3
(for the MoviePilot dataset) and Table 4.4 (for the Netflix dataset). In these tables we report
the accuracy of our proposed method in three rows: Firstly, the row “Estimation step” pro-
vides the results of the predictions made by using model (3.7) with the parameter values ob-
tained in the estimation step of our algorithm (see Section 3.2.1). Secondly, the row “Optimi-
zation step” reports the accuracy of the predictions by the same model initialized with opti-
mized parameter values (see Section 3.2.2). Finally, the row labeled as “Hybrid” provides the
accuracy measures obtained through hybridization of both highlighted methods, i.e. our opti-
mized solution and item-based CF, as described in Section 3.3.2.
The columns of the tables report the four accuracy measures introduced in Section 4.2
as well as the percentage of improvement achieved by a particular algorithm with respect to
the global average and the Netflix Prize winner benchmarks. For the reasons explained in the
foregoing section, the latter improvement is only reported for RMSE and NRMSE measures.
To simplify the comparison, two additional columns display the rank order of a correspondent
measure achieved by the algorithms compared. In these columns, lower ranks correspond to
better accuracy.
DRAFT -
final
revisi
on to
appe
ar in
2012
Tab
le 4
.3:
Com
pa
riso
n o
f th
e p
red
icti
on
acc
ura
cy o
f d
iffe
ren
t alg
ori
thm
s fo
r M
ovie
Pil
ot
data
set
Alg
ori
thm
M
AE
N
MA
E
Imp
rov
em
ent
w.r
.t.
glo
ba
l
av
era
ge
(%)
Ra
nk
# o
f
MA
E &
NM
AE
RM
SE
N
RM
SE
Im
pro
vem
ent
w.r
.t.
glo
ba
l
av
era
ge
(%)
Imp
rov
em
ent
w.r
.t.
Net
flix
Pri
ze w
inn
er
(%)
Ra
nk
# o
f
RM
SE
&
NR
MS
E
Ben
chm
ark
met
ho
ds
Glo
bal
aver
age
21
.556
41
0
.21
55
5
0.0
9
2
6.3
44
66
0
.26
34
5
0.0
-2
1.0
1
0
Net
flix
Pri
ze w
inner
n/a
n/a
n/a
n/a
n/a
0
.21
78
0
17
.3
0.0
2
Co
lla
bo
rati
ve f
ilte
rin
g m
eth
od
s
Use
r-b
ased
, P
ears
on
1
6.9
21
35
0
.16
92
1
21
.5
3
22
.157
78
0
.22
15
8
15
.9
-1.7
3
Use
r-b
ased
, C
osi
ne
17
.372
69
0
.17
37
3
19
.4
5
22
.551
14
0
.22
55
1
14
.4
-3.5
6
Item
-base
d,
Pea
rso
n
16
.806
97
0
.16
80
7
22
.0
2
22
.174
44
0
.22
17
4
15
.8
-1.8
4
Item
-base
d,
Co
sine
17
.214
27
0
.17
21
4
20
.1
4
22
.521
12
0
.22
52
1
14
.5
-3.4
5
Mat
rix f
acto
riza
tio
n
17
.566
50
0
.17
56
7
18
.5
6
22
.650
87
0
.22
65
1
14
.0
-4.0
7
Pro
po
sed
met
ho
d
Est
imati
on s
tep
1
8.1
86
84
0
.18
18
7
15
.6
8
24
.283
05
0
.24
28
3
7.8
-1
1.5
9
Op
tim
izat
ion s
tep
1
8.1
63
94
0
.18
16
4
15
.7
7
24
.177
54
0
.24
17
8
8.2
-1
1.0
8
Hyb
rid
1
6.1
92
31
0
.16
19
2
24
.9
1
20
.664
75
0
.20
66
5
21
.6
5.1
1
Chapter 4: Empirical Study 117
DRAFT -
final
revisi
on to
appe
ar in
2012
118
Chap
ter
4:
Em
pir
ical
Stu
dy
Tab
le 4
.4:
Com
pari
son
of
the
pre
dic
tion
acc
ura
cy o
f d
iffe
ren
t alg
ori
thm
s fo
r N
etfl
ix d
ata
set
Alg
ori
thm
M
AE
N
MA
E
Imp
rov
em
ent
w.r
.t.
glo
ba
l
av
era
ge
(%)
Ra
nk
# o
f
MA
E &
NM
AE
RM
SE
N
RM
SE
Im
pro
vem
ent
w.r
.t.
glo
ba
l
av
era
ge
(%)
Imp
rov
em
ent
w.r
.t.
Net
flix
Pri
ze w
inn
er
(%)
Ra
nk
# o
f
RM
SE
&
NR
MS
E
Ben
chm
ark
met
ho
ds
G
lob
al a
ver
age
0.9
360
9
0.2
281
5
0.0
1
1
1.1
089
9
0.2
772
5
0.0
-2
7.3
1
0
Net
flix
Pri
ze w
inner
n/a
n/a
n/a
n/a
0
.87
12
0
0.2
178
0
21
.4
0.0
2
Co
llab
ora
tive
filt
erin
g m
etho
ds
Use
r-b
ased
, P
ears
on
0
.67
94
0
0.1
698
5
25
.6
4
0.8
792
1
0.2
198
0
20
.7
-0.9
4
Use
r-b
ased
, C
osi
ne
0.6
954
8
0.1
738
7
23
.8
5
0.8
950
2
0.2
237
6
19
.3
-2.7
6
Item
-base
d,
Pea
rso
n
0.6
772
6
0.1
693
2
25
.8
2
0.8
771
4
0.2
192
9
20
.9
-0.7
3
Item
-base
d,
Co
sine
0.6
791
1
0.1
697
8
25
.6
3
0.8
797
5
0.2
199
4
20
.7
-1.0
5
Mat
rix f
acto
riza
tio
n
0.7
052
1
0.1
763
0
22
.7
6
0.9
032
4
0.2
258
1
18
.6
-3.7
7
Pro
po
sed
met
ho
d
Est
imati
on s
tep
0
.70
71
8
0.1
768
0
22
.5
8
0.9
118
9
0.2
279
7
17
.8
-4.7
9
Op
tim
izat
ion s
tep
0
.70
61
0
0.1
765
3
22
.6
7
0.9
076
0
0.2
269
0
18
.2
-4.2
8
Hyb
rid
0
.64
05
3
0.1
601
3
29
.8
1
0.8
222
0
0.2
055
5
25
.9
5.6
1
Chapter 4: Empirical Study 118
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 119
First of all, it can be seen, that differences between the MoviePilot and the Netflix da-
tasets impact the magnitudes of the accuracy measures. The normalized accuracy measures
(NMAE and NRMSE) are all greater in case of the Netflix dataset. The accuracy of the global
average predictions, i.e. our bottom-level benchmark, is impacted most. This is, however,
concordant with Koren‟s observation of a sudden jump of the mean rating in the Netflix da-
taset that has happened in early 2004 and may be attributed to the alternation of the Netflix‟s
rating scale labels (Koren 2009). Indeed, such increase of the mean rating should cause both,
increase of the mean error and increase of the error variance, which is reflected in higher
NMAE and NRMSE values of prediction runs on the Netflix dataset. Whereas the global av-
erage‟s predictions are impacted by definition48
, other methods compared exhibit a surprising
robustness to the alternation of scale points meaning: The difference in NMAE between the
predictions on both datasets becomes noticeable only in the fourth decimal place for the ma-
jority of the compared methods. Nevertheless, NRMSE values are injured one order of magni-
tude stronger, so that the difference between the prediction runs can be seen already in the
third position after the decimal point. Consequently, the percentage of accuracy improvement
with respect to our benchmarks is also impacted by these issues, which makes the correspond-
ent values less consistent across the considered datasets and less informative for comparison
reasons.
Another source of the higher values of the normalized accuracy measures on the Netflix
dataset may be simply its larger size: Conventional wisdom tells us that the chance to make
higher errors increases with the size of the dataset – simply because on a larger dataset an
algorithm has to make more predictions. Nonetheless, irrespectively of the proportion of con-
tributions of both issues to the higher values of error measures, it should be recognized that
the differences of the dataset do impact the accuracy measures. Hence, the comparison of ac-
curacy improvement should be taken with care and account for the above described circum-
stances.
Even so, it can be noted that the accuracy measures of our proposed method are impact-
ed considerably less by the difference between the datasets. We explain it by the fact that,
contrary to other methods, our preference model incorporates temporal effects. Hence, it was
able to capture the time changing component of the rating variance to a greater extent than the
48 Recall that global average is defined as the mean rating of a dataset.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 120
competing methods did. This underlines the importance of accounting for temporal changes
within recommendation algorithms.
A more important observation is, nevertheless, that the rank order of different methods
with respect to their accuracy remains mostly persistent for both datasets employed. We inter-
pret this fact as an indicator of the generalizability of the result summaries provided in Table
4.3 and Table 4.4 – at least concerning the rank order of algorithms. That is, we assert that the
results obtained are descriptive of the algorithms‟ performance on the accuracy measures and
that they are generalizable to other datasets. Further discussion of the results concerning the
accuracy of our proposed method is provided based on these asserts:
It can be seen that our method‟s predictions, though exhibiting significant accuracy im-
provements (over 15%) with respect to the bottom line benchmark (i.e. global average), evi-
dently do not belong to the table leaders. Moreover, the results of the optimization step do not
differ substantially from the results of the estimation step. The former achieves only a mar-
ginal improvement (less than 1%) over the latter with respect to both MAE and RMSE. How-
ever, RMSE improves about five times better than MAE. A yet more interesting fact is that
the proposed hybridization of our method with the item-based CF breeds a sudden jump of
accuracy improvement that makes the aggregated method overperform all of its competitors –
even the Netflix winning algorithm. These observations lead us to the two following conclu-
sions:
Firstly, the superior accuracy of our hybrid over both of its components indicates that
our model (3.7) does not capture all the user rating variance. That is, the attribute based model
of user preferences fails to describe the preference formation for some of the users contained
in the datasets. For these users, the item-based CF produces predictions that are nearer to their
true ratings than the predictions of our method. Hence, item-based CF captures some movie
characteristics that are „hidden‟ from the attribute based preference model and go beyond
formal attributes. Such characteristics may be, for example, the deepness of character devel-
opment, an enthralling story line or the overall atmosphere of a movie. In other words, for the
users who base their preferences on such hard-to-formalize movie characteristics, the analysis
of item similarities is capable of revealing the relationships between the movies that are due to
such characteristics. On the other hand, there is also a substantial number of users whose pref-
erence structures are described better by our attribute based preference model. Therefore,
providing both groups of users with predictions based on the individual method that better
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 121
describes their correspondent preferences results in superior performance of our hybrid (see
Section 3.3.2). It follows that the inferior performance of our preference model is mostly not
due to calculation errors but is rather caused by the inability of the model to capture movie
preferences for a certain group of users.
Secondly, connecting this fact with the observation of a merely moderate improvement
of prediction accuracy from the estimation step to the optimization step of our algorithm al-
lows us to conclude that the former provides fairy good estimates of the model parameters.
That is, if our model‟s predictions outperform the predictions of item-based CF for a substan-
tial group of users and the lower accuracy for the remaining group of users is caused by the
model‟s inadequacy for describing the preference formation of these users; and if the combi-
nation of the pure prediction of both hybrid‟s components leads to superior overall results
then the indications will be that the model parameters were estimated reliably enough to sub-
stantially reduce the overall prediction error. The five times better improvement of RMSE (as
compared to the improvement of MAE) indicates that the optimization procedure reduces
mainly the error variance rather than the error‟s expected value. This, again, testifies that the
adjustment of the point estimates in the optimization step results in a slightly better fit of the
model to user preferences whereas the model bias (i.e. the error‟s expectation value) remains
nearly constant. That is, the initial interval estimates in the estimation step were obtained reli-
ably.
Now, let us take a closer look at the sudden accuracy improvement caused by the hy-
bridization of our preference model and the item-based CF. Consider Table 4.5 that summa-
rizes the distribution parameters of the absolute error after the optimization step.
Table 4.5: Distribution parameters
of the absolute prediction error of the optimization step
Dataset
Min
Max
Mea
n
SD
Mod
e
Ku
rtosi
s
SE
of
ku
rtosi
s
25
th
per
cen
tile
50
th
per
cen
tile
75
th
per
cen
tile
MoviePilot 0 100 18.19 16.36 0 2.434 .022 6.03 13.60 25.48
Netflix 0 5 .706 .624 0 2.527 .018 .2315 .5368 1.44
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 122
It can be seen that for both datasets, our algorithm exhibits relatively high positive kur-
tosis values (over 2) and a relatively low standard deviation (as compared to the mean error
value). Both facts indicate that the error distribution is highly peaked, i.e. the most error val-
ues are concentrated around a particular point rather than being spread in a wide interval. The
analysis of the quantiles (25th
, 50th
and 75th
percentiles) shows that the error distribution is in
addition positively skewed, i.e. the distribution‟s peak is situated nearer to the point of zero
error, rather than to the error mean. The distribution‟s peakedness and positive skewness are
also confirmed by the fact that the error‟s standard deviation around the mean (SD) is lower
than the value of RMSE49
. Further, it can be seen that the absolute prediction error is lower
than the value of the standard deviation and exceeds it only in about 30% of the cases (see the
50th
and the 75th
percentile). This means that the error measures are mainly constituted by a
low number of points with large deviations rather than by a large number of points with near-
ly equal deviations. These facts altogether provide the evidence that most of the time our
model predicts user ratings fairy accurately and that it fails to do this only in a relatively small
number of cases (about 30%).
In the latter cases, however, the magnitude of the error is substantially large, ranging
from about 25% to 100% of the RS‟s rating scale‟s interval. One possible explanation for this
can be a systematical nature of the large errors. The source of the systematic errors, in turn,
may be attributed to a range of factors such as a model‟s quality, calculation errors as well as
to user or item rating patterns. In order to prove this assumption and, if it is applicable, to
identify the source of the systematic errors, we inspected the ratings our algorithm produced
large errors for. Indeed, we found that large errors belong to the same group of users. This
proves our assumption of a systematic nature of the error and allows us to attribute it to the
users. However, we were unable to find patterns which allow the a-priory identification of
users with high prediction errors. That is, these groups of users do not exhibit any noticeable
regularities with respect to the source data, such as a low number of ratings or a specific rat-
ing distribution, etc., that allow to discern these users from those for whom our algorithm
produces lower errors. The only sensible explanation for this is that the „problematic‟ users
form their movie preferences on the basis of information which is not captured by the prefer-
ence function shown in equation (3.7). This observation supports our previously stated sug-
49 Recall that RMSE designates the standard deviation of the error distribution around zero, (see Section 4.2).
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 123
gestion that the attribute based preference model is unable to capture movie preferences for a
certain group of users which motivates hybridization.
To further justify the hybridization of our method with item-based CF, we performed
the Kolmogorov-Smirnoff test for the equality of distribution functions. The results showed
that the error distribution of the item-based CF significantly differs from the one produced by
the attribute based preference model ( for both MoviePilot and Netflix datasets). Con-
sistent with this, both approaches produced unequal errors for most users on the single user
level ( in Student‟s t-test for equality of means). Again, these results confirm that both
approaches capture respectively different „kinds‟ of the variance in the user ratings, each of
them good for describing the preference formation of different „kinds‟ of users. Hence, the
hybridization of individual predictions of both approaches as described in Section 3.3.2, is
sensible and results in a substantial improvement of predictive accuracy for the hybrid.
Table 4.6: Accuracy improvement of the hybrid methodTable 4.6 provides a summary
of the accuracy improvement of our proposed hybrid method in comparison with its compo-
nents (i.e. to individual predictions of the optimization step and the item-based CF) as well as
with the benchmark methods (i.e. the global average and the Netflix Prize winner algorithm).
It can be seen that the improvements are substantial and consistent with respect to both
(N)MAE and (N)RMSE on both MoviePilot and Netflix datasets.
Table 4.6: Accuracy improvement of the hybrid method
The values indicate percentage of accuracy improvement of the hybrid method relative to other methods
MoviePilot Netflix
Algorithm (N)MAE (N)RMSE (N)MAE (N)RMSE
Global average 24.88% 21.56% 29.81% 25.86%
Optimization step 10.85% 14.52% 9.28% 9.41%
Item-based CF, Pearson 3.65% 6.80% 5.42% 6.26%
Netflix Prize winner n/a 5.12% n/a 5.62%
Moreover, our proposed hybrid method outperforms all compared algorithms, even the most
accurate one – the Netflix Prize winning algorithm. This finding allows us to state that one of
our initial objectives, i.e. the development of an accurate recommendation algorithm, is
achieved.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 124
4.4.2 Provided Explanation Style
In the previous section our hybrid method was shown to outperform all other methods
compared with respect to predictive accuracy. However, the predictions of the hybrid are pro-
duced as a combination of individual predictions of the hybrid‟s components, each of them
providing their own explanation styles that are differently effective for the user‟s decision
making (see Section 3.3.2): Whereas the attribute preference model based predictions allow
for the most effective pros-and-cons explanation style, the item-based predictions provide a
less efficient keyword explanation style. Since the item-based CF component has substantially
contributed to the outperforming predictive accuracy of the hybrid method, the question arises
in what proportion of cases the final recommendations of the hybrid are produced by means of
our user preference model. That is, how many users receive recommendations explained in
the most effective pros-and-cons explanation style?
Table 4.7 answers this question by providing a summary of the number of users for
whom each of the individual explanation styles applies. The numbers for the pros-and-cons
explanation style correspond to the number of cases when the final recommendations are pro-
duced by means of the user preference model (3.7). The numbers for the influence explana-
tion style reflect the number of cases when predictions of the item-based CF component of the
hybrid were used as final recommendations.
Table 4.7: Provided explanation style
MoviePilot Netflix
Explanation
style
Number of
users
Percentage of
users
Number of
users
Percentage of
users
Pros-and-cons 5,194 65.31% 290,146 67.73%
Influence 2,759 34.69% 138,239 32.27%
Total 7,953 100% 428,385 100%
We can see that the results are consistent for both employed datasets, i.e. do not exhibit
substantial differences between the datasets. Accordingly, item-based CF and its inherent in-
fluence explanation style were used for only about 34% of the users. The majority of the users
(about 66%) received explanations for the recommended items produced in the most detailed
and the most effective pros-and-cons explanation style.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 125
Although our hybrid method could not ensure the provision of explanations in the most
detailed explanation style for all users, all of the users were provided explanations in one of
the top effective styles (see Sections 2.1.2 and 2.1.3). Nevertheless, we can state that our se-
cond objective, i.e. the provision of effective and actionable explanations, was achieved: Re-
call the discussion of the previous section where we found that the attribute preference model
cannot capture the preference structure for some users, because those users form their prefer-
ences based on factors other than movie attributes. Hence, the explanations of recommenda-
tions for these users in terms of movie attributes would not be informative for them and thus
would not increase the effectiveness of their choice making, simply because they rely on other
kinds of information while making choices. Since the item-based component of our hybrid
substantially increases the predictive accuracy for these users, it seems to capture the „right‟
part of the rating variance for these users. Hence, the influence based explanation style that
highlights the similarity between the movies is more informative, and thus, more effective for
users whose preferences are better described with the item similarity model of item-based CF.
At the same time, about two thirds of the user base is provided with detailed explana-
tions of the recommendations based on the attribute preference model that effectively captures
the preferences of the correspondent users. In other words, our hybrid method provides for
each group of users explanations of the recommendations in the style that is most effective for
the respective user group.
The latter assert is supported by the research of Biglic and Mooney (2005) and
Symeonidis, Napopoulos, and Manopoulos (2008) – strictly speaking, by the contrast of their
findings: Among other things, both studies compare respectively the effectiveness of and the
user satisfaction with the influence and the keyword explanation styles within an experimental
framework of a single recommendation algorithm. Whereas in Bilgic and Mooney‟s study, the
keyword explanation style dominated the influence style, the study of Symeonidis, Napopou-
los, and Manopoulos reveals the opposite findings (see Section 2.1.2). In both studies, howev-
er, the difference between the two explanation styles is not significant.
Recall now that our pros-and-cons explanation style derives from the keyword explana-
tion style and represents the extension of the latter. In the view of our findings that reveal the
existence of two user groups that form their preferences differently, the controversy of the
results of the two studies discussed above becomes explainable: Since both user groups are
substantially large, the users of both groups had sufficiently high chances to be assigned to
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 126
either of the experiment groups, i.e. the „keyword‟ and the „influence‟ groups. Hence, the
chances that both experimental groups contained a substantial number of the users of both
types are very high. Nevertheless, the exact proportions in which the users of both types were
represented in different experimental groups could slightly differ. This leads to the difference
in the „mean‟ judgments of the experimental groups though could not entail the significance
of the difference in means – because the groups were formed of the two user types which
were both allocated in comparable numbers but in different proportions among the experi-
ments. Although this explanation requires proof, we leave it to future research. At this point
we suggest this explanation to be convincing and concordant with our findings.
Summarizing the above said, we argue that in our empirical study our proposed method
confirmed its ability of providing actionable recommendations that increase the effectiveness
of recommendations. The consistency of the results on two different datasets indicates their
generalizability. This allows us to assert that the second objective of the current thesis is
achieved and thereby to conclude the development of our proposals.
The next section provides a brief summary of the findings of the empirical study.
4.5 Summary
The purpose of the current chapter was to test our theoretically developed algorithm for
providing recommendations and the explanations thereof in an empirical setting, i.e. to prove
the portability of our proposition to the real-world operating environment of recommender
systems as well as the compliance of the proposed method with the declared objectives of the
current thesis.
For this sake, we conducted an empirical study that employs datasets of user ratings to
movies of two real-world recommendation systems. Using these datasets, our proposed rec-
ommendation method was compared with the key recommendation algorithms with respect to
their prediction accuracy, i.e. the ability to generate reliable recommendations. Further, the
ability to provide effective and actionable explanations to the users was examined.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 4: Empirical Study 127
The results show that our proposed hybrid recommendation method outperforms collab-
orative filtering approaches and even the state-of-the-art Netflix Prize winning algorithm in
terms of predictive accuracy, while providing all users with explanations of the reasoning
behind recommendations.
The majority of the user base (about two thirds) received explanations in the pros-and-
cons explanation style that provided detailed, easy-to-understand and actionable explanations
that increase the efficiency of the user‟s choice. However, for the smaller fraction of users
(about one third of the user base) the explanations are given with a lower level of details than
can be provided by the pros-and-cons explanation style. This is due to the fact that our theo-
retically founded multi-attribute preference model does not capture the variance in the ratings
of these users. This indicates that they base their preferences on other factors than the infor-
mation contained in the formalizable movie attributes. However, the item-based CF method
was capable of producing reliable rating predictions for such users. This, in turn, indicates that
the similarity between the movies can serve as a reliable descriptor of the preference for-
mation for such users. Hence, to be effective, the explanations for such users should also be
provided in the style that better suits their preference function, i.e. in the influence explanation
style. Thus, both user groups received the explanations that effectively support users in their
choice making.
Since different user groups received recommendations provided by an algorithm that
better suits the users‟ preference functions, it can be argued that each user received recom-
mendations generated by a recommendation process that aligns with the user‟s preferences.
Hence, we can assert that the third aspect of our objectives is also achieved.
That is, in our empirical study, our proposed recommendation method has proven to be
capable of providing both, accurately predicted recommendations and actionable explanations
of the reasoning behind them, as well as aligning the recommendation process with the user
preferences.
The results are consistent for both datasets employed and do not exhibit significant al-
ternations between them. Since both datasets underlying the study exhibit unique characteris-
tics, the consistency of the results obtained on both datasets indicates the generalizability of
the findings to the domain of movie recommendations as well as recommendations systems as
a whole.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 5: Conclusions and Future Work 128
Chapter 5
Conclusions and Future Work
5 Conclusions and Future Work
In this chapter, we summarize our research and its findings as well as discuss implica-
tions of the latter and provide suggestions for future work. The first subsection provides a
brief recapitalization of the course of our analysis and the development of our algorithm as
well as it summarizes our contributions to research. The second subsection highlights the
main implications of our findings for recommender systems providers and developers. Final-
ly, the third subsection concludes our thesis with a discussion of ways to improve our pro-
posed recommendation method and of the avenues for future research.
5.1 Research Summary, Findings and Contributions
The aim of the current thesis was to develop a recommendation method which is capa-
ble of providing both, accurately predicted recommendations and actionable explanations of
the reasoning behind them, as well as aligning the recommendation process with the user
preferences.
In order to provide foundations for our developments, we began with a theoretical dis-
cussion addressing the questions why explanations of the recommendations should be an inte-
gral part of recommender systems and how they should be provided to the users. Prior re-
search has shown evidence that explanations of the reasoning behind recommendations are
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 5: Conclusions and Future Work 129
capable of establishing the users‟ acceptance of and trust in recommendation systems as well
as of increasing the users‟ loyalty thereto. Moreover, explanations can further extend the us-
ers‟ decision effectiveness when choosing among recommended items and raise their satisfac-
tion with the choice. However, these advantages will only play off if explanations are under-
standable and actionable to the users. Not only dies this require that the users should compre-
hend explanations but also does it imply that the terms in which the explanations are provided
involve the terms that are relevant for the users‟ decision making, i.e. to comprise the charac-
teristics users actually employ for judging choice alternatives. On the other hand, the recom-
mendation algorithm should also reflect the user‟s way of thinking while producing recom-
mendations. That is, an algorithm should also operate in terms of characteristics that users
employ for judging choice alternatives. Not only does this allow an algorithm to provide ac-
tionable explanations to the users but also does it ensure that the recommendations produced
are effective, i.e. indeed reflect the user‟s optimal choice. However, in order for the latter to
hold, an algorithm should be aligned with the users‟ preference weights for the relevant char-
acteristics. That is, the importance weights of the characteristics employed in the algorithm
should be similar to the user‟s actual weights.
These considerations evoked us to employ a multi-attribute utility (MAU) model as a
basis for our recommendation algorithm. MAU model „decomposes‟ the utility of a choice
alternative into a sum of preferences of attributes an alternative consists of and hence is suita-
ble for modeling the user preferences that are based on item characteristics. Further, we sug-
gested using the weighted additive decision rule (WADD), which considers all the attributes
of an alternative for the identification of an optimal choice. Although the research provides
evidence that in many real-world situations (such as stress, time pressure, etc.) people rely on
simplifying decision procedures and evaluate only a fraction of the item‟s attributes, WADD
was shown to lead to the most effective choices. On the other hand, the choice of WADD for
our algorithm is supported by the work of Aksoy et al. (2006) who provided evidence that it is
enough for a recommendation algorithm to maintain the similarity of the internal representa-
tion of the user attribute weights to the „true‟ ones, while the decision strategy issue can be
ignored.
The choice of MAU and WADD as a basis for our recommendation algorithm made it
necessary to determine a list of attributes that should be considered by the algorithm we de-
velop. Given the domain of motion pictures for our recommendations we analyzed the movie
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 5: Conclusions and Future Work 130
research literature in order to derive preference relevant movies attributes. However, we were
facing a lack of research on this particular topic: The existent theoretical discussion on the
preference relevance of movie attributes is neither secured empirically nor does it claim to
provide a complete list of the preference relevant movie attributes. Hence, we adopted the
existing findings and extended our list by movie attributes that are employed in the research
on movie success factors. The latter research field, however, concerns movie attributes as a
part of a superordinate concept „success factors‟ and considers them on an aggregate level, i.e.
the relevance of movie attributes for the preference formation of individual consumers is not
analyzed. However, it can be argued that if a factor is found to be relevant for reflecting the
choice of a corpus, it should also be relevant on an individual level. Following this argumen-
tation, we provided a discussion of the suitability of the movie success factors for describing
preferences of individuals that resulted in a list of 318 movie attributes to consider in our rec-
ommendation algorithm.
Further, to ensure the novelty of our approach, we provided an overview of key recom-
mendation algorithms as well as insights into the principles of their algorithmic functioning,
their problems and trade-offs. Specifically, we discussed a family of collaborative recommen-
dation techniques (including user-based CF, item-based CF and matrix factorization ap-
proaches) as well as content-based filtering. We also discussed hybrid methods, which com-
bine different recommendation techniques in one approach to mitigate the disadvantages of
the constituent methods and to utilize the respective strengths of the components of a hybrid
method. The discussion provided has also shed light onto mathematical issues of different
recommendation approaches. Above all, it was shown that the application of content-based
techniques in the domain of multimedia items, such as movies, is impeded by two factors: On
the one hand, the ability of contemporary content processing algorithms to extract meaningful
features from multimedia content is limited, which makes it impossible to compile a list of
preference relevant attributes automatically, i.e. without involving additional personnel on the
side of recommendation systems. The latter increases the costs of producing recommenda-
tions, which, in most cases, radically reduces the attractiveness of content-based algorithms
for recommender system providers due to economic considerations. On the other hand, even if
a list of preference relevant item characteristics can be obtained from a third-party provider,
such as IMDb, only a fraction of attributes can be utilized for the provision of recommenda-
tions: Most users of a recommendation system only have a limited number of ratings in their
user profiles, which (from the algebraic point of view) „naturally‟ limits the number of attrib-
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 5: Conclusions and Future Work 131
utes for which user preference weights can be estimated to the number of the user‟s ratings.
Consequently, such a content-based recommender algorithm would not be able to capture a
substantial part of the variance in a user‟s ratings and thus to produce reliable recommenda-
tions for the majority of the recommender system‟s user base.
The latter considerations motivate us to utilize statistical techniques in our approach of
estimating users‟ attribute preferences. Hence, we account for this aspect already in the early
stages of the development of a conceptual framework for our recommendation algorithm:
Firstly, based on the findings of the preceding discussions we develop a model of user movie
preferences in a regression analysis manner. Essentially, our model incorporates four types of
effects:
(i) The very basic effects of movie-user interaction, i.e. preferences of a user to-
wards each of the movie attributes;
Two kinds of effects that are beyond the user-item interaction and due to either users or
items:
(ii) „Raw‟ user effects caused, e.g., by a user‟s perception and handling of the rating
scale or a user‟s reaction to the trends of main stream;
(iii) „Raw‟ item effects caused, e.g., by different popularity of different movies that
is not conditioned on the presence of a certain movie characteristic in the mov-
ie‟s profile;
And, finally,
(iv) temporal changes in the three kinds of effects presented above.
This leads us to a model of user preferences that contains 643 parameters, each of them to be
estimated individually for each user.
Since an estimation of this number of parameters cannot be done in a „traditional‟ way
for the majority of users who simply do not have a sufficient number of ratings in their user
profiles, we propose a two-step algorithm that accomplishes the estimation task by means of
statistical techniques. The first step of the algorithm provides interval estimates of the model
parameters, i.e. the estimates of the parameter values and of their confidence limits. These are
obtained in auxiliary regressions that are performed for each model parameter individually,
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 5: Conclusions and Future Work 132
i.e. separately from other parameters. Such estimation is subject to the so-called omitted vari-
able bias and ignores correlations that may be present between the parameters of the model.
Thus, the estimates of such auxiliary regressions are theoretically unreliable and result in er-
roneous predictions of user ratings. Hence, to reduce the bias and to recover the reliability of
the estimates as well as their validity for predictions we propose a procedure that corrects the
initially obtained estimates for both the omitted variable bias and multicollinearity. The se-
cond step of the algorithm then optimizes the bias-corrected estimates to further increase the
data fit and to reduce prediction errors. The optimization is done by means of conjugate gra-
dient descent method, which was modified so that the parameter values are only allowed to
vary inside their respective confidence intervals. Leaping ahead, let us point out that in our
empirical study, this novel procedure of parameter estimation for an underdetermined regres-
sion model was proven to provide reliable estimates. Hence, we see this procedure itself as
one of the most notable contributions of the current thesis to research.
Next to presenting our model and the estimation procedure, we suggested the hybridiza-
tion of our method with item-based CF. We motivated this hybridization by the following
concerns: Although our model covers 318 movie attributes that potentially can capture the
preferences of the majority of users, the hedonic nature of motion pictures may affect some
users to judge movies using other criteria than movie attributes, e.g., less well-defined overall
impression or entertaining value. Hence, our model would not be able to capture the prefer-
ence of such users to a full extent. On the other hand, collaborative filtering techniques, which
are not concerned with movie attributes and base their recommendations on more general rat-
ing patterns, may be better off in revealing relations between movies for such “hedonically
oriented” users, and hence, in producing recommendations for them. Taking our objective to
provide users with actionable explanations along with recommendations themselves into ac-
count, we suggested hybridizing our method with item-based CF, because the latter approach
provides the second best explanation style with respect to its potential of increasing user
choice effectiveness. Hence, the hybrid method provides all users with one of the most effec-
tive explanations of recommendations. To ensure the ability to provide explanations, we do
not combine individual predictions of the component methods algebraically (e.g., by means of
averaging or weighting). Instead, we choose to use the „raw‟ prediction of the component that
performs best on withheld data as the final recommendation.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 5: Conclusions and Future Work 133
To test our proposed method and to locate its place within the family of contemporary
recommendation algorithms, we conducted an empirical study, which involved rating datasets
of two real-world recommender systems, each having their own inherent properties. The study
compared predictive accuracy of the key recommendation methods as well as it reported re-
sults of the most accurate Netflix Prize winning algorithm. The results are consistent for both
datasets, which indicates the generalizability of the findings for the domain of motion pictures
as well as for the domain of recommendation systems as a whole. It was shown that the two
groups of users indeed exist: The first and the larger group (about two thirds of the user base)
can be described well by our proposed multi-attribute model of user preferences. Consequent-
ly, for these users, the explicit preference modeling outperforms the CF component of our
hybrid method, thus providing more precise rating predictions and the most effective pros-
and-cons explanation style. The second, smaller group of users (about one third) seem to form
their movie preferences on other factors than movie attributes. For this group of users item-
based CF provides essentially more reliable rating predictions, i.e. item similarity is more
descriptive of these users‟ preferences. The latter also indicates that highlighting the similarity
of recommended movies to the previously seen ones is more informative for this part of users
as an explanation of recommendations. That is, each group of users received the most precise-
ly predicted recommendations supported by the most effective explanations.
All in one, our content-based hybrid method was shown to outperform collaborative fil-
tering techniques with respect to predictive accuracy, while inherently ensuring the provision
of explanations behind recommendations for each user in the most effective explanation style.
Notably, the prediction accuracy of our hybrid method outperformed also the Netflix Prize
winning algorithm that ranks as the most accurate among published recommendation algo-
rithms but possesses no inherent capability to provide explanations behind recommendations.
This finding constitutes the main contribution of the current thesis to research in the field of
recommendation systems.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 5: Conclusions and Future Work 134
Recapitalizing the above said, our results and contributions to the research can be brief-
ly summarized as follows:
(i) We extended the keyword explanation style by integrating of negative cues
therein and established theoretically that the resulting pros-and-cons explanation
style increases the effectiveness of recommendations for the user‟s decision
making.
(ii) We developed a content based recommendation algorithm for the domain of
multimedia products, i.e. for recommendations of motion pictures. This algo-
rithm outperforms the key recommendation algorithms for the majority of users
and is capable of providing them with explanations of recommendations that ef-
fectively support the users‟ decision making.
(iii) We developed a novel statistical approach for the estimation of highly underde-
termined regression models. The approach employs a set of auxiliary regressions
that estimate one regression parameter at once. The initial estimates are then cor-
rected for the omitted variable bias and multicollinearity and subsequently opti-
mized for s further reduction of prediction errors.
(iv) We have shown the existence of two substantially large user groups of movie
recommender systems who form their preferences differently. Providing rec-
ommendations for each group of users by means of a method that captures the
preferences of a correspondent user group better leads to a substantial increase in
the prediction accuracy of a recommender system.
(v) We showed that a carefully designed content based hybrid recommendation
method can outperform collaborative filtering algorithms with respect to predic-
tion accuracy.
(vi) We provided an empirical support for the findings of previous research that ar-
gues that “[recommendation] agents should think like the people they are at-
tempting to help” (Aksoy et al. 2006, p. 310).
The next section discusses implications of our findings.
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 5: Conclusions and Future Work 135
5.2 Discussion and Implications
Even the most accurate recommendation algorithm is subject to prediction errors.
Hence, recommendation systems that aim at helping users to make better choices should take
account for factors that are beyond the rating predictions as such and spread their horizons to
encompass the aspects of the recommendation process as well as facilities that further in-
crease users‟ choice efficiency:
On the one hand, recommender system providers should make efforts to increase their
understanding of the criteria users base their decisions on and to integrate an ability to align
the process of the generation of recommendations with these criteria in their algorithms: Since
different users base their choices on different criteria, a recommender system should employ
different recommendation processes that match the individual user decision making and in-
corporate the user‟s underlying choice making criteria into a personalized recommendation
process. That is, rather than employing one algorithm that performs best on overall, a recom-
mendation system should handle its users individually. That is, a recommender system should
be a hybrid of several recommendation methods, each aligned with choice making criteria of a
specific user group, and provide recommendations to a user by means of a component method
that reflects the user‟s choice making best.
On the other hand, efforts should be made to increase the user‟s understanding of rec-
ommendations. That is, an explanation facility should be made an integral part of recom-
mender systems. This facility, however, should be tightly coupled with the recommender al-
gorithm: Provided that the recommendations for different users are produced differently (see
above), the explanations should also reflect the underlying process of producing recommenda-
tions and highlight the aspects of recommendations that are relevant for the user‟s choice
making. This increases the user‟s choice effectiveness and compensates algorithmic predic-
tion errors through allowing users to assess the quality and suitability of recommendations
before completing his or her choice. Furthermore, the provision of explanation as additional
decision supporting information allows the users to address the context in which the decision
is made better as well as other fine aspects of the decision‟s implications. In other words, ex-
planations can make the aspects addressable that are hardly addressable by an automated rec-
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 5: Conclusions and Future Work 136
ommendation agent. For instance, if western fan Thorsten chooses a movie to watch after din-
ning out with his spouse Claudia, he will unlikely choose a protracted Clint Eastwood classic
for this occasion. Instead, he will appreciate a recommendation that hints an entertaining or
love-story component of a western movie, which will allow him to choose a movie that suits
his decision context best, i.e. a movie that is worth watching for both him and his wife. Not
only can an alignment of explanations with the user‟s decision relevant characteristics in-
crease the user‟s confidence in recommendations and the user‟s choice efficiency, the provi-
sion of explanations that are understandable and actionable to the users increases his or her
trust in and acceptance of a recommendation system as a whole, which also increases the us-
er‟s loyalty to a recommender.
The next section discusses the ways for improvement of the proposals made in the cur-
rent thesis and shows the directions for future work.
5.3 Future Research
No research publication can ever completely cover a topic with all it facets and nuances.
No research project is free of limitations. Neither is also our thesis. In the following, we will
discuss the limitations of our research and show the ways for its improvements and exten-
sions.
In the current thesis, we developed a recommendation algorithm that is capable of
providing explanations alongside with recommendations. Although the proposed explanation
style and its effectiveness for user choice making as well as the ability of the algorithm to
provide such explanations were proven theoretically in previous chapters, we cannot quantify
the degree to which the explanations presented in our proposed style actually increase the
choice effectiveness. This improvement can be substantial or it can be only marginal. Like-
wise, it can be argued that the effectiveness of the pros-and-cons explanation style may de-
pend on the nuances of the formulation of an explanation. These nuances include the ques-
tions of the optimal number of attributes that an explanation should report, the valance and the
balance between the positive and negative cues included in an explanation, the wording and
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 5: Conclusions and Future Work 137
the length of explanations, their place and design in the user interface of a recommender sys-
tem, etc. These topics were not addressed within the current thesis and require additional user
studies, which would provide empirical tests of our theoretically founded propositions and
allow to increase the understanding of the issue of effective explanations in recommender
systems research.
Through providing consistent results on real-world datasets with different properties,
our proposed recommendation method was proven to be generalizable for the domain of mov-
ie recommendations. We encourage further studies that will test the suitability, applicability
and effectiveness of our method in other real-world applications as well as for recommenda-
tions of other types of items and products.
Research directions we would also like to explore concern the modeling side of our
method, such as extending the list of item attributes, adding interaction effects, accounting for
non-linear attribute preference functions and for non-linear temporal changes of the prefer-
ences. Accounting for these factors potentially increases the explanation power of the multi-
attribute preference model, which in turn can improve the prediction accuracy of our algo-
rithm as well as it can allow to capture the preferences of a greater part of users, i.e. also of
users whose preferences were not adequately captured by our model in our study.
Improvements can be made also to the proposed algorithm itself. So, similarity-based
techniques can be employed for enriching the representation of user profiles through the im-
putation of part-worths. Such an imputation, again, potentially increases the number of users
for whom our algorithm can provide reliable rating predictions by uncovering the attribute
preferences that are initially „hidden‟ from the algorithm. Furthermore, the amount of items
that can be potentially recommended to a user also increases. For example, if a user who likes
both action movies and westerns has only rated westerns, our content-based algorithm would
not be able to deduce the user‟s preference for action movies due to the lack of the corre-
spondent data. Hence, such user would never receive a recommendation of an action movie.
In this situation a user-based CF could determine that other users with similar ratings also rate
action movies high. This information could then be used to input the part-worths for the genre
„action‟ as well as for other attributes contained in the „source‟ users profiles (e.g. actors, di-
rectors, budgets, etc.) into the incomplete profile of the active user. Obviously, such imputa-
tion requires great care, so that the enriched user profile remains descriptive of the user‟s
preferences and balanced with respect to the relative importance of different attributes. A pos-
DRAFT -
final
revisi
on to
appe
ar in
2012
Chapter 5: Conclusions and Future Work 138
sible approach to ensure this is rescaling of the part-worths to be imputed on the value of the
part-worth of the known attributes. Another possible approach for imputation can be based on
the similarity or correlations of the known part-worths between profiles of different users.
Further, our empirical study revealed the existence of two substantially large user
groups that form their preferences differently. Whereas the first and larger group could be
reasonably well described by our multi-attribute preference model and thus received recom-
mendations predicted by the model, we used predictions of item-based CF for all users of the
second group. Although the predictions of item-based CF allowed to improve the prediction
accuracy of our hybrid method substantially, this method must not necessarily provide the
best description of the underlying preference structures for all users of the second group. It is
also possible that the users of this group can be further differentiated with respect to the crite-
ria they base their movie choices on or with respect to a method that predicts their ratings bet-
ter. We argue that further analysis of the users of the second group and application of a rec-
ommendation method that captures the preferences of each user better may be fruitful and
increase both the prediction accuracy and the effectiveness of explanations. Hence, we inspire
the researches to examine this issue more deeply and encourage the recommender system
providers to combine several recommendation techniques in their recommendation systems,
rather than building a system around an algorithm that performs best on overall.
Finally, a „joint product‟ of the current thesis is the mathematical core of our algorithm
– a method to estimate parameters of underdetermined regression models. Recall that already
the parameters predicted in the estimation step of the algorithm provided reasonably accurate
estimations of the user rating. It is worth mentioning that in many cases the estimation of 636
parameters was done on the basis of as many as only 6 data points. Utilizing further 6 data
points as a holdout in the optimization step improved the prediction accuracy by 1% and 5%
with respect to MAE and RMSE correspondingly. We suggest that these results are notable
and that the estimation method itself deserves attention from other research fields that deal
with the need to estimate many parameters based on a small number of data points. Hence, we
are eager to expand the application of our estimation method to the solution of other types of
problems than recommending items and see this as a potentially fruitful research field for our
future work.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 139
Bibliography
Bibliography
Adomavicius, Gediminas and Alexander Tuzhilin (2005). "Toward the next generation of
recommender systems: A survey of the state-of-the-art and possible extensions",
IEEE transactions on knowledge and data engineering. 2005. 734–749.
Adomavicius, Gediminas, Ramesh Sankaranarayanan, Shahana Sen, and Alexander Tuzhilin
(2005). "Incorporating contextual information in recommender systems using a mul-
tidimensional approach." in ACM Transactions on Information Systems (TOIS), Vol.
23, Issue 1, pp. 103-145.
Adomavicius, Gediminas and Alexander Tuzhilin (2008). "Context-Aware Recommender
Systems", in Proceedings of the 2008 ACM conference on Recommender systems -
RecSys ’08, pp. 335-336.
Aksoy, Lerzan, Paul N. Bloom, Nicholas H. Lurie, and Bruce Cooil (2006). "Should
Recommendation Agents Think Like People?" in Journal of Service Research Vol. 8,
No. 4, pp. 297-315.
Aksoy, Lerzan, Bruce Cooil, and Nicholas H. Lurie (2011). "Decision Quality Measures in
Recommendation Agents Research" in Journal of Interactive Marketing Vol. 25
(2011), pp. 110-122.
Allan, James, Jaime Carbonell, George Doddington, Jonathan Yarmon, and Yiming Yang
(1998), "Topic Detection and Tracking Pilot Study Final Report", in Proceedings of
the DARPA Broadcast News Transcription and Understanding Workshop, pp. 194-
218.
Alspector, Joshua, Aleksander Kolcz, and Nachimuthu Karunanith (1998), "Comparing Fea-
ture-Based and Clique-Based User Models for Movie Seletion", in Proceedings ot
the third ACM Conference on Digital Libraries, Pittsburgh, PA, pages 11-18.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 140
Anand, Sarabjot S. and Bamshad Mobasher (2005), “Intelligent Techniques for Web Person-
alization”, in Mobasher, Bamshad and Sarabjot Anand [eds.] “Intelligent Techniques
for Web Personalization”, Lecture Notes in Computer Science, Vol. 3169, Springer,
Heidelberg, Berlin, pp. 1-36.
Andersen, Stig K., Kristian G. Olesen, and Finn V. Jensen (1990). "HUGIN - A Shell for
Building Bayesian Belief Universes for Expert Systems", Morgan Kaufmann Pub-
lishers Inc., San Francisco, CA, USA.
Anderson, Chris (2004). "The long tail.“, Wired, Hyperion New York (10), 170-177.
Ansari, Asim, Skander Essegaier, and Rajeev Kohli (2000), "Internet Recommendation Sys-
tems", Journal of Marketing Research 37 (August): 363-376.
Ariely, Dan (2000). "Controlling the information flow: Effects on Consumers' Decision Mak-
ing and Preferences", in Journal of Consumer Research Vol. 27(2), pp. 233-248.
Ariely, Dan, John G. Lynch Jr, Manuel Aparicio IV (2004). "Learning by collaborative and
individual-based recommendation agents", in Journal of Consumer Psychology Vol.
14(1&2), pp. 81–95.
Augistin, Vernon E. (1927), "Motion Pictures Preferences", in Journal of Delinquency Vol 7,
pp. 206-209.
Austin, Bruce A. (1981), "Film Attendance: Why College Students Chose to See Their Most
Recent Film", in Journal of Popular Film and Television, Vol 9, pp. 43-49.
Austin, Bruce A. (1989), "A Factor Analysis Study of Attitudes Toward Motion Pictures", in
Journal of Social Psychlology, Issue 117, pp. 211-217.
Austin, Bruce A. (1989), "Immediate Seating: A Look at Movie Audiences", Wadsworth, Inc.
Avery, Christopher and Richard Zeckhauser (1997), “Recommender Systems for Evaluating
Computer Messages”, in Communications of the ACM, Vol. 40, Issue 3, pp. 88-89.
Baeza-Yates, Ricardo and Berthier Ribeiro-Neto (1999), "Modern Information Retrieval",
Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 141
Balabanovic, Marko and Yoav Shoham (1997), “Fab: Content-Based, Collaborative Recom-
mendation”, in Communications of the ACM, Vol. 40, No. 3, pp. 66-72.
Baltrunas, Linas (2008). "Exploiting Contextual Information in Recommender Systems", in
Proceedings of the 2008 ACM conference on Recommender systems - RecSys ’08,
pp. 295-298
Baltrunas, Linas and Francesco Ricci (2009). "Context-Dependent Items Generation in Col-
laborative Filtering", in ACM Workshop on Context-aware Recommender Systems
(CARS 2009), pp. 295-298
Balabanovic, Marko, and Yoav Shoham (1997), "Fab: Content-based, Collaborative Recom-
mendation", Communications of the ACM 40(3), pages 66-72.
Basu, Chumki, Haym Hirsh, William and Cohen (1998), "Recommendation as Classification:
Using Social and Content-based Information in Recommendation", in AAAI '98/IAAI
'98 Proceedings of the fifteenth national/tenth conference on Artificial intelli-
gence/Innovative applications of artificial intelligence, pp. 714–720.
Baudisch, Patrick (1999), “Joining Collaborative and Content-based Filtering”, in Proceed-
ings of the ACM Conference on Human Factors in Computing Systems, pp. 1-5.
Bell, Robert and Yehuda Koren (2007) “Scalable Collaborative Filtering with Jointly Derived
Neighborhood Interpolation Weights”, in Proceedings of the 2007 Seventh IEEE In-
ternational Conference on Data Mining (ICDM'07), pp. 43-52.
Bell, Robert, Yehuda Koren, and Chris Volinsky (2007), "Modeling Relationships at multiple
Scales to Improve Accuracy of Large Recommender Systems", in Proceedings of the
13th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD '07), pp. 95-104.
Bell, Robert, Yehuda Koren, and Chris Volinsky (2007b), "The BellKor Solution to the Net-
flix Prize", http://www2.research.att.com/~volinsky/netflix/ProgressPrize2007Bell
KorSolution.pdf, [retrieved on 20.06.2011]
Bell, Robert, Yehuda Koren, and Chris Volinsky (2008), "The BellKor 2008 Solution to the
Netflix Prize", http://www2.research.att.com/~volinsky/netflix/Bellkor2008.pdf, [re-
trieved on 20.06.2011]
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 142
Bennet, James and Stan Lanning (2007), "The Netflix Prize", in Proceedings of KDD Cup
and Workshop, August 12, 2007. www.netflixprize.com
Bettman, James R., Eric J. Johnson, and John W. Payne (1991). “Consumer Decision Mak-
ing.” In Thomas S. Robertson and Harold H. Kassarjian (Eds.) "Handbook of Con-
sumer Behavior", Prentice Hall, pp. 50–84.
Bilgic, Mustafa and Raymond J. Mooney (2005). "Explaining recommendations: Satisfaction
vs. Promotion", in Proceedings of Beyond Personalization 2005: the Workshop on
the Next Stage of Recommender Systems Research at the 2005 International Confer-
ence on Intelligent User Interfaces (IUI'05), pp. 1-6.
Billsus, Daniel and Michael J. Pazzani (1999), "A Personal News Agent that Talks, Learns
and Explains", in Proceedings of the 3rd
ACM Annual Conference on Autonomous
Agents (AGENTS'99), pp. 268-275.
Billsus, Daniel and Michael J. Pazzani (2000), "User Modeling for Adaptive News Access",
in User-Modeling and User-Adapted Interaction Vol. 10(2-3), pp. 147-180.
Billsus, Daniel and Michael J. Pazzani, and James Chen (2000), "A Learning Agent for Wire-
less News Access", in Proceedings of the 5th
ACM International Conference on Intel-
ligent User Interfaces (IUI'00), pp. 33-36.
Bodapati, Anand V. (2008). "Recommendation Systems with Purchase Data", Journal of
Marketing Research, 45 (1), 77-93.
Breese, John S., David Heckerman, and Carl Kadie (1998), “Empirical Analysis of Predictive
Algorithms for Collaborative Filtering”, in Proceedings of the 14th Conference on
Uncertainty in Artificial Intelligence (UAI-98), San Francisco, July 24-26, pp. 43-52.
Brézillon, Patric J. and Jean-Charles Pomerol (1996). “Misuse and Nonuse of Knowledge-
based Systems: the Past Experiences Revisited”, in "Implementing Systems for Sup-
porting Management Decisions", Patrick Humphreys, Liam Bannon, Andrew
McCosh, Piero Migliarese and Jean-Charles Pomerol (eds.), Chapman and Hall, pp.
44-60.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 143
Buchanan, Bruce G. and Edward H. Shortliffe (1984). "Rule-based Expert Systems: The
MYCIN Experiments of Stanford Heuristic Programming Project", Addison-Wesley,
Reading, MA.
Burke, Robin (2002), "Hybrid recommender systems: Survey and experiments", in User
Modeling and User Adapted Interaction (2002) Vol. 12, Issue: 4, pp. 331–370.
Canny, John (2002), “Collaborative Filtering with Privacy via Factor Analysis”, in Proceed-
ings of the 25th
Annual International ACM SIGIR Conference on Research and De-
velopment in Information Retrieval SIGIR’02, pp. 238-245.
Caroll, J. Douglas, Paul E. Green (1995). "Psychometric Methods in Marketing Research:
Part I, Conjoint Analysis", in Journal of Marketing Research, Vol. 32 (4), pp. 385-
391.
Chakrabarti, Soumen (2002), "Mining the Web: Discovering Knowledge from Hypertext Da-
ta", 1st edition, Morgan Kaufmann Publishers, San Francisco.
Chakravarti, Dipankar and John G. Lynch (1983). “A Framework for Examining Context Ef-
fects on Consumer Judgment and Choice”. In R. P. Bagozzi and Alice M. Tybout
(Eds.), “Advances in Consumer Research”, Vol. 10. Ann Arbor, MI: Association of
Consumer Research, pp. 289-297.
Chen, Li (2009). "Adaptive Tradeoff Explanations in Conversational Recommenders", Pro-
ceedings of the third ACM conference on Recommender systems, ACM 225–228.
Claypool, Mark, Anuja Gokhale, Tim Miranda, Pavel Murnikov, Dmitry Netes, and Matthew
Sartin (1999), “Combining Content-Based and Collaborative Filters in an Online
Newspaper”, in Proceedings of ACM SIGIR’99 Workshop on Recommender Systems:
Algorithms and Evaluation, pp. 1-8.
Cooke, Alan D.J., Harish Sujan, Mita Sujan, Barton A. Weitz (2002). "Marketing the Un-
farmiliar: The Role of Context and Item-Specific Information in Electronic Agent
Recommendations", Journal of Marketing Research, Vol 1, pages 488-497.
Cooper-Martin, Elizabeth (1991), “Consumers and Movies: Some Findings on Experiential
Products”, in Advances in Consumer Research 18, pp. 372-378.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 144
Cooper-Martin, Elizabeth (1992), “Consumers and Movies: Information Sources for Experi-
ential Products”, in Advances in Consumer Research 19, pp. 756-761.
Corner, James L., Craig W. Kirkwood (1991), "Decision Analysis Applications in the Opera-
tions Research Literature, 1970–1989", in Operation Research, Vol. 39, Issue 2, pp.
206–219.
Cramer, Henriette, Vanessa Evers, Satyan Ramlal, Maarten Someren, Lloyd Rutledge, Natalia
Stash, Lora Aroyo, Bob Wielinga (2008). "The Effects of Transparency on Trust in
and Acceptance of a Content-based Art Recommender", in User Modeling and User-
Adapted Interaction 18, 5, pp. 455-496.
Das, Abhinadan S., Mayur Datar, Ashutosh Garg, and Shuyam Rajaram (2007), “Google
news personalization: scalable online collaborative filtering”, in Proceedings of the
16th international conference on World Wide Web (WWW’07), ACM, New York, pp.
271-280.
Delgado, Joaquin and Naohiro Ishii (1999), “Memory-Based Weighted-Majority Prediction
for Recommender Systems”, in Proceedings of the ACM SIGIR’99, Workshop Rec-
ommender Systems: Algorithms and Evaluation, pp. 1-5.
De Vany, Arthur and Walls, David (1999), “Uncertainty in the Movie Industry: Does Star
Power Reduce the Terror of the Box Office?”, in Journal of Cultural Economics, Vol
23, pp. 285-318.
Diehl, Kristin, Laura J. Kornish, and John G. Lynch Jr. (2003), “Smart Agents: When Lower
Search Costs for Quality Information Increase Price Sensitivity,” Journal of Con-
sumer Research, 30 (June), pp. 56-71.
Dick, Alan S. and Kunal Basu (1994), "Customer Loyalty: Toward an Integrated Conseptual
Framework", in Journal of the Academy of Marketing Science, Vol 22, Issue 2, pp.
99-113.
Ding, Yi and Xue Li (2005), "Time Weight Collaborative Filtering", in Proceedings of the
14th ACM International Conference on Information and Knowledge Management,
pp. 485-492.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 145
Doyle, Dónal, Alexey Tsymbal, and Pádraig Cunningham (2003). "A Review of Explanation
and Explanation in Case-based Reasoning", Technical Report, Department of Com-
puter Science, Trinity College, Dublin.
Edwards, Ward (1954), "The theory of decision making", in Psychological Bulletin, Vol. 51,
380-417.
Edwards, Ward (1961), "Behavioral decision theory", in Annual Review of Psychology, Vol.
12, pp. 473–498.
El Helou, Sandy, Christophe Salzmann, Stéphane Sire, and Denis Gillet (2009), " The 3A
contextual ranking system: simultaneously recommending actors, assets, and group
activities", in RecSys '09 Proceedings of the third ACM conference on Recommender
systems, pp. 373-376.
Fishburn, Peter C. (1967). "Methods of estimating Additive Utilities", in Management
Schience Vol. 18 (7), pp. 435-453.
Fishburn, Peter C. (1968). "Utility Theory", in Management Science Vol. 14 (5), pp. 335-378.
Fishburn, Peter C. (1988). "Nonlinear Preference and Utility Theory", John Hopkins Universi-
ty Press, Baltimore.
Fishburn, Peter C. (1970), "Utility Theory for Decision Making", Wiley, New York.
Fitzsimons, Gavan J. and Donald R. Lehmann (2004), „Reactance to Recommendations:
When Unsolicited Advice Yields Contrary Responses,“ Marketing Science, Institute
for Operations Research and the Management Sciences 23 (1), 82-94.
Funk, Simon (2006), "Netflix Update: Try this at Home", retrieved at http://sifter.org/~simon/
journal/20061211.html, on 04.06.2011.
Gershoff, Andrew D., Ashesh Mukherjee, and Anirban Mukhopadhyay (2003). "Consumer
acceptance of online agent advice: Extremity and positivity effects.", in Journal of
Consumer PsychologyVol 13, pp. 161-170.
Gigerenzer, Gerd, Peter M. Todd, and the ABC Research Group (1999), "Simple heuristics
that make us smart", New York: Oxford University Press.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 146
Goldberg, David, David Nichols, Brian M. Oki, and Douglas Terry (1992). "Using
collaborative filtering to weave an information tapestry", Communications of the
ACM, ACM 35 (12), 61-70.
Goldberg, Ken, Theresa Roeder, Dhruv Gupta, and Chris Perkins (2001), “Eigentaste: A Con-
stant Time Collaborative Filtering algorithm”, in Information Retrieval, Vol. 4, No. 2,
pp. 133-151.
Golub, Gene H. and William Kahan (1965), "Calculating the Singular Values and Pseudo-
inverse of a Matrix", in Journalof the Society for Industrial and Applied Mathemat-
ics, Series B: Numerical Analysis, Vol. 2, No. 2, pp. 205-224.
Green, Paul E., Yoram Wind, and Arun K. Jain (1972), "Preference Measurement of Item
Collections", in Journal of Marketing Research, Vol. 9, pp. 371-377.
Green, Paul E. and Yoram Wind (1973), "Multiattribute Decisions in Marketing: A Measure-
ment Approach", Hinsdale, II.
Green, Paul E., V. Srinivasan (1990), "Conjoint Analysis in Marketing: New Developments
With Implications for Research and Practice", in Journal of Marketing, October
1990, pp. 3-15
Grudin, Jonathan (1988), "Why CSCW Applications Fail: Problems in the Design and Eval-
uation of Organizational Interfaces", in Proceedings of the 1988 ACM Conference on
Computer-Supported Cooperative Work (CSCW '88), pp. 85-93.
Gunawardana, Asela and Christopher Meek (2009), “A Unified Approach to Building Hybrid
Recommender Systems”, in RecSys '09 Proceedings of the third ACM conference on
Recommender systems, pp. 117-124.
Hennig-Thurau, Thorsten, Christian Friege, Sonja Gensler, Lara Lobschat, Arvind
Rangaswamy, and Bernd Skiera (2010). "The Impact of New Media on Customer
Relationships", in Journal of Service Research, August 11, 2010 Vol. 13, No. 3, pp.
311-330.
Hennig-Thurau, Thorsten, Mark B. Houston, Gianfranco Walsh (2006), "Differing Roles of
success Drivers Across Sequential Channels: An Application to the Motion Picture
Industry", in Journal of Academy of Marketing Science, Vol. 34(4), pp. 559-575.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 147
Hennig-Thurau, Thorsten, Mark B. Houston, Gianfranco Walsh (2007), "Determinants of
Motion Picture Box Office and Profitability: an interrelationship approach", in Re-
view of Managerial Science, Vol. 1(1), pp. 65-92.
Hennig-Thurau, Thorsten and Alexander Klee (1997), "The Impact of Customer Satisfaction
and Relationship Quality on Customer Retention: A Critical Reassessment and Mod-
el Development", in Psychology & Marketing, Vol. 14, pp. 737-764.
Hennig-Thurau, Thorsten, André Marchand and Paul Marx (2011), „Can Automated Recom-
mender Systems Lead to Better Group Decisions?,“ AMA Winter Educatorsʼ Con-
ference, Track 10 .
Hennig-Thurau, Thorsten, Walsh, Gianfranco, and Wruck, Oliver (2001) “An Investigation
into the Success Factors of Motion Pictures”, in Academy of Marketing Science Re-
view, (at amsreview.org/amsrev/theory/hennig06-01.html).
Herlocker, Johnathan, Joseph A. Konstan, Al Borchers, and John T. Riedl (1999). "An algo-
rithmic framework for performing collaborative filtering", in SIGIR ’99: Proceedings
of the 22nd Annual In- ternational ACM SIGIR Conference on Research and De-
velopment in Information Retrieval, 230–237.
Herlocker, Johnathan L., Joseph A. Konstan, and John T. Riedl (2000), "Explaining Collabo-
rative Filtering Recommendations", in Proceedings of the 2000 ACM conference on
Computer supported cooperative work, ACM New York, NY, USA, pp. 241–250.
Herlocker, Johnathan L., Joseph A. Konstan, and John T. Riedl (2002). "An Empirical Analy-
sis of Design Choices in Neighborhood-based Collaborative filtering Algorithms", in
Information Retrieval, Vol 5, No. 4, pp. 287–310.
Herlocker, Johnathan L., Joseph A. Konstan, Loren G. Terveen, and John T. Riedl (2004).
„Evaluating Collabotarive Filtering Recommender Systems“, ACM Transaction on
Information Systems 22(1), ACM New York, NY, USA 5–53.
Hill, Will, Larry Stead, Mark Rosenstein, and George Furnas (1995), "Recommending and
Evaluating Choices in a Virtual Community of Use", in Proceedings of ACM CHI’95
Conference on Human Factors in Computing Systems, pp.194–201.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 148
Hirschman, Elizabeth C. and Morris B. Holbrook (1982), "Hedonic Consumption: Emerging
Concepts, Methods and Propositions", in Journal of Marketing, Vol. 46 (Summer),
pp. 92-101.
Holbrook, Morris B. and Hirschman, Elizabeth C. (1982) “The Experiential Aspects of Con-
sumption: Consumer Fantasies, Feelings, and Fun”, in Journal of Consumer Re-
search, Vol. 9 (September), pp. 132-140.
Horvitz, Eric, John Breese, and Max Henrion (1988). "Decision Theory in Expert Systems
and Artificial Intelligence", in International Journal of Approximate Reasoning, Spe-
cial Issue on Uncertainty in Artificial Intelligence, 2 (3), pp. 247-302. Also, Stanford
CS Technical Report KSL-88-13.
Ito, Tiffany A., Jeff T. Larsen, N. Kyle Smith, and John T. Cacioppo (1998). "Negative In-
formation Weighs More Heavily on the Brain: The Negativity Bias in Evaluative
Categorizations", in Journal of Personality and Social Psychology, Vol. 75, No. 4,
pp. 887-900.
Jacoby, Jacob, Donald E. Speller, and Carol Kohn Berning (1974). "Brand choice behavior as
a function of information load: Replication and extension", in Journal of Consumer
Research, 1, 33–42.
Jannach, Dietmar, Markus Zanker, Alexander Feldfering, and Gerhardt Friedrich (2011),
"Recommender Systems: An Introduction", Cambridge university press, New York,
2011.
Johnson, Harry and Peter Johnson (1993). "Explanation Facilities and Interactive Systems", in
IUI '93 Proceedings of the 1st international conference on Intelligent user interfaces,
ACM New York, NY, USA, pp.: 159-166.
Johnston, Jack and John DiNardo (1997), "Econometric Methods", 4th
edition, McGraw-Hill,
New York.
Kahneman, Daniel and Amos Tversky (1984). "Choices, values, and frames", in American
Psychologist, Vol. 39, pp. 341–350.
Kamenta, Jan (1971), "Elements of Econometrics", McMillan, New York.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 149
Kanouse, David E. and Reid L. Hanson (1972), "Negativity in Evaluations," in Attribution:
Perceiving the Causes of Behavior, eds. Edward E. Jones and David E. Kanouse,
Hillsdale, NJ, England: Lawrence Erlbaum Associates, Inc., 47-62
Keefer, Donald L , Kirkwood, Craig W , Corner, James L (2002), "Summary of Decision
Analysis Applications in the Operations Research Literature 1990-2001", Technical
Report of the Department of Supply Chain Management of the Arizona State Univer-
sity, ( retrieved at http://www.informs.org/content/download/14833/178547/file/DA
AppsSummaryTechReport.pdf, 30.06.2011)
Kim, Dohyun and Bong-Jin Yum (2005), “Collaborative Filtering Based on Iterative Principal
Component Analysis”, in Expert Systems with Applications, Vol. 28, pp. 823-830.
Klein, Noreen M. and Manjit S. Yadav (1989), “Context Effects on Effort and Accuracy in
Choice: An Inquiry into Adaptive Decision Making.” In Journal of Consumer Re-
search, Vol. 15 (4), pp. 411–421.
Komarek, Paul (2004), "Logistic Regression for Data Mining and High-dimensional Classifi-
cation", Doctoral Dissertaion, Carnegie Mellon University Pittsburgh, PA, USA. [re-
trieved at http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:
lr_thesis.pdf, on 05.07.2011]
Konstan, Joseph A., Bradley N. Miller, David Malz, Jonathan L. Herlocker, Lee R. Gordon,
and John Riedl (1997), “GroupLens: Applying Collaborative Filtering to Usenet
News”, in Communications of the ACM, Vol. 30, No. 3, pp. 77-87.
Konstan, Joseph A., John Riedl, Al Borchers, and Jonathan L. Herlocker (1998) “Recom-
mender Systems: A GroupLens Perspective”, in Recommender Systems: Papers from
the 1998 Workshop (AAAI Technical Report WS-98), Vol. 8, pp. 60-64.
Koren, Yehuda (2008), "Factorization Meets the Neighborhood: A Multifaceted Collaborative
Filtering Model", in Proceeding of the 14th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 426-434.
Koren, Yehuda (2009), "Collaborative Filtering with Temporal Dynamics", in Proceedings of
the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 447-456.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 150
Koren, Yehuda (2010), "Factor in the Neighbors: Scalable and Accurate Collaborative Filter-
ing", in ACM Transactions on Knowledge Discovery from Data (TKDD), Vol. 4, No.
1, pp. 1-24.
Koren, Yehuda, Robert Bell, Chris Volinsky (2009), "Matrix Factorization Techniques for
Recommender Systems", in IEEE Computer Society Press Los Alamitos, CA, USA
Vol.42, Issue 8, pp. 42-49.
Koren, Yehuda, and Robert Bell (2011), “Advances in Collaborative Filtering”, in Ricci,
Francesco, Lior Rokach, Bracha Shapira, Paul B. Kantor [eds.] (2011). "Recom-
mender Systems Handbook", Springer Science + Business Media LLC, pp. 145 -
186.
Lacave, Carmen and Francisco J. Diéz (2004). "A review of explanation methods for heuristic
expert systems", in The Knowledge Engineering Review, 19, pp 133-146.
Lange, Kenneth (2010), "Optimization (Springer Texts in Statistics)", Springer Verlag New
York LLC.
Lam, Shyong K. and John Riedl (2004), “Shilling Recommender Systems for Fun and Profit”,
in Proceedings of the 13th international conference on World Wide Web, WWW’04,
pp. 393–402.
Linden, Greg, Brent Smith, and Jeremy York (2003), “Amazon.com Recommendations: Item-
to-item Collaborative Filtering”, in Internet Computing, IEEE, pp. 76-80.
Lops, Pasquale, Marco de Gemmis, and Giovanni Semeraro (2011), "Content-based Recom-
mender Systems: State of the Art and Trends ", in Ricci, Francesco, Lior Rokach,
Bracha Shapira, Paul B. Kantor [eds.] (2011). "Recommender Systems Handbook",
Springer Science + Business Media LLC, pp. 73 - 105.
Luce, R. Duncan (1992), "Where does subjective expected utility fail descriptively?", in
Journal of Risk and Uncertainty, Vol. 5, pp. 5-27.
Lutz, Richard J. (1975), "Changing Brand Attitudes through Modification of Cognitive Struc-
ture", in Journal of Consumer Research, 1 (March), pp. 49 - 59.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 151
Maimon, Oded and Lior Rokach (eds.) (2005), "The Data Mining and Knowledge Discovery
Handbook", Springer Science+Business Media Inc.
Majchrzak, Ann and Les Gasser (1991). “On using Artificial intelligence to integrate the de-
sign of organizational and process change in US manufacturing”, AI and society,
Vol. 5, pp 321-338.
McNee, Sean M., Shyong K. Lam, Joseph A. Konstan, John and Riedl (2003), "Interfaces for
Eliciting New User Preferences in Recommender Systems", in The 9th International
Conference on User Modeling (UM'2003), pp. 178–187.
McSherry, David (2005), "Explanation in recommender systems", in Artificial Intelligence
Review, Vol. 24, Issue 2, pp. 179 – 197.
Mehta Bhaskar, Thomas Hofmann, Wolfgang Nejdl (2007), “Robust Collaborative Filtering”,
in Proceedings of the 2007 ACM conference on Recommender Systems, pp. 49-56.
Melville, Prem, Raymond J. Mooney, and Ramadass Nagarajan (2002), “Content-boosted
Collaborative Filtering”, in Proceedings of 18th
National Conference on Artificial In-
telligence (AAAI-2002), pp. 187-192.
Mild, Andreas and Martin Natter (2002). "Collaborative Filtering or Regression Models for
Internet Recommendation Systems?", in Journal of Targeting, Measurement and
Analysis for Marketing, volume 10, issue 4, pages 304-313.
Miller, Christopher A. and Raymond Larson (1992). "An Explanatory and "Argumentative"
Interface for a Model-based Diagnostic System", in Proceedings of the 5th annual
ACM symposium on User interface software and technology (UIST'92), ACM, pp:
43-52
Mladenic, Dunja (1999), “Text-learning and Related Intelligent Agents: A Survey”, in IEEE
Intelligent Systems, Vol. 14, No. 4, pp. 44-54.
Mobasher Bamshad, Robin Burke, Runa Bhaumik, and Chad Williams (2007), "Towards
Trustworthy Recommender Systems: An Analysis of Attack Models and Algorithm
Robustness", in ACM Transactions on Internet Technology, Vol. 7, No. 2, pp.23-60.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 152
Moon, Sangkil, Paul K. Bergey, and Dawn Iacobucci (2010), "Dynamic Effects Among Mov-
ie Ratings, Movie Revenues, and Viewer Satisfaction", in Journal of Marketing, Vol.
74, pp. 108-121.
Mooney, Raymond J., and Loriene Roy (1999). "Content-based Book Rrecommending Using
Learning for Text Categorization" Proceedings of the ACM SIGIR'99, Workshop on
Recommender Systems: Algorithms and Evaluation
Mooney, Raymond J., and Loriene Roy (2000). "Content-based Book Rrecommending Using
Learning for Text Categorization" Proceedings of the Fifth ACM Conference in Digi-
tal Libraries. San Anto- nio, TX. pp. 195-204
Moore, Johanna D. and William R. Swartout (1988). "Explanation in expert systems: A sur-
vey", Research Report RR-88-228, University of Southern California, Marina Del
Rey, CA, 1988.
Moore, Carolyn A., Bednall, David, and Adam, Stewart (2005), "Genre, Gender and Interpre-
tation of Movie Trailers: An Exploratory Study", in ANZMAC 2005: Broadening the
boundaries, conference proceedings, ANZMAC, Dunedin, N.Z., pp. 124-130.
Myers, James H. (1996), "Segmentation and Positioning for Strategic Marketing Decisions",
American Marketing Association, Chicago, IL USA, 1996.
Nakamura, Atsuyoshi and Naoki Abe (1998), “Collaborative Filtering Using Weighted Ma-
jority Prediction Algorithms”, in ICML '98: Proceedings of the 15th International
Conference on Machine Learning, pp. 395-403.
Neumann, Andreas W. (2009). "Recommender Systems for Information Providers", Physica-
Verlag Heidelberg
O'Donovan, John and Barry Smyth (2005). "Trust in Recommender Systems", in IUI'05 Pro-
ceedings of the 10th international conference on Intelligent user interfaces, ACM
New York, NY, USA, pp. 167-174.
O‟Sullivan, Derry, Barry Smyth, and David C. Wilson (2004), “Preserving Recommender
Accuracy and Diversity in Sparse Datasets”, in International Journal on Artificial In-
telligence Tools, Vol. 13, Issue 1, pp. 219–236.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 153
O‟Sullivan, Derry, Barry Smyth, David C. Wilson, Kieran McDonald, Alan and Smeaton
(2004), "Improving the quality of the personalized electronic program guide", in Us-
er Modeling and User-Adapted Interaction, Vol 14, Issue 1, pp. 5–36.
Park, Seung-Taek and Wei Chu (2009), “Pairwise preference regression for cold-start recom-
mendation”, in RecSys '09 Proceedings of the third ACM conference on Recom-
mender systems, pp. 21-28.
Paterek, Arkadiusz (2007), "Improving Regularized Singular Value Decomposition for Col-
laborative Filtering", in Proceedings of KDD Cup Workshop at SIGKDD'07, 13th
ACM International Conference on Knowledge Discovery and Data Mining, pp. 39-
42.
Payne, John W., James R. Bettman, and Eric Johnson (1988), “Adaptive Strategy Selection in
Decision Making,” Journal of Experimental Psychology: Learning, Memory and
Cognition, 14 (July), 534-52.
Payne, John W., James R. Bettman, and Eric Johnson (1993), The Adaptive Decision Maker.
Cambridge, UK: Cambridge University Press.
Pazzani, Michael J. (1999), "A Framework for Collaborative, Content-Based, and Demo-
graphic Filtering", in Artificial Intelligence Review - Special issue on data mining on
the Internet, pp. 393-408.
Pazzani, Michael J. and Daniel Billsus (2007), "Content-based Recommendation Systems", in
The Adaptive Web, pages 325-341.
Prag, Jay and James Casavant (1994) “An Empirical Study of the Determinants of Revenues
and Marketing Expenditures in the Motion Picture Industry”. in Journal of Cultural
Economics, Vol. 18, pp. 217-235.
Press, William H., Saul A. Teukolsky, William T. Vetterling, Biran P. Flannery (2007), "Nu-
merical Recipes: The Art of Scientific Computing", Cambridge University Press, 3rd
edition.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 154
Rashid, Al Mamunur, Istvan Albert, Dan Cosley, Shyong K. Lam, Sean M. McNee, Joseph A.
Konstan, and John Riedl (2002), “Getting to Know You: Learning New User Prefer-
ences in Recommender Systems”, in Proceedings of the International Conference on
Intelligent User Interfaces, pp. 127–134.
Resnick, Paul, Neophytos Iakovou, Mitesh Sushak, Peter Bergstrom, and John Riedl (1994),
“GroupLens: An Open Architecture for Collaborative Filtering of Netnews”, in Pro-
ceedings of ACM CSCW’94 Conference on Computer Supported Cooperative Work,
pp. 175-186.
Resnick, Paul, Rahul Sami (2007), “The Influence Limiter: Provably Manipulation-resistant
Recommender Systems”, in Proceedings of the 2007 ACM conference on Recom-
mender systems Rec Sys’07, pp. 25-32.
Ricci, Francesco, Lior Rokach, Bracha Shapira (2011), "Introduction to Recommender Sys-
tems Handbook", in Ricci, Francesco, Lior Rokach, Bracha Shapira, Paul B. Kantor
[eds.] (2011). "Recommender Systems Handbook", Springer Science + Business
Media LLC, pp. 1 - 35.
Rutkowski, Anne-Francoise, Alea Fairchild, John B.Rijsman (2004). "Group Decision Sup-
port Systems and Patterns of Interpersonal Communication to Improve Ethical Nego-
tiation in Dyads". European Journal of Social Psychology, vol. 9, pages 11-30.
Salakhurdinov, Ruslan, Andriy Minh, and Geoffrey Hinton (2007), "Restricted Boltzmann
Machines for Collaborative Filtering", in Proceedings of the 24th International Con-
ference on Machine Learning, pp. 791-798.
Salton, Gerard, Anita Wong, and Chung-Shu Yang (1975), "A Vector Space model for Infor-
mation Retrieval", in Journal of The American Society for Information Science, Vol.
18, No. 11, pp. 613-620.
Salton, Gerard and Christopher Buckley (1988), "Term-weighting Approaches in Automatic
Text Retrieval", in Information Processing and Management, Vol. 25, No. 5, pp.
513-523.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 155
Sandvig, J. J., Bamshad Mobasher, and Robin Burke (2007), "Robustness of Collaborative
Recommendation Based on Association Rule Mining", in Proceedings of the 2007
ACM conference on Recommender systems, pp. 105-111.
Sarwar, Badul M., George Karypis, Konstan Joseph A., and Riedl John T. (2000), “Applica-
tion of Dimensionality Reduction in Recommender System – A Case Study”, in
ACM WebKDD 2000 Web Mining for E-Commerce Workshop, pp. 285-289.
Sarwar, Badul M., George Karypis, Joseph Konstan, and John T. Riedl (2001), “Item-Based
Collaborative Filtering Recommendation Algorithms”, in WWW '01 Proceedings of
the 10th International Conference on World Wide Web ACM New York, NY, USA,
pp. 285-295.
Sarwar, Badul M., George Karypis, Joseph Konstan, and John T. Riedl (2002), “Incremental
Singular Value Decomposition algorithms for Highly Scalable Recommender Sys-
tems”, in ICCIT '02 Proceedings of the 5th International Conference on Computer
and Information Technology, pp. 399-404.
Sawhney, Mohanbir S. and Jehoshua Eliashberg (1996) “A Parsimonious Model of Forecast-
ing Gross Box-Office Revenues of Motion Pictures”, in Marketing Science, Vol. 15,
Issue 2, pp. 113-131.
Seyerlehner, Klaus, Arthur Flexer, and Gerhard Widmer (2009), “On the Limitations of
Browsing Top-N Recommender Systems”, in Proceedings of the third ACM confer-
ence on Recommender systems, pp. 321-324.
Shardanand, Upenda and Patti Maes (1995), “Social Information Filtering: Algorithms for
Automating „Word of Mouth‟”, in Proceedings of ACM CHI’95 Conference on Hu-
man Factors in Computing Systems, pp. 210-217.
Schafer, Ben J., Joseph A. Konstan, and John Riedl (1999). "Recommender Systems in E-
Commerce", Proceedings of the First ACM Conference on Electronic Commerce,
Denver, CO, 158-166.
Schafer, Ben J., Joseph A. Konstan, and John Riedl (2001). "E-Commerce Recommendation
Applications", Data mining and Knowledge Discovery. 5 (1-2), 115-153.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 156
Schafer, Joseph L. and John W. Graham (2002), "Missing Data: Our View of the Stat of the
Art", in Psychological Methods Vol. 7, No. 2, pp. 147-177.
Schwab, Ingo, Alfred Kobsa, and Ivan Koychev (2001), “Learning User Interests through
Positive Examples Using Content Analysis and Collaborative Filtering”, in User
Modeling and User-Adapted Interaction.
Senecal, Sylvain and Jacques Nantel (2004), „The Influence of Online Product Recommenda-
tions on Consumers‟ Online Choices,“ Journal of Retailing, 80 (2), 159-169.
Simon, Herbert A. (1982), "Models of bounded rationality", Cambridge,MA: MIT Press.
Sinha, Rashmi, and Kirsten Swearingen (2002). „The Role of Transparency in Recommender
Systems“, Conference on Human Factors in Computing Systems, ACM New York,
NY, USA 830–831.
Shortliffe, Edward H. and Bruce G. Buchanan (1975). "A model of inexact reasoning in med-
icine". Mathematical Biosciences Vol. 23 (3-4), pp. 351–379.
Soboroff, Ian M. and Charles Nicholas (1999), “Combining Content and Collaboration in
Text Filtering”, in Proceedings of the IJCAI-99 Workshop on Machine Learning for
Information Filtering,Vol. 99, pp. 86-91
Sørmo, Frode, Jörg Cassens, and Agnar Aamodt (2005). "Explanation in Case-Based Reason-
ing: Perspectives and Goals", in Artificial Intelligence Review, Volume 24 Issue 2,
Kluwer Academic Publishers, pp. 145-161.
Symeonidis, Panagoitis, Alexandros Nanopoulos, Yannis Manolopoulos (2007). "Feature-
weighted User Model for Recommender Systems", in UM '07 Proceedings of the
11th international conference on User Modeling, pp. 97–106.
Symeonidis, Panagoitis, Alexandros Nanopoulos, Yannis Manolopoulos (2008), "Providing
Justifications in Recommender Systems", IEEE Transactions on Systems, MAN, and
Cybernetics, Vol. 38, No. 6, pp. 1262-1272.
Symeonidis, Panagoitis, Alexandros Nanopoulos, Yannis Manolopoulos (2009). „MoviEx-
plain: a recommender system with explanations“, Proceedings of the third ACM con-
ference on Recommender systems, ACM 317–320.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 157
Takács, Gábor, Isván Pilászy, Bottyán Németh, and Domonkos Tikk (2007), "Major Compo-
nents of the Gravity Recommendation System", in SIGKDD Explorations, Vol. 9,
No. 2., pp. 80-84.
Tang, Tiffany Ya, Pinata Winoto, and Keith C. C. Chan (2003) "On the Temporal Analysis
for Improved Hybrid Recommendations", in WI '03 Proceedings of the 2003
IEEE/WIC International Conference on Web Intelligence. pp. 214-220.
Terveen, Loren, Jessica McMackin, Brian Amento, and Will Hill (2002), "Specifying prefer-
ences based on user history", in Proceedings of the Conference on Human Factors in
Computing Systems, pp. 315-322.
Tintarev, Nava (2007), "Explanations of recommendations", in Proceedings of the 2007 ACM
Conference on Recommender Systems (RecSys'07), Minneapolis, MN, pp. 203-206.
Tintarev, Nava and Masthoff, Judith (2007), "Effective Explanations of Recommendations:
User-Centered Design", in Proceedings of the 2007 ACM conference on Recom-
mender systems, ACM New York, NY, USA 153–156.
Tintarev, Nava and Masthoff, Judith (2007), "The Effectiveness of Personalized Movie Ex-
planations: An Experiment Using Commercial Meta-data", in AH '08 Proceedings of
the 5th international conference on Adaptive Hypermedia and Adaptive Web-Based
Systems, pp. 204-213.
Tintarev, Nava and Masthoff, Judith (2011), "Designing and Evaluating Explanations for
Recommender Systems", in Ricci, Francesco, Lior Rokach, Bracha Shapira, Paul B.
Kantor [eds.] (2011). "Recommender Systems Handbook", Springer Science + Busi-
ness Media LLC, pp. 479-510.
Tompson, Clive (2008), "If You Liked This, sure to Love That", in The New York Times, No-
vember 23th 2008, http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html
Tran, Thomas and Robin Cohen (2000), “Hybrid Recommender Systems for Electronic
Commerce”, in Knowledge-Based Electronic Markets, Papers from the AAAI Work-
shop, AAAI Technical Report WS-00-04, AAAI Press, pp. 78-83.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 158
Tran-Le, Esther (2010), "NYC Pandora Listener Meet Up", blog entry, March 22, 2010,
http://esthertranle.com/wordpress/2010/03/23/nyc-pandora-listener-meet-up, re-
trieved on 15.06.2011.
Tsymbal, Alexey (2004), “The Problem of Concept Drift: Definitions and Related Work”,
Technical Report TCD-CS-2004-15, Trinity College Dublin.
Tversky, Amos (1967), "Additivity, Utility, and Subjective Probability", in Journal of Math-
ematical Psychology, Vol 4, pp. 175-201.
Uchyigit, Gulden and Matthew Y. Ma [Eds.] (2008), "Personalization Techniques and Rec-
ommender Systems: Series in Machine Perception and Artificial Intelligence - Vol.
70", World Scientific Publishing Co. Pte. Ltd. 2008
von Winterfeldt, Detlouf and Ward Edwards (1986), "Decision analysis and behavioral re-
search", New Yor k:Cambridge University Press.
Wei, Chang-Ping, Michael J. Shaw, and Robert F. Easley (2002), "A Survey of Recommenda-
tion Systems in Electronic Commerce", in Roland T. Rust and P.K. Kannan [eds.] "e-
Service: New Directions in Theory and Prasctice" (2002), M.E. Sharpe, Armonk,
New-York, London, England.
Weiss, Jie W., David J. Weiss, and Ward Edwards (2009), "A Descriptive Multi-attribute
Utility Model for Everyday Decisions", in Theory and Decision, Vol. 68, Issues (1-
2), pp. 101-114.
Wright, Peter (1974), "The Harassed Decision Maker: Time Pressures, Distractions, and the
Use of Evidence", in Journal of Applied Psychology, 59 (October), pp. 555-561.
Ying, Yuanping, Fred Feinberg, Michel Wedel (2006). "Leveraging Missing Ratings to Im-
prove Online Recommendation Systems", in Journal of Marketing Research, Vol.
XLIII, August, pp. 355-365.
Zanker, Markus, Sergiu Gordea, Markus Jessenitschnig, and Michael Schnabl (2006), "A Hy-
brid Similarity Concept for Browsing Semi-structured Product Items", in Proceed-
ings of 7th International Conference on Electronic Commerce and Web Technologies
(EC-Web), Springer 2006 (LNCS, 4082), pp. 21-30.
DRAFT -
final
revisi
on to
appe
ar in
2012
Bibliography 159
Zaslow, Jeffrey (2002), "If TiVo Thinks You Are Gay, Here's How to Set It Straight", in Wall
Street Journal - Eastern Edition, 11/26/2002, Vol. 240 Issue 105, p.A1
Zhan, Sinan, Fengrong Gao, Chunxiao Xing, and Lizhu Zhou (2006), "Addressing Concept
Drift Problem in Collaborative Filtering Systems", in Proceedings of the 17th Euro-
pean Conference on Artificial Intelligence, pp. 34-39.
Zhang, Yi, Jamie Callan, and Thomas Minka (2002), “Novelty and Redundancy Detection in
Adaptive Filtering”, in Proceedings of the 25th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval SIGIR '02, pp.
81-88.
Zhang Mi (2009), “Enhancing Diversity in Top-N Recommendation”, in Proceedings of the
third ACM conference on Recommender systems, pp. 397-400.
Zhao, Yangchang, Chengqi Zhang, and Shichao Zhang (2005), "A Recent-biased Dimension
Reduction Technique for Time Series Data", in ACM Proceedings of the 9th Pacific-
Asia Conference on Knowledge Discovery and Data Mining (PAKDD'05), pp. 751-
758.
Ziegler, Cai-Nicolas, Sean M. McNee, Konstan, Joseph A., and Georg Lausen (2005), „Im-
proving Recommendation Lists Through Topic Diversification”, in Proceedings of
the International World Wide Web Conference WWW’05, pp. 22–32.
DRAFT -
final
revisi
on to
appe
ar in
2012
Appendix A: Sources of Error in Recommender Systems 160
Appendix A: Sources of Error in Recommender Systems
Automated recommender systems are in essence stochastic processes that infer their
recommendations based on heuristic approximations of human processes by means of numer-
ic algorithms. Their computations are done on extremely sparse and incomplete data. These
two conditions result in recommendations that are often correct and reliable but also occa-
sionally very wrong, i.e. the suggestions generated by RS are subject to errors. According to
Herlocker, Konstan, and Riedl (2000), the sources of errors can be roughly grouped into two
categories: model/process errors and data errors. We agree with this classification and will
extend its understanding below.
MODEL/PROCESS ERRORS
Model or process errors occur when a computational process employed by the RS for
generating recommendations does not appropriately reflect the user‟s intrinsic decision pro-
cess and thus does not match his or her requirements. This can happen, for example, due to:
Multiattribute preferences. Multiattribute utility (MAU) models have a long history in
the research fields of decision-making and marketing (eg. Edwards 1954; Tversky 1967;
Green, Wind, and Jain 1972; Green and Wind 1973; Luce 1992). According to MAU, people
make choices using an intrinsic utility function which sums up attribute-related preferences of
the items considered, i.e. contained in the evoked set of choice alternatives. An item with the
highest utility for a given consumer has the highest probability to be chosen. Although the
research of motion pictures success factors has shown that movie attributes such as actors,
directors, genres, budgets, country of origin, awards, etc. significantly influence the movie‟s
success as a result of the expression of consumer preferences (Hennig-Thurau, Houston, and
Walsh 2006), contemporary movie recommender systems still fail to adequately incorporate
such attribute characteristics and to account for attribute-related consumer preferences within
the recommendation process. The reason for that is the limited ability of information pro-
cessing algorithms to automatically extract meaningful attributes, descriptive to multimedia
content (Wei, Shaw, Easely 2002; Pazzani, Billsus 1997; Lops, de Gemmis, and Semeraro
2011). When preferences towards movie attributes were used in extant work, the choice of
attributes was either based on information availability, not a thorough study of relevant attrib-
DRAFT -
final
revisi
on to
appe
ar in
2012
Appendix A: Sources of Error in Recommender Systems 161
utes (e.g., Ying, Feinberg, and Wedel 2006), or the attributes were used for post processing of
recommendation generation (e.g., Symenoidis, Nanopoulos, and Manolopoulos 2009). It fol-
lows that RS fail to model the user attribute-related preferences to a full extent, and thus the
recommendation processes can lead to potentially erroneous recommendations.
Concept or interest drift. It is not uncommon for people to change their interests. Es-
pecially in the domain of movie recommendations, it can be clearly seen. In fact, movies go in
and out, users may adopt new views on actors, genres, directors, etc. In RS literature, this
phenomena is referred to as “concept drift” (Billus and Pazzani 2000) or “interest drift”
(Burke 2002). Traditional RS, however, do not consider the user interest drift, so they cannot
reflect changes in the user preferences (Zhan et al. 2006). To our knowledge, only few studies
have focused on this problem. So Tang, Winoto, and Chan (2003) suggested that the movie‟s
production year reflects the situational environment in which a movie was filmed and thus
might significantly affect users‟ feature preferences. For this reason, they propose discounting
user preferences for earlier movies while boosting the newer ones in the recommendation pro-
cess, i.e. assigning higher weights to user ratings for the newer movies. Similarly, other works
suggest using the date of when the ratings were collected as a basis for the weights assign-
ment. In accordance to this, greater weights are assigned to recent data, while the older data is
either decayed or completely removed from the computational process (Terveen et al. 2002;
Zhao, Zhang, and Zhang 2005; Ding and Lee 2005). Zhan et al. (2006) proposed an iterative
data weighting method which can capture also recurring user interests. However, the
weighting methods are still open for model and process errors caused by interest drift, as they
solely rely on the time as the describer of interest drift at an aggregated level and do not con-
sider changes in attribute preferences.
Contextual factors. Traditional RS assume homogeneity of context, i.e. the decision
upon what to recommend does not depend on when the recommendation is requested (Ado-
mavicius et al. 2005). Though, behavioral research in marketing has shown evidence that con-
sumer decision making, rather to be invariant, depends on the context of decision making:
One and the same consumer may prefer different products or brands and even utilize different
decision-making strategies under different contexts (Chakravarti and Lynch 1983; Klein and
Yadav 1989; Bettman, Johnson, and Payne 1991). Because of a huge variety of imaginable
user contexts, however, it seems not possible for RS to collect all the data needed to be able to
suitably account for the contexts. Implicit collection of the context information, though possi-
ble, is constrained only to the information that is automatically available to or can be queried
DRAFT -
final
revisi
on to
appe
ar in
2012
Appendix A: Sources of Error in Recommender Systems 162
by the system from real-time databases, e.g. daytime, season, weather conditions, traffic situa-
tion, user GPS coordinates, etc. Active querying of the users for additional information about
their contexts would contradict one of the main the principles of RS, i.e. simplifying the us-
ers‟ choice making through minimizing the amount of user-system interactions rather than
overwhelming them with long questionnaires50
. Although concepts of RS that incorporate
contextual information were elaborated in recent RS literature, they either do not go beyond
the concept level (e.g. Adomavicius et al. 2005; Adomavicius and Tuzhilin 2008) or employ
only very limited amount of contextual information (e.g. Balturnas 2008; Balturnas and Ricci
2008; El Helou et al. 2009). Many of the context factors that influence the decision making
process, such as motives, anticipated complexity of the decision task and need to justify the
decision to others or to account for somebodies preferences, time pressure, prior knowledge
(Bettmann, Johnson, and Payne 1991) can hardly be formalized both for explicit and implicit
data collection and thus cannot be properly accounted for in the models underlying recom-
mendations. Consequently, the computational process of an RS fails to fully reflect the user
context, thus leaving a room for errors especially in the cases where contextual factors domi-
nate over user preferences.
Scale granularity. RS typically make use of discrete integer-valued rating scales for
collecting user preferences towards items, e.g., movies, contained in the catalog. Or they uti-
lize a binary 0/1 scale for the implicit collection of purchase acts or other events (such as
clicking on a hyperlink or reading an article) that represent meaningful data input for the rec-
ommendation process in correspondent item domains. This raises two problems that may lead
to errors in the computational process: Firstly, it cannot be guaranteed that all users perceive
the scale points identically and express certain amounts of preference on a given scale equal-
ly. For instance, if two persons find a certain movie equally good, one of them may rate it
with 5 of 5 points, while the other may give it only 4 of 5 points. In such a situation, a rec-
ommendation process may not be able to determine that in both cases the same amount of
preference was meant and thus would treat the assigned scores differently – in accordance
with its internal representation of the meanings of the scale points. Hence, the difference in
the ratings received from the considered users introduces an error to the recommendation pro-
cess. Secondly, as described in Chapter 2.3, the algorithms employed in RS typically operate
on rational numbers. For this reason, the results of averaging or weighing, which may be em-
50 This thesis is supported by the early studies in the research area of CSCW that revealed that people are not
ready to explicitly express their preferences and priorities as well as to perceive such actions as extrinsic to their actual task and as requiring extra effort (eg. Grudin 1988).
DRAFT -
final
revisi
on to
appe
ar in
2012
Appendix A: Sources of Error in Recommender Systems 163
ployed within the recommendation process, will also often be rational numbers that, however,
should be represented as an integer at least for the evaluation of the prediction accuracy. This
either introduces potential errors caused by rounding off/up or makes the accuracy evaluation
per se error -prone.
Algorithmic processing errors. Finally, the computational procedure itself represents a
potential source for errors. Even a perfect model of user choice-making numeric algorithms
will still be error-prone due to the possibilities of overfitting, rounding errors, and other types
of miscalculations. Not at last, the quality of data determines the outcome of calculations.
DATA ERRORS
Data errors result from inadequacies of the data employed in the calculations of recom-
mendations. These data inadequacies usually fall into three classes: not enough data, poor or
bad data, and high variance data (Herlocker, Konstan, and Riedl 2000).
Not enough data. RS base their computations on extremely sparse and incomplete data.
Indeed, if the data were complete, there would be no reason for RS to predict the missing data
points. The estimation of missing data itself is known to raise computational challenges and to
be prone to errors (e.g., Schafer and Graham 2002). In the context of RS, the latter problem
becomes even more aggravated for the items and users that have recently entered the system –
an issue we addressed earlier in this chapter as the new item and new user problems.
Poor or bad data. Even in the cases where considerable amounts of data about the us-
ers and items are available, some of the data may still contain errors. These errors may result
from accidentally erroneous user inputs or even be fraudulently generated through shilling
attacks by malicious web robots, which favor or disfavor a particular item (Mobasher et al.
2007; Sandvig, Mobasher and Burke 2007). Another part of inconsistent data points is pro-
duced by natural variability in the perception of the scale points by users, i.e. when users pro-
vide different ratings to the same item at different times (Hill et al. 1995; Herlocker et al.
2004) or when different users associate different ratings with the same amount of preference.
High variance data. High variance data is not necessarily considered bad data for rec-
ommendation algorithms. However, it can cause recommendation errors (Herlocker, Konstan,
and Riedl 2000). Especially in cases of interest polarizing items, such as the comedy movie
“Napoleon Dynamite” that can “be either loved or despised” (Thompson 2008), it can be hard
DRAFT -
final
revisi
on to
appe
ar in
2012
Appendix A: Sources of Error in Recommender Systems 164
to predict its preference rating for a given user. In such cases, a proper prediction is probably
not an average rating, although this is probably what will be predicted by a RS (Herlocker,
Konstan, and Riedl 2000).
As we have shown above, there are many factors that can cause misleading recommen-
dations. A chance to receive an erroneous recommendation impairs users‟ acceptance and
trust in RS. Explanations of the reasoning behind the recommendations provide users with
indications when to trust a recommendation and when to doubt one. This gives an instrument
to handle the errors in recommendations as well as it recovers users‟ trust in and acceptance
of RS (Herlocker, Konstan, and Riedl 2000).
DRAFT -
final
revisi
on to
appe
ar in
2012
Appendix B: List of Preference Relevant Attributes 165
Appendix B: List of Preference Relevant Attributes
Genres (26)
Action
Adult
Adventure
Animation
Biography
Comedy
Crime
Documentary
Drama
Family
Fantasy
Film
History
Horror
Music
Musical
Mystery
News
Reality
Romance
Sci-Fi
Short
Sport
Thriller
War
Western
Actors (87)
Affleck, Ben
Allen, Tim
Bale, Christian
Banderas, Antonio
Black, Jack
Bleibtreu, Moritz
Bloom, Orlando
Broderick, Matthew
Cage, Nicolas
Caine, Michael
Carrey, Jim
Chan, Jackie
Clooney, George
Connery, Sean
Costner, Kevin
Craig, Daniel
Crowe, Russell
Cruise, Tom
Cusack, John
Damon, Matt
De Niro, Robert
Depp, Johnny
DiCaprio, Leonardo
Diesel, Vin
Douglas, Michael
Downey, Jr. Robert
Dreyfuss, Richard
Eastwood, Clint
Farrell, Colin
Ford, Harrison
Foxx, Jamie
Fraser, Brendan
Freeman, Morgan
Gere, Richard
Gibson, Mel
Grant, Hugh
Gyllenhaal, Jake
Hanks, Tom
Hartnett, Josh
Hoffman, Dustin
Hopkins, Anthony
Ice, Cube
Jackman, Hugh
Jackson, Samuel L.
Kutcher, Ashton
LaBeouf, Shia
Law, Jude
Lawrence, Martin
Ledger, Heath
Maguire, Tobey
Marsden, James
Martin, Steve
McConaughey, Matthew
McGregor, Ewan
McKellen, Ian
Murphy, Eddie
Murray, Bill
Myers, Mike
Newman, Paul
Nicholson, Jack
Norton, Edward
Owen, Clive
Pacino, Al
Phoenix, Joaquin
Pitt, Brad
Quaid, Dennis
Redford, Robert
Reeves, Keanu
Reynolds, Ryan
Russell, Kurt
Sandler, Adam
Schwarzenegger, Arnold
Schweiger, Til
Scott, Seann William
Smith, Will
Snipes, Wesley
Stallone, Sylvester
Statham, Jason
Stiller, Ben
Travolta, John
Tucker, Chris
Waalkes, Otto
Wahlberg, Mark
Washington, Denzel
Williams, Robin
Willis, Bruce
Wilson, Owen
Wood, Elijah
Actresses (46)
Adams, Amy
Aniston, Jennifer
Barrymore, Drew
Berry, Halle
Blanchett, Cate
Bullock, Sandra
Curtis, Jamie Lee
Diaz, Cameron
Dunst, Kirsten
Fonda, Jane
Foster, Jodie
Hathaway, Anne
Hawn, Goldie
Hewitt, Jennifer Love
Hudson, Kate
Hunt, Helen
DRAFT -
final
revisi
on to
appe
ar in
2012
Appendix B: List of Preference Relevant Attributes 166
Johansson, Scarlett
Jolie, Angelina
Keaton, Diane
Kidman, Nicole
Knightley, Keira
Lopez, Jennifer
Moore, Demi
Moore, Julianne
Paltrow, Gwyneth
Pfeiffer, Michelle
Portman, Natalie
Potente, Franka
Riemann, Katja
Roberts, Julia
Russo, Rene
Ryan, Meg
Ryder, Winona
Sarandon, Susan
Stiles, Julia
Streep, Meryl
Streisand, Barbra
Swank, Hilary
Theron, Charlize
Thurman, Uma
Weaver, Sigourney
Weisz, Rachel
Winslet, Kate
Witherspoon, Reese
Zellweger, Renée
Zeta-Jones, Catherine
Directors (106)
Abrahams, Jim
Allen, Woody
Amiel, Jon
Anderson, Paul W. S.
Annaud, Jean-Jacques
Apted, Michael
Bay, Michael
Besson, Luc
Boyle, Danny
Brest, Martin
Brooks, James L
Burton, Tim
Cameron, James
Campbell, Martin
Carpenter, John
Coen, Joel
Cohen, Rob
Columbus, Chris
Coppola, Francis Ford
Craven, Wes
Crowe, Cameron
Dante, Joe
Davis, Andrew
de Bont, Jan
Demme, Jonathan
del Toro, Guillermo
De Palma, Brian
De Vito, Danny
Dörrie, Doris
Donner, Richard
Dugan, Dennis
Eastwood, Clint
Emmerich, Roland
Ephron, Nora
Farrelly, Peter
Farrelly, Bobby
Fincher, David
Forster, Marc
Gilliam, Terry
Gosnell, Raja
Gray, F. Gary
Hallström, Lasse
Hanson, Curtis
Harlin, Renny
Herek, Stephen
Hoblit, Gregory
Howard, Ron
Jackson, Peter
Johnston, Joe
Lee, Ang
Lee, Spike
Levant, Brian
Levinson, Barry
Levy, Shawn
Lucas, George
Lyne, Adrian
Mann, Michael
Marshall, Garry
Marshall, Penny
McTiernan, John
Miller, George
Newell, Mike
Nichols, Mike
Nolan, Christopher
Noyce, Phillip
Oz, Frank
Petersen, Wolfgang
Pollack, Sydney
Raimi, Sam
Ramis, Harold
Ratner, Brett
Reiner, Rob
Reitman, Ivan
Reynolds, Kevin
Roach, Jay
Rodriguez, Robert
Russell, Chuck
Schumacher, Joel
Scorsese, Martin
Scott, Ridley
Scott, Tony
Segal, Peter
Shadyac, Tom
Shankman, Adam
Shyamalan, M. Night
Singer, Bryan
Singleton, John
Smith, Kevin
Soderbergh, Steven
Sommers, Stephen
Sonnenfeld, Barry
Spielberg, Steven
Stone, Oliver
Tarantino, Quentin
Thomas, Betty
Turteltaub, Jon
Tykwer, Tom
Verbinski, Gore
Vilsmaier, Joseph
Weir, Peter
Woo, John
Wortmann, Sönke
Zemeckis, Robert
Zucker, David
Zucker, Jerry
Zwick, Edward
Producers (4)
Apatow, Judd
Bruckheimer, Jerry
Rudin, Scott
Silver, Joel
DRAFT -
final
revisi
on to
appe
ar in
2012
Appendix B: List of Preference Relevant Attributes 167
Writers (5)
Crichton, Michael
Curtis, Richard
Dick, Philip K.
Grisham, John
King, Stephen
Production Firms (6)
Imagine
Nickelodeon
Pixar
Revolution
Section
Spyglass
Countries of Origin (38)
Australia
Austria
Argentina
Belgium
Brazil
Canada
China
Czech Republic
Czechoslovakia
Denmark
East Germany
France
Finland
Germany
Hong Kong
Iceland
India
Ireland
Israel
Italy
Japan
Mexico
Netherlands
New Zealand
Norway
Poland
Russia
South Africa
South Korea
Soviet Union
Spain
Sweden
Switzerland
Thailand
Turkey
UK
USA
West Germany
DRAFT -
final
revisi
on to
appe
ar in
2012
Appendix C: Technical Details of Prediction Accuracy Tests 168
Appendix C: Technical Details of Prediction Accuracy Tests
Whereas in Chapter 4 of our thesis we describe our tests of predictive accuracy concep-
tually, in this appendix we provide insights in the details of technical implementation and ex-
ecution of the tests. By providing this information we ensure that a critical reader can prove
the methodical correctness of the process of how the results were obtained, understand the
course of our action deeper and, if necessary, replicate our results as well as use our procedure
for his or her own studies and for building his or her own recommender system.
For the calculation of the prediction accuracy of the Global average as well as all vari-
ants of user-based and item-based collaborative filtering algorithms, we utilized the open
source library of recommender system algorithms „MyMediaLite‟51
. This library was recom-
mended for use in real-world recommender systems as well as for research purposes on the 4th
ACM Conference on Recommender Systems RecSys‟201052
(personal communication with
Francesco Ricci, Gediminas Adomavicius, Xavier Amatriain).
The matrix factorization algorithm was implemented based on Simon Funk‟s (2006) de-
scription53
of his approach that brought him to the fourth position in the Netflix Prize leader-
board in the fall of 2006. Funk algorithm‟s surprising performance has attracted an extreme
attention from the Netflix Prize community, which made the matrix factorization approach
popular in the recommender system research. Although Funk‟s approach was never published
in an academic journal, his blog entry describing his stochastic gradient descent method for
matrix factorization was widely cited in recent literature and serves as the basis for all pub-
lished matrix factorization approaches (e.g., Paterek 2007; Koren 2009; Linden 2009; Koren
and Bell 2011).
The program for our approach, described in Chapter 4, was implemented using source
code snippets from Press et al. 2007, a widely acknowledged source of numeric methods for
scientific computing. Table C.1 provides an overview of the employed procedures, their short
descriptions and information about their roles in our algorithm.
51 http://www.ismll.uni-hildesheim.de/mymedialite/index.html
52 http://recsys.acm.org/2010/
53 http://sifter.org/~simon/journal/20061211.html
DRAFT -
final
revisi
on to
appe
ar in
2012
Appendix C: Technical Details of Prediction Accuracy Tests 169
All algorithms employed in our study are implemented in programming language C#.
The tests were performed on an Intel® QuadCore Duo
™ Q9400 2.67GHz machine with 8GB
RAM running 64-bit Windows Server® 2008 Standard Edition with Service Pack 2.
Table C.1: Overview of the employed source code snippets from Press et al. 2007
Method or function
name
Description Role for the algorithm
Fitab
Object for fitting a straight line to set of points , with or without
available errors .
Solving regression
problems for one re-
gression parameter,
Section 3.2.1. invxlogx
Erf Normaldist:Erf
Lognormaldist:Erf
Gauleg18 Beta:Gauleg18
Gamma:Gauleg18
Studenttdist:Beta Fdist:Beta
Classes and functions providing distribu-
tional statistics and statistical tests for
Betta, Gamma, Gauss, logarithmic, Stu-
dent-t, and F-distributions. Performing tests for
significance, Section
3.2.1.
SVD Object for Singular Value Decomposition
of matrix .
Correction for omitted
variable bias. Solving
equation system (3.22),
Section 3.2.1.3.
SVD::solve Solves equation system for vector
using pseudoinverse of matrix .
Bracketmethod Base class for one-dimensional minimiza-
tion routines. Provides a routine to bracket
a minimum and several utility functions.
Optimizing initial pa-
rameter values, Section
3.2.2.
Brent:Bracketmethod Isolates the minimum using Brent‟s meth-
od.
F1dim Performs one-dimensional minimization.
Linemethod Base class for line minimization algo-
rithms.
Frprmn:Linemethod
Multidimensional minimization by the
Fletcher-Reeves-Polak-Ribiere method.
DRAFT -
final
revisi
on to
appe
ar in
2012
DRAFT -
final
revisi
on to
appe
ar in
2012
Ehrenwörtliche Erklärung
Ich erkläre hiermit ehrenwörtlich, dass ich die vorliegende Arbeit ohne unzulässige
Hilfe Dritter und ohne Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe.
Die aus anderen Quellen direkt oder indirekt übernommenen Daten und Konzepte sind unter
Angabe der Quelle gekennzeichnet.
Bei der Auswahl und Auswertung folgenden Materials haben mir die nachstehend
aufgeführten Personen in der jeweils beschriebenen Weise entgeltlich/unentgeltlich geholfen:
keiner
Weitere Personen waren an der inhaltlich-materiellen Erstellung der vorliegenden Ar-
beit nicht beteiligt. Insbesondere habe ich hierfür nicht die entgeltliche Hilfe von Vermittlung-
bzw. Beratungsdiensten (Promotionsberater oder anderer Personen) in Anspruch genommen.
Niemand hat von mir unmittelbar oder mittelbar geldwerte Leistungen für Arbeiten erhalten,
die im Zusammenhang mit dem Inhalt der vorgelegten Dissertation stehen.
Die Arbeit wurde bisher weder im In- noch im Ausland in gleicher oder ähnlicher
Form einer anderen Prüfungsbehörde vorgelegt.
Ich versichere, dass ich nach bestem Wissen die reine Wahrheit gesagt und nichts ver-
schwiegen habe.
Langenhagen, den 29.07.2011
DRAFT -
final
revisi
on to
appe
ar in
2012