recommender systems evaluation: a 3d benchmark - presented at rue 2012 workshop at acm recsys 2012
TRANSCRIPT
Recommender systems evaluation: a 3D benchmark
Alan Said1, Domonkos Tikk2, Yue Shi3, Martha Larson3, Klára Stumpf2, Paolo Cremonesi4
1: TU Berlin2: Gravity R&D3: TU Delft4: Politecnico di Milano/Moviri
Motivation
• Current recsys evaluation benchmarks are insufficient– mostly focused on IR measures (RMSE,
MAP@X, precision/recall)– does not consider the need of all stakeholders
(users, content provider, recsys vendor)– technological and business requirements are
mostly overlooked
• 3D Recommender System Benchmarking Model
Recent benchmarks (1)
• pros:– Large scale– very well organized
• cons:– qualitative assessment of recommendation:
simplified to RMSE– rating prediction (not ranking)– no focus on direct business and technical
parameters (scalability, robustness, reactivity)
Recent benchmarks (2)
• pros:– constraints on training and response time– real traffic (only planned)– major driver: revenue increase
• cons:– only business goals, but otherwise unclear optimization
criteria– user needs are neglected– organization
Recent Benchmarks (3)
• pros:– availability of additional metadata (compared to KDD
Cup 2011)– not rating based (implicit feedback)– ranking based evaluation metric (MAP@500)
• cons:– offline evaluation– size does not matter anymore (lower interest)– no business requirements or technical constraint
User requirements
• functional (quality-related)– relevant, interesting, novel, diverse,
serendipitious, context-aware, ethical, etc.
• non-functional (technology related)– real-time– usability-related
Business requirements
• Business model – for-profit: revenue stream – NP-style: award driven (reputation, community
building)
• KPI depends on the application area– Revenue increase– CTR– Raise awarness to content or service
Technical constraints
• data driven– availability of user feedback (e.g. satellite TV)
• system driven– hardware/software limitations (device-
dependent)
• scalability– typical response time
• robustness
Example
• VoD recommendation scenario (TV)– user: easy contect exploration, context-
awareness (time, viewer identification)– business: increase VoD sales & awareness
(user base)– technical: middleware, HW/SW of the
provider, response time
Facit
• Recommendation tasks have many aspects typically overlooked
• Tasks define the important user, business, and technical quality measures– the fulfilment of all is required at a certain level– trade-off is usually required
• Proposal: with our 3D evaluation concept more comprehensive evaluation can be achieved