balancing discovery and continuation in recommendation (hossein taghavi netflix)
TRANSCRIPT
Hossein TaghaviWith: Ashok Chandrashekar, Linas Baltrunas, and Justin Basilico
Balancing Discovery and Continuation in Recommendations
RecSysTV 2016
Outline
§ Background: Netflix recommendations
§ Recommending for different modes of watching
§ Case study: Continue Watching row
§ Conclusions
2
Netflix Scale
§ > 83M members
§ > 190 countries
§ > 1000 device types
§ > 3.7B hours of content streamed every month
§ 36% of peak US downstream traffic
4
§ Recommendations through predicted star rating
§ Contest:§ Accuracy measured by root
mean squared error (RMSE)§ Improve by 10% = $1 million!
§ Data size:§ 100M ratings (back then
“almost massive”)5
Turn on Netflix, and the absolute best contents for you
would automatically start playing
Recommendation System: Ideal State
6
Create a page of recommendations where the titles you are
most likely to watch and enjoy are shown on the most visible parts of
the page
Meanwhile…
7
Title Ranking
Everything is a RecommendationR
owSe
lect
ion
& O
rder
ing
Recommendations are driven by machinelearning algorithms
Over 80% of what members watch comes from our recommendations
8
How the Homepage is Built
§ The titles are organized as rows
§ Ordering of titles within rows depends on the row type
§ Selection and ordering of rows:§ Personalized page generation
algorithm§ Also some business rules and
constraints
§ Balance thematic coherence, relevance, and diversity
9
Various Types of Member Interactions/Feedback
§ Plays§ How long, pause, rewind, skip, etc.
§ Rating and social§ Rate, like, share
§ Context§ Time, location, device, language
§ Interactions§ Scrolling, opening a title page,
search, list add 10
Building the Recommendations is Data Driven
§ Try an idea offline using historical data to see if it would have made better recommendations§ Offline metrics: AUC, nDCG, Recall, …
§ If it did, deploy a live A/B test to see if it performs well in Production§ Primary metric: Member retention
Idea / Problem
Data
Algorithm
Model
Metrics
A/B Testing
11
For More Reading
§ Netflix tech blog:§ bit.ly/beyondfivestars§ bit.ly/learnapage§ bit.ly/sparktimetravel
12
The same you watched last time!
What Is the Most Likely Title You Will Watch?
§ A large portion of watching hours are spent in continue watching mode
14
Different Modes of Watching
§ Continuation: Resume a recently-watched TV/Movie
§ List: Play a title previously added to My List
§ Rewatch: Rewatch a title enjoyed in the past
§ Discovery: Discover a new title to watch
15
Recommending for Different Modes: Approach 1§ Build one unified model for ranking the titles in each row
and one for ranking rows§ Optimized for the likelihood of play/enjoyment from the page
§ Benefits:§ Fewer models to maintain§ Fewer A/B tests
16
Approach 1: Challenges
§ Members behave differently in different modes§ Different row types are designed for different behaviors§ Hard to capture and balance all that in one objective§ E.g. simply ranking titles by likelihood of play will fill the page with
already-watched titles è Poor member experience
§ Recommendations for different modes have different sensitivities to member actions§ Continuation recs may react immediately to watching activities,
My List recs may react to My List add/remove activities, etc.17
Approach 2: Dedicated Models + Blend
§ Build separate models for the each mode
§ Blend the results on the page§ Blending can be done through a model trained offline, or a
parameter tuned online§ E.g., one or more dedicated rows for each mode
§ Pro:§ More modular, provides more intuitive knobs for balancing
§ Con:§ Less elegant, more maintenance 18
Continue Watching Row: The Past
§ CW row was shown on some devices§ Videos sorted by recency of last watch§ Row appearance on page by business rules§ On the website, only a single CW title
§ A very significant fraction of plays are continuations§ CW deserved a better treatment
20
Objective
§ Unify the CW row across devices
§ Optimize the row in two dimensions:§ Row position on page
§ Place it higher when the member is more likely to resume a video
§ Re-order the titles within the CW row§ By their likelihood to be resumed in the
current session
21
Some Intuitive Patterns
§ Member may be more likely to want to§ Resume a video if:
§ In the middle of binging a TV show§ Partially watched a movie recently§ Often watched it around this time of the day, location, or on the current
device
§ Discover a new title if:§ Just finished a movie or completed all episodes of a show§ Hasn’t watched anything recently§ Is a relatively new member
22
Building a Recommendation Model for CW
§ Feature Brainstorm
§ Training Data
§ Models and Metrics
§ Implementation
23
Feature Ideas
§ Member-level:§ Member’s subscription: tenure, country, language§ How active has the member been recently§ Member past ratings, genre preferences, etc.
24
Feature Ideas
§ Video and member’s previous interactions with it:§ How recently was the video added to the catalog, watched, ...§ How much of the movie/show watched§ Video metadata:
§ Type and genre of video, # episodes§ E.g., kids titles may be re-watched more
§ What else is on the catalog§ Popularity and relevance of the video§ How often do members resume this video
25
Feature Ideas
§ Contextual:§ Time of the day and day of the week§ Location at various resolutions§ Device
26
Title Ranking Model
§ Training data§ Continuation sessions§ Look at which of the recently-watched titles were played?
§ Model§ Learn-to-rank: Linear/ensembles/…§ Optimize for how well we rank the played title among other titles
27
Title Ranking Model: Performance
§ Baseline: Ranking by recency of last play§ Recency rank was also an
important feature in the model
§ Metrics significantly higher than the baseline§ E.g. Significant lift in precision§ A/B testing also showed
improvements28
Row Placement Model
§ Objective§ Estimate the likelihood of continuation vs. discovery§ Map that likelihood to a position on the page
§ Simplification: § Fix two candidate positions on the page and apply a threshold§ Tune the threshold to optimize some accuracy metric
29
Row Placement Model: Training
§ Training data§ Randomly select sessions with plays globally
§ Model§ Binary classification of continuation vs. discovery sessions§ Evaluated using classification and ranking metrics
30
Row Placement Model: Performance
§ Metrics§ Achieved high classification metrics for predicting continuation vs
discovery§ Error types:
§ False positives è CW occupies top of the page unnecessarily§ False negative è Difficult for member to find the CW title
§ Placing the row§ Threshold trades off FP and FN è Hard to tune offline§ Tuned the threshold by A/B testing
31
Reusing the Title Ranking Model
§ Use the title-level scores§ Calibrate scores to get probability Pt of continuation for each CW
title t§ Aggregate into an overall probability of continuation§ E.g., assuming independence:
PCW = 1 - ∏tϵCW (1- Pt)
§ Pro: Avoids maintaining two separate models
§ Con: Not as accurate as a dedicated model
32
Context Awareness§ Title ranks highest on the same time of day and device
as last play§ Experiment:
§ Played “Sid the Science Kid” on iPhone§ Played “Narcos” on the website
è Different ranking on iPhone and Web
33
Serving the CW Row in Production
§ Score cannot be precomputed è Real- or near real-time§ Some features are context dependent§ Row should refresh each time a member watches a title§ Need to push updates to clients to keep the row fresh
§ Latency bottleneck: Data transfers from the cache to computation backend§ Requires careful backend engineering§ Fallback strategy: If computation fails, can use recency ranking
34
Conclusions
§ Important to understand different modes of behavior
§ Continuation is a key driver of streaming hours§ Improving CW recommendations improves member experience§ A/B testing showed significant boost in user engagement
§ Future: § Incorporate the placement of CW row (and others) into the main
page construction model§ When can we automatically start resuming a title?
36