balancing discovery and continuation in recommendation (hossein taghavi netflix)

37
Hossein Taghavi With: Ashok Chandrashekar, Linas Baltrunas, and Justin Basilico Balancing Discovery and Continuation in Recommendations RecSysTV 2016

Upload: rtbf

Post on 16-Apr-2017

303 views

Category:

Business


1 download

TRANSCRIPT

Hossein TaghaviWith: Ashok Chandrashekar, Linas Baltrunas, and Justin Basilico

Balancing Discovery and Continuation in Recommendations

RecSysTV 2016

Outline

§ Background: Netflix recommendations

§ Recommending for different modes of watching

§ Case study: Continue Watching row

§ Conclusions

2

Evolution of Netflix

2006 2016

Netflix Scale

§ > 83M members

§ > 190 countries

§ > 1000 device types

§ > 3.7B hours of content streamed every month

§ 36% of peak US downstream traffic

4

§ Recommendations through predicted star rating

§ Contest:§ Accuracy measured by root

mean squared error (RMSE)§ Improve by 10% = $1 million!

§ Data size:§ 100M ratings (back then

“almost massive”)5

Turn on Netflix, and the absolute best contents for you

would automatically start playing

Recommendation System: Ideal State

6

Create a page of recommendations where the titles you are

most likely to watch and enjoy are shown on the most visible parts of

the page

Meanwhile…

7

Title Ranking

Everything is a RecommendationR

owSe

lect

ion

& O

rder

ing

Recommendations are driven by machinelearning algorithms

Over 80% of what members watch comes from our recommendations

8

How the Homepage is Built

§ The titles are organized as rows

§ Ordering of titles within rows depends on the row type

§ Selection and ordering of rows:§ Personalized page generation

algorithm§ Also some business rules and

constraints

§ Balance thematic coherence, relevance, and diversity

9

Various Types of Member Interactions/Feedback

§ Plays§ How long, pause, rewind, skip, etc.

§ Rating and social§ Rate, like, share

§ Context§ Time, location, device, language

§ Interactions§ Scrolling, opening a title page,

search, list add 10

Building the Recommendations is Data Driven

§ Try an idea offline using historical data to see if it would have made better recommendations§ Offline metrics: AUC, nDCG, Recall, …

§ If it did, deploy a live A/B test to see if it performs well in Production§ Primary metric: Member retention

Idea / Problem

Data

Algorithm

Model

Metrics

A/B Testing

11

For More Reading

§ Netflix tech blog:§ bit.ly/beyondfivestars§ bit.ly/learnapage§ bit.ly/sparktimetravel

12

Building recommendation algorithms that are balanced for different modes of watching

13

The same you watched last time!

What Is the Most Likely Title You Will Watch?

§ A large portion of watching hours are spent in continue watching mode

14

Different Modes of Watching

§ Continuation: Resume a recently-watched TV/Movie

§ List: Play a title previously added to My List

§ Rewatch: Rewatch a title enjoyed in the past

§ Discovery: Discover a new title to watch

15

Recommending for Different Modes: Approach 1§ Build one unified model for ranking the titles in each row

and one for ranking rows§ Optimized for the likelihood of play/enjoyment from the page

§ Benefits:§ Fewer models to maintain§ Fewer A/B tests

16

Approach 1: Challenges

§ Members behave differently in different modes§ Different row types are designed for different behaviors§ Hard to capture and balance all that in one objective§ E.g. simply ranking titles by likelihood of play will fill the page with

already-watched titles è Poor member experience

§ Recommendations for different modes have different sensitivities to member actions§ Continuation recs may react immediately to watching activities,

My List recs may react to My List add/remove activities, etc.17

Approach 2: Dedicated Models + Blend

§ Build separate models for the each mode

§ Blend the results on the page§ Blending can be done through a model trained offline, or a

parameter tuned online§ E.g., one or more dedicated rows for each mode

§ Pro:§ More modular, provides more intuitive knobs for balancing

§ Con:§ Less elegant, more maintenance 18

Case Study: Continue Watching Row

19

Continue Watching Row: The Past

§ CW row was shown on some devices§ Videos sorted by recency of last watch§ Row appearance on page by business rules§ On the website, only a single CW title

§ A very significant fraction of plays are continuations§ CW deserved a better treatment

20

Objective

§ Unify the CW row across devices

§ Optimize the row in two dimensions:§ Row position on page

§ Place it higher when the member is more likely to resume a video

§ Re-order the titles within the CW row§ By their likelihood to be resumed in the

current session

21

Some Intuitive Patterns

§ Member may be more likely to want to§ Resume a video if:

§ In the middle of binging a TV show§ Partially watched a movie recently§ Often watched it around this time of the day, location, or on the current

device

§ Discover a new title if:§ Just finished a movie or completed all episodes of a show§ Hasn’t watched anything recently§ Is a relatively new member

22

Building a Recommendation Model for CW

§ Feature Brainstorm

§ Training Data

§ Models and Metrics

§ Implementation

23

Feature Ideas

§ Member-level:§ Member’s subscription: tenure, country, language§ How active has the member been recently§ Member past ratings, genre preferences, etc.

24

Feature Ideas

§ Video and member’s previous interactions with it:§ How recently was the video added to the catalog, watched, ...§ How much of the movie/show watched§ Video metadata:

§ Type and genre of video, # episodes§ E.g., kids titles may be re-watched more

§ What else is on the catalog§ Popularity and relevance of the video§ How often do members resume this video

25

Feature Ideas

§ Contextual:§ Time of the day and day of the week§ Location at various resolutions§ Device

26

Title Ranking Model

§ Training data§ Continuation sessions§ Look at which of the recently-watched titles were played?

§ Model§ Learn-to-rank: Linear/ensembles/…§ Optimize for how well we rank the played title among other titles

27

Title Ranking Model: Performance

§ Baseline: Ranking by recency of last play§ Recency rank was also an

important feature in the model

§ Metrics significantly higher than the baseline§ E.g. Significant lift in precision§ A/B testing also showed

improvements28

Row Placement Model

§ Objective§ Estimate the likelihood of continuation vs. discovery§ Map that likelihood to a position on the page

§ Simplification: § Fix two candidate positions on the page and apply a threshold§ Tune the threshold to optimize some accuracy metric

29

Row Placement Model: Training

§ Training data§ Randomly select sessions with plays globally

§ Model§ Binary classification of continuation vs. discovery sessions§ Evaluated using classification and ranking metrics

30

Row Placement Model: Performance

§ Metrics§ Achieved high classification metrics for predicting continuation vs

discovery§ Error types:

§ False positives è CW occupies top of the page unnecessarily§ False negative è Difficult for member to find the CW title

§ Placing the row§ Threshold trades off FP and FN è Hard to tune offline§ Tuned the threshold by A/B testing

31

Reusing the Title Ranking Model

§ Use the title-level scores§ Calibrate scores to get probability Pt of continuation for each CW

title t§ Aggregate into an overall probability of continuation§ E.g., assuming independence:

PCW = 1 - ∏tϵCW (1- Pt)

§ Pro: Avoids maintaining two separate models

§ Con: Not as accurate as a dedicated model

32

Context Awareness§ Title ranks highest on the same time of day and device

as last play§ Experiment:

§ Played “Sid the Science Kid” on iPhone§ Played “Narcos” on the website

è Different ranking on iPhone and Web

33

Serving the CW Row in Production

§ Score cannot be precomputed è Real- or near real-time§ Some features are context dependent§ Row should refresh each time a member watches a title§ Need to push updates to clients to keep the row fresh

§ Latency bottleneck: Data transfers from the cache to computation backend§ Requires careful backend engineering§ Fallback strategy: If computation fails, can use recency ranking

34

Conclusions and Future Directions

35

Conclusions

§ Important to understand different modes of behavior

§ Continuation is a key driver of streaming hours§ Improving CW recommendations improves member experience§ A/B testing showed significant boost in user engagement

§ Future: § Incorporate the placement of CW row (and others) into the main

page construction model§ When can we automatically start resuming a title?

36

Questions?

Upcoming blog post on this topic at: techblog.netflix.com

Job openings: jobs.netflix.com

37