show me the problem- our insights journey at netflix
TRANSCRIPT
Show me the Problem Our Insights Journey at Netflix
Suudhan Rangarajan Senior Software Engineer, Playback Features
@suudhan
On Feb 26th...
2
Before ElasticSearch (ES)
Our Insights Journey
Today:ElasticSearch(ES) + Kibana
The Future: Taking Insights to Next Level
Motivation
Why is Insights a critical part of our Service?
DVDs and IFO Files
VIDEO_TS
VIDEO_TS.VOBVIDEO_TS.BUPVIDEO_TS.IFO
VIDEO_TS.VOBVIDEO_TS.BUPVIDEO_TS.IFO
VIDEO_TS.VOBVIDEO_TS.BUPVIDEO_TS.IFO
PLAYBACK CONTEXT (Tracks + Track Urls)
Our Service
NETFLIX OPENCONNECT CDN URLS
Many Many Dimensions
PLAYBACKCONTEXT
COUNTRY
USER PREFERENCES
TITLEMETADATA
DEVICE
NETWORK
Tens of Millions of custom DVDs
Errors Happen
Before ES
Distributed Grep
Log-enabled Clusters
Instances with Verbose Logging
Before ES
Diagnostics REST Endpoints
Before ES
Incident To Resolution Time
10 min of Detection
2+ Hours of Analysis
5 min of Resolution
An Incident Review
Before ES
Hours and even Days of Debugging time
High Incident-To-Resolution Time
No Big picture View
No insights into QoE
Our Insights Journey
Today:ES + Kibana
The Future: Taking Insights to Next Level
Before ES● Hours and even Days of
Debugging time● High Incident-To-Resolution
Time● No Big picture View● No insights into QoE
What is Elasticsearch & Kibana
Now: ES + Kibana
Log Essential Data for all Requests
Now: ES + Kibana
Find Specific Request Fast
Now: ES + Kibana
Interactive Exploration
Now: ES + Kibana
Top N queries
Keep It Simple, Stupid INPUTS
Our Insights Philosophy
DECISIONS OUTPUTS
Just Log Essential Data
For every feature, generate insights:- Input parameters- Decision factors- Results
Micro-analytics
Key Observation
This customer has a problem playing this title
Our device partner is not able to test the new HEVC encodes - we don’t seem to returning those streams
Our latest iOS client’s always plays spanish audio by default.
Macro-analytics
Key Observation
How many requests had a max Video Resolution of 720p?
Are we returning chinese audio for this title in all these countries?
With this feature roll-out, how many unique customers are impacted? Should we roll-back or fix-forward?
A Success Story
Incident To Resolution Time
Before ES
10 min
2+ Hours
5 min
Today: ES + Kibana
10 min
10 min
5 min
With ES and Kibana
Fast Root-Cause-Analysis(Minutes and Seconds)
Quick Incident Resolution
Macro-Analytics
Still Manual
Our Insights Journey
The Future: Automated RCA
Before ES● Hours and even Days of
Debugging time● High Incident-To-Resolution
Time● No Big picture View● No insights into Quality of
Experience
Today:ES + Kibana
● Fast Root-cause-Analysis (minutes and seconds)
● Quick Incident Resolution● Macro-Analytics● Still Manual
Today’s problems
When Developers are focused on Innovation and Creative Problem Solving, a context-switch becomes very costly
Automated Root Cause Analysis
Taking Insights Further
The Runbook Lookup
trends in Kibana
Identify dimensions causing the
issue
Figure out resolution
options
Identifying Repetition
Alert fires
Show me the Problem
Alert fires
Send out Resolution options
Awesome Service for Prod Incident REsolution (ASPIRE)
Don’t Repeat Yourself
ASPIRE Workflow
ES Aggregations FTW
1.Start a ES Query with the Alert Dimension
ASPIRE Workflow
ES Aggregations FTW
2. Combine it with a Significant Terms Aggregation on Error Codes
ASPIRE Workflow
ES Aggregations FTW
3. Cardinality Aggregation on Top Dimensions → Sort on % distinctness
ASPIRE Workflow
ES Aggregations FTW
4. Terms or Cardinality Aggregation on specific Sub-Dimensions → Sort on % distinctness
ASPIRE Workflow
ES Aggregations FTW
5. Collect all results and email
The 80% Use-case
Title Alert
Maturity Error is
statistically significant
Top Dimensions [countries:
US,BR]
Sub Dimensions[titleMaturityLevel:TV-Y7
customerMaturityLevel:Age<=6]
Title Alert
What Caused the
Alert (Error
scenario)?
Is it Specific to a Country, a
Device or RequestType?
What Changed to cause the Alert?
Device Alert
Sub Dimensions: ●Available Video Tracks●Filtrations on Video
Tracks
A Complex Use-case
“All Video tracks are filtered out” is one
statistically significant error
Unexpected Exception is
another statistically significant error
Sub Dimensions:●exception Stack Trace●server instances
Incident To Resolution Time
Before ES
10 min
2+ Hours
5 min
Today: ES + Kibana
10 min
10 min
5 min
Tomorrow: ASPIRE
2 min
2 min
5 min
ASPIRE
Automated and Scalable RCA
Cut the Slow Middle-Man
Increased Developer Productivity
Our Insights Journey
Before ES● Hours and even Days of
Debugging time● High Incident-To-Resolution
Time● No Big picture View
Today:ES + Kibana
● Fast Root-cause-Analysis (minutes and seconds)
● Quick Incident Resolution● Still Manual
The Future: Automated RCA● Scalable RCA ● Cut the slow Middle-man● Increased Developer
Productivity
Big Takeaways
Invest in a micro-&-macro analytics tool for your service
Empower your runbook automation with ES aggregations
@suudhan
Discussion
What’s your Insights Story?
@suudhan
Parting Thought
@suudhan
Imagine you are deeply engaged in designing the next big thing for your team. Production Pages start firing, but problems are getting analyzed and routed to the teams who can fix them. You can focus on your deep-thinking, while the machines take care of themselves