show me the problem- our insights journey at netflix

45
Show me the Problem Our Insights Journey at Netflix Suudhan Rangarajan Senior Software Engineer, Playback Features @suudhan

Upload: suudhan-rangarajan

Post on 13-Apr-2017

1.099 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Show me the problem- Our insights journey at Netflix

Show me the Problem Our Insights Journey at Netflix

Suudhan Rangarajan Senior Software Engineer, Playback Features

@suudhan

Page 2: Show me the problem- Our insights journey at Netflix

On Feb 26th...

2

Page 3: Show me the problem- Our insights journey at Netflix

Before ElasticSearch (ES)

Our Insights Journey

Today:ElasticSearch(ES) + Kibana

The Future: Taking Insights to Next Level

Page 4: Show me the problem- Our insights journey at Netflix

Motivation

Why is Insights a critical part of our Service?

Page 5: Show me the problem- Our insights journey at Netflix

DVDs and IFO Files

VIDEO_TS

VIDEO_TS.VOBVIDEO_TS.BUPVIDEO_TS.IFO

VIDEO_TS.VOBVIDEO_TS.BUPVIDEO_TS.IFO

VIDEO_TS.VOBVIDEO_TS.BUPVIDEO_TS.IFO

Page 6: Show me the problem- Our insights journey at Netflix

PLAYBACK CONTEXT (Tracks + Track Urls)

Our Service

NETFLIX OPENCONNECT CDN URLS

Page 7: Show me the problem- Our insights journey at Netflix

Many Many Dimensions

PLAYBACKCONTEXT

COUNTRY

USER PREFERENCES

TITLEMETADATA

DEVICE

NETWORK

Page 8: Show me the problem- Our insights journey at Netflix

Tens of Millions of custom DVDs

Page 9: Show me the problem- Our insights journey at Netflix

Errors Happen

Page 10: Show me the problem- Our insights journey at Netflix

Before ES

Distributed Grep

Page 11: Show me the problem- Our insights journey at Netflix

Log-enabled Clusters

Instances with Verbose Logging

Before ES

Page 12: Show me the problem- Our insights journey at Netflix

Diagnostics REST Endpoints

Before ES

Page 13: Show me the problem- Our insights journey at Netflix

Incident To Resolution Time

10 min of Detection

2+ Hours of Analysis

5 min of Resolution

Page 14: Show me the problem- Our insights journey at Netflix

An Incident Review

Page 15: Show me the problem- Our insights journey at Netflix

Before ES

Hours and even Days of Debugging time

High Incident-To-Resolution Time

No Big picture View

No insights into QoE

Page 16: Show me the problem- Our insights journey at Netflix

Our Insights Journey

Today:ES + Kibana

The Future: Taking Insights to Next Level

Before ES● Hours and even Days of

Debugging time● High Incident-To-Resolution

Time● No Big picture View● No insights into QoE

Page 17: Show me the problem- Our insights journey at Netflix

What is Elasticsearch & Kibana

Page 18: Show me the problem- Our insights journey at Netflix

Now: ES + Kibana

Log Essential Data for all Requests

Page 19: Show me the problem- Our insights journey at Netflix

Now: ES + Kibana

Find Specific Request Fast

Page 20: Show me the problem- Our insights journey at Netflix

Now: ES + Kibana

Interactive Exploration

Page 21: Show me the problem- Our insights journey at Netflix

Now: ES + Kibana

Top N queries

Page 22: Show me the problem- Our insights journey at Netflix

Keep It Simple, Stupid INPUTS

Our Insights Philosophy

DECISIONS OUTPUTS

Just Log Essential Data

For every feature, generate insights:- Input parameters- Decision factors- Results

Page 23: Show me the problem- Our insights journey at Netflix

Micro-analytics

Key Observation

This customer has a problem playing this title

Our device partner is not able to test the new HEVC encodes - we don’t seem to returning those streams

Our latest iOS client’s always plays spanish audio by default.

Page 24: Show me the problem- Our insights journey at Netflix

Macro-analytics

Key Observation

How many requests had a max Video Resolution of 720p?

Are we returning chinese audio for this title in all these countries?

With this feature roll-out, how many unique customers are impacted? Should we roll-back or fix-forward?

Page 25: Show me the problem- Our insights journey at Netflix

A Success Story

Page 26: Show me the problem- Our insights journey at Netflix

Incident To Resolution Time

Before ES

10 min

2+ Hours

5 min

Today: ES + Kibana

10 min

10 min

5 min

Page 27: Show me the problem- Our insights journey at Netflix

With ES and Kibana

Fast Root-Cause-Analysis(Minutes and Seconds)

Quick Incident Resolution

Macro-Analytics

Still Manual

Page 28: Show me the problem- Our insights journey at Netflix

Our Insights Journey

The Future: Automated RCA

Before ES● Hours and even Days of

Debugging time● High Incident-To-Resolution

Time● No Big picture View● No insights into Quality of

Experience

Today:ES + Kibana

● Fast Root-cause-Analysis (minutes and seconds)

● Quick Incident Resolution● Macro-Analytics● Still Manual

Page 29: Show me the problem- Our insights journey at Netflix

Today’s problems

When Developers are focused on Innovation and Creative Problem Solving, a context-switch becomes very costly

Page 30: Show me the problem- Our insights journey at Netflix

Automated Root Cause Analysis

Taking Insights Further

Page 31: Show me the problem- Our insights journey at Netflix

The Runbook Lookup

trends in Kibana

Identify dimensions causing the

issue

Figure out resolution

options

Identifying Repetition

Alert fires

Page 32: Show me the problem- Our insights journey at Netflix

Show me the Problem

Alert fires

Send out Resolution options

Awesome Service for Prod Incident REsolution (ASPIRE)

Don’t Repeat Yourself

Page 33: Show me the problem- Our insights journey at Netflix

ASPIRE Workflow

ES Aggregations FTW

1.Start a ES Query with the Alert Dimension

Page 34: Show me the problem- Our insights journey at Netflix

ASPIRE Workflow

ES Aggregations FTW

2. Combine it with a Significant Terms Aggregation on Error Codes

Page 35: Show me the problem- Our insights journey at Netflix

ASPIRE Workflow

ES Aggregations FTW

3. Cardinality Aggregation on Top Dimensions → Sort on % distinctness

Page 36: Show me the problem- Our insights journey at Netflix

ASPIRE Workflow

ES Aggregations FTW

4. Terms or Cardinality Aggregation on specific Sub-Dimensions → Sort on % distinctness

Page 37: Show me the problem- Our insights journey at Netflix

ASPIRE Workflow

ES Aggregations FTW

5. Collect all results and email

Page 38: Show me the problem- Our insights journey at Netflix

The 80% Use-case

Title Alert

Maturity Error is

statistically significant

Top Dimensions [countries:

US,BR]

Sub Dimensions[titleMaturityLevel:TV-Y7

customerMaturityLevel:Age<=6]

Title Alert

What Caused the

Alert (Error

scenario)?

Is it Specific to a Country, a

Device or RequestType?

What Changed to cause the Alert?

Page 39: Show me the problem- Our insights journey at Netflix

Device Alert

Sub Dimensions: ●Available Video Tracks●Filtrations on Video

Tracks

A Complex Use-case

“All Video tracks are filtered out” is one

statistically significant error

Unexpected Exception is

another statistically significant error

Sub Dimensions:●exception Stack Trace●server instances

Page 40: Show me the problem- Our insights journey at Netflix

Incident To Resolution Time

Before ES

10 min

2+ Hours

5 min

Today: ES + Kibana

10 min

10 min

5 min

Tomorrow: ASPIRE

2 min

2 min

5 min

Page 41: Show me the problem- Our insights journey at Netflix

ASPIRE

Automated and Scalable RCA

Cut the Slow Middle-Man

Increased Developer Productivity

Page 42: Show me the problem- Our insights journey at Netflix

Our Insights Journey

Before ES● Hours and even Days of

Debugging time● High Incident-To-Resolution

Time● No Big picture View

Today:ES + Kibana

● Fast Root-cause-Analysis (minutes and seconds)

● Quick Incident Resolution● Still Manual

The Future: Automated RCA● Scalable RCA ● Cut the slow Middle-man● Increased Developer

Productivity

Page 43: Show me the problem- Our insights journey at Netflix

Big Takeaways

Invest in a micro-&-macro analytics tool for your service

Empower your runbook automation with ES aggregations

@suudhan

Page 44: Show me the problem- Our insights journey at Netflix

Discussion

What’s your Insights Story?

@suudhan

Page 45: Show me the problem- Our insights journey at Netflix

Parting Thought

@suudhan

Imagine you are deeply engaged in designing the next big thing for your team. Production Pages start firing, but problems are getting analyzed and routed to the teams who can fix them. You can focus on your deep-thinking, while the machines take care of themselves