the new netflix api

76
The new Netflix API Why more complexity must lead to more simplicity Katharina Probst DevNexus 2017

Upload: katharina-probst

Post on 12-Apr-2017

94 views

Category:

Software


2 download

TRANSCRIPT

Page 1: The new Netflix API

The new Netflix API

Why more complexity must lead to more simplicity

Katharina ProbstDevNexus 2017

Page 2: The new Netflix API
Page 3: The new Netflix API

Js(mostly)

java

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary API Server JVM

groovy

Network boundary

Today’s architectureNetwork boundary

Gateway

Page 4: The new Netflix API

What is the Netflix

Page 5: The new Netflix API

Raison d’Être

Page 6: The new Netflix API

Is the API just one gigantic translation layer?

Is it a routing layer?

If it’s too complex, can we just get rid of it?

Raison d’Être.

Page 7: The new Netflix API

1. Orchestration

2. Availability protection

3. Abstraction

Raison d’Être

Page 8: The new Netflix API

1. Orchestration

Page 9: The new Netflix API

Simple example: search

Page 10: The new Netflix API
Page 11: The new Netflix API

Related Terms

Page 12: The new Netflix API

People

Page 13: The new Netflix API

Titles

Page 14: The new Netflix API

Search request → response● Search services provides related search terms● Search service provides IDs for videos and people

○ IDs depend on various factors, e.g., different catalogs in different countries

● For each ID, we need metadata○ Titles○ Images○ Names○ Ratings○ etc.

● ..., which depend on○ Country○ A/B tests user is in○ etc.

Response:❏ Hydrated videos❏ People names❏ Query suggestions

Page 15: The new Netflix API

Orchestration● Own order of operations● Provide whatever info clients/services need

○ From other clients/libraries/services○ From request

● Merge partial results● Filter results● Retrieve more info if necessary● Support mutations (e.g., profile switch)● Support complex transactions in a limited way

Page 16: The new Netflix API
Page 17: The new Netflix API

2. Availability protection

Page 18: The new Netflix API

Prevent this as much as possible

Page 19: The new Netflix API

What do customers want?

● No personalized recommendations, or no ability to stream?● No search, or no ability to continue watching the movie you started last night?● No cutting-edge A/B experiment experience, or no ability to stream?

Page 20: The new Netflix API

Top priority: customer experience

● Top priority of top priority: customer can stream videos● This means API cannot go down entirely

○ If it does, we have an outage● But some services are not critical to this mission

○ A/B - if we don’t know what A/B tests you’re in, you can still get the default experience

○ Search - if you can’t search, you can still browse

Page 21: The new Netflix API

Exposure to failures

● As your app grows, your set of dependencies is much more likely to get bigger, not smaller

● Overall uptime = (Dep uptime)^(num deps)

Page 22: The new Netflix API

● Fault-tolerance pattern as a library

● Provides operational insights in real-time

● Automatic load-shedding under pressure

Hystrix

Page 23: The new Netflix API

Search client libClient lib B

Ratings client lib

Client lib N

Cust client libClient lib Z

...

...

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

Availability protection

Search

Ratings

Customers

...

Network boundary

Gateway

API

Page 24: The new Netflix API

Search client libClient lib B

Ratings client lib

Client lib N

Cust client libClient lib Z

...

...

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

Availability protection

Search

Ratings

Customers

...

Network boundary

Gateway

API

Page 25: The new Netflix API

Search client libClient lib B

Ratings client lib

Client lib N

Cust client libClient lib Z

...

...

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

Availability protection

Search

Ratings

Customers

...

Network boundary

Gateway

API

Page 26: The new Netflix API

Search client libClient lib B

Ratings client lib

Client lib N

Cust client libClient lib Z

...

...

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

If you don’t plan for failure

Search

Ratings

Customers

...

Network boundary

Gateway

API

Page 27: The new Netflix API

Search client libClient lib B

Ratings client lib

Client lib N

Cust client libClient lib Z

...

...

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

If you do plan for failure

Search

Ratings

Customers

...

Network boundary

Gateway

API

No search results >> no Netflix

Page 28: The new Netflix API

Search client libClient lib B

Ratings client lib

Client lib N

Cust client libClient lib Z

...

...

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

Fallbacks

Search

Ratings

Customers

...

Network boundary

Gateway

API

Return static or stale rating

Page 29: The new Netflix API

return getRatings(id);

How to handle errors

Page 30: The new Netflix API

try {

return getRatings(id);

} catch (Exception ex) {

//static value

return null;

}

How to handle errors

Page 31: The new Netflix API

try {

return getRatings(id);

} catch (Exception ex) {

//TODO What to return here?

}

How to handle errors

Page 32: The new Netflix API

Handle errors with fallbacks

● Some options for fallbacks

○ Static value

○ Value from in-memory

○ Value from cache

○ Value from network

○ Throw

○ Code

● Make error-handling explicit

● Applications have to work in the presence of either fallbacks or rethrown exceptions

Page 33: The new Netflix API
Page 34: The new Netflix API

● Throttling

● Retries

● Timeouts

● Canaries

● Regional rollouts

● Traffic shifting

● Outlier detection (and elimination)

● Advanced load balancing

Availability protection beyond Hystrix

Page 35: The new Netflix API

3. Abstraction

Page 36: The new Netflix API

Abstraction goals

● Shield all device teams from every single mid-tier change … at least for a time. Allows us to move more independently

● Shield all device teams from every single platform/infrastructure change● Provide APIs not provided by downstream services

○ Find all movies that...○ Length of movie

● Implementation flexibility, e.g., ○ Caching○ Batch APIs

Page 37: The new Netflix API

Abstraction challenges

● Tech debt● Device teams can have black-box view (“api == cloud”)● But isn’t the API team the bottleneck?

○ Yes, sometimes. But organizational structure makes this less of a problem than m mid-tier teams dealing with n device teams

● But: separation of concerns

Page 38: The new Netflix API

Server-side logic

Page 39: The new Netflix API

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

~2100 active

Network boundary

Reminder: Today’s architectureNetwork boundary

Gateway

API

Page 40: The new Netflix API

Device teams write server-side logic

● Decoupling teams → better velocity● UI teams are empowered to

○ Change presentation○ Filter○ Add users to A/B tests, which then leads to e.g., different layout.

Page 41: The new Netflix API

What if we didn’t have an API?

Page 42: The new Netflix API

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

What if? Implications for device teamsNetwork boundary

Gateway

Device teams own client-side applications …

Page 43: The new Netflix API

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

What if? Implications for device teamsNetwork boundary

Gateway

...and groovy scripts

Page 44: The new Netflix API

What if? Implications for device teams

● Each device team would have to own○ Orchestration○ Frequent dependency updates (currently done (attempted) daily)○ Implement higher level APIs (all movies that…)○ Fallbacks and other resiliency protection (e.g., timeouts, retries)

● Recent example○ Library upgrade caused a lot of NPEs -- why? ○ Worked with team to find out why○ When fixed, no more NPEs, but instead performance degradation

● Should all teams be dealing with this?

Page 45: The new Netflix API

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

What if? Implications for service teamsNetwork boundary

Gateway

Service teams own services...

Page 46: The new Netflix API

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

Network boundary

What if? Implications for service teamsNetwork boundary

Gateway

...and client libraries

Page 47: The new Netflix API

What if? Implications for service teams● Can only make breaking changes if all device teams who use their service

upgrade● Don’t get resiliency protection (e.g., timeouts, load balancing, retries, fallbacks)

unless all device teams who use their service provide it● Should all teams be dealing with this?

Page 48: The new Netflix API

What if? Implications for Netflix● Lower velocity due to tight coupling between many mid-tier teams and many

device teams

Page 49: The new Netflix API

OR:THE DOWNSIDE OF CENTRALIZATION

Page 50: The new Netflix API

Where are we today?

● Principle: don’t repeat logic○ It’s better to do it once in API than do it n times for n devices.

● Principle is good, but leads to complexity

Page 51: The new Netflix API

What complexity challenges to we have?

Page 52: The new Netflix API

Complexity challenges

● Frequent (not always canaried) updates to a critical service in production● Difficulty of debugging (esp. for groovy script writers)● Slow server startup times● Lack of operational insights into script resource consumption● Difficulty of performance profiling● Lack of feedback loop● Decoupled code versioning and transitive dependencies

Page 53: The new Netflix API

Where are we going next?

Page 54: The new Netflix API

Top priorities

● Move groovy scripts out● Split up API

Page 55: The new Netflix API

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Netflix Micro-services

Network boundary

...

Network boundary

New architecture: Edge PaaSNetwork boundary

Network boundary

Gate-way

EAS

Network boundary Client lib A

Client lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Titus

Page 56: The new Netflix API

Network boundary

Network boundary

Netflix Micro-services

Network boundary

...

New architecture: Edge PaaSNetwork boundary

Gate-way

EAS

Network boundary

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Titus

Edge Auth Service● Auth

termination● Centralized

place for auth

Edge PaaS: ● Platform for node scripts● Developer tooling for entire SDLC● Remote API with optimized data access (Falcor)

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Page 57: The new Netflix API

Two APIs

Page 58: The new Netflix API

DNAClient A

...

Network boundary

...

Network boundary

Two (or more) APIsNetwork boundary

Network boundary

Gate-way

EAS

Network boundary

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Titus

PB Service A

PB Service B

PB Service Z

...

DNAClient B

DNAClient Z

Shared Client C

Shared Client A

...

PB Client B

PB Client Z

PB Client C

PB Service C

DNA Service A

DNA Service B

DNA Service Z

...

DNA Service C

Shared Service A

Shared Service B

Shared Service Z

...

Split API by function

Page 59: The new Netflix API

NodeQuark Platform

Page 60: The new Netflix API

java

Netflix Micro-services

Network boundary

...

Network boundary

NodeQuark PlatformNetwork boundary

Network boundary

Zuul

EAS

Network boundary

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Titus

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Platform for node scripts

Page 61: The new Netflix API

Edge PaaS: Node Platform

● Node apps run in containers on Titus platform● Node Platform provides

○ Integration into Netflix ecosystem (e.g., discovery)○ Logging○ Dashboards, metrics out of the box with option to customize○ Support for mocking and testing

● Titus provides○ Scheduling○ Autoscaling

Page 62: The new Netflix API

Developer experience

Page 63: The new Netflix API

java

Netflix Micro-services

Network boundary

...

Network boundary

New architecture: Edge PaaSNetwork boundary

Network boundary

Gate-way

EAS

Network boundary

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Titus

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Developer tooling for entire SDLC

Page 64: The new Netflix API

Edge PaaS: Developer tooling

● Command line tool for node apps○ Setup○ Starting apps○ Deploying apps

● Local development and debugging of node apps● UI for lifecycle management, e.g., version management● One-click rollouts, one-click rollbacks● Versioning

Page 65: The new Netflix API

Remote API

Page 66: The new Netflix API

Netflix Micro-services

Network boundary

...

Network boundary

New architecture: Edge PaaSNetwork boundary

Network boundary

Zuul

EAS

Network boundary

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

Node app NodeQuark

TitusRemote API with optimized data access

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Client lib AClient lib BClient lib C

Client lib N

Client lib YClient lib Z

...

...

Page 67: The new Netflix API

Edge PaaS: Remote API

● API still takes care of○ Orchestration○ Resiliency protection○ Abstraction

● Optimized access with Falcor○ “RESTful composition” with caching

● Binary transport● Future: channel support

Page 68: The new Netflix API

Greater simplicity

Page 69: The new Netflix API

Isolated failures: Scripts don’t affect each other (usually)

API

Temporarily unavailable!

Page 70: The new Netflix API

Independent root causing

API

Latency spike after push: 150ms

Average latency: 10ms

Page 71: The new Netflix API

Independent autoscaling

API

Page 72: The new Netflix API

Independent insights

API

Average latency: 50ms

Average latency: 10ms

Page 73: The new Netflix API

Better regression/performance testing

API

Tests not affected by other scripts eating up resources on the same JVM

Page 74: The new Netflix API

Conclusion

Page 75: The new Netflix API

Complexity and simplicity

● Product has become much more complex○ Scripts (more scripts, more complex scripts)○ Features○ Number of downstream services to integrate○ More personalization○ etc.

● Complexity of API service is high → Need to optimize for simplicity now○ Process isolation○ Cleaner developer experience

Page 76: The new Netflix API

END