maintaining the front door to netflix : the netflix api

Download Maintaining the Front Door to Netflix : The Netflix API

If you can't read please download the document

Upload: daniel-jacobson

Post on 08-Sep-2014

51 views

Category:

Technology


9 download

DESCRIPTION

This presentation was given to the engineering organization at Zendesk. In this presentation, I talk about the challenges that the Netflix API faces in supporting the 1000+ different device types, millions of users, and billions of transactions. The topics range from resiliency, scale, API design, failure injection, continuous delivery, and more.

TRANSCRIPT

  • Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson

There are copious notes attached to each slide in this presentation. Please read those notes to get the full context of the presentation Global Streaming Video for TV Shows and Movies More than 44 Million Subscribers More than 40 Countries Netflix Accounts for ~33% of Peak Internet Traffic in North America Netflix subscribers are watching more than 1 billion hours a month Team Focus: Build the Best Global Streaming Product Three aspects of the Streaming Product: Non-Member Discovery Streaming Key Responsibilities Broker data between services and UIs Maintain a resilient front-door Scale the system vertically and horizontally Maintain high velocity But Before Streaming Monolithic Application In Netflix Data Centers The bigger the ship the slower it turns Distributed Architecture 1000+ Device Types Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies Reviews A/B Test Engine Dozens of Dependencies Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Dependency Relationships 2,000,000,000 Requests Per Day to the Netflix API 30 Distinct Dependent Services for the Netflix API ~500 Dependency jars Slurped into the Netflix API 14,000,000,000 Netflix API Calls Per Day to those Dependent Services 0 Dependent Services with 100% SLA 99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month 99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month 99.9% = 97%30 3% of 2B = 60M failures per day 20+ Hours of Downtime Per Month Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Circuit Breaker Dashboard Call Volume and Health / Last 10 Seconds Call Volume / Last 2 Minutes Successful Requests Successful, But Slower Than Expected Short-Circuited Requests, Delivering Fallbacks Timeouts, Delivering Fallbacks Thread Pool & Task Queue Full, Delivering Fallbacks Exceptions, Delivering Fallbacks Error Rate # + # + # + # / (# + # + # + # + #) = Error Rate Status of Fallback Circuit Requests per Second, Over Last 10 Seconds SLA Information Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback Scaling the Distributed System AWS Cloud Autoscaling Autoscaling Amazon Auto Scaling Limitations Hard to fit policies to variable traffic patterns (weekday vs weekend) Limited control over capacity adjustments (absolute value or %) The Impact of AAS Limitations Traffic drop can lead to scale downs during outage Performance degradation between new instance launch and taking traffic Excess capacity at peak and trough Scryer : Predictive Auto Scaling Not yet Typical Traffic Patterns Over Five Days Predicted RPS Compared to Actual RPS Scaling Plan for Predicted Workload What is Scryer Doing? Evaluating needs based on historical data Week over week, month over month metrics Adjusts instance minimums based on algorithms Relies on Amazon Auto Scaling for unpredicted events Results Results : Load Average Reactive Predictive Results : Response Latencies Reactive Predictive Results : Outage Recovery Results : Outage Recovery Results : AWS Costs Scaling Globally More than 44 Million Subscribers More than 40 Countries Zuul Gatekeeper for the Netflix Streaming Application Zuul * Multi-Region Resiliency Insights Stress Testing Canary Testing Dynamic Routing Load Shedding Security Static Response Handling Authentication * Most closely resembles an API proxy Isthmus All of these approaches are designed to prevent failures But sometimes the best way to prevent failures is to force them! I randomly terminate instances in production to identify dormant failures. Chaos Monkey Chaos Gorilla I simulate an outage of an entire Amazon availability zone. I simulate an outage in an AWS region. Chaos Kong I find instances that dont adhere to best practices. Conformity Monkey I extend Conformity Monkey to find security violations. Security Monkey I detect unhealthy instances and remove them from service. Doctor Monkey I clean up the clutter and waste that runs in the cloud. Janitor Monkey I induce artificial delays and errors into services to determine how upstream services will respond. Latency Monkey Deployments in the Cloud Dependency Relationships Testing Philosophy: Act Fast, React Fast That Doesnt Mean We Dont Test Automated Delivery Pipeline Cloud-Based Deployment Techniques Current Code In Production API Requests from the Internet Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet Canary Analysis Automation Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet Error! Current Code In Production API Requests from the Internet Current Code In Production API Requests from the Internet Current Code In Production API Requests from the Internet Perfect! Stress Test with Zuul Current Code In Production API Requests from the Internet New Code Getting Prepared for Production Current Code In Production API Requests from the Internet New Code Getting Prepared for Production Error! Current Code In Production API Requests from the Internet New Code Getting Prepared for Production Current Code In Production API Requests from the Internet New Code Getting Prepared for Production Current Code In Production API Requests from the Internet Perfect! Stress Test with Zuul Current Code In Production API Requests from the Internet New Code Getting Prepared for Production Current Code In Production API Requests from the Internet New Code Getting Prepared for Production API Requests from the Internet New Code Getting Prepared for Production Brokering Data to 1,000+ Device Types Screen Real Estate Controller Technical Capabilities One-Size-Fits-All API Request Request Request Courtesy of South Florida Classical Review Resource-Based API vs. Experience-Based API Resource-Based Requests /users//ratings/title /users//queues /users//queues/instant /users//recommendations /catalog/titles/movie /catalog/titles/series /catalog/people REST API RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Network Border Network Border RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border SERVER CODE CLIENT CODE RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border DATA GATHERING, FORMATTING, AND DELIVERY USER INTERFACE RENDERING Experience-Based Requests /ps3/homescreen JAVA API Network Border Network Border RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Groovy Layer RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API SERVER CODE CLIENT CODE CLIENT ADAPTER CODE (WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER) Network Border Network Border RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API DATA GATHERING DATA FORMATTING AND DELIVERY USER INTERFACE RENDERING Network Border Network Border https://www.github.com/Netflix Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson