netflix in the cloud at sv forum

Cloud Architecture at Ne0lix How Ne0lix Built a Scalable Java oriented PaaS running on AWS

SVForum March 27th, 2012 Adrian Cockcro9 @adrianco #ne=lixcloud

h@p://www.linkedin.com/in/adriancockcro9

Adrian Cockcro9 •  Director, Architecture for Cloud Systems, Ne=lix Inc.

–  Previously Director for PersonalizaOon Pla=orm

•  DisOnguished Availability Engineer, eBay Inc. 2004-‐7 –  Founding member of eBay Research Labs

•  DisOnguished Engineer, Sun Microsystems Inc. 1988-‐2004 –  2003-‐4 Chief Architect High Performance Technical CompuOng –  2001 Author: Capacity Planning for Web Services –  1999 Author: Resource Management –  1995 & 1998 Author: Sun Performance and Tuning –  1996 Japanese EdiOon of Sun Performance and Tuning

•  SPARC & Solarisパフォーマンスチューニング (サンソフトプレスシリーズ)

•  More –  Twi@er @adrianco – Blog h@p://perfcap.blogspot.com –  PresentaOons at h@p://www.slideshare.net/adrianco

Why Ne=lix, Why Cloud, Why AWS

Part 1 of 3

What kind of Cloud? •  So9ware as a Service – SaaS

–  Replaces in house applicaOons –  Targets end users

•  Pla=orm as a Service – PaaS –  Replaces in house operaOons funcOons –  Targets developers

•  Infrastructure as a Service – IaaS –  Replaces in house datacenter capacity –  Targets developers and ITops

What Ne=lix Did

•  Moved to SaaS –  Corporate IT – OneLogin, Workday, Box, Evernote… –  Tools – Pagerduty, AppDynamics, ElasOc MapReduce

•  Built our own PaaS <-‐ today’s focus –  Customized to make our developers producOve – When we started, we had li@le choice

•  Moved incremental capacity to IaaS – No new datacenter space since 2008 as we grew – Moved our streaming apps to the cloud

Why Use Public Cloud?

Things We Don’t Do

Be@er Business Agility

Ne=lix could not build new

datacenters fast enough

Capacity growth is acceleraOng, unpredictable Product launch spikes -‐ iPhone, Wii, PS3, Xbox

InternaOonal – Canada, LaOn America, UK/Ireland

Data Center

Ne=lix.com is now ~100% Cloud

A few small back end data sources sOll in progress All internaOonal product is cloud based

USA specific logisOcs remains in the Datacenter Working on SOX, PCI as scope starts to include AWS

Ne=lix Choice was AWS with our own pla=orm and tools

Unique pla=orm requirements and extreme scale, agility and flexibility

Leverage AWS Scale “the biggest public cloud” AWS investment in features and automaOon

Use AWS zones and regions for high availability, scalability and global deployment

But isn’t Amazon a compeOtor?

Many products that compete with Amazon run on AWS We are a “poster child” for the AWS Architecture

Ne=lix is one of the biggest AWS customers Co-‐opeOOon -‐ compeOtors are also partners

Could Ne=lix use another cloud?

Would be nice, we use three interchangeable CDN Vendors But no-‐one else has the scale and features of AWS

You have to be this tall to ride this ride… Maybe in 2-‐3 years?

We want to use clouds, we don’t have Ome to build them

Public cloud for agility and scale We use electricity too, but don’t want to build our own power staOon… AWS because they are big enough to allocate thousands of instances per

hour when we need to

What about other PaaS?

•  CloudFoundry – Open Source by VMWare – Developer-‐friendly, easy to get started – Missing scale and some enterprise features

•  Rightscale – Widely used to abstract away from AWS – Creates it’s own lock-‐in problem…

•  AWS is growing into this space – We didn’t want a vendor between us and AWS – We wanted to build a thin PaaS, that gets thinner

Ne=lix Deployed on AWS

Content

Video Masters

EC2

S3

CDNs

Logs

S3

EMR Hadoop

Hive

Business Intelligence

Play

DRM

CDN rouOng

Bookmarks

Logging

WWW

Sign-‐Up

Search

Movie Choosing

RaOngs

API

Metadata

Device Config

TV Movie Choosing

Social Facebook

CS

InternaOonal CS lookup

DiagnosOcs & AcOons

Customer Call Log

CS AnalyOcs

2009 2009 2010 2010 2010 2011

Cloud Architecture Pa@erns

Where do we start?

Goals •  Faster

–  Lower latency than the equivalent datacenter web pages and API calls –  Measured as mean and 99th percenOle –  For both first hit (e.g. home page) and in-‐session hits for the same user

•  Scalable –  Avoid needing any more datacenter capacity as subscriber count increases –  No central verOcally scaled databases –  Leverage AWS elasOc capacity effecOvely

•  Available –  SubstanOally higher robustness and availability than datacenter services –  Leverage mulOple AWS availability zones –  No scheduled down Ome, no central database schema to change

•  ProducOve –  OpOmize agility of a large development team with automaOon and tools –  Leave behind complex tangled datacenter code base (~8 year old architecture) –  Enforce clean layered interfaces and re-‐usable components

Datacenter AnO-‐Pa@erns

What do we currently do in the datacenter that prevents us from

meeOng our goals?

Rewrite from Scratch

Not everything is cloud specific Pay down technical debt

Robust pa@erns

Ne=lix Datacenter vs. Cloud Arch

Central SQL Database Distributed Key/Value NoSQL

SOcky In-‐Memory Session Shared Memcached Session

Cha@y Protocols Latency Tolerant Protocols

Tangled Service Interfaces Layered Service Interfaces

Instrumented Code Instrumented Service Pa@erns

Fat Complex Objects Lightweight Serializable Objects

Components as Jar Files Components as Services

So9ware Architecture Pa@erns

•  Object Models – Basic and derived types, facets, serializable – Pass by reference within a service – Pass by value between services

•  ComputaOon and I/O Models – Service ExecuOon using Best Effort / Futures – Common thread pool management – Circuit breakers to manage and contain failures

Model Driven Architecture •  TradiOonal Datacenter PracOces –  Lots of unique hand-‐tweaked systems – Hard to enforce pa@erns –  Some use of Puppet to automate changes

•  Model Driven Cloud Architecture –  Perforce/Ivy/Jenkins based builds for everything –  Every producOon instance is a pre-‐baked AMI –  Every applicaOon is managed by an Autoscaler

Every change is a new AMI

Ne=lix PaaS Principles

•  Maximum FuncOonality – Developer producOvity and agility

•  Leverage as much of AWS as possible – AWS is making huge investments in features/scale

•  Interfaces that isolate Apps from AWS – Avoid lock-‐in to specific AWS API details

•  Portability is a long term goal – Gets easier as other vendors catch up with AWS

Ne=lix Global PaaS

•  Architecture Features and Overview •  Portals and Explorers •  Pla=orm Services •  Pla=orm APIs •  Pla=orm Frameworks •  Persistence •  Scalability Benchmark

Global PaaS? Toys are nice, but this is the real thing…

•  Supports all AWS Availability Zones and Regions •  Supports mulOple AWS accounts {test, prod, etc.} •  Cross Region/Acct Data ReplicaOon and Archiving •  InternaOonalized, Localized and GeoIP rouOng •  Security is fine grain, dynamic AWS keys •  Autoscaling to thousands of instances •  Monitoring for millions of metrics •  ProducOve for 100s of developers on one product •  23M+ users USA, Canada, LaOn America, UK, Eire

Basic PaaS EnOOes

•  AWS Based EnOOes –  Instances and Machine Images, ElasOc IP Addresses –  Security Groups, Load Balancers, Autoscale Groups – Availability Zones and Geographic Regions

•  Ne=lix PaaS EnOOes – ApplicaOons (registered services) –  Clusters (versioned Autoscale Groups for an App) –  ProperOes (dynamic hierarchical configuraOon)

Core PaaS Services •  AWS Based Services

–  S3 storage, to 5TB files, parallel mulOpart writes –  SQS – Simple Queue Service. Messaging layer.

•  Ne=lix Based Services –  EVCache – memcached based ephemeral cache –  Cassandra – distributed data store

•  External Services –  GeoIP Lookup interfaced to a vendor –  Keystore HSM in Ne=lix Datacenter

Instance Architecture

Linux Base AMI (CentOS or Ubuntu)

OpOonal Apache frontend,

memcached, non-‐java apps

Monitoring Log rotaOon

to S3 AppDynamics machineagent

Epic

Java (JDK 6 or 7)

AppDynamics appagent

GC and thread dump logging

Tomcat ApplicaOon servlet, base server, pla=orm, interface jars for dependent services

Healthcheck, status servlets, JMX interface,

Servo autoscale

Security Architecture •  Instance Level Security baked into base AMI

–  Login: ssh only allowed via portal (not between instances) –  Each app type runs as its own userid app{test|prod}

•  AWS Security, IdenOty and Access Management –  Each app has its own security group (firewall ports) –  Fine grain user roles and resource ACLs

•  Key Management –  AWS Keys dynamically provisioned, easy updates –  High grade app specific key management support

Portals and Explorers

•  Ne=lix ApplicaOon Console (NAC) – Primary AWS provisioning/config interface

•  AWS Usage Analyzer – Breaks down costs by applicaOon and resource

•  Cassandra Explorer – Browse clusters, keyspaces, column families

•  Base Server Explorer – Browse service endpoints configuraOon, perf

Pla=orm Services •  Discovery – service registry for “ApplicaOons” •  IntrospecOon – Entrypoints •  Cryptex – Dynamic security key management •  Geo – Geographic IP lookup •  Pla=ormservice – Dynamic property configuraOon •  LocalizaOon – manage and lookup local translaOons •  Evcache – ephemeral volaOle cache •  Cassandra – Cross zone/region distributed data store •  Zookeeper – Distributed CoordinaOon (Curator) •  Various proxies – access to old datacenter stuff

Metrics Framework •  System and ApplicaOon

–  CollecOon, AggregaOon, Querying and ReporOng –  Non-‐blocking logging, avoids log4j lock contenOon –  Honu-‐Streaming -‐> S3 -‐> EMR -‐> Hive

•  Performance, Robustness, Monitoring, Analysis –  Tracers, Counters – explicit code instrumentaOon log –  Real Time Tracers/Counters –  SLA – service level response Ome percenOles –  Servo annotated JMX extract to Cloudwatch

•  Latency Monkey Infrastructure –  Inject random delays into service responses

Ne0lix Pla0orm Persistence •  Ephemeral VolaOle Cache – evcache – Discovery-‐aware memcached based backend –  Client abstracOons for zone aware replicaOon – OpOon to write to all zones, fast read from local

•  Cassandra – Highly available and scalable (more later…)

•  MongoDB –  Complex object/query model for small scale use

•  MySQL – Hard to scale, legacy and small relaOonal models

Priam – Cassandra AutomaOon Available at h@p://github.com/ne=lix

•  Ne=lix Pla=orm Tomcat Code •  Zero touch auto-‐configuraOon •  State management for Cassandra JVM •  Token allocaOon and assignment •  Broken node auto-‐replacement •  Full and incremental backup to S3 •  Restore sequencing from S3 •  Grow/Shrink Cassandra “ring”

Astyanax Available at h@p://github.com/ne=lix

•  Cassandra java client •  API abstracOon on top of Thri9 protocol •  “Fixed” ConnecOon Pool abstracOon (vs. Hector)

–  Round robin with Failover –  Retry-‐able operaOons not Oed to a connecOon –  Ne=lix PaaS Discovery service integraOon –  Host reconnect (fixed interval or exponenOal backoff) –  Token aware to save a network hop – lower latency –  Latency aware to avoid compacOng/repairing nodes – lower variance

•  Batch mutaOon: set, put, delete, increment •  Simplified use of serializers via method overloading (vs. Hector) •  ConnecOonPoolMonitor interface for counters and tracers •  Composite Column Names replacing deprecated SuperColumns

Astyanax Query Example Paginate through all columns in a row ColumnList<String> columns; int pageize = 10; try { RowQuery<String, String> query = keyspace .prepareQuery(CF_STANDARD1) .getKey("A") .setIsPaginaOng() .withColumnRange(new RangeBuilder().setMaxSize(pageize).build()); while (!(columns = query.execute().getResult()).isEmpty()) { for (Column<String> c : columns) { } } } catch (ConnecOonExcepOon e) { }

High Availability

•  Cassandra stores 3 local copies, 1 per zone – Synchronous access, durable, highly available – Read/Write One fastest, least consistent -‐ ~1ms – Read/Write Quorum 2 of 3, consistent -‐ ~3ms

•  AWS Availability Zones – Separate buildings – Separate power etc. – Fairly close together

“TradiOonal” Cassandra Write Data Flows Single Region, MulOple Availability Zone, Not Token Aware

Non Token Aware Clients

Cassandra • Disks • Zone A


Cassandra • Disks • Zone B


Cassandra • Disks • Zone C


1.  Client Writes to any Cassandra Node

2.  Coordinator Node replicates to nodes and Zones

3.  Nodes return ack to coordinator

4.  Coordinator returns ack to client

5.  Data wri@en to internal commit log disk (no more than 10 seconds later)

If a node goes offline, hinted handoff completes the write when the node comes back up. Requests can choose to wait for one node, a quorum, or all nodes to ack the write SSTable disk writes and compacOons occur asynchronously

15

4 2

5

2

5

2

3

3

3

Astyanax -‐ Cassandra Write Data Flows Single Region, MulOple Availability Zone, Token Aware

Token Aware Clients







1.  Client Writes to nodes and Zones

2.  Nodes return ack to client

3.  Data wri@en to internal commit log disks (no more than 10 seconds later)

If a node goes offline, hinted handoff completes the write when the node comes back up. Requests can choose to wait for one node, a quorum, or all nodes to ack the write SSTable disk writes and compacOons occur asynchronously

13

3

32

2

2

Data Flows for MulO-‐Region Writes Token Aware, Consistency Level = Local Quorum

1.  Client writes to local replicas 2.  Local write acks returned to

Client which conOnues when 2 of 3 local nodes are commi@ed

3.  Local coordinator writes to remote coordinator.

4.  When data arrives, remote coordinator node acks and copies to other remote zones

5.  Remote nodes ack to local coordinator

6.  Data flushed to internal commit log disks (no more than 10 seconds later)

If a node or region goes offline, hinted handoff completes the write when the node comes back up. Nightly global compare and repair jobs ensure everything stays consistent.

US Clients

Cassandra •  Disks •  Zone A


Cassandra •  Disks •  Zone B


Cassandra •  Disks •  Zone C


EU Clients







6

5

5

6 6 4

4 4

1 6

6

6 2

2

2 3

100+ms latency

Cassandra Backup

S3 Backup

Cassandra

Cassandra

Cassandra

Cassandra

Cassandra

Cassandra Cassandra

Cassandra

Cassandra

Cassandra

Cassandra

•  Full Backup – Time based snapshot – SSTable compress -‐> S3

•  Incremental – SSTable write triggers compressed copy to S3

•  Archive – Copy cross region

A

ETL for Cassandra

•  Data is de-‐normalized over many clusters! •  Too many to restore from backups for ETL •  SoluOon – read backup files using Hadoop •  Aegisthus

–  h@p://techblog.ne=lix.com/2012/02/aegisthus-‐bulk-‐data-‐pipeline-‐out-‐of.html

– High throughput raw SSTable processing – Re-‐normalizes many clusters to a consistent view – Extract, Transform, then Load into Teradata

Cassandra Archive Appropriate level of paranoia needed…

•  Archive could be un-‐readable –  Restore S3 backups weekly from prod to test, and daily ETL

•  Archive could be stolen –  PGP Encrypt archive

•  AWS East Region could have a problem –  Copy data to AWS West

•  ProducOon AWS Account could have an issue –  Separate Archive account with no-‐delete S3 ACL

•  AWS S3 could have a global problem –  Create an extra copy on a different cloud vendor….

A

Tools and AutomaOon •  Developer and Build Tools

–  Jira, Perforce, Eclipse, Jenkins, Ivy, ArOfactory –  Builds, creates .war file, .rpm, bakes AMI and launches

•  Custom Ne=lix ApplicaOon Console –  AWS Features at Enterprise Scale (hide the AWS security keys!) –  Auto Scaler Group is unit of deployment to producOon

•  Open Source + Support –  Apache, Tomcat, Cassandra, Hadoop –  Datastax support for Cassandra, AWS support for Hadoop via EMR

•  Monitoring Tools –  Alert processing gateway into Pagerduty –  AppDynamics – Developer focus for cloud h@p://appdynamics.com

Open Source Strategy

•  Release PaaS Components git-‐by-‐git – Source at github.com/ne=lix –  Intros and techniques at techblog.ne=lix.com – Blog post or new code every week or so

•  MoOvaOons – Give back to Apache licensed OSS community – MoOvate, retain, hire top engineers – Create a community that adds features and fixes

Current OSS Projects and Posts Github / Techblog

Apache Project

Techblog Post

Priam

Astyanax

CassJMeter

Cassandra

Aegisthus

Exhibitor

Curator

Zookeeper

EVCache

Servo

Autoscaling scripts

Honu

Circuit Breaker

Scalability TesOng •  Cloud Based TesOng – fricOonless, elasOc

–  Create/destroy any sized cluster in minutes – Many test scenarios run in parallel

•  Test Scenarios –  Internal app specific tests –  Simple “stress” tool provided with Cassandra

•  Scale test, keep making the cluster bigger –  Check that tooling and automaOon works… –  How many ten column row writes/sec can we do?

<DrEvil>ONE MILLION</DrEvil>

Scale-‐Up Linearity h@p://techblog.ne=lix.com/2011/11/benchmarking-‐cassandra-‐scalability-‐on.html

174373

366828

537172

1099837

0

200000

400000

600000

800000

1000000

1200000

0 50 100 150 200 250 300 350

Client Writes/s by node count – ReplicaJon Factor = 3

Availability and Resilience

Chaos Monkey

•  Computers (Datacenter or AWS) randomly die – Fact of life, but too infrequent to test resiliency

•  Test to make sure systems are resilient – Allow any instance to fail without customer impact

•  Chaos Monkey hours – Monday-‐Thursday 9am-‐3pm random instance kill

•  ApplicaOon configuraOon opOon – Apps now have to opt-‐out from Chaos Monkey

Responsibility and Experience

•  Make developers responsible for failures – Then they learn and write code that doesn’t fail

•  Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame”

•  Keep Omeouts short, fail fast – Don’t let cascading Omeouts stack up

•  Make configuraOon opOons dynamic – You don’t want to push code to tweak an opOon

Resilient Design – Circuit Breakers h@p://techblog.ne=lix.com/2012/02/fault-‐tolerance-‐in-‐high-‐volume.html

PaaS OperaOonal Model

•  Developers – Provision and run their own code in producOon – Take turns to be on call if it breaks (pagerduty) – Configure autoscalers to handle capacity needs

•  DevOps and PaaS (aka NoOps) – DevOps is used to build and run the PaaS – PaaS constrains Dev to use automaOon instead – PaaS puts more responsibility on Dev, with tools

What’s Le9 for Corp IT? •  Corporate Security and Network Management

–  Billing and remnants of streaming service back-‐ends in DC •  Running Ne=lix’ DVD Business

–  Tens of Oracle instances –  Hundreds of MySQL instances –  Thousands of VMWare VMs –  Zabbix, CacO, Splunk, Puppet

•  Employee ProducOvity –  Building networks and WiFi –  SaaS OneLogin SSO Portal –  Evernote Premium, Safari Online Bookshelf, Dropbox for Teams –  Google Enterprise Apps, Workday HCM/Expense, Box.com –  Many more SaaS migraOons coming…

Corp WiFi Performance

ImplicaOons for IT OperaOons •  Cloud is run by developer organizaOon

–  Product group’s “IT department” is the AWS API and PaaS –  CorpIT handles billing and some security funcOons

Cloud capacity is 10x bigger than Datacenter –  Datacenter oriented IT didn’t scale up as we grew –  We moved a few people out of IT to do DevOps for our PaaS

•  TradiOonal IT Roles and Silos are going away –  We don’t have SA, DBA, Storage, Network admins for cloud –  Developers deploy and “run what they wrote” in producOon

Ne=lix PaaS OrganizaOon Developer Org ReporOng into Product Development, not ITops

Ne=lix Cloud Pla=orm Team Cloud Ops Reliability Engineering

Alert RouOng Incident Lifecycle

PagerDuty

Architecture

Future planning Security Arch Efficiency

AWS VPC Hyperguard

Powerpoint J

Build Tools and

AutomaOon

Perforce Jenkins ArOfactory JIRA Base AMI, Bakery Ne=lix App Console

AWS API

Pla=orm and Database Engineering

Pla=orm jars Key store Zookeeper Cassandra

AWS Instances

Cloud Performance

Cassandra Benchmarking JVM GC Tuning Wiresharking

AWS Instances

Cloud SoluOons

Monitoring Monkeys Entrypoints

AWS Instances

Roadmap for 2012

•  Readiness for global Ne=lix launches •  More resiliency and improved availability •  More automaOon, orchestraOon •  “Hardening” the pla=orm •  Lower latency for web services and devices •  Working towards IPv6 support •  More open sourced components

Wrap Up

Answer your remaining quesOons…

What was missing that you wanted to cover?

Next up – Jason Chan on Security Architecture

Takeaway

Ne>lix has built and deployed a scalable global Pla>orm as a Service.

Key components of the Ne>lix PaaS are being released as Open Source projects so you can build your own custom PaaS.

h@p://github.com/Ne=lix h@p://techblog.ne=lix.com h@p://slideshare.net/Ne=lix

h@p://www.linkedin.com/in/adriancockcro9

@adrianco #ne=lixcloud

netflix in the cloud at sv forum

Technology