netflix in the cloud at sv forum
DESCRIPTION
Talk given at SVForum, Sunnyvale CA, March 27th 2012TRANSCRIPT
Cloud Architecture at Ne0lix How Ne0lix Built a Scalable Java oriented PaaS running on AWS
SVForum March 27th, 2012 Adrian Cockcro9 @adrianco #ne=lixcloud
h@p://www.linkedin.com/in/adriancockcro9
Adrian Cockcro9 • Director, Architecture for Cloud Systems, Ne=lix Inc.
– Previously Director for PersonalizaOon Pla=orm
• DisOnguished Availability Engineer, eBay Inc. 2004-‐7 – Founding member of eBay Research Labs
• DisOnguished Engineer, Sun Microsystems Inc. 1988-‐2004 – 2003-‐4 Chief Architect High Performance Technical CompuOng – 2001 Author: Capacity Planning for Web Services – 1999 Author: Resource Management – 1995 & 1998 Author: Sun Performance and Tuning – 1996 Japanese EdiOon of Sun Performance and Tuning
• SPARC & Solarisパフォーマンスチューニング (サンソフトプレスシリーズ)
• More – Twi@er @adrianco – Blog h@p://perfcap.blogspot.com – PresentaOons at h@p://www.slideshare.net/adrianco
What kind of Cloud? • So9ware as a Service – SaaS
– Replaces in house applicaOons – Targets end users
• Pla=orm as a Service – PaaS – Replaces in house operaOons funcOons – Targets developers
• Infrastructure as a Service – IaaS – Replaces in house datacenter capacity – Targets developers and ITops
What Ne=lix Did
• Moved to SaaS – Corporate IT – OneLogin, Workday, Box, Evernote… – Tools – Pagerduty, AppDynamics, ElasOc MapReduce
• Built our own PaaS <-‐ today’s focus – Customized to make our developers producOve – When we started, we had li@le choice
• Moved incremental capacity to IaaS – No new datacenter space since 2008 as we grew – Moved our streaming apps to the cloud
Ne=lix could not build new
datacenters fast enough
Capacity growth is acceleraOng, unpredictable Product launch spikes -‐ iPhone, Wii, PS3, Xbox
InternaOonal – Canada, LaOn America, UK/Ireland
Data Center
Ne=lix.com is now ~100% Cloud
A few small back end data sources sOll in progress All internaOonal product is cloud based
USA specific logisOcs remains in the Datacenter Working on SOX, PCI as scope starts to include AWS
Ne=lix Choice was AWS with our own pla=orm and tools
Unique pla=orm requirements and extreme scale, agility and flexibility
Leverage AWS Scale “the biggest public cloud” AWS investment in features and automaOon
Use AWS zones and regions for high availability, scalability and global deployment
But isn’t Amazon a compeOtor?
Many products that compete with Amazon run on AWS We are a “poster child” for the AWS Architecture
Ne=lix is one of the biggest AWS customers Co-‐opeOOon -‐ compeOtors are also partners
Could Ne=lix use another cloud?
Would be nice, we use three interchangeable CDN Vendors But no-‐one else has the scale and features of AWS
You have to be this tall to ride this ride… Maybe in 2-‐3 years?
We want to use clouds, we don’t have Ome to build them
Public cloud for agility and scale We use electricity too, but don’t want to build our own power staOon… AWS because they are big enough to allocate thousands of instances per
hour when we need to
What about other PaaS?
• CloudFoundry – Open Source by VMWare – Developer-‐friendly, easy to get started – Missing scale and some enterprise features
• Rightscale – Widely used to abstract away from AWS – Creates it’s own lock-‐in problem…
• AWS is growing into this space – We didn’t want a vendor between us and AWS – We wanted to build a thin PaaS, that gets thinner
Ne=lix Deployed on AWS
Content
Video Masters
EC2
S3
CDNs
Logs
S3
EMR Hadoop
Hive
Business Intelligence
Play
DRM
CDN rouOng
Bookmarks
Logging
WWW
Sign-‐Up
Search
Movie Choosing
RaOngs
API
Metadata
Device Config
TV Movie Choosing
Social Facebook
CS
InternaOonal CS lookup
DiagnosOcs & AcOons
Customer Call Log
CS AnalyOcs
2009 2009 2010 2010 2010 2011
Goals • Faster
– Lower latency than the equivalent datacenter web pages and API calls – Measured as mean and 99th percenOle – For both first hit (e.g. home page) and in-‐session hits for the same user
• Scalable – Avoid needing any more datacenter capacity as subscriber count increases – No central verOcally scaled databases – Leverage AWS elasOc capacity effecOvely
• Available – SubstanOally higher robustness and availability than datacenter services – Leverage mulOple AWS availability zones – No scheduled down Ome, no central database schema to change
• ProducOve – OpOmize agility of a large development team with automaOon and tools – Leave behind complex tangled datacenter code base (~8 year old architecture) – Enforce clean layered interfaces and re-‐usable components
Datacenter AnO-‐Pa@erns
What do we currently do in the datacenter that prevents us from
meeOng our goals?
Ne=lix Datacenter vs. Cloud Arch
Central SQL Database Distributed Key/Value NoSQL
SOcky In-‐Memory Session Shared Memcached Session
Cha@y Protocols Latency Tolerant Protocols
Tangled Service Interfaces Layered Service Interfaces
Instrumented Code Instrumented Service Pa@erns
Fat Complex Objects Lightweight Serializable Objects
Components as Jar Files Components as Services
So9ware Architecture Pa@erns
• Object Models – Basic and derived types, facets, serializable – Pass by reference within a service – Pass by value between services
• ComputaOon and I/O Models – Service ExecuOon using Best Effort / Futures – Common thread pool management – Circuit breakers to manage and contain failures
Model Driven Architecture • TradiOonal Datacenter PracOces – Lots of unique hand-‐tweaked systems – Hard to enforce pa@erns – Some use of Puppet to automate changes
• Model Driven Cloud Architecture – Perforce/Ivy/Jenkins based builds for everything – Every producOon instance is a pre-‐baked AMI – Every applicaOon is managed by an Autoscaler
Every change is a new AMI
Ne=lix PaaS Principles
• Maximum FuncOonality – Developer producOvity and agility
• Leverage as much of AWS as possible – AWS is making huge investments in features/scale
• Interfaces that isolate Apps from AWS – Avoid lock-‐in to specific AWS API details
• Portability is a long term goal – Gets easier as other vendors catch up with AWS
Ne=lix Global PaaS
• Architecture Features and Overview • Portals and Explorers • Pla=orm Services • Pla=orm APIs • Pla=orm Frameworks • Persistence • Scalability Benchmark
Global PaaS? Toys are nice, but this is the real thing…
• Supports all AWS Availability Zones and Regions • Supports mulOple AWS accounts {test, prod, etc.} • Cross Region/Acct Data ReplicaOon and Archiving • InternaOonalized, Localized and GeoIP rouOng • Security is fine grain, dynamic AWS keys • Autoscaling to thousands of instances • Monitoring for millions of metrics • ProducOve for 100s of developers on one product • 23M+ users USA, Canada, LaOn America, UK, Eire
Basic PaaS EnOOes
• AWS Based EnOOes – Instances and Machine Images, ElasOc IP Addresses – Security Groups, Load Balancers, Autoscale Groups – Availability Zones and Geographic Regions
• Ne=lix PaaS EnOOes – ApplicaOons (registered services) – Clusters (versioned Autoscale Groups for an App) – ProperOes (dynamic hierarchical configuraOon)
Core PaaS Services • AWS Based Services
– S3 storage, to 5TB files, parallel mulOpart writes – SQS – Simple Queue Service. Messaging layer.
• Ne=lix Based Services – EVCache – memcached based ephemeral cache – Cassandra – distributed data store
• External Services – GeoIP Lookup interfaced to a vendor – Keystore HSM in Ne=lix Datacenter
Instance Architecture
Linux Base AMI (CentOS or Ubuntu)
OpOonal Apache frontend,
memcached, non-‐java apps
Monitoring Log rotaOon
to S3 AppDynamics machineagent
Epic
Java (JDK 6 or 7)
AppDynamics appagent
GC and thread dump logging
Tomcat ApplicaOon servlet, base server, pla=orm, interface jars for dependent services
Healthcheck, status servlets, JMX interface,
Servo autoscale
Security Architecture • Instance Level Security baked into base AMI
– Login: ssh only allowed via portal (not between instances) – Each app type runs as its own userid app{test|prod}
• AWS Security, IdenOty and Access Management – Each app has its own security group (firewall ports) – Fine grain user roles and resource ACLs
• Key Management – AWS Keys dynamically provisioned, easy updates – High grade app specific key management support
Portals and Explorers
• Ne=lix ApplicaOon Console (NAC) – Primary AWS provisioning/config interface
• AWS Usage Analyzer – Breaks down costs by applicaOon and resource
• Cassandra Explorer – Browse clusters, keyspaces, column families
• Base Server Explorer – Browse service endpoints configuraOon, perf
Pla=orm Services • Discovery – service registry for “ApplicaOons” • IntrospecOon – Entrypoints • Cryptex – Dynamic security key management • Geo – Geographic IP lookup • Pla=ormservice – Dynamic property configuraOon • LocalizaOon – manage and lookup local translaOons • Evcache – ephemeral volaOle cache • Cassandra – Cross zone/region distributed data store • Zookeeper – Distributed CoordinaOon (Curator) • Various proxies – access to old datacenter stuff
Metrics Framework • System and ApplicaOon
– CollecOon, AggregaOon, Querying and ReporOng – Non-‐blocking logging, avoids log4j lock contenOon – Honu-‐Streaming -‐> S3 -‐> EMR -‐> Hive
• Performance, Robustness, Monitoring, Analysis – Tracers, Counters – explicit code instrumentaOon log – Real Time Tracers/Counters – SLA – service level response Ome percenOles – Servo annotated JMX extract to Cloudwatch
• Latency Monkey Infrastructure – Inject random delays into service responses
Ne0lix Pla0orm Persistence • Ephemeral VolaOle Cache – evcache – Discovery-‐aware memcached based backend – Client abstracOons for zone aware replicaOon – OpOon to write to all zones, fast read from local
• Cassandra – Highly available and scalable (more later…)
• MongoDB – Complex object/query model for small scale use
• MySQL – Hard to scale, legacy and small relaOonal models
Priam – Cassandra AutomaOon Available at h@p://github.com/ne=lix
• Ne=lix Pla=orm Tomcat Code • Zero touch auto-‐configuraOon • State management for Cassandra JVM • Token allocaOon and assignment • Broken node auto-‐replacement • Full and incremental backup to S3 • Restore sequencing from S3 • Grow/Shrink Cassandra “ring”
Astyanax Available at h@p://github.com/ne=lix
• Cassandra java client • API abstracOon on top of Thri9 protocol • “Fixed” ConnecOon Pool abstracOon (vs. Hector)
– Round robin with Failover – Retry-‐able operaOons not Oed to a connecOon – Ne=lix PaaS Discovery service integraOon – Host reconnect (fixed interval or exponenOal backoff) – Token aware to save a network hop – lower latency – Latency aware to avoid compacOng/repairing nodes – lower variance
• Batch mutaOon: set, put, delete, increment • Simplified use of serializers via method overloading (vs. Hector) • ConnecOonPoolMonitor interface for counters and tracers • Composite Column Names replacing deprecated SuperColumns
Astyanax Query Example Paginate through all columns in a row ColumnList<String> columns; int pageize = 10; try { RowQuery<String, String> query = keyspace .prepareQuery(CF_STANDARD1) .getKey("A") .setIsPaginaOng() .withColumnRange(new RangeBuilder().setMaxSize(pageize).build()); while (!(columns = query.execute().getResult()).isEmpty()) { for (Column<String> c : columns) { } } } catch (ConnecOonExcepOon e) { }
High Availability
• Cassandra stores 3 local copies, 1 per zone – Synchronous access, durable, highly available – Read/Write One fastest, least consistent -‐ ~1ms – Read/Write Quorum 2 of 3, consistent -‐ ~3ms
• AWS Availability Zones – Separate buildings – Separate power etc. – Fairly close together
“TradiOonal” Cassandra Write Data Flows Single Region, MulOple Availability Zone, Not Token Aware
Non Token Aware Clients
Cassandra • Disks • Zone A
Cassandra • Disks • Zone A
Cassandra • Disks • Zone B
Cassandra • Disks • Zone B
Cassandra • Disks • Zone C
Cassandra • Disks • Zone C
1. Client Writes to any Cassandra Node
2. Coordinator Node replicates to nodes and Zones
3. Nodes return ack to coordinator
4. Coordinator returns ack to client
5. Data wri@en to internal commit log disk (no more than 10 seconds later)
If a node goes offline, hinted handoff completes the write when the node comes back up. Requests can choose to wait for one node, a quorum, or all nodes to ack the write SSTable disk writes and compacOons occur asynchronously
15
4 2
5
2
5
2
3
3
3
Astyanax -‐ Cassandra Write Data Flows Single Region, MulOple Availability Zone, Token Aware
Token Aware Clients
Cassandra • Disks • Zone A
Cassandra • Disks • Zone A
Cassandra • Disks • Zone B
Cassandra • Disks • Zone B
Cassandra • Disks • Zone C
Cassandra • Disks • Zone C
1. Client Writes to nodes and Zones
2. Nodes return ack to client
3. Data wri@en to internal commit log disks (no more than 10 seconds later)
If a node goes offline, hinted handoff completes the write when the node comes back up. Requests can choose to wait for one node, a quorum, or all nodes to ack the write SSTable disk writes and compacOons occur asynchronously
13
3
32
2
2
Data Flows for MulO-‐Region Writes Token Aware, Consistency Level = Local Quorum
1. Client writes to local replicas 2. Local write acks returned to
Client which conOnues when 2 of 3 local nodes are commi@ed
3. Local coordinator writes to remote coordinator.
4. When data arrives, remote coordinator node acks and copies to other remote zones
5. Remote nodes ack to local coordinator
6. Data flushed to internal commit log disks (no more than 10 seconds later)
If a node or region goes offline, hinted handoff completes the write when the node comes back up. Nightly global compare and repair jobs ensure everything stays consistent.
US Clients
Cassandra • Disks • Zone A
Cassandra • Disks • Zone A
Cassandra • Disks • Zone B
Cassandra • Disks • Zone B
Cassandra • Disks • Zone C
Cassandra • Disks • Zone C
EU Clients
Cassandra • Disks • Zone A
Cassandra • Disks • Zone A
Cassandra • Disks • Zone B
Cassandra • Disks • Zone B
Cassandra • Disks • Zone C
Cassandra • Disks • Zone C
6
5
5
6 6 4
4 4
1 6
6
6 2
2
2 3
100+ms latency
Cassandra Backup
S3 Backup
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra Cassandra
Cassandra
Cassandra
Cassandra
Cassandra
• Full Backup – Time based snapshot – SSTable compress -‐> S3
• Incremental – SSTable write triggers compressed copy to S3
• Archive – Copy cross region
A
ETL for Cassandra
• Data is de-‐normalized over many clusters! • Too many to restore from backups for ETL • SoluOon – read backup files using Hadoop • Aegisthus
– h@p://techblog.ne=lix.com/2012/02/aegisthus-‐bulk-‐data-‐pipeline-‐out-‐of.html
– High throughput raw SSTable processing – Re-‐normalizes many clusters to a consistent view – Extract, Transform, then Load into Teradata
Cassandra Archive Appropriate level of paranoia needed…
• Archive could be un-‐readable – Restore S3 backups weekly from prod to test, and daily ETL
• Archive could be stolen – PGP Encrypt archive
• AWS East Region could have a problem – Copy data to AWS West
• ProducOon AWS Account could have an issue – Separate Archive account with no-‐delete S3 ACL
• AWS S3 could have a global problem – Create an extra copy on a different cloud vendor….
A
Tools and AutomaOon • Developer and Build Tools
– Jira, Perforce, Eclipse, Jenkins, Ivy, ArOfactory – Builds, creates .war file, .rpm, bakes AMI and launches
• Custom Ne=lix ApplicaOon Console – AWS Features at Enterprise Scale (hide the AWS security keys!) – Auto Scaler Group is unit of deployment to producOon
• Open Source + Support – Apache, Tomcat, Cassandra, Hadoop – Datastax support for Cassandra, AWS support for Hadoop via EMR
• Monitoring Tools – Alert processing gateway into Pagerduty – AppDynamics – Developer focus for cloud h@p://appdynamics.com
Open Source Strategy
• Release PaaS Components git-‐by-‐git – Source at github.com/ne=lix – Intros and techniques at techblog.ne=lix.com – Blog post or new code every week or so
• MoOvaOons – Give back to Apache licensed OSS community – MoOvate, retain, hire top engineers – Create a community that adds features and fixes
Current OSS Projects and Posts Github / Techblog
Apache Project
Techblog Post
Priam
Astyanax
CassJMeter
Cassandra
Aegisthus
Exhibitor
Curator
Zookeeper
EVCache
Servo
Autoscaling scripts
Honu
Circuit Breaker
Scalability TesOng • Cloud Based TesOng – fricOonless, elasOc
– Create/destroy any sized cluster in minutes – Many test scenarios run in parallel
• Test Scenarios – Internal app specific tests – Simple “stress” tool provided with Cassandra
• Scale test, keep making the cluster bigger – Check that tooling and automaOon works… – How many ten column row writes/sec can we do?
Scale-‐Up Linearity h@p://techblog.ne=lix.com/2011/11/benchmarking-‐cassandra-‐scalability-‐on.html
174373
366828
537172
1099837
0
200000
400000
600000
800000
1000000
1200000
0 50 100 150 200 250 300 350
Client Writes/s by node count – ReplicaJon Factor = 3
Chaos Monkey
• Computers (Datacenter or AWS) randomly die – Fact of life, but too infrequent to test resiliency
• Test to make sure systems are resilient – Allow any instance to fail without customer impact
• Chaos Monkey hours – Monday-‐Thursday 9am-‐3pm random instance kill
• ApplicaOon configuraOon opOon – Apps now have to opt-‐out from Chaos Monkey
Responsibility and Experience
• Make developers responsible for failures – Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame”
• Keep Omeouts short, fail fast – Don’t let cascading Omeouts stack up
• Make configuraOon opOons dynamic – You don’t want to push code to tweak an opOon
Resilient Design – Circuit Breakers h@p://techblog.ne=lix.com/2012/02/fault-‐tolerance-‐in-‐high-‐volume.html
PaaS OperaOonal Model
• Developers – Provision and run their own code in producOon – Take turns to be on call if it breaks (pagerduty) – Configure autoscalers to handle capacity needs
• DevOps and PaaS (aka NoOps) – DevOps is used to build and run the PaaS – PaaS constrains Dev to use automaOon instead – PaaS puts more responsibility on Dev, with tools
What’s Le9 for Corp IT? • Corporate Security and Network Management
– Billing and remnants of streaming service back-‐ends in DC • Running Ne=lix’ DVD Business
– Tens of Oracle instances – Hundreds of MySQL instances – Thousands of VMWare VMs – Zabbix, CacO, Splunk, Puppet
• Employee ProducOvity – Building networks and WiFi – SaaS OneLogin SSO Portal – Evernote Premium, Safari Online Bookshelf, Dropbox for Teams – Google Enterprise Apps, Workday HCM/Expense, Box.com – Many more SaaS migraOons coming…
Corp WiFi Performance
ImplicaOons for IT OperaOons • Cloud is run by developer organizaOon
– Product group’s “IT department” is the AWS API and PaaS – CorpIT handles billing and some security funcOons
Cloud capacity is 10x bigger than Datacenter – Datacenter oriented IT didn’t scale up as we grew – We moved a few people out of IT to do DevOps for our PaaS
• TradiOonal IT Roles and Silos are going away – We don’t have SA, DBA, Storage, Network admins for cloud – Developers deploy and “run what they wrote” in producOon
Ne=lix PaaS OrganizaOon Developer Org ReporOng into Product Development, not ITops
Ne=lix Cloud Pla=orm Team Cloud Ops Reliability Engineering
Alert RouOng Incident Lifecycle
PagerDuty
Architecture
Future planning Security Arch Efficiency
AWS VPC Hyperguard
Powerpoint J
Build Tools and
AutomaOon
Perforce Jenkins ArOfactory JIRA Base AMI, Bakery Ne=lix App Console
AWS API
Pla=orm and Database Engineering
Pla=orm jars Key store Zookeeper Cassandra
AWS Instances
Cloud Performance
Cassandra Benchmarking JVM GC Tuning Wiresharking
AWS Instances
Cloud SoluOons
Monitoring Monkeys Entrypoints
AWS Instances
Roadmap for 2012
• Readiness for global Ne=lix launches • More resiliency and improved availability • More automaOon, orchestraOon • “Hardening” the pla=orm • Lower latency for web services and devices • Working towards IPv6 support • More open sourced components
Wrap Up
Answer your remaining quesOons…
What was missing that you wanted to cover?
Next up – Jason Chan on Security Architecture
Takeaway
Ne>lix has built and deployed a scalable global Pla>orm as a Service.
Key components of the Ne>lix PaaS are being released as Open Source projects so you can build your own custom PaaS.
h@p://github.com/Ne=lix h@p://techblog.ne=lix.com h@p://slideshare.net/Ne=lix
h@p://www.linkedin.com/in/adriancockcro9
@adrianco #ne=lixcloud