devops at netflix (re:invent)

Post on 12-Jan-2015

9.723 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

How Netflix operates for maximum freedom and agility. Video here: https://www.youtube.com/watch?v=s0rCGFetdtM

TRANSCRIPT

Rainmakers

How Netflix Operates Clouds for Maximum Freedom and Agility

Jeremy EdbergReliability Architect,

Netflix

Tweet @jedberg with feedback!

Do you have...

• A release Engineer?

• A QA department?

• Chef or Puppet to manage your systems?

Tweet @jedberg with feedback!

Do you have...

• Upwards of 100 releases a day?

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

With more than 30 million streaming members in the United States,

Canada, Latin America, the United Kingdom, Ireland and the Nordics,

Netflix is the world's leading internet subscription service for enjoying

movies and TV programs streamed over the internet to PCs, Macs and TV.

Source: http://ir.netflix.com

Tweet @jedberg with feedback!

The Netflix Way• Everything is “built for three”

• Fully automated build tools to test and make packages

• Fully automated machine image bakery

• Fully automated image deployment

• Independent teams responsible for both Dev and Ops

Tweet @jedberg with feedback!

Philosophy

Tweet @jedberg with feedback!

Automate all the things!

Tweet @jedberg with feedback!

Automate all the things!

• Application startup

• Configuration

• Code deployment

• System deployment

Tweet @jedberg with feedback!

Automation

• Standard base image

• Tools to manage all the systems

• Automated code deployment

Tweet @jedberg with feedback!

Shared state should be stored in a shared

service

Data on an instance should be replicated to other instances

Tweet @jedberg with feedback!

“Build for Three”We hold a boot camp for new engineers to teach

them how to build for a highly distributed environment.

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Netflix on AWS2012IPv6

2012IPv6

2012IPv6

Open Connect

Tweet @jedberg with feedback!

Highly aligned, loosely coupled

• Services are built by different teams who work together to figure out what each service will provide.

• The service owner publishes an API that anyone can use.

Tweet @jedberg with feedback!

Advantages to a Service Oriented Architecture

• Easier auto-scaling

• Easier capacity planning

• Identify problematic code-paths more easily

• Narrow in the effects of a change

• More efficient local caching

Tweet @jedberg with feedback!

Freedom and Responsibility

• Developers deploy when they want

• They also manage their own capacity and autoscaling

• And fix anything that breaks at 4am!

Tweet @jedberg with feedback!

All systems choices assume some part will fail

at some point.

Tweet @jedberg with feedback!

The Monkey Theory

•Simulate things that go wrong

•Find things that are different

Tweet @jedberg with feedback!

Execution

Photo from I, Robot, copyright 20th Century Fox

Tweet @jedberg with feedback!

Netflix built a global PaaS

•Service Oriented Architecture

•HTTP/Rest interfaces between services

Tweet @jedberg with feedback!

Netflix PaaS features• Supports all regions and zones

• Multiple accounts

• Cross region/account replication

• Internationalized, localized and GeoIP routed

• Advanced key management

• Autoscaling with 1000s of instances

• Monitoring and alerting on millions of metrics

Tweet @jedberg with feedback!

What AWS Provides

• Instances

• Machine Images

• Elastic IPs

• Load Balancers

• Security groups / Autoscaling groups

• Availability zones and regions

Tweet @jedberg with feedback!

Monitoring

Log Rotation to

S3

Appdynamics Machine

Agent

Linux Base AMI (CentOS or Ubuntu)

Java (JDK 6 or 7)

Tomcat

Optional Apache

Appdynamics App Agent

monitoringApplication war file,

base servlet, platform, interface jars for dependent

services

GC and thread dump

logging

Healthcheck, status servelets, JMX

interface, Servo autoscale

Tweet @jedberg with feedback!

The Netflix PlatformDiscovery

(Eureka)Entrypoints (Edda)Configuration

(Archaius)Zookeeper (Exhibitor)

logging (Blitz4j & Honu)NIWSGeoBase

Circut Breakers (Hystrix)

Cassandra (Priam & Astyanax &

CassJMeter) Cryptex AKMSEvCache

Proxiesi18nL10nOpen Source

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

NovNov Feb

Feb

DecDec 201

20122 M

arMar AprApr Ju

nJu

n Jul

Jul

AugAug Sep

Sep

OctOct

Curat

o

Curat

o

rrAst

yana

Astya

na

xxSe

rvSe

rv

ooPr

iaPr

ia

mmCas

sJMet

er

CassJM

eter

Exhi

bito

Exhi

bito

rrArc

haiu

Archa

iu

ssAsg

ar

Asgar

ddCha

os

Chaos

Mon

key

Mon

key

Eure

k

Eure

k

aa

MayMay

Open Source at Netflix

GovernatorBlitz4jEdda

Hystrix

Tweet @jedberg with feedback!

Finding things• Discovery (Eureka)

• Application to instance mapping• Heartbeat to keep track of health

• Entrypoints (Edda)• Local database of AWS resources

• NIWS (Netflix Internal Web Service)• On instance software load balancer• Handles retry logic

• Geo (Geolocation library)• Provides IP to Lat/Lon mapping for any service

that needs it.

Tweet @jedberg with feedback!

Entrypoints (Edda)

• REST API

• GET /REST/v2/instance/$id

• Keeps track of all resources

• Autoscaling groups, EIPs, Instances, Applications, Clusters, History

Tweet @jedberg with feedback!

Entrypoints ExplorationFind all active

instancesGET /REST/v2/view/instances

Find all instances in a cluster

GET /REST/v2/group/clusters

Show only ASG name, instance ID

and health

/v2/aws/autoScalingGroups/edda-v123;_pp:(autoScalingGroupName,instances:

(instanceId,lifecycleState))

Which ASG contains a particular instance?

/v2/aws/autoScalingGroups;instances.instanceId=i-96f3ca3a

Tweet @jedberg with feedback!

Keeping it all Straight• Configuration (Archaius)

• Global variables (Fast properties)

• Base

• Base system. Prod vs. Test, etc

• Zookeeper (Curator)

• Locks, other similar coordination

• Logging (Blitz4j and Honu)

• Keep track of what happened and store it for post analysis.

Tweet @jedberg with feedback!

Keeping it Secure• Cryptex

• Service for key management

• High, medium and low value keys

• AKMS (Amazon Key Management System)

• Hands out keys to instances (and dev boxes) so they don’t have to store the key on the instance

For more info, see SEC201: Security Panel

Tweet @jedberg with feedback!

Storing it• Cassandra (Priam, astyanax)

• Configure and access Cassandra

• Provide OO abstractions handle connection pooling, discovery of hosts

• EVCache (Eccentric Volatile Cache)

• Wrapper for memcached to handle zone awareness and replication

• Proxies

• Get data out of the datacenter and into the cloud.

Tweet @jedberg with feedback!

DataWhat do we do with it all?

Tweet @jedberg with feedback!

We store it!

•Cache (memcached)

•Cassandra

•RDS (MySql)

Tweet @jedberg with feedback!

Cassandra

Tweet @jedberg with feedback!

Why Cassandra?

•Availability over consistency

•Writes over reads

•We know Java

•Open source + support

Tweet @jedberg with feedback!

Using Cassandra at Netflix

• Priam

• Zero touch auto-config

• State management

• Token assignment

• Node replacement

• Backup/restore to/from S3

• Astyanax

• OO abstraction to Cassandra

• Multi-region support

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Cassandra Architecture

Tweet @jedberg with feedback!

Cassandra Architecture

For more info, see DAT202: Optimizing your Cassandra Database on AWS

Tweet @jedberg with feedback!

Tools

• Asgard

• AWS usage

• Atlas

• Chronos

• Build system

• Explorers (Cassandra and SimpleDB)

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Auto ScalingGroup

LaunchConfiguration

SecurityGroup

Amazon MachineImage

Instances

Elastic LoadBalancer

Tweet @jedberg with feedback!

api-usprod-v007

api-frontend

api-usprod-v008

Tweet @jedberg with feedback!

api-usprod-v007

api-frontend

api-usprod-v008

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Netflix has moved the granularity from the

instance to the cluster

Tweet @jedberg with feedback!

Why Bake?

Generic AMI

Instance

Traditional:•launch OS•install packages•install app

Netflix:•launch OS+app App AMI Instance

Tweet @jedberg with feedback!

Getting Baked

Perforce / GitPerforce / GitPerforce / GitPerforce / Git

libraries

source

Ant targetsAnt targets

IvyIvy

Groovy all overGroovy all over

snapshot / release libraries / apps

app bundles

JenkinsJenkinsJenkinsJenkins

syncsyncsyncsync

resolveresolveresolveresolve

buildbuildbuildbuildcompilecompilecompilecompile reportreportreportreport

publishpublishpublishpublishtesttesttesttest

ArtifactoryArtifactoryArtifactoryArtifactory

Tweet @jedberg with feedback!

Base ImageBaking

Yum / AptYum / AptYum / AptYum / Apt

Linux: CentOS, Fedora, UbuntuLinux: CentOS, Fedora, Ubuntu

AWSRPMs: Apache, Java...

ec2 slave instancesec2 slave instances

S3 / EBS

foundatiofoundation AMIn AMI

foundatiofoundation AMIn AMI

base base AMIAMIbase base AMIAMI

BakeryBakeryBakeryBakery

mount

install

Ready forappbake

Ready forappbake

snapshot

Tweet @jedberg with feedback!

App ImageBaking

Jenkins / Yum / Jenkins / Yum / ArtifactoryArtifactory

Jenkins / Yum / Jenkins / Yum / ArtifactoryArtifactory

Linux, Apache, Java, TomcatLinux, Apache, Java, Tomcat

AWSapp bundle

ec2 slave instancesec2 slave instances

S3 / EBS

base AMIbase AMIbase AMIbase AMI

app app AMIAMIapp app AMIAMI

BakeryBakeryBakeryBakery

mount

install

Ready to launch!

Ready to launch!

snapshot

Tweet @jedberg with feedback!

Monitoring

Log Rotation to

S3

Appdynamics Machine

Agent

Linux Base AMI (CentOS or Ubuntu)

Java (JDK 6 or 7)

Tomcat

Optional Apache

Appdynamics App Agent

monitoringApplication war file,

base servlet, platform, interface jars for dependent

services

GC and thread dump

logging

Healthcheck, status servelets, JMX

interface, Servo autoscale

Tweet @jedberg with feedback!

Linux Base AMI (CentOS or Ubuntu)

Java (JDK 6 or 7)

JBoss

Optional Apache

Monitoring

Log Rotation to

S3

Appdynamics Machine

Agent

Appdynamics App Agent

monitoringApplication war file,

base servlet, platform, interface jars for dependent

services

GC and thread dump

logging

Healthcheck, status servelets, JMX

interface, Servo autoscale

Tweet @jedberg with feedback!

Linux Base AMI (CentOS or Ubuntu)

Python

Django

Optional Apache

Monitoring

Log Rotation to

S3

Appdynamics Machine

Agent

monitoring

Application file, base server, platform, interface libs for

dependent serviceslogging

Tweet @jedberg with feedback!

The Monkey Theory

•Simulate things that go wrong

•Find things that are different

Tweet @jedberg with feedback!

The simian army• Chaos -- Kills random instances

• Chaos Gorilla -- Kills zones

• Chaos Kong -- Kills regions

• Latency -- Degrades network and injects faults

• Conformity -- Looks for outliers

• Circus -- Kills and launches instances to maintain zone balance

• Doctor -- Fixes unhealthy resources

• Janitor -- Cleans up unused resources

• Howler -- Yells about bad things like Amazon limit violations

• Security -- Finds security issues and expiring certificates

For more info, see ARC301: Intro to Chaos Monkey & the Simian Army

Tweet @jedberg with feedback!

What’s going on?!

Tweet @jedberg with feedback!

Atlas

Tweet @jedberg with feedback!

{  "clusters": [    "epic_aggregator",    "epic_aggregator-dev"  ],  "alerts": [    // you can use javascript style comments in the config    {      "metricName": "EpicPlugin_NumDropped",      "applyTo": "cluster",      "condition": {        "type": "StaticThreshold",        "max": 0.0      },      "severity": "major",      "description": "plugin is dropping metrics"    },    {      "metricName": "EpicPlugin_NumDropped_Instance",      "applyTo": "instance",      "condition": {        "type": "NumOccurrences",        "num": 4,        "condition": {          "type": "StaticThreshold",          "max": 0.0        }      },      "overrides": {        "service_key_override": "12345",        "require_instance_status_not_in: ["DOWN", "OUT_OF_SERVICE"],        "email_override": "devnull@netflix.com"      },      "severity": "minor"    },   

{      "metricName": "EpicPlugin_MetricCount",      "applyTo": "instance",      "description": "${instanceId} is reporting too many metrics",      "condition": {        "type": "NumOccurrences",        "num": 4,        "condition": {          "type": "StaticThreshold",          "max": 0.0        }      },      "additionalDetails": {        "statusUrl": "http://${publicDnsName}:7001/Status",        "nacClusterUrl": "nac${env}/${region}/cluster/show/${cluster}"      }      "overrides": {        "subject": "${instanceId} is reporting too many metrics",        "incident_key": "${metricName}:${instanceId}",        "service_key_override": "12345",        "email_override": "devnull@netflix.com"      },      "severity": "minor"    }  ]}

Example Alert Config

Tweet @jedberg with feedback!

Alert Tuning

Tweet @jedberg with feedback!

Alert Systems

alertingalertingalertingalerting

apiapiapiapi

apiapiapiapi

CORECOREEvent Event

GatewaGatewayy

CORECOREEvent Event

GatewaGatewayy

Paging Paging ServiceServicePaging Paging ServiceService

AmazoAmazonn

SESSES

AmazoAmazonn

SESSES

CORE CORE AgentAgentCORE CORE AgentAgent

Other Other TeamTeam’’s s AgentAgent

Other Other TeamTeam’’s s AgentAgent

CORE CORE AgentAgentCORE CORE AgentAgent

Atlas

Appdynamics

Tweet @jedberg with feedback!

Tweet @jedberg with feedback!

Chronos

Tweet @jedberg with feedback!

Text

Data Collection Pipeline

Data Processing Pipeline

For more info, see BDT303: Data Science with Elastic MapReduce

Tweet @jedberg with feedback!

Chuckwa/Honu messages / min

63 billion

messages a day

Tweet @jedberg with feedback!

Best Practices

Tweet @jedberg with feedback!

Incident Reviews

• What went wrong?

• How could we have detected it sooner?

• How could we have prevented it?

• How can we prevent this class of problem in the future?

• How can we improve our behavior for next time?

Ask the key questions:

Tweet @jedberg with feedback!

Best Practices for Data

• Have multiple copies of all data

• Keep those copies in multiple AZs

• Avoid keeping state on a single instance

• Take frequent snapshots of EBS disks

• No secret keys on the instance

Tweet @jedberg with feedback!

Netflix autoscaling

Traffic Peak

Text1

2

Deployment

Tweet @jedberg with feedback!

AWS UsageDollar amounts have been carefully removed

Tweet @jedberg with feedback!

Going multi-zone

Tweet @jedberg with feedback!

Benefits of Amazon’s Zones

• Loosely connected

• Low latency between zones

• 99.95% uptime guarantee per region

Tweet @jedberg with feedback!

Going Multi-region

Tweet @jedberg with feedback!

Leveraging Multi-region

• 100% uptime is theoretically possible.

• You have to replicate your data

• This will cost money

Tweet @jedberg with feedback!

Circuit Breakers (Hystrix)Be liberal in what you accept, strict in what you send

Tweet @jedberg with feedback!

Just a quick reminder...

• (Some of) Netflix is open source:

• https://github.com/netflix

We are sincerely eager to hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.

Tweet @jedberg with feedback!

Questions?

Tweet @jedberg with feedback!

Getting in touchEmail: jedberg@{gmail,netflix}.com

Twitter: @jedberg

Web: www.jedberg.net

Facebook: facebook.com/jedberg

Linkedin: www.linkedin.com/in/jedberg

top related