saturn 2014. engineering velocity: continuous delivery at netflix

Post on 17-Oct-2014

911 Views

Category:

Software

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

At Netflix, we realize that there’s a tension between the availability of our service and our speed of innovation. If we move slowly, we can be very available -- but that’s not a good business proposition. If we move super fast, we risk downtime -- and that might annoy our customers. But what if we could increase our velocity without significantly impacting availability? How can we shift that curve so that we’re moving faster without dropping any of those coveted 9’s? How can we engineer velocity by weaving together tooling and culture with software development to expose and elevate highly effective practices? This talk describes various components of Netflix’s continuous delivery platform -- much of which is available in open source. I’ll show how these pieces fit together and allow us to build scaffolding so that we’re comfortable with software developers making the decision to push the button for prod deployment -- and helps them to recover if necessary. As a result, we can run fast, trusting our tooling and our culture. I’ll also describe how we test our resiliency through simulating failure, unleashing the monkeys (Simian Army) on our production environment. Because if you’re afraid of cute little monkeys, imagine how afraid you’ll be of a production environment that offers those same risks but doesn’t give you an opportunity to test your response to those dangers. Throughout this talk, I hope that you will challenge yourself to consider how your company can "shift the curve" through tooling and to achieve a high velocity environment without negatively impacting reliability.

TRANSCRIPT

Engineering Velocity: Continuous Delivery at Netflix

Dianne Marsh SATURN 2014

en-gi-neer-ing + ve-loc-i-ty !applying science and technology to designing and building speed

into a system

Availability vs. Rate of ChangeAv

aila

blity

(in

9’s)

0

1

2

3

4

5

6

Rate of Change0 10 100 1000

Shift the CurveAv

aila

blity

(in

9’s)

0

1

2

3

4

5

6

Rate of Change0 10 100 1000 10000

http://www.slideshare.net/reed2001/culture-1798664

Manager’s Role

Context, not Control

Loosely coupled, Tightly aligned

And hire well!

Get out of the Way

Freedom to Innovate

Support Experimentation

!

How We Built a Predictive

Autoscaling Engine

http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html

Support Independent Paths of Exploration Don’t Prematurely Optimize!

Blameless Culture

Developers Deploy Their Code

Run What You Wrote

!

• Rapid Innovation

• Rapid Detection

• Rapid Response

!

= Freedom + Responsibility

Support with Tools

Jenkins Job DSL

Configuration as Code

Groovy Script

Scripts go in Version Control

http://www.slideshare.net/quidryan/configuration-as-code

Aminator

Create AMI from Base AMI

Image contains service and everything needed to run it

Unit of Deployment for Test and Prod

Abstracts Cloud Details

http://techblog.netflix.com/2013/03/ami-creation-with-aminator.html

Asgard

Deploys Netflix to the Cloud

Red/Black push

Developed to address delays in rollback

http://www.infoq.com/presentations/asgard

Red/Black Push!

• Scale up new instances

• Run canary analysis

• Turn on traffic to new ASG

• Turn off traffic to old ASG

• Wait … analyze … continue

Workflow

Continuous Delivery Engine

Judges between Stages

Represent Best Practices

http://techblog.netflix.com/2013/09/glisten-groovy-way-to-use-amazons.html

One Click Deployment?

Regional IsolationLimit Impact of Human Error

!

• Stagger Deployments?

• Canary Testing per Region?

!

Know your Service!

Multi-Region ConsistencyBuild Tooling to:

!

• Schedule Deployments

• Prefer Off-Peak

• Choose Next Available Region

• Provide Visibility by Region

Simian Army

• Chaos Monkey

• Latency Monkey

• Conformity Monkey

• Janitor Monkey (and more)

http://www.infoq.com/presentations/netflix-resiliency-failure-cloud

Chaos Monkey

Kills Running Instances

• Simulates failures inherent to running in the cloud

• In Production

Latency Monkey

Introduces Latency between services

Conformity Monkey

Have Deployments Diverged?

• Balance Regional Consistency with Regional Isolation

• Build Best Practices into Tooling and Reporting

Janitor Monkey

Reduce Cognitive Load and Cost

• Remove unused instances

• Uniform way to clean up

Shifting the Curve with Tooling

• Value Self-Service

• Test Everywhere

• Awareness of Multiple Regions

• Best Practices Represented in Tooling

• Recover Quickly and Easily

• Be Cloud Native

Shifting the Curve with Culture

• Context not Control

• Freedom to Experiment

• Blameless Culture

ArsTechnica, November 2012

“As the number of applications and the scale of the campaign's AWS infrastructure use

climbed, the DevOps team shifted to using Asgard—an open-source tool developed by

Netflix to manage cloud deployments.”

Thanks!

Dianne Marsh (@dmarsh)

dmarsh@netflix.com

top related