how netflix thinks of devops. spoiler: we don’t

61
Dianne Marsh Director of Engineering @dmarsh

Upload: dianne-marsh

Post on 15-Jan-2017

4.747 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: How Netflix thinks of DevOps. Spoiler: we don’t

Dianne Marsh Director of Engineering

@dmarsh

Page 2: How Netflix thinks of DevOps. Spoiler: we don’t

DevOps

Photo Photo Credit: https://www.facebook.com/theprincessbride/photos_stream

Page 3: How Netflix thinks of DevOps. Spoiler: we don’t

DevOps  in  Three  Acts  

Page 4: How Netflix thinks of DevOps. Spoiler: we don’t

Driven  by  Scale  

Page 5: How Netflix thinks of DevOps. Spoiler: we don’t

Empowered  by  Culture  

Page 6: How Netflix thinks of DevOps. Spoiler: we don’t

Supported  by  Tools  

Page 7: How Netflix thinks of DevOps. Spoiler: we don’t

Approaching  Global  Reach  

October - Spain, Portugal, Italy Early 2016 - Korea, Taiwan, Singapore, Hong Kong

65m members à 100m ~60 counties à 200

Page 8: How Netflix thinks of DevOps. Spoiler: we don’t

Ne=lix  ecosystem  •  100s  of  microservices  •  1000s  of  daily  producBon  changes  •  10,000s  of  instances  •  100,000s  of  customer  interacBons/minute  •  1,000,000s  of  customers  •  1,000,000,000s  of  metrics  •  10,000,000,000  hours  of  streamed    

Page 9: How Netflix thinks of DevOps. Spoiler: we don’t

Yet  …  •  10s  of  OperaBons  Engineers  •  No  NOC  

Page 10: How Netflix thinks of DevOps. Spoiler: we don’t

You  Build  It,  You  Run  It  

Page 11: How Netflix thinks of DevOps. Spoiler: we don’t

Outages  

24/7

Page 12: How Netflix thinks of DevOps. Spoiler: we don’t

•  Developers  •  CriBcal  OperaBons/Reliability  Engineering  team  (CORE)  

•  Crisis  Response  Manager  

   

Page 13: How Netflix thinks of DevOps. Spoiler: we don’t
Page 14: How Netflix thinks of DevOps. Spoiler: we don’t

“Get  rid  of  the  safeguards.    Enable  the  most  knowledgeable  

people  to  do  their  job  effecBvely.”  

Page 15: How Netflix thinks of DevOps. Spoiler: we don’t

Blameless  Culture  

Page 16: How Netflix thinks of DevOps. Spoiler: we don’t

Produc4on  Ready  

•  IdenBfy  criBcal  services  •  Provide  context,  assistance  •  Keep  number  small  

Page 17: How Netflix thinks of DevOps. Spoiler: we don’t

Conformity  Monkey    IdenBfy  best  pracBces  NoBfy  service  owners  

Page 18: How Netflix thinks of DevOps. Spoiler: we don’t

AutomaBon  and  Tools  

Page 19: How Netflix thinks of DevOps. Spoiler: we don’t

It’s  Complicated  …  

Page 20: How Netflix thinks of DevOps. Spoiler: we don’t
Page 21: How Netflix thinks of DevOps. Spoiler: we don’t

Common  RunBme  Services  and  Libraries  

Eureka  Ribbon  Hystrix  Zuul    

Page 22: How Netflix thinks of DevOps. Spoiler: we don’t

Hystrix:  Automate  Recovery  

Page 23: How Netflix thinks of DevOps. Spoiler: we don’t

Delivery  Tools  

Aminator  Spinnaker      

Page 24: How Netflix thinks of DevOps. Spoiler: we don’t

•  Cloud Management •  Delivery Engine •  Automation Platform

Page 25: How Netflix thinks of DevOps. Spoiler: we don’t

Global  Cloud  Management  

Page 26: How Netflix thinks of DevOps. Spoiler: we don’t

Delivery  Pipelines    

Page 27: How Netflix thinks of DevOps. Spoiler: we don’t

Automated  Global  Delivery  

Page 28: How Netflix thinks of DevOps. Spoiler: we don’t

Insight  

Atlas  Edda  Vector      

Page 29: How Netflix thinks of DevOps. Spoiler: we don’t

Atlas:  Telemetry  Pla=orm  

Page 30: How Netflix thinks of DevOps. Spoiler: we don’t

Insight  

Page 31: How Netflix thinks of DevOps. Spoiler: we don’t

Insight  (Dashboards)  

Page 32: How Netflix thinks of DevOps. Spoiler: we don’t

What  did  you  expect?  

Page 33: How Netflix thinks of DevOps. Spoiler: we don’t

Been  Thro_led?  

Page 34: How Netflix thinks of DevOps. Spoiler: we don’t

Performance  Monitoring  

Page 35: How Netflix thinks of DevOps. Spoiler: we don’t

Vector  

Page 36: How Netflix thinks of DevOps. Spoiler: we don’t

•  DES on time series data

•  Predict the future

based on history

•  Favor recent history

•  Threshold-based alerts •  6-8 minute delay

Anomaly Detection

Alert!

Page 37: How Netflix thinks of DevOps. Spoiler: we don’t
Page 38: How Netflix thinks of DevOps. Spoiler: we don’t

Finer Granularity, Shorter Time Windows

Page 39: How Netflix thinks of DevOps. Spoiler: we don’t

Ensemble  Learning  

Page 40: How Netflix thinks of DevOps. Spoiler: we don’t

Median Absolute Deviation

IQR

Least Squares

HDI

Voting

Page 41: How Netflix thinks of DevOps. Spoiler: we don’t

Alert  Sooner  

Alert!

From 6-8 minutes to < 1 minute

Page 42: How Netflix thinks of DevOps. Spoiler: we don’t

AcBon  was  an  Alert  

Page 43: How Netflix thinks of DevOps. Spoiler: we don’t

Ge`ng  the  Humans  Out  of  the  EquaBon  is  BETTER  

Page 44: How Netflix thinks of DevOps. Spoiler: we don’t

Outlier Detection & Remediation

Page 45: How Netflix thinks of DevOps. Spoiler: we don’t

Kepler  •  Unsupervised  machine  

learning  •  Density-­‐based  clustering  

algorithm    

•  AcBons  –  Email,  page  –  OOS,  detach,  

terminate  

Page 46: How Netflix thinks of DevOps. Spoiler: we don’t

An  ounce  of  prevenBon…  

Page 47: How Netflix thinks of DevOps. Spoiler: we don’t

Old Version (v1.0)

New Version (v1.1)

Load Balancer Customers 100 Servers

5 Servers

95%

5%

Metrics

Canary  Release  Process  

Page 48: How Netflix thinks of DevOps. Spoiler: we don’t

Old Version (v1.0)

New Version (v1.1)

Load Balancer Customers 0 Servers

100 Servers

100%

Metrics

Canary  Release  Process  

Page 49: How Netflix thinks of DevOps. Spoiler: we don’t

Automated  Canary  Analysis  Define  •  Metrics  •  A  threshold    Every  n  minutes  ●  Classify  metrics  ●  Compute  score  ●  Make  a  decision  

Page 50: How Netflix thinks of DevOps. Spoiler: we don’t

Chaos  Engineering  the  discipline  of  experimenBng  on  a  distributed  system  in  order  

to  build  confidence  in  the  systems  capability  to  withstand  turbulent  condiBons  in  producBon.  

Page 51: How Netflix thinks of DevOps. Spoiler: we don’t

Cluster A Cluster D

Edge Cluster

Cluster B

Cluster C

Imagine a monkey loose in your data center…

Page 52: How Netflix thinks of DevOps. Spoiler: we don’t

Xen  Hypervisor  vulnerability  –  9/25/14    218  out  of  2700+  Cassandra  nodes  rebooted    22  did  not  reboot  successfully  AutomaBon  recovered  those  

A State of Xen – Chaos Monkey & Cassandra

Page 53: How Netflix thinks of DevOps. Spoiler: we don’t

Device   Service  B    

Service  C  

Internet   Edge  Zuul  

Service  A    

ELB  

FIT  

Fault-Injection Testing (FIT)

•  Simulate service failures •  Override by device or account •  % of member traffic

Page 54: How Netflix thinks of DevOps. Spoiler: we don’t

Device   Service  B    

Service  C  

Internet   Edge  Zuul  

Service  A    

ELB  

FIT  

Fault-Injection Testing (FIT)

•  Simulate service failures •  Override by device or account •  % of member traffic

Page 55: How Netflix thinks of DevOps. Spoiler: we don’t

Monkey  –  Single  Instance  Gorilla  –  Availability  Zone  Kong  -­‐  Region  

More Chaos

Page 56: How Netflix thinks of DevOps. Spoiler: we don’t

US-East US-West

AZ1

EU-West

Global Traffic Management

Page 57: How Netflix thinks of DevOps. Spoiler: we don’t
Page 58: How Netflix thinks of DevOps. Spoiler: we don’t

Exercise  Regularly  

Page 59: How Netflix thinks of DevOps. Spoiler: we don’t

DevOps  at  Ne=lix  

Page 60: How Netflix thinks of DevOps. Spoiler: we don’t

How  do  you  think  about  DevOps?  

Page 61: How Netflix thinks of DevOps. Spoiler: we don’t

Roll  the  Credits  Ne=lix.github.io  

 Dianne  Marsh,  Director  of  Engineering  

 dmarsh@ne=lix.com  

@dmarsh