lessons from etsy: avoiding kitchen nightmares - #chefconf 2012

50

Upload: patrick-mcdonnell

Post on 20-May-2015

17.593 views

Category:

Technology


4 download

DESCRIPTION

Talk by Patrick McDonnell (@mcdonnps) at #ChefConf 2012 Chef makes it so easy to change configuration en masse that it can be dangerous if not used with certain precautions and in accordance with a well thought out testing workflow. In our use of Chef at Etsy, we have devised many in-house best practices in response to failures which have helped greatly in avoiding catastrophic outages. This talk will focus on mistakes we've made and how we've avoided repeating them by enforcing standards in cookbooks, testing changes before rollout through the use of environments and in conjunction with the Spork plugin for Knife, and linting cookbooks with Foodcritic. I'll also talk about using handlers intelligently to monitor Chef runs and how to generate reports from the myriad data available in CouchDB.

TRANSCRIPT

Page 1: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Page 2: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Image Service Outage

Page 3: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

postrotate /bin/kill -HUP `cat /var/run/httpd.pid 2>/dev/null` 2> /dev/null || true

Page 4: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

NEVER TEST IN PRODUCTION!

Page 5: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

It only takesone tiny mistake

Page 6: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

How Do You Enforce This?

• Documented standards and communicated best practices

• Robust testing workflow

• Environments

• Knife Plugins

• Linting with rules derived from standards

• Foodcritic

Page 7: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Testing Workflow

Page 8: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

How We Use Environments

• Three environments: production, development, testing

• Testing is unconstrained

• Test nodes are depooled and “flipped” to the testing environment, then repooled and analyzed

• Test nodes are then flipped back to production

Page 9: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Working with Environments

• knife-flip by Etsy engineer Jon Cowie (https://github.com/jonlives/knife-flip)

% knife node flip somenode.etsy.com testing

% knife role flip SomeRole testing

• knife-bulkchangeenvironment (https://github.com/jonlives/knife-bulkchangeenvironment)

% knife node bulk_change_environment testing production

Page 10: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Keeping Environments in Sync• knife-env-diff by Etsy engineer John Goulah

• Get it at https://github.com/jgoulah/knife-env-diff

% knife environment diff development production

diffing environment development against production

cookbook: hadoop development version: = 0.1.0 production version: = 0.1.8

cookbook: mysql development version: = 0.2.4 production version: = 0.2.5

Page 11: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Introducing Knife Spork

• Knife plugin providing a testing/versioning workflow

• Authored by Jon Cowie

• Get it at https://github.com/jonlives/knife-spork

Page 12: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Spork Features

• Four stage process

• Check: Look at versioning info for a cookbook

• Bump: Automatically increment the cookbook’s version number

• Upload: Knife upload and freeze

• Promote: Set environment constraints equal to specified version

Page 13: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

git: enabled: trueirccat: enabled: true server: irccat.mycompany.com port: 12345 channel: "#chef"graphite: enabled: true server: graphite.mycompany.com port: 2003gist: enabled: true in_chef: true chef_path: cookbooks/gist/files/default/gist path: /usr/bin/gistfoodcritic: enabled: true fail_tags: [any] tags: [foo] include_rules: [/home/me/myrules]default_environments: [ production, development ]

Spork Config• /path/to/chef-repo

/config/spork-config.yml

• /etc/spork-config.yml

• ~/.chef/spork-config.yml

Page 14: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

% knife spork check foodcritic

Checking versions for cookbook foodcritic...

Current local version: 0.0.4

Remote versions (Max. 5 most recent only):*0.0.4, frozen0.0.3, frozen0.0.2, unfrozen0.0.1, frozen

DANGER: Your local cookbook has same version number as the starred version above!

Please bump your local version or you won't be able to upload.

Page 15: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

% knife spork bump foodcritic

Loaded config file /home/pmcdonnell/git/chef-repo/config/spork-config.yml...

Loaded config file /etc/spork-config.yml...

Pulling latest changes from git

Pulling latest changes from git submodules (if any)

Bumping patch level of the foodcritic cookbook from 0.0.4 to 0.0.5

Git add'ing /home/pmcdonnell/git/chef-repo/cookbooks/foodcritic/metadata.rb

Page 16: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

% knife spork upload foodcritic Loaded config file /home/pmcdonnell/git/chef-repo/config/spork-config.yml...

Loaded config file /etc/spork-config.yml...

Uploading and freezing foodcritic [0.0.5]

upload complete

Page 17: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

% knife spork promote foodcritic --remote

Pulling latest changes from git

Checking that foodcritic version 0.0.5 exists on the server before promoting (any error means it hasn't been uploaded yet)...

foodcritic version 0.0.5 found on server!

Environment: productionAdding version constraint foodcritic = 0.0.5Saving changes into production.json

Git add'ing /home/pmcdonnell/git/chef-repo/environments/production.json

Uploading production to server

Page 18: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

WARNING: You're about to promote changes to several cookbooks:logrotate: = 0.1.24 changed to = 0.1.23foodcritic: = 0.0.4 changed to = 0.0.5

Are you sure you want to continue? (Y/N) n

You said no, so I'm done here.

Would you like to reset your local production.json to match the server?? (Y/N) y

Git add'ing /home/pmcdonnell/git/chef-repo/environments/production.json

production.json reset.

Page 19: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Spork’s Logging Mechanisms• Irccat: Logs to IRC channel (https://github.com/RJ/irccat)

Environment production uploaded at 2012-05-15 18:35:42 UTC by pmcdonnell

Constraints updated on server in this version:ldap: = 0.1.26 changed to = 0.1.27

• Gist: Added to irccat notifications on promote --remote

• Graphite: promote --remote sends to deploys.chef metric

[11:35:33] <irccat> CHEF: pmcdonnell uploaded and froze cookbook ldap version 0.1.27[11:35:43] <irccat> CHEF: pmcdonnell uploaded environment production https://github.etsycorp.com/gist/376967[11:35:43] <irccat> CHEF: pmcdonnell uploaded environment development https://github.etsycorp.com/gist/376968

Page 20: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Linting

Page 21: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Foodcritic• A lint tool for Chef cookbooks written by Andrew Crump

(http://acrmp.github.com/foodcritic/)

• Comes with a good set of default rules and is very easily extensible

• To enable in spork config:

foodcritic: enabled: true fail_tags: [any] tags: [foo] include_rules: [/home/me/myrules]

Page 22: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Etsy’s Rules• A work in progress, but newly open-sourced at

https://github.com/etsy/foodcritic-rules

• Our rules are “style”-tagged rules that serve to enforce what we consider to be best practices in our environment

• ETSY001 - Package or yum_package resource used with :upgrade action

• ETSY002 - Execute resource used to run git commands

• ETSY003 - Execute resource used to run curl or wget commands

• ETSY004 - Execute resource defined without conditional or action :nothing

• ETSY005 - Action :restart sent to a core service

Page 23: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Rule Resulting from Image Outage

• ETSY005 - Action :restart sent to a core service

• Trippable services include httpd, mysql, memcached, postgresql-server

% foodcritic -t etsy -I ~/git/chef-repo/config/rules.rb ~/git/chef-repo/cookbooks/apache

ETSY005: Action :restart sent to a core service: /home/pmcdonnell/git/chef-repo/cookbooks/apache/recipes/default.rb:39

Page 24: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Rule Resulting from Image Outage

30 template "/etc/httpd/conf/httpd.conf" do 31 source "httpd-conf.erb" 32 owner "root" 33 group "root" 34 mode 00644 35 variables( 36 :fqdn => node[:fqdn], 37 :port => "80" 38 ) 39 notifies :restart, resources(:service => "httpd") 40 end

Page 25: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Memcache Outage

Page 26: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

02:27 < jallspaw> [Sat, 10 Jul 2010 01:45:01 +0000]INFO: Upgrading package[memcached] version from

1.4.2-1.fc10 to 1.4.5-1.el5

Page 27: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Don’t leave “known unknowns” lying in wait

Page 28: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Resulting Foodcritic Rule

• ETSY001 - Package or yum_package resource used with :upgrade action

• Enforces always using :install

% foodcritic -t etsy -I ~/git/chef-repo/config/rules.rb ~/git/chef-repo/cookbooks/memcache

ETSY001: Package or yum_package resource used with :upgrade action: /home/pmcdonnell/git/chef-repo/cookbooks/memcache/recipes/default.rb:20

Page 29: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Resulting Foodcritic Rule

20 package "memcached" do 21 action :upgrade 22 end

Changed to:

20 package "memcached" do 21 version "1.4.2-1.fc10" 22 action :install 23 end

Page 30: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Reporting and Monitoring

Page 31: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Using Handlers

• Etsy’s handlers (https://github.com/etsy/chef-handlers)

• Log failures to IRC

• Graph aggregated metrics with Graphite

• Graph chef “deploys”

[10:52:03] <irccat> Chef run failed on dev-dbtasks01.ny4dev.etsy.com[10:52:03] <irccat> https://github.etsycorp.com/gist/371229

Page 32: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Graph with Graphite

• Metrics reporting made possible by knife-lastrun, authored by John Goulah (https://github.com/jgoulah/knife-lastrun)

• Provides a handler and knife plugin for reporting on the most recent chef run, storing data as node attributes

• Elapsed, starting, and ending time

• Exit code status

• Backtrace/exception information

Page 33: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Page 34: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Page 35: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

% dsh -g all -c -M 'grep "Chef Run complete in" /var/log/chef/client.log | head -n 3' 2>&1 | tee /tmp/tee && grep 'Chef Run complete' /tmp/tee | sort -n -k +13 | tail -5

dn0035.doop: [Mon, 14 May 2012 03:21:07 +0000] INFO: Chef Run complete in 512.936813012 secondsdn0004.doop: [Mon, 14 May 2012 04:28:03 +0000] INFO: Chef Run complete in 677.423964906 secondsdn0006.doop: [Mon, 14 May 2012 04:29:51 +0000] INFO: Chef Run complete in 770.231469266 secondsdn0025.doop: [Mon, 14 May 2012 04:26:13 +0000] INFO: Chef Run complete in 787.183615612 secondsdn0030.doop: [Mon, 14 May 2012 04:30:42 +0000] INFO: Chef Run complete in 848.586507872 seconds

Page 36: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Page 37: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Page 38: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Page 39: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
Page 40: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Finding Run Time Outliers

• Knife doesn’t currently support Lucene’s NumericRangeQuery

• Elapsed time is a floating point number, but we can only match it as a string due to query limitations in knife

• Work around it with knife search -a

Page 41: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

% knife search node 'elapsed:[200 TO 225]' -a lastrun.runtimes.elapsed

4 items found

id: cent6-vmtemplate.ny4dev.etsy.comlastrun.runtimes.elapsed: 21.642378406

id: sandboxmisc01.ny4.etsy.comlastrun.runtimes.elapsed: 211.749555

id: smardenfeld.vm.ny4dev.etsy.comlastrun.runtimes.elapsed: 22.184596

id: bob0120.vm.ny4dev.etsy.comlastrun.runtimes.elapsed: 21.348335354

Page 42: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

% knife node lastrun sandboxmisc01.ny4.etsy.com

Status failed Elapsed Time 211.78604 Start Time 2012-05-15 07:43:18 +0000End Time 2012-05-15 07:46:50 +0000

BacktraceOmitted for brevity

ExceptionChef::Exceptions::Package: package[diffutils] (installerz::diffutils line 1) had an error: Yum failed - #<Process::Status: pid 21293 exit 1> - returns: ["yum-dump Repository Error: Cannot retrieve repository metadata (repomd.xml) for repository: PostgreSQL-8.3-x86_64. Please verify its path and try again\n"]

Page 43: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

What Did Chef Just Do?

• chefrecentupdates by Etsy engineer Laurie Denness (https://github.com/lozzd/ChefScripts)

% chefrecentupdates...1 resources updated in /var/log/chef/client.log-20120505.gz:[Fri, 04 May 2012 17:49:42 +0000]INFO: cookbook_file[/usr/bin/gist]...

Page 44: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Preventative Measures

Page 45: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Knife Preflight

% knife preflight memcache::datacache Searching for nodes containing memcache::datacache in their expanded run_list...4 Nodes found

datacache03.ny4.etsy.comdatacache04.ny4.etsy.comdatacache01.ny4.etsy.comdatacache02.ny4.etsy.com

Searching for roles containing memcache::datacache in their run_list...1 Roles found

Datacache

Found 4 nodes and 1 roles using the specified search criteria

• By Jon Cowie (https://github.com/jonlives/knife-preflight)

Page 46: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Continuous Chef

• Using Jenkins and base virtual machine images

Page 47: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

“Out-of-Band” Management

• dsh (distributed shell) works even if Chef server is down

• Etsy’s dsh groups are managed by Chef and generated from the list of nodes corresponding to each role

Page 48: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Configs Bundled with Packages

• Be careful with configs distributed with packages overwriting Chef configs

• They must be replaced by Chef before restarting services, so watch out for resource order

Page 49: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

Jon will be at Velocity!

• Workshop: Michelin Starred Cooking with Chef

• 11:00am Monday, 06/25/2012

• Topics

• Team-wide familiarity and understanding

• Critical approach and experimentation with workflows

• Plugin writing 101

Page 50: Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012

We’re Hiring!

• TONS of engineering positions open!

• Especially looking for a talented network engineer; referrals welcome!

http://www.etsy.com/careers