disaster recovery strategies with config management

62
DR Strategies with CM Mandi Walls CfgMgmtCamp 3 FEB 2014 Monday, February 3, 14

Upload: mandi-walls

Post on 08-May-2015

1.243 views

Category:

Technology


2 download

DESCRIPTION

Presented at CfgMgmtCamp, Ghent, BE. 3 FEB 2014.

TRANSCRIPT

Page 1: Disaster Recovery Strategies with Config Management

DR Strategies with CMMandi WallsCfgMgmtCamp 3 FEB 2014

Monday, February 3, 14

Page 2: Disaster Recovery Strategies with Config Management

whoami

• Mandi Walls

• Technical Practice Manager, CHEF

[email protected]

• @lnxchk

Monday, February 3, 14

Page 3: Disaster Recovery Strategies with Config Management

What is Disaster Recovery

http://www.flickr.com/photos/61617934@N03/6196510705/sizes/z/in/photostream/Monday, February 3, 14

Page 4: Disaster Recovery Strategies with Config Management

Reasons to Make DR Plans

• Your business insurance requires it

• Things are going to happen, whether you are ready or not

Monday, February 3, 14

Page 5: Disaster Recovery Strategies with Config Management

Tornado Events in Loudoun County, VA

http://www.tornadohistoryproject.com/tornado/Virginia/Loudoun/map

Monday, February 3, 14

Page 10: Disaster Recovery Strategies with Config Management

Hurricane Sandy, NYC, October 2012

Photo: Iwan Baan and New York MagazineMonday, February 3, 14

Page 11: Disaster Recovery Strategies with Config Management

Hurricane Sandy, NYC, October 2012

Photo: Iwan Baan and New York Magazine

33 Whitehall

Monday, February 3, 14

Page 12: Disaster Recovery Strategies with Config Management

Hurricane Sandy, NYC, October 2012

Photo: Iwan Baan and New York Magazine

33 Whitehall60 Hudson

Monday, February 3, 14

Page 13: Disaster Recovery Strategies with Config Management

Hurricane Sandy, NYC, October 2012

Photo: Iwan Baan and New York Magazine

33 Whitehall

375 Pearl

60 Hudson

Monday, February 3, 14

Page 14: Disaster Recovery Strategies with Config Management

Hurricane Sandy, NYC, October 2012

Photo: Iwan Baan and New York Magazine

33 Whitehall

375 Pearl

65 Broadway

60 Hudson

Monday, February 3, 14

Page 15: Disaster Recovery Strategies with Config Management

Hurricane Sandy, NYC, October 2012

Photo: Iwan Baan and New York Magazine

33 Whitehall

375 Pearl

65 Broadway

25 Broadway

60 Hudson

Monday, February 3, 14

Page 16: Disaster Recovery Strategies with Config Management

Hurricane Sandy, NYC, October 2012

Photo: Iwan Baan and New York Magazine

33 Whitehall

375 Pearl

65 Broadway

25 Broadway

60 Hudson

111 8th

Monday, February 3, 14

Page 17: Disaster Recovery Strategies with Config Management

Hurricane Sandy, NYC, October 2012

Photo: Iwan Baan and New York Magazine

33 Whitehall

375 Pearl

65 Broadway

25 Broadway75 Broad

60 Hudson

111 8th

Monday, February 3, 14

Page 18: Disaster Recovery Strategies with Config Management

Hurricane Sandy, NYC, October 2012

Photo: Iwan Baan and New York Magazine

33 Whitehall

375 Pearl

65 Broadway

25 Broadway75 Broad

121 Varick60 Hudson

111 8th

Monday, February 3, 14

Page 19: Disaster Recovery Strategies with Config Management

Hurricane Sandy, NYC, October 2012

Photo: Iwan Baan and New York Magazine

33 Whitehall

375 Pearl

65 Broadway

25 Broadway

My Apartment

75 Broad

121 Varick60 Hudson

111 8th

Monday, February 3, 14

Page 20: Disaster Recovery Strategies with Config Management

Hurricane Sandy, NYC, October 2012

Photo: Iwan Baan and New York Magazine

33 Whitehall

375 Pearl

65 Broadway

25 Broadway

My ApartmentBitches in BPC with newer infrastructure

75 Broad

121 Varick60 Hudson

111 8th

Monday, February 3, 14

Page 21: Disaster Recovery Strategies with Config Management

Current State of DR

• Event horizon for modern DR was 9/11

• Same neighborhood as Hurricane Sandy

• Most of the literature reflects the state of IT at that time

Monday, February 3, 14

Page 22: Disaster Recovery Strategies with Config Management

Goals of DR Planning

• Name staff and services that are key to business continuity

• Provide clear guidance for making decisions in real time

• Set rules for escalation, communication, participation

• Document all of these things, publish the results, keep them updated on a regular basis

Monday, February 3, 14

Page 23: Disaster Recovery Strategies with Config Management

Advantages of CM when Planning DR

• Topology and service definition

• Settings and relationships

• Documentation

• Tooling and workflows

Monday, February 3, 14

Page 24: Disaster Recovery Strategies with Config Management

Old Rules that Still Apply

• Accessible off site backups, with periodically tested restores

• Documentation should also be available if your normal services are not

• Documents need to be updated on a regular schedule, and personnel should be trained on their potential roles

Monday, February 3, 14

Page 25: Disaster Recovery Strategies with Config Management

New Rules

http://www.flickr.com/photos/26058810@N02/5650149188/sizes/z/in/photostream/Monday, February 3, 14

Page 26: Disaster Recovery Strategies with Config Management

Rule 1: Your availability is your responsibility

• Cloud / managed hosting allows us to outsource a number of worries

• Bandwidth, power, cooling

• That’s awesome, but does your vendor care as much about your customers or users as you do?

• You must assess your tolerance for risk vs cost

• No longer entirely dependent on getting budget for full scale “DR sites”

Monday, February 3, 14

Page 27: Disaster Recovery Strategies with Config Management

Rule 1: To the Cloud!

• Justifying DR planning is much easier without justifying massive quantities of capital for emergency capacity

• If your applications are not tightly coupled to custom services by your IaaS provider, your flexibility in outage events is increased

• Commonly missed items include

• Keeping passwords in a single location that may be inaccessible in outages

• Not having the most correct information about operating systems or server capacities that will be needed, and how to translate among providers

• Not engaging with security and network teams to ensure all access is ok

Monday, February 3, 14

Page 28: Disaster Recovery Strategies with Config Management

Knife Plugins

$ knife rackspace server create (options)$ knife linode server create (options)$ knife ec2 server create (options)

Monday, February 3, 14

Page 29: Disaster Recovery Strategies with Config Management

Rule 2: Assessing realistic risk

http://badassoftheweek.com/godzilla.html

• Do not bikeshed all possible events along all potential space-time continua

• Assess risk based on affected services

Monday, February 3, 14

Page 30: Disaster Recovery Strategies with Config Management

Rule 2: Planning for the Extent of an Event

• Service level

• Datacenter level

• Regional level

• National level

Monday, February 3, 14

Page 31: Disaster Recovery Strategies with Config Management

Service-Level and Datacenter-Level Events

• These are the easiest to deal with when you’re using CM!

• If your infrastructure is in code, move services to new blades of grass by redeploying

Monday, February 3, 14

Page 32: Disaster Recovery Strategies with Config Management

Spiceweasel

• https://github.com/mattray/spiceweasel

• Define groups of infrastructure in Ruby, JSON, or YAML

• Spiceweasel will translate into knife commands to recreate the running infrastructure

Monday, February 3, 14

Page 33: Disaster Recovery Strategies with Config Management

Spiceweaselnodes:- serverA: run_list: role[base] options: -i ~/.ssh/mray.pem -x user --sudo- serverB serverC: run_list: role[base] options: -i ~/.ssh/mray.pem -x user --sudo -E production- windows_winrm winboxA: run_list: role[base],role[iisserver] options: -x Administrator -P 'super_secret_password'- windows_ssh winboxB winboxC: run_list: role[base],role[iisserver] options: -x Administrator -P 'super_secret_password'

Monday, February 3, 14

Page 34: Disaster Recovery Strategies with Config Management

Regional Events

• Storms, volcanoes, large telecom cuts, worker strikes, etc

• When regional civil infrastructure is affected

• May provide more warning - hurricanes may take several days to form

• Your staff may be without power or the ability to be physically present in your office or datacenter

• Prioritization of services, training of backup staff

Monday, February 3, 14

Page 35: Disaster Recovery Strategies with Config Management

National Events

• Political unrest

• Other large natural disasters

• Decide if you even need a strategy for these cases

• If your service is down, but all of your customers are also offline, does it make sense to pursue an extensive plan?

Monday, February 3, 14

Page 36: Disaster Recovery Strategies with Config Management

Kind of a Bummer

http://i.imgur.com/CH5J6Uz.jpg

Monday, February 3, 14

Page 37: Disaster Recovery Strategies with Config Management

Rule 3: Comprehensive plans require all players

• You may find yourself faced with an event in which your organization is able to only provide Minimum Viable Product-level services

• Scaling back services to only critical core components requires decision making and planning by product, dev, ops, security, etc

• Minimize the need to also bring along extraneous services like VPNs and specialized gear

Monday, February 3, 14

Page 38: Disaster Recovery Strategies with Config Management

Getting an MVP Up

App LBs

App Servers

DB slaves

Cache

DB Cache

DBs

Monday, February 3, 14

Page 39: Disaster Recovery Strategies with Config Management

Getting an MVP Up

App LBs

App Servers

DB slaves

Cache

DB Cache

DBsBaseline Capacity

Baseline Capacity

Monday, February 3, 14

Page 40: Disaster Recovery Strategies with Config Management

Getting an MVP Up

App LBs

App Servers

DB slaves

Cache

DB Cache

DBs

Maintain Interfaces?

Baseline Capacity

Baseline Capacity

Monday, February 3, 14

Page 41: Disaster Recovery Strategies with Config Management

Tackling a Reduced Topology

• Container for metadata related to the DR topology

• Chef environment, data bags for storing new info

• Separate from existing infrastructure metadata

http://www.flickr.com/photos/psd/9626226855/sizes/z/in/photostream/Monday, February 3, 14

Page 42: Disaster Recovery Strategies with Config Management

DR Environment

• In Chef, an environment is a logical grouping for nodes

• Environments belonging to the same organization share other Chef components like cookbooks and role definitions

• The environment allows you to customize settings for the nodes that live in the environment

Monday, February 3, 14

Page 43: Disaster Recovery Strategies with Config Management

DR Environment

$ cat environments/dr.rbname “dr-app1”description “DR for App1”override_attributes( :app1 => { :db_conn => “ro” })

Monday, February 3, 14

Page 44: Disaster Recovery Strategies with Config Management

Rule 4: Prioritize

• Determine the hierarchy of all critical services

• Your list may have a different order depending on:

• Day of week / month / quarter - is accounting software P1 on the 10th of the month?

• Length of outage - can a service be down a short time with fewer risks?

• Amount of time necessary to recover - how long will it take your data analytics system to catch up after an outage of N hours? More than N additional hours?

Monday, February 3, 14

Page 45: Disaster Recovery Strategies with Config Management

User Behavior

0

37.5

75

112.5

150

0600 0800 1000 1200 1400 1600 1800 2000 2200 0000 0200 0400 0600

App 1 App1 Avg

Monday, February 3, 14

Page 46: Disaster Recovery Strategies with Config Management

Managing Complexity

• Your CM tool is composed of atomic units representing your infrastructure

• Rely on those to help you manage the additional complexity of instantiating new resources in emergencies

• All relationships should be well defined and encoded in the CM tools

• Eliminate the need for specialized knowledge for your DR planning

Monday, February 3, 14

Page 47: Disaster Recovery Strategies with Config Management

Rule 5: Don’t plan for heroism

• When catastrophic events occur, safety of your people is primary

• Large events affect the availability of people resources

• If your staff has reason to be concerned for their welfare, or the welfare of their families, those are priorities

Monday, February 3, 14

Page 48: Disaster Recovery Strategies with Config Management

DR for People

• Resist the urge to hide your config management from different teams

• You can’t predict which members of your team will be able to help

Monday, February 3, 14

Page 49: Disaster Recovery Strategies with Config Management

Checklist

• Identify providers to be used in the case of an outage

• Are you going to use AWS? Use idle or under utilized infrastructure in other locations? Will there be DNS changes, etc?

• Make sure all accounts, billing, and personnel access are up to date

• Check this on a regular basis. Add new staff to access lists promptly.

• All new service deployments must include emergency plan

• Plan for your primary folks to be unavailable

Monday, February 3, 14

Page 50: Disaster Recovery Strategies with Config Management

TL;DR

• Start with baseline

• Add components over time

• Rebuild and return to initial infrastructure if / when possible

Monday, February 3, 14

Page 51: Disaster Recovery Strategies with Config Management

TL;DR

• Start with baseline

• Add components over time

• Rebuild and return to initial infrastructure if / when possible

Monday, February 3, 14

Page 52: Disaster Recovery Strategies with Config Management

TL;DR

• Start with baseline

• Add components over time

• Rebuild and return to initial infrastructure if / when possible

Monday, February 3, 14

Page 53: Disaster Recovery Strategies with Config Management

TL;DR

• Start with baseline

• Add components over time

• Rebuild and return to initial infrastructure if / when possible

Monday, February 3, 14

Page 54: Disaster Recovery Strategies with Config Management

TL;DR

• Start with baseline

• Add components over time

• Rebuild and return to initial infrastructure if / when possible

Monday, February 3, 14

Page 55: Disaster Recovery Strategies with Config Management

TL;DR

• Start with baseline

• Add components over time

• Rebuild and return to initial infrastructure if / when possible

Monday, February 3, 14

Page 56: Disaster Recovery Strategies with Config Management

TL;DR

• Start with baseline

• Add components over time

• Rebuild and return to initial infrastructure if / when possible

Monday, February 3, 14

Page 57: Disaster Recovery Strategies with Config Management

TL;DR

• Start with baseline

• Add components over time

• Rebuild and return to initial infrastructure if / when possible

Monday, February 3, 14

Page 58: Disaster Recovery Strategies with Config Management

TL;DR

• Start with baseline

• Add components over time

• Rebuild and return to initial infrastructure if / when possible

Monday, February 3, 14

Page 59: Disaster Recovery Strategies with Config Management

TL;DR

• Start with baseline

• Add components over time

• Rebuild and return to initial infrastructure if / when possible

Monday, February 3, 14

Page 60: Disaster Recovery Strategies with Config Management

TL;DR

• Start with baseline

• Add components over time

• Rebuild and return to initial infrastructure if / when possible

Monday, February 3, 14

Page 61: Disaster Recovery Strategies with Config Management

Other Stuff to Take into Consideration

• SaaS solutions for temporary infrastructures

• Monitoring and metrics, CDNs, code repositories

• Also for backoffice: email services, document storage

• Often scary for security and compliance folks

• Speed time to recovery in large-loss events

Monday, February 3, 14

Page 62: Disaster Recovery Strategies with Config Management

fin

• Time to rewrite DR practices for new generation of tools and services

• Send me your stories if you can share [email protected]

http://i.imgur.com/KdRnwZK.jpg

Monday, February 3, 14