startops: growing an ops team from 1 founder

60
StartOps: Growing an ops team from 1 founder - Lot of knowledge online but it usually assumes you have a team, lots of time and money - That is the goal but it doesn’t start like that so I’m going to talk about the stages to achieve that - Tips and tools to help along the way - Use my own company and gratuitous photos of Japan to illustrate the point

Upload: server-density

Post on 03-Dec-2014

904 views

Category:

Technology


0 download

DESCRIPTION

Bootstrapped startups don't have the luxury of a full team of ops engineers available to respond to issues 24/7, so how can you survive on your own? This talk will tell the story of how to run your infrastructure as a single founder through to growing that into a team of on call engineers. It will include some interesting war stories as well as tips and suggestions for how to run ops at a startup. Presented at DevOpsDays London 2013 by David Mytton.

TRANSCRIPT

Page 1: StartOps: Growing an ops team from 1 founder

StartOps: Growing an ops team from 1 founder

- Lot of knowledge online but it usually assumes you have a team, lots of time and money- That is the goal but it doesn’t start like that so I’m going to talk about the stages to achieve that- Tips and tools to help along the way- Use my own company and gratuitous photos of Japan to illustrate the point

Page 2: StartOps: Growing an ops team from 1 founder

David Mytton

Woop Japan!

Page 3: StartOps: Growing an ops team from 1 founder
Page 4: StartOps: Growing an ops team from 1 founder
Page 5: StartOps: Growing an ops team from 1 founder

Bootstrapping sometimes means leaving things to the last minute.

Photo: dannychoo.com

- First tip- Limited resources, people, time

Page 6: StartOps: Growing an ops team from 1 founder

April 2009

- Quick development- Experience with PHP + MySQL- Slicehost was cheap- Problems with MySQL so moved to MongoDB

Page 7: StartOps: Growing an ops team from 1 founder

Why?

• Replication

Page 8: StartOps: Growing an ops team from 1 founder

Why?

• Replication

• Official drivers

Page 9: StartOps: Growing an ops team from 1 founder

Why?

• Replication

• Official drivers

• Easy deployment

Page 10: StartOps: Growing an ops team from 1 founder

Why?

• Replication

• Official drivers

• Easy deployment

• Fast out of the box (sort of)

1 = changes to WriteConcern

Page 11: StartOps: Growing an ops team from 1 founder

david@pan ~: df -aFilesystem 1K-blocks Used Available Use% Mounted on/dev/sda1 156882796 148489776 423964 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2097260 0 2097260 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_misc david@pan ~: df -ahFilesystem Size Used Avail Use% Mounted on/dev/sda1 150G 142G 415M 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2.1G 0 2.1G 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_

- Needed to upgrade a machine- Resize = downtime- Resyncing finished just in time

Page 12: StartOps: Growing an ops team from 1 founder

MongoDB at Server Density

•27 nodes

Page 13: StartOps: Growing an ops team from 1 founder

•27 nodes

MongoDB at Server Density

•17TB data per month

Page 14: StartOps: Growing an ops team from 1 founder

MongoDB at Server Density

Queues

Primary data store

Time series

Page 15: StartOps: Growing an ops team from 1 founder

It also means trying to find the quickest way.

david@asriel ~: scp david@stelmaria:~/local/local.11 .local.11 100% 2047MB 6.8MB/s 05:01

- Needed to resync a database server across the US- Take too long; oplog not large enough- Fast internal network but slow internet

Page 16: StartOps: Growing an ops team from 1 founder

1d, 1h, 58m

11.22MB/s

Page 17: StartOps: Growing an ops team from 1 founder

• Roaming is expensive

Hacking traveling

- Wifi hotspot- Prepaid SIM- Euro data cap

Page 18: StartOps: Growing an ops team from 1 founder

Hacking traveling

•Starbucks free wifi + power

Page 19: StartOps: Growing an ops team from 1 founder

Hacking traveling

• Travel light

- Buying things locally

Page 20: StartOps: Growing an ops team from 1 founder

Hacking traveling

• Don’t update

- Like no deploy Friday- Server updates- Local OS updates

Page 21: StartOps: Growing an ops team from 1 founder

Let other people help

- Summer 2009 moved to several managed servers with Rackspace.

Page 22: StartOps: Growing an ops team from 1 founder

Let other people help

• Managed hosts

- Rackspace managed hosting- Softlayer charge $1/ticket

Page 23: StartOps: Growing an ops team from 1 founder

Let other people help

• Managed hosts

• Support contracts

- Depending on the level of support you buy- Expensive- Are ways to work around that; getting involved with projects

Page 24: StartOps: Growing an ops team from 1 founder

Outsourcing

- Engineers terrible at valuing their own time- “Why pay for something I can build/install/configure myself?”- Can pay a trusted company/individual to do things- Lots of little things that need doing- Examples

Page 25: StartOps: Growing an ops team from 1 founder

Service access list

Outsourcing

- List of services employees have access to- Revoking credentials- Adding new users- Password management

Page 26: StartOps: Growing an ops team from 1 founder

PCI certification

Outsourcing

- Paperwork / checklist

Page 27: StartOps: Growing an ops team from 1 founder

CDN research

Outsourcing

- Paperwork / checklist

Page 28: StartOps: Growing an ops team from 1 founder

Is it time consuming?

Outsourcing

Page 29: StartOps: Growing an ops team from 1 founder

Is it time consuming?

Boring?

Outsourcing

Page 30: StartOps: Growing an ops team from 1 founder

Is it time consuming?

Boring?

Measurable improvement?

Outsourcing

Page 31: StartOps: Growing an ops team from 1 founder

2010 - 2011

And then there were 3

- Added a new engineer at the end of 2009 and the team stayed at 3 until the start of 2011.- More than 1 then you start having to think properly

Page 32: StartOps: Growing an ops team from 1 founder

Dealing with humans

- As much as we’d like an API to life, managing human issues become important for scaling

Page 33: StartOps: Growing an ops team from 1 founder

Dealing with humans

Automate as much as possible

- You want to remove humans from as much as possible- Prevents mistakes, makes things easier and faster- Keeps a log of what was happened- Ideally you only want to ever manually to something once- Even with just 1 person, setting up config management is a minimum

Page 34: StartOps: Growing an ops team from 1 founder

Dealing with humans

Silo’d information

- Small team so usually 1 person responsible for a lot of code- Not reasonable to have to ask that person every time there’s a problem with that bit

Page 35: StartOps: Growing an ops team from 1 founder

Dealing with humans

Up to date docs

- Every component should be fully documented- Consider appliance manuals with the troubleshooting tables they have at the back- Table of potential failures and how to deal with them- Vendor contact information- Team contact information- Have someone responsible for keeping them up to date

Page 36: StartOps: Growing an ops team from 1 founder

Dealing with humans

Checklists

- Stolen from the Checklist Manifesto / airline industry- Any manual steps, however trivial, should be checklisted- Failover, backup recovery, incident handling

Page 37: StartOps: Growing an ops team from 1 founder

Dealing with humans

Force scripting

- Takes a bit of extra time but the ROI is massive- Disallow direct access to things e.g. database queries- Better to push a button and get a guaranteed result than risk mistakes

Page 38: StartOps: Growing an ops team from 1 founder

2012 - 2013

Growing to 12

- 12, 11 of which are technical- Now have the luxury of being able to spread things out- Proper on call schedule

Page 39: StartOps: Growing an ops team from 1 founder

On-call

Dealing with humans

- Sharing out the responsibility- Determining level of response: 24/7 real monitoring or first responder- 24/7 real monitoring for HA environments, real people at a screen at all times- First responder: people at the end of a phone

Page 40: StartOps: Growing an ops team from 1 founder

On-call 1) Ops engineer

Dealing with humans

- During working hours our dedicated ops engineers take the first level- Avoids interrupting product engineers for initial fire fighting

Page 41: StartOps: Growing an ops team from 1 founder

On-call 1) Ops engineer

2) All engineers

Dealing with humans

- Out of hours we rotate every engineer, product and ops- Rotation every 7 days on a Tuesday

Page 42: StartOps: Growing an ops team from 1 founder

On-call 1) Ops engineer

2) All engineers

3) Ops engineer

Dealing with humans

- Always have a secondary- This is always an ops engineer- Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needs additional systems expertise

Page 43: StartOps: Growing an ops team from 1 founder

On-call 1) Ops engineer

2) All engineers

3) Ops engineer

4) Others

Dealing with humans

- Next month we’re launching a major new product into beta- Support from design / frontend engineering- Have to press a button to get them involved

Page 44: StartOps: Growing an ops team from 1 founder

Off-call

Dealing with humans

- Responders to an incident get next 24 hours off-call- Social issues to deal with

Page 45: StartOps: Growing an ops team from 1 founder

On-call CEO

Dealing with humans

- I receive push notifications + e-mails for all outages

Page 46: StartOps: Growing an ops team from 1 founder

Uptime reporting

Dealing with humans

- Weekly internal report on G+- Gives visibility to entire company about any incidents- Allows us to discuss incidents to get to that 100% uptime

Page 47: StartOps: Growing an ops team from 1 founder

Social issues

Dealing with humans

- How quickly can you get to a computer?- Are they out drinking on a Friday?- What happens if someone is ill?- What if there’s a sudden emergency: accident? family emergency?- Do they have enough phone battery?- Can you hear the ringtone?

Page 48: StartOps: Growing an ops team from 1 founder

Backup responder

Dealing with humans

- Backup responder- Time out the initial responder- Escalate difficult problems- Essentially human redundancy: phone provider, geographic area, internet connectivity

Page 49: StartOps: Growing an ops team from 1 founder

Expected

Dealing with outages

- Outages are going to happen, especially at the beginning- Costs money for redundancy- How you deal with them

Page 50: StartOps: Growing an ops team from 1 founder

Dealing with outages

Externally

Communication

- Telling people what is happening- Frequently- Dependent on audience - we can go into more detail because our customers are techies- Github do a good job of providing incident writeups but don’t provide a good idea of what is happening right now- Generally Amazon and Heroku are good and go into more detail

Page 51: StartOps: Growing an ops team from 1 founder

Communication

Dealing with outages

Internally

- Open Skype conferences between the responders- Usually mostly silence or the sound of the keyboard, but simulates being in the situation room- Faster than typing

Page 52: StartOps: Growing an ops team from 1 founder

Really test your vendors

Dealing with outages

- Shows up flaws in vendor support processes- Frustrating when waiting on someone else- You want as much information as possible- Major outage? Everyone will be calling them

Page 53: StartOps: Growing an ops team from 1 founder

Simulations

Dealing with outages

- Try and avoid unncessary problems- Do servers come back up from boot?- Can hot spares handle the load?- Test failover: databases, HA firewalls- Regularly reboot servers- Wargames can happen at another stage: startups are usually too focused on building things first

Page 54: StartOps: Growing an ops team from 1 founder

You want your own team

- The only ones who care the most- Know the most- Can fix things fastest

Page 55: StartOps: Growing an ops team from 1 founder
Page 56: StartOps: Growing an ops team from 1 founder
Page 57: StartOps: Growing an ops team from 1 founder

Monitoring tools

Server Density

Page 58: StartOps: Growing an ops team from 1 founder
Page 59: StartOps: Growing an ops team from 1 founder

Woop Japan!

www.serverdensity.com/dd

Page 60: StartOps: Growing an ops team from 1 founder

David Mytton

[email protected]

@davidmytton

Woop Japan!

www.serverdensity.com