startops: growing an ops team from 1 founder
DESCRIPTION
Bootstrapped startups don't have the luxury of a full team of ops engineers available to respond to issues 24/7, so how can you survive on your own? This talk will tell the story of how to run your infrastructure as a single founder through to growing that into a team of on call engineers. It will include some interesting war stories as well as tips and suggestions for how to run ops at a startup. Presented at DevOpsDays London 2013 by David Mytton.TRANSCRIPT
StartOps: Growing an ops team from 1 founder
- Lot of knowledge online but it usually assumes you have a team, lots of time and money- That is the goal but it doesn’t start like that so I’m going to talk about the stages to achieve that- Tips and tools to help along the way- Use my own company and gratuitous photos of Japan to illustrate the point
David Mytton
Woop Japan!
Bootstrapping sometimes means leaving things to the last minute.
Photo: dannychoo.com
- First tip- Limited resources, people, time
April 2009
- Quick development- Experience with PHP + MySQL- Slicehost was cheap- Problems with MySQL so moved to MongoDB
Why?
• Replication
Why?
• Replication
• Official drivers
Why?
• Replication
• Official drivers
• Easy deployment
Why?
• Replication
• Official drivers
• Easy deployment
• Fast out of the box (sort of)
1 = changes to WriteConcern
david@pan ~: df -aFilesystem 1K-blocks Used Available Use% Mounted on/dev/sda1 156882796 148489776 423964 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2097260 0 2097260 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_misc david@pan ~: df -ahFilesystem Size Used Avail Use% Mounted on/dev/sda1 150G 142G 415M 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2.1G 0 2.1G 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_
- Needed to upgrade a machine- Resize = downtime- Resyncing finished just in time
MongoDB at Server Density
•27 nodes
•27 nodes
MongoDB at Server Density
•17TB data per month
MongoDB at Server Density
Queues
Primary data store
Time series
It also means trying to find the quickest way.
david@asriel ~: scp david@stelmaria:~/local/local.11 .local.11 100% 2047MB 6.8MB/s 05:01
- Needed to resync a database server across the US- Take too long; oplog not large enough- Fast internal network but slow internet
1d, 1h, 58m
11.22MB/s
• Roaming is expensive
Hacking traveling
- Wifi hotspot- Prepaid SIM- Euro data cap
Hacking traveling
•Starbucks free wifi + power
Hacking traveling
• Travel light
- Buying things locally
Hacking traveling
• Don’t update
- Like no deploy Friday- Server updates- Local OS updates
Let other people help
- Summer 2009 moved to several managed servers with Rackspace.
Let other people help
• Managed hosts
- Rackspace managed hosting- Softlayer charge $1/ticket
Let other people help
• Managed hosts
• Support contracts
- Depending on the level of support you buy- Expensive- Are ways to work around that; getting involved with projects
Outsourcing
- Engineers terrible at valuing their own time- “Why pay for something I can build/install/configure myself?”- Can pay a trusted company/individual to do things- Lots of little things that need doing- Examples
Service access list
Outsourcing
- List of services employees have access to- Revoking credentials- Adding new users- Password management
PCI certification
Outsourcing
- Paperwork / checklist
CDN research
Outsourcing
- Paperwork / checklist
Is it time consuming?
Outsourcing
Is it time consuming?
Boring?
Outsourcing
Is it time consuming?
Boring?
Measurable improvement?
Outsourcing
2010 - 2011
And then there were 3
- Added a new engineer at the end of 2009 and the team stayed at 3 until the start of 2011.- More than 1 then you start having to think properly
Dealing with humans
- As much as we’d like an API to life, managing human issues become important for scaling
Dealing with humans
Automate as much as possible
- You want to remove humans from as much as possible- Prevents mistakes, makes things easier and faster- Keeps a log of what was happened- Ideally you only want to ever manually to something once- Even with just 1 person, setting up config management is a minimum
Dealing with humans
Silo’d information
- Small team so usually 1 person responsible for a lot of code- Not reasonable to have to ask that person every time there’s a problem with that bit
Dealing with humans
Up to date docs
- Every component should be fully documented- Consider appliance manuals with the troubleshooting tables they have at the back- Table of potential failures and how to deal with them- Vendor contact information- Team contact information- Have someone responsible for keeping them up to date
Dealing with humans
Checklists
- Stolen from the Checklist Manifesto / airline industry- Any manual steps, however trivial, should be checklisted- Failover, backup recovery, incident handling
Dealing with humans
Force scripting
- Takes a bit of extra time but the ROI is massive- Disallow direct access to things e.g. database queries- Better to push a button and get a guaranteed result than risk mistakes
2012 - 2013
Growing to 12
- 12, 11 of which are technical- Now have the luxury of being able to spread things out- Proper on call schedule
On-call
Dealing with humans
- Sharing out the responsibility- Determining level of response: 24/7 real monitoring or first responder- 24/7 real monitoring for HA environments, real people at a screen at all times- First responder: people at the end of a phone
On-call 1) Ops engineer
Dealing with humans
- During working hours our dedicated ops engineers take the first level- Avoids interrupting product engineers for initial fire fighting
On-call 1) Ops engineer
2) All engineers
Dealing with humans
- Out of hours we rotate every engineer, product and ops- Rotation every 7 days on a Tuesday
On-call 1) Ops engineer
2) All engineers
3) Ops engineer
Dealing with humans
- Always have a secondary- This is always an ops engineer- Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needs additional systems expertise
On-call 1) Ops engineer
2) All engineers
3) Ops engineer
4) Others
Dealing with humans
- Next month we’re launching a major new product into beta- Support from design / frontend engineering- Have to press a button to get them involved
Off-call
Dealing with humans
- Responders to an incident get next 24 hours off-call- Social issues to deal with
On-call CEO
Dealing with humans
- I receive push notifications + e-mails for all outages
Uptime reporting
Dealing with humans
- Weekly internal report on G+- Gives visibility to entire company about any incidents- Allows us to discuss incidents to get to that 100% uptime
Social issues
Dealing with humans
- How quickly can you get to a computer?- Are they out drinking on a Friday?- What happens if someone is ill?- What if there’s a sudden emergency: accident? family emergency?- Do they have enough phone battery?- Can you hear the ringtone?
Backup responder
Dealing with humans
- Backup responder- Time out the initial responder- Escalate difficult problems- Essentially human redundancy: phone provider, geographic area, internet connectivity
Expected
Dealing with outages
- Outages are going to happen, especially at the beginning- Costs money for redundancy- How you deal with them
Dealing with outages
Externally
Communication
- Telling people what is happening- Frequently- Dependent on audience - we can go into more detail because our customers are techies- Github do a good job of providing incident writeups but don’t provide a good idea of what is happening right now- Generally Amazon and Heroku are good and go into more detail
Communication
Dealing with outages
Internally
- Open Skype conferences between the responders- Usually mostly silence or the sound of the keyboard, but simulates being in the situation room- Faster than typing
Really test your vendors
Dealing with outages
- Shows up flaws in vendor support processes- Frustrating when waiting on someone else- You want as much information as possible- Major outage? Everyone will be calling them
Simulations
Dealing with outages
- Try and avoid unncessary problems- Do servers come back up from boot?- Can hot spares handle the load?- Test failover: databases, HA firewalls- Regularly reboot servers- Wargames can happen at another stage: startups are usually too focused on building things first
You want your own team
- The only ones who care the most- Know the most- Can fix things fastest
Monitoring tools
Server Density
Woop Japan!
www.serverdensity.com/dd