system, heal thy self….. the major difference between a thing which might go wrong, and a thing...

System, heal thy self….

The major difference between a thing which might go wrong, and a thing which cannot possibly go wrong, is that, when a thing which cannot possibly go wrong goes wrong, it usually turns out to be impossible to get at or repair.

- un-attributed, posted in Microsoft conf. room

Failure is always an option…

At (cloud) scale…• Hardware fails just as much, often times

more, than software• Build world class server hardware with a

MTBF of 30 years, buy 10,000 of these. Watch one fail each day

• The internet is not “five nines” (99.999). At most it’s two nines in any given location (very geography dependent)

• Expect it to be down minimum 4 (whole) days out of the year

• It’s impossible for humans to monitor, detect and react to issues at scale

• Server : Ops ratio is 1:50 – 1:150[1]

• Google has ~1M servers[2] == 6500+ ops?

[1] – http://serverfault.com/questions/77374/whats-the-server-to-admin-ratio-at-your-workplace[2] – http://www.datacenterknowledge.com/archives/2011/08/01/report-google-uses-about-900000-servers/

When a system gets large/complex enough failure is not an option, it’s a fact of everyday life

Livesite first

Traditional way of reacting to livesite issues/outages

• Page support staff, open support tickets for technicians, email programmers

• This is way too slow!

Service-level (network-based) failover is common

• Load balancers, GTM routing, hot-hot BCP

Software failover is less common• Code has much more context if something is

“wrong” or not

Let your DEV team be the ops• Have them feel the pain!• Often, they are ultimately the people that will

have to finally fix the problem anyway• Most ops activities should be scripted/code –

make fat-fingers less likely

Ain’t nothing like the real thing…

How much of any given codebase is validation/verification and error handling?

How much testing is performed on a system under real-world conditions/load?

• How easy is this even to simulate?

How much (time/money) are we spending on QA?

• How many servers/network/etc for test, staging, vs. production?

• How often is it fully utilized?

What if we rolled the whole lot together…• TiP – Testing in Production

…and didn’t worry about the spectre of “quality”• Let bugs run free!

Towards NirvanaHow do you get there – how much does this cost?

This isn’t a small undertaking – it took MSFT 12+ years to get here in many guises and lessons learned

• BRS’s, SPoF’s, Wiring “oddities”• The feedback/learning cycle is getting smaller

though - Facebook has only been around for 8 years

You need a service-orientated culture• It’s not about shipping software

You need team(s) focused on the “platform” and/or “fabric”

• These are absolutely critical and not cheap to build

Buy vs. build?• IaaS, PaaS, SaaS• Azure, AppEngine, AWS

Deploy, Monitoring, Rules

Service Routing

Failu

re D

om

ain

System, heal thy self

Don’t sweat the hard stuff• Allow things to crash – let the system pick

up the pieces• Encourage things to crash – chaos monkey

Simplicity in programming• Shared and hardened services are key –

take as much thought/effort out of individual developers hands as possible

Not everything is equally important• Maslow's hierarchy of needs – acceptable losses!

Automated as many repair actions as possible• Humans can’t be around all the time and can be slow to react (which

sometime is a good thing)• Tier 3 isn't 24/7, and for the major things that happen at 2:00am you just

know those are the people you need

How does Bing protect itself?

Getting to Bing.com

Front Door

SERP

Front Door

World wide web

Akamai

Front Door

Images

AS AS

…

AS AS

Dos/Bot/Load protection in Bing

• We are more concerned about protecting the “Good Guys” (Our carbon based users) than we are about blocking the “Bad Guys” (‘Synthetic’ traffic)

• Some amount of synthetic traffic is ok• We have agreements in place to be scraped• We scrape our own content

Crash Protection Service – Watson for servicesIf query is “expensive” or causes crash(es)• Cache and/or block the request• “Bucket” crashes/errors and turn off features/flights if there’s a

pattern

Can help with complex scenarios• 80/20 rule – are there a small set of bugs responsible for most issues?• Gather data on bugs with large cause-effect chasm• Catch (and respond) to things not seen before

Experimentation/Flighting

Limit exposure, and impact, to a small subset of users• Some may not like it, some may really like it• If there are issues, can eject users from a flight (implicitly

or explicitly), or stop the flight altogether

Roll out changes gradually• Allow systems to “warm up”• Manage demand• Allow for roll-back -- N+1 / N / N-1 versions

A/B Testing• Control group, treatment group.

Look for differences• Let users define acceptance

criteria - scorecard off some key metrics

• In other complex systems, this is common – the FDA do this

Resiliency through redundancy• Assume failure

• Disable faulty services/software• Roll back to a “known good” state

WARNING: Computer pr0n ahead!

• Assume machines/services will crash• Have enough redundancy to continue to operate

AutopilotFrontdoor

Web Index

SU1 SU2 SU3 SU4

Application Services

Collection Service

Cockpit

Watchdog

Device Manager

Provisioning Service

Repair Service

Deployment Service

Core Autopilot Services

Other service

Failure by design - Autopilot

• Using Autopilot means Bing has failure “designed in”• All systems/services are designed such that any instance can be

killed unexpectedly without destabilizing the rest of the system• If service/machine(s) are failing, they will a) be restarted, then b)

reimaged, then c) RMA’ed• Also roll-back to previous “known good” version

• Allows for simpler development • Don’t worry (too much) about failure cases / clean-up code• “Crash early, [crash often]”• Fork (“T-ed”) real traffic into pre-release and scale units during roll-

out• Customers are helping us test our v-next product without knowing it

• Allows for some simple security management• Out of spec with current configuration -- reimage.

• More info at http://research.microsoft.com/pubs/64604/osr2007.pdf

http://research.microsoft.com/pubs/64604/osr2007.pdf

http://research.microsoft.com/pubs/64604/osr2007.pdf

http://sharepoint/sites/autopilot/

Learning some painful lessons on the way

Learning by previous mistakes: something the software industry really should understand by now.

Closing GuidanceAs software “experts”, we know abstraction is good.

• Abstract failure away from developers – allow (most of) them to think that the environment they are writing code for is perfect

• THEY DO THIS ANYWAY!

The best BCP is no BCP• You have systems that are on, and are being paid for – make use of them.

Aim towards Testing in Production• Monitoring, reliability, and QA become the same thing• Services have to harden against each other

Let the system regulate itself• It’s way quicker at identifying issues, triaging problems, debugging, and

performing repair actions• MTTR is more important than MTTF

Granted, this isn’t easy, and there can be painful lessons• Humans are still needed – we’re not looking for a sentient system• This isn’t necessarily one-off, bespoke – if you build it, they will come

© 2010 Microsoft Corporation. All rights reserved. Microsoft and Bing are trademarks of the Microsoft group of companies.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Thank You

Outlook: Sunny, with a chance of clouds

Mike Andrews - [email protected]

Backup/thoughts/ideas

system, heal thy self….. the major difference between a thing which might go wrong, and a thing...

Documents