pagerduty | oscon 2016 failure testing

Post on 15-Jan-2017

304 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

@alperkokmen

Failure TestingAUTOMATING A SERIES OF UNFORTUNATE EVENTS

#OSCON

Alper Kokmen PRESENT

Software Engineer at PagerDuty

Surrounded by smart people

PAST

Start-ups, Microsoft

Surrounded by smart people

#OSCON

#OSCON

Goals

Start manually injecting failures. Start automating your manual tests.

#OSCON

CHAOS ENGINEERING

“[T]he discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

Principles of Chaos Engineering http://principlesofchaos.org

#OSCON

Netflix Simian Army DIFFERENT SIMIANS FOR DIFFERENT FAILURES

#OSCON

PagerDuty Simian Army?

Multiple cloud providers (AWS and Azure) Experimentation

Application-specific failure scenarios

#OSCON

PagerDuty Simian Human Army FAILURE FRIDAY

Time-boxed recurring meeting

Pre-announced agenda

Break things

Sign-off from service owners

Attendance

GROUND RULES

Keep monitoring & alerting

Abort if needed

#OSCON

Failure Friday: Agenda

#OSCON

Failure Friday: Process

#OSCON

Inject Failure Monitor Repeat

Failure Friday: Monitoring

#OSCON

2 Years Later BENEFITS

System design Knowledge sharing Incident response training

#OSCON

2 Years Later ACCOMPLISHMENTS

Whole DC outages Target multiple services at once Distribute failure testing to teams Automation (in progress)

#OSCON

Automation: Rationale

#OSCON

“MANY” HOSTS

- Distribute tasks to multiple people and keep executing manually. - Watch Operations team with envy while they use chef and knife.

- Start automating.

PagerDuty/blender A MODULAR ORCHESTRATION ENGINE

Ruby DSL

Host Discovery (blender-chef, blender-serf)

Ranjib Dey (@RanjibDey)

#OSCON

PagerDuty/blender CODE

#OSCON

# example.rb ssh_task 'update' do execute 'sudo apt-get update -y' members ['ubuntu01', 'ubuntu02', 'ubuntu03'] end

PagerDuty/blender EXECUTION

#OSCON

blend -f example.rb

Run[example.rb] started 3 job(s) computed using 'Default' strategy Job 1 [update on ubuntu01] finished Job 2 [update on ubuntu02] finished Job 3 [update on ubuntu03] finished Run finished (42.228923876 s)

PagerDuty/smoothie A SIMPLE LIBRARY OF BLENDER RECIPES

Chef Integration

Recipes for Disaster

CLI to Specify Recipes

#OSCON

PagerDuty/smoothie REBOOT RECIPE

#OSCON

def recipe__reboot(hosts) ssh_task 'reboot' do members hosts execute 'sudo /sbin/reboot'

# shutdown will break ssh connection. ignore_failure true end end

PagerDuty/smoothie UNICORN SUSPEND & RESUME RECIPES

#OSCON

def recipe__unicorn_suspend_master(hosts) ssh_task 'suspend unicorn[master] immediately' do members hosts execute 'sudo kill -s STOP `cat /u/.../pids/unicorn.pid`' end end

def recipe__unicorn_resume_master(hosts) ssh_task 'resume unicorn[master] immediately' do members hosts execute 'sudo kill -s CONT `cat /u/.../pids/unicorn.pid`' end end

PagerDuty/smoothie LATENCY RECIPE

#OSCON

def recipe__tc_add_latency(hosts) ssh_task 'add network latency using tc' do members hosts execute 'sudo tc qdisc add dev eth0 root netem delay 500ms 100ms loss 20%' end end

def recipe__tc_remove_latency(hosts) ssh_task 'remove network latency using tc' do members hosts execute 'sudo tc qdisc del dev eth0 root netem' end end

PagerDuty/smoothie EXECUTION

#OSCON

HOSTFILTER=app1 RECIPE=reboot blend -f smoothie.rb

def recipe__reboot(hosts)

PagerDuty/smoothie EXECUTION

#OSCON

ZONE=us-west-2a RECIPE=reboot blend -f smoothie.rb

def recipe__reboot(hosts)

Failure Friday: Blender

#OSCON

ZONE=us-west-2a ROLE=web-app RECIPE=monit_unmonitor

ZONE=us-west-2a ROLE=web-app RECIPE=monit_monitor

ZONE=us-west-2a ROLE=web-app RECIPE=unicorn_stop_master_gracefully

ZONE=us-west-2b ROLE=web-app RECIPE=unicorn_suspend_master

ZONE=us-west-2b ROLE=web-app RECIPE=unicorn_resume_master

ZONE=us-west-2c ROLE=web-app RECIPE=reboot

ZONE=us-west-2a ROLE=web-app RECIPE=iptables_network_isolate

ZONE=us-west-2a ROLE=web-app RECIPE=iptables_rebuild

ZONE=us-west-2b ROLE=web-app RECIPE=tc_add_latency

ZONE=us-west-2b ROLE=web-app RECIPE=tc_remove_latency

Future AUTOMATION

Build more automation for service-specific scenarios.

Scheduled runs (similar to Netflix).

#OSCON

Future CHATOPS

Inject failures by invoking chat commands.

Share metrics and graphs to help people follow along.

Collect TODOs during Failure Fridays and generate a report.

#OSCON

Future NEW TYPES OF FAILURES

Distributed Denial of Service (DDoS) attacks for services.

Impediments that come up during Incident Response.

#OSCON

Summary FAILURES WILL HAPPEN

Anything that can go wrong, will go wrong.

Proactively test failure handling now.

Start simple.

#OSCON

#OSCON

PROPOSED EDIT

“Experiments that aren’t introducing new insights should be automated and used to monitor ongoing health of the system. New experiments should be devised to continue to push the bounds of the system.”

Culture From Chaos by @dougbarthhttps://speakerdeck.com/dougbarth/culture-from-chaos

Thank you.

#OSCON

@alperkokmen

top related