qconsf, november 9, 2016 freeing the whale...freeing the whale how to fail at scale ... freeing the...

55
Freeing the Whale How to Fail at Scale oliver gould cto, buoyant QConSF, November 9, 2016 from

Upload: others

Post on 13-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

Freeing the Whale How to Fail at Scale

oliver gouldcto, buoyant

QConSF, November 9, 2016

from

Page 2: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

2010A FAILWHALE ODYSSEY

Page 3: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011
Page 4: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011
Page 5: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

Twitter, 2010107 users

107 tweets/day

102 engineers

101 ops eng

101 services

101 deploys/week

102 hosts

0 datacenters

101 user-facing outages/weekhttps://blog.twitter.com/2010/measuring-tweets

Page 6: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

objective

reliability

flexibility

Page 7: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

objective

reliability

flexibility

solution

platform

SOA + devopsi.e. “microservices”

Page 8: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

Resilience is an imperative: our software runs on the truly dismal computers we call datacenters. Besides being heinouslycomplex… they are unreliable and prone to operator error.

Marius Eriksen @mariusRPC Redux

Page 9: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

software you didn’t write

hardware you can’t touch

network you can’t trace

break in new and surprising ways

and your customers shouldn’t notice

Page 10: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

freeing the whale

photo: Johanan Ottensooser

Page 11: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

mesos.apache.orgUC Berkeley, 2010

Twitter, 2011

Apache, 2012

Abstracts compute resources

Promise: don’t worry about the hosts

Page 12: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

aurora.apache.org

Twitter, 2011

Apache, 2013

Schedules processes on Mesos

Promise: no more puppet, monit, etc

Page 13: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

timelines

Aurora (or Marathon, or …)

host

Mesos

host host host host host

users notifications

x800 x300 x1000

Page 14: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

timelines

Aurora (or Marathon, or …)

host

Mesos

host host host host

users notifications

x800 x300 x1000

🔥

Page 15: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

service discovery

timelines users

zookeeper

create ephemeral /svc/users/node_012345{“host”: “host-abc”,“port”: 4321}

Page 16: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

service discovery

timelines users

zookeeper

watch /svc/users/*

Page 17: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

service discovery

timelines users

zookeeper

GetUser(olix0r)

Page 18: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

service discovery

timelines users

zookeeper

uh oh.

GetUser(olix0r)

Page 19: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

service discovery

timelines users

zookeeper

client caches results

GetUser(olix0r)

Page 20: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

service discovery

timelines users

zookeeper

GetUser(olix0r)

zookeeper serves empty results?!

Page 21: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

service discovery

timelines users

zookeeper

service discovery is advisory

GetUser(olix0r)

Page 22: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

github.com/twitter/finagleRPC library (JVM)

asynchronous

built on Netty

scala

functional

strongly typed

first commit: Oct 2010

Page 23: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

datacenter

[1] physical

[2] link

[3] network

[4] transportkubernetes, mesos, swarm, … canal, weave, …

aws, azure, digitalocean, gce, …

business languages, libraries[7] application

rpc[5] session

[6] presentation json, protobuf, thrift, …

http/2, mux, …

Page 24: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

“It’s slow”is the hardest problem you’ll ever debug.

Jeff Hodges @jmhodgesNotes on Distributed Systems for Young Bloods

Page 25: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

observability

counters (e.g. client/users/failures)

histograms (e.g. client/users/latency/p99)

tracing

Page 26: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

tracing

Page 27: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

timeouts & retries

timelines

users

web

db

timeout=400ms retries=3

timeout=400ms retries=2

timeout=200ms retries=3

timelines

users

web

db

Page 28: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

timeouts & retries

timelines

users

web

db

timeout=400ms retries=3

timeout=400ms retries=2

timeout=200ms retries=3

timelines

users

web

db

800ms!

600ms!

Page 29: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

deadlines

timelines

users

web

db

timeout=400ms

deadline=323ms

deadline=210ms

77ms elapsed

113ms elapsed

Page 30: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

retries

typical:

retries=3

Page 31: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

retries

typical:

retries=3worst-case: 300% more load!!!

Page 32: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

budgets

typical:

retries=3

better: retryBudget=20%

worst-case: 300% more load!!!

worst-case: 20% more load

Page 33: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

load shedding via cancellation

timelines

users

web

db

timelines

users

web

db

timeout!

Page 34: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

load shedding via cancellation

timelines

users

web

db

timelines

users

web

db

timeout!

Page 35: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

backpressure

timelines

users

web

db

timelines

users

web

db

1000 requests

100 requests

1000 requests

Page 36: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

backpressure

timelines

users

web

db

timelines

users

web

db

1000 failed

💀

1000 failed

Page 37: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

backpressure

timelines

users

web

db

100 ok

100 ok

100 ok + 900 failed/redirected/etc

Page 38: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

lb algorithms: • round-robin • fewest connections • queue depth • exponentially-weighted

moving average (ewma) • aperture

request-level load balancing

Page 39: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011
Page 40: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

So just rewrite everything in Finagle!?

Page 41: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

linkerd

Page 42: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

github.com/buoyantio/linkerdservice mesh proxy

built on finagle & netty

suuuuper pluggable

http, thrift, …

etcd, consul, kubernetes, marathon, zookeeper, …

Page 43: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

Linkers and Loaders, John R. Levine, Academic Press

Page 44: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

linker for the datacenter

Page 45: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

logical namingapplications refer to logical names

requests are bound to concrete names

delegations express routing

/s/users

/#/io.l5d.zk/prod/users /#/io.l5d.zk/staging/users

/s => /#/io.l5d.zk/prod

Page 46: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

per-request routing: staging

GET / HTTP/1.1 Host: mysite.coml5d-dtab: /s/B => /s/B2

Page 47: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

per-request routing: debug proxy

GET / HTTP/1.1Host: mysite.coml5d-dtab: /s/E => /s/P/s/E

Page 48: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

linkerd service meshtransport security

service discovery

circuit breaking

backpressure

deadlines

retries

tracing

metrics

keep-alive

multiplexing

load balancing

per-request routing

service-level objectives

Service Binstance

linkerd

Service Cinstance

linkerd

Service Ainstance

linkerd

Page 49: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

demo: gob’s microservice

Page 50: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

web

wordgen

l5d

l5dl5d

Page 51: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

web

wordgen

gen-v2

l5d

l5dl5d

l5d

Page 52: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

web

wordgen

gen-v2

l5d

l5dl5d

l5d

namerd

Page 53: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

github.com/buoyantio/linkerd-examples

Page 54: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

linkerd roadmap• Battle test HTTP/2 • TLS client certs • Deadlines • Dark Traffic • All configurable everything

Page 55: QConSF, November 9, 2016 Freeing the Whale...Freeing the Whale How to Fail at Scale ... freeing the whale photo: Johanan Ottensooser. mesos.apache.org UC Berkeley, 2010 Twitter, 2011

more at linkerd.io

slack: slack.linkerd.io

email: [email protected]

twitter:

• @olix0r

• @linkerd

thanks!