how does the cloud foundry diego project run at scale?

56
How does the Cloud Foundry Diego Project Run at Scale? and updates on .NET Support

Upload: pivotal

Post on 07-Aug-2015

383 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: How does the Cloud Foundry Diego Project Run at Scale?

How does the Cloud Foundry Diego Project Run at Scale?

and updates on .NET Support

Page 2: How does the Cloud Foundry Diego Project Run at Scale?

Who’s this guy?

• Amit Gupta

• https://akgupta.ca

• @amitkgupta84

Page 3: How does the Cloud Foundry Diego Project Run at Scale?

Who’s this guy?

• Berkeley math grad school… dropout

• Rails consulting… deserter

• now I do BOSH, Cloud Foundry, Diego, etc.

Page 4: How does the Cloud Foundry Diego Project Run at Scale?

Testing Diego Performance at Scale

• current Diego architecture• performance testing approach• test specifications• test implementation and tools• results• bottom line• next steps

Page 5: How does the Cloud Foundry Diego Project Run at Scale?

Current Diego Architecture

+

Page 6: How does the Cloud Foundry Diego Project Run at Scale?

Current Diego Architecture

What’s new-ish?• consul for service discovery• receptor (API) to decouple from CC• SSH proxy for container access• NATS-less auction• garden-windows for .NET applications

Page 7: How does the Cloud Foundry Diego Project Run at Scale?

Current Diego Architecture

Main components:

• etcd ephemeral data store• consul service discovery• receptor Diego API• nsync sync CC desired state w/Diego• route-emitter sync with gorouter• converger health mgmt & consistency• garden containerization• rep sync garden actual state w/Diego• auctioneer workload scheduling

Page 8: How does the Cloud Foundry Diego Project Run at Scale?

Performance Testing Approach

• full end-to-end tests• do a lot of stuff:– is it correct, is it performant?

• kill a lot of stuff:– is it correct, is it performant?

• emit logs and metrics (business as usual)• plot & visualize• fix stuff, repeat at higher scale*

Page 9: How does the Cloud Foundry Diego Project Run at Scale?

Test Specifications

#1: #2:

#3: #4:

Page 10: How does the Cloud Foundry Diego Project Run at Scale?

Test Specifications

#1: #2:

#3: #4:

x 1#1: #2:

#3: #4:

x 2

#1: #2:

#3: #4:

x 5#1: #2:

#3: #4:

x 10

n

Page 11: How does the Cloud Foundry Diego Project Run at Scale?

Test Specifications

• Diego does tasks and long-running processes• launch 10n, …, 400n tasks:– workload distribution?– scheduling time distribution?– running time distribution?– success rate?– growth rate?

• launch 10n, …, 400n-instance LRP:– same questions…

Page 12: How does the Cloud Foundry Diego Project Run at Scale?

Test Specifications

• Diego+CF stages and runs apps• > cf push• upload source bits• fetch buildpack and stage droplet (task)• fetch droplet and run app (LRP)• dynamic routing• streaming logs

Page 13: How does the Cloud Foundry Diego Project Run at Scale?

Test Specifications

• bring up n nodes in parallel– from each node, push a apps in parallel– from each node, repeat this for r rounds

• a is always ≈ 20• r is always = 40• n starts out = 1

Page 14: How does the Cloud Foundry Diego Project Run at Scale?

Test Specifications

• the pushed apps have varying characteristics:– 1-4 instances– 128M-1024M memory– 1M-200M source code payload– 1-20 log lines/second– crash never vs. every 30 s

Page 15: How does the Cloud Foundry Diego Project Run at Scale?

Test Specifications

• starting with n=1:– app instances ≈ 1k – instances/cell ≈ 100 – memory utilization across cells ≈ 90% – app instances crashing (by-design) ≈ 10%

Page 16: How does the Cloud Foundry Diego Project Run at Scale?

Test Specifications

• evaluate:– workload distribution– success rate of pushes– success rate of app routability– times for all the things in the push lifecycles– crash recovery behaviour– all the metrics!

Page 17: How does the Cloud Foundry Diego Project Run at Scale?

Test Specifications

• kill 10% of cells– watch metrics for recovery behaviour

• kill moar cells… and etcd– does system handle excess load gracefully?

• revive everything with > bosh cck– does system recover gracefully…– with no further manual intervention?

Page 18: How does the Cloud Foundry Diego Project Run at Scale?

Test Specifications

– Figure Out What’s Broke –

– Fix Stuff –

– Move On Scale Up & Repeat –

Page 19: How does the Cloud Foundry Diego Project Run at Scale?

Test Implementation and Tools

• S3 log, graph, plot backups• ginkgo & gomega testing DSL• BOSH parallel test-lab deploys• tmux & ssh run test suites remotely• papertrail log archives• datadog metrics visualizations• cicerone (custom) log visualizations

Page 20: How does the Cloud Foundry Diego Project Run at Scale?

Results400 tasks’ lifecycle timelines, dominated by container creation

Page 21: How does the Cloud Foundry Diego Project Run at Scale?

Results

Maybe some cells’ gardens were running slower?

Page 22: How does the Cloud Foundry Diego Project Run at Scale?

ResultsGrouping by cell shows uniform container creation slowdown

Page 23: How does the Cloud Foundry Diego Project Run at Scale?

Results

So that’s not it…Also, what’s with the blue steps?

Let’s visualize logs a couple more waysThen take stock of the questions raised

Page 24: How does the Cloud Foundry Diego Project Run at Scale?

ResultsLet’s just look at scheduling (ignore container creation, etc.)

Page 25: How does the Cloud Foundry Diego Project Run at Scale?

ResultsScheduling again, grouped by which API node handled the request

Page 26: How does the Cloud Foundry Diego Project Run at Scale?

ResultsAnd how about some histograms of all the things?

Page 27: How does the Cloud Foundry Diego Project Run at Scale?

Results

From the 400-task request from “Fezzik”:• only 3-4 (out of 10) API nodes handle reqs?• recording task reqs take increasing time?• submitting auction reqs sometimes slow?• later auctions take so long?• outliers wtf?• container creation takes increasing time?

Page 28: How does the Cloud Foundry Diego Project Run at Scale?

Results

• only 3-4 (out of 10) API nodes handle reqs?– when multiple address requests during DNS lookup, Golang

returns the DNS response to all requests; this results in only 3-4 API endpoint lookups for the whole set of tasks

• recording task reqs take increasing time?– API servers use an etcd client with throttling on # of concurrent

requests

• submitting auction reqs sometimes slow?– auction requests require API node to lookup auctioneer address

in etcd, using throttled etcd client

Page 29: How does the Cloud Foundry Diego Project Run at Scale?

Results

• later auctions take so long?– reps were taking longer to report their state to auctioneer,

because they were making expensive calls to garden, sequentially, to determine current resource usage

• outliers wtf?– combination of missing logs due to papertrail lossiness, +

cicerone handling missing data poorly

• container creation takes increasing time?– garden team tasked with investigation

Page 30: How does the Cloud Foundry Diego Project Run at Scale?

Results

Problems can come from:

• our software– throttled etcd client– sequential calls to garden

• software we consume– garden container creation

• “experiment apparatus” (tools and services):– papertrail lossiness– cicerone sloppiness

• language runtime– Golang’s DNS behaviour

Page 31: How does the Cloud Foundry Diego Project Run at Scale?

ResultsFixed what we could control, and now it’s all garden

Page 32: How does the Cloud Foundry Diego Project Run at Scale?

ResultsOkay, so far, that’s just been

#1: #2:

#3: #4:

x 1#1: #2:

#3: #4:

x 2

#1: #2:

#3: #4:

x 5#1: #2:

#3: #4:

x 10

Page 33: How does the Cloud Foundry Diego Project Run at Scale?

ResultsNext, the timelines of pushing 1k app instances

Page 34: How does the Cloud Foundry Diego Project Run at Scale?

Results

• for the fastest pushes– dominated by red, blue, gold– i.e. upload source & CC emit “start”, staging process,

upload droplet• pushes get slower – growth in green, light blue, fucsia, teal– i.e. schedule staging, create staging container, schedule

running, create running container

• main concern: why is scheduling slowing down?

Page 35: How does the Cloud Foundry Diego Project Run at Scale?

Results

• we had a theory (blame app log chattiness)• reproduced experiment in BOSH-Lite– with chattiness turned on– with chattiness turned off

• appeared to work better• tried it on AWS• no improvement

Page 36: How does the Cloud Foundry Diego Project Run at Scale?

Results

• spelunked through more logs• SSH’d onto nodes and tried hitting services• eventually pinpointed it:– auctioneer asks cells for state– cell reps ask garden for usage– garden gets container disk usage bottleneck

Page 37: How does the Cloud Foundry Diego Project Run at Scale?

ResultsGarden stops sending disk usage stats, scheduling time disappears

Page 38: How does the Cloud Foundry Diego Project Run at Scale?

ResultsLet’s let things stew between

and

Page 39: How does the Cloud Foundry Diego Project Run at Scale?

ResultsRight after all app pushes, decent workload distribution

Page 40: How does the Cloud Foundry Diego Project Run at Scale?

Results… an hour later, something pretty bad happened

Page 41: How does the Cloud Foundry Diego Project Run at Scale?

Results

• cells heartbeat their presence to etcd• if ttl expires, converger reschedules LRPs• cells may reappear after their workloads have

been reassigned• they remain underutilized

• but why do cells disappear in the first place?• added more logging, hope to catch in n=2 round

Page 42: How does the Cloud Foundry Diego Project Run at Scale?

ResultsWith the one lingering question about cell disappearnce, on to n=2

#1: #2:

#3: #4:

x 1#1: #2:

#3: #4:

x 2

#1: #2:

#3: #4:

x 5#1: #2:

#3: #4:

x 10

✓✓

✓ ✓

?

Page 43: How does the Cloud Foundry Diego Project Run at Scale?

ResultsWith 800 concurrent task reqs, found container cleanup garden bug

Page 44: How does the Cloud Foundry Diego Project Run at Scale?

ResultsWith 800-instance LRP, found API node request scheduling serially

Page 45: How does the Cloud Foundry Diego Project Run at Scale?

Results

• we added a story to the garden backlog• the serial request issue was an easy fix

• then, with n=2 parallel test-lab nodes, we pushed 2x the apps– things worked correctly– system was performant as a whole– but individual components showed signs of scale

issues

Page 46: How does the Cloud Foundry Diego Project Run at Scale?

ResultsOur “bulk durations” doubled

Page 47: How does the Cloud Foundry Diego Project Run at Scale?

Results

• nsync fetches state from CC and etcd to make sure CC desired state is reflected in diego

• converger fetches desired and actual state from etcd to make sure things are consistent

• route-emitter fetches state from etcd to keep gorouter in sync

• bulk loop times doubled from n=1

Page 48: How does the Cloud Foundry Diego Project Run at Scale?

Results… and this happened again

Page 49: How does the Cloud Foundry Diego Project Run at Scale?

Results

– the etcd and consul story –

Page 50: How does the Cloud Foundry Diego Project Run at Scale?

ResultsFast-forward to today

#1: #2:

#3: #4:

x 1#1: #2:

#3: #4:

x 2

#1: #2:

#3: #4:

x 5#1: #2:

#3: #4:

x 10

✓✓

✓ ✓

? ✓✓

✓ ✓

?

✓✓

✓ ✓

? ✓ ???

Page 51: How does the Cloud Foundry Diego Project Run at Scale?

Bottom LineAt the highest scale:

• 4000 concurrent tasks ✓• 4000-instance LRP ✓

• 10k “real app” instances @ 100 instances/cell:– etcd (ephemeral data store) ✓– consul (service discovery) ? (… it’s a long story)– receptor (Diego API) ? (bulk JSON)– nsync (CC desired state sync) ? (because of receptor)– route-emitter (gorouter sync) ? (because of receptor)– garden (containerizer) ✓– rep (garden actual state sync) ✓– auctioneer (scheduler) ✓

Page 52: How does the Cloud Foundry Diego Project Run at Scale?

Next Steps

• Security– mutual SSL between all components– encrypting data-at-rest

• Versioning– handle breaking API changes gracefully– production hardening

• Optimize data models– hand-in-hand with versioning– shrink payload for bulk reqs– investigate faster encodings; protobufs > JSON– initial experiments show 100x speedup

Page 54: How does the Cloud Foundry Diego Project Run at Scale?

Updates on .NET Support

• what’s currently supported?– ASP.NET MVC– nothing too exotic– most CF/Diego features, e.g. security groups– VisualStudio plugin, similar to the Eclipse CF plugin for

Java

• what are the limitations?– some newer Diego features, e.g. SSH– in α/β stage, dev-only

Page 55: How does the Cloud Foundry Diego Project Run at Scale?

Updates on .NET Support

• what’s coming up?– make it easier to deploy Windows cell– more VisualStudio plugin features– hardening testing/CI

• further down the line?– remote debugging– the “Spring experience”

Page 56: How does the Cloud Foundry Diego Project Run at Scale?

Updates on .NET Support

• shout outs– CenturyLink– HP

• feedback & questions?– Mark Kropf (PM): [email protected]– David Morhovich (Lead): [email protected]