dockercon eu 2015: placing a container on a train at 200mph
TRANSCRIPT
Placing a container on a train at 200mph
Casper S. JensenSoftware Engineer, Uber
About Me
● Joined Uber January 2015,Compute PlatformDenmark, Aarhus office
● PhD, CSOn a completely unrelated topic
● Linux aficionado
● Docker “user” since February
About UBERWhy all the fuzz?
The UBER app
4
339 Cities
5
61 Countries
6
2.000.000+ Trips/day
7
4000+ Employees
8
Not that hard...
10
You just have to handle
● 24/7 availability across the globe● Very different markets● 1000s of developers and teams● Adding new features like there’s no tomorrow
UberPOOL, UberKITTEN, UberICECREAM, UberEATS, UberWHATEVERYOUCANIMAGINE
● Hypergrowth in all dimensions● Datacenters, servers, infrastructure, etc
Basically, you have to make magic happen every time a user opens the application
Software DevelopmentThe old UBER way
A fair amount of frustration
12
1)Write service RFC2)Wait for feedback3)Do all necessary scaffolding by hand4)Start developing your service5)Wait for infra team to write service scaffolding6)Wait for IT to allocate servers7)Wait for infra team to provision servers8)Deploy to development servers and test9)Deploy to production10)Monitor and iterate
Steps 5–7 could take days or weeks...
It's just not scalable
13
But you have to start somewhere
—Internal e-mail, February 2015
“Make it easier for service owners to manage their local
service environments.”
14
New development process
16
1)Write service RFC2)Wait for feedback3)Do all necessary scaffolding using tools4)Start developing your service
5)Deploy to development servers and test6)Deploy to production7)Monitor and iterate
No silver bullets
All the things you did not consider
19
● Routing● Dynamic service discovery● Deployment● Placement engine● Logging and tracing● Dual build environments● Handling of secrets● Security updates● Private repositories● Replicating images across multiple datacenters
Also, how much freedom do you really want to give your developers?
Change all the things!Let's go through some examples
uDeploy
21
● Rolling upgrades● Automatic rollbacks on failure● Health checks, stats, exceptions,
○ Load-, and system-tests● Service building● Build replication● 4.000+ upgrades/week● 3.000+ builds/week● 300+ rollbacks/week● 600+ managed services
Our in-house deployment/cluster management system
Moving to docker with zero downtime
22
Build multiplexing
We want to keep on trucking while migrating to docker
Build process & scaffolding
23
Declarative build scripts
● Service configuration in git● Preset service frameworks● Many options● Generator creating
○ Dockerfile○ Health checks○ Entry point scripts inside container○ In general, all glue between host and service
● Possible to supply custom Dockerfile
service_name: test-uber-serviceowning_team: udeploybackend_port: 123frontend_port: 456service_type: clay_wheelclay_wheel: celeries: - queue: test-uber-service
has_celerybeat: true
Image replication
24
● Multiple datacenters● Images must be stored within DCs● Build once, replicate everywhere● Traffic restrictions, push but not pull
Current setup● Stock docker registry● File back-end● Docker-mover● Syncing images using pull/push● Use notification API to speed up replication
Service discovery & routing
25
● Previously, we used HAProxy + scripts to do this● Now, we use Hyberbahn + TChannel RPC
https://github.com/uber/{hyperbahn|tchannel}○ Used for docker and legacy services○ Required in order to move containers around in seconds○ Dynamic routing, circuit breaking, retries, rate limiting,
load balancing○ Completely dynamic, no fixed ports
Key Take-Aways
27
● Remove team dependencies● More freedom● Not tied to specific frameworks
or versions (hi, Python 3)● Easy to experiment with new
technologies
● Too much freedom● Non-trivial integrating with a
large running system● Infrastructure must be dynamic
throughout● Containers are only a minor
part of the infrastructure, don't forget that
The good & the bad
Current and future wins● Today, 30% of all services in docker● Soon-ish, 100%
● Great improvements in provisioning time (done)● Framework and service owners can manage their own
environment (done)● Faster and automatic scaling of capacity (in progress)
Thank you!Casper S. [email protected]