building reliable systems from unreliable parts
TRANSCRIPT
Building Reliable Systems From Unreliable Parts
You might think that the problems at large scale are different from the problems at small scale, but it's all failure all the time, and it is only worse at large scale
Plan for it.
All systems fail
Tell the story about the failure last week when a developer pushed a small Java warmup script and took out all of netflix api servers.
Jonah Horowitz
(Site Reliability Engineer at Netflix and elsewhere)
Home built BBS in 1990Some NOC/Helpdesk workWalmart.com in 2000BSEE from the Univ of CincinnatiMusic Startup in 2005Telecom Startup in 2007Advertising companiesNetflix
Talk about Netflix scale 100k servers, 80M users, 30TB/s network traffic, 800 microservices, 1500 engineers
Chaos is your friend
Talk about Chaos Monkey, Chaos Kong
Stateless services are awesome
Most of the 800 microservices at Netflix are stateless, this allows for failure
store state somewhere
Globally replicated cassandra database rings, massive number of nodes, but you should have 2 copies of your database.
Repair Automatically
Never have an engineer do something that can be done automatically. Computers are better at pushing puttons than you are.
Talk about rebootageddon, zero downtime even though 1/3 of our cassandra servers were rebooted over 48 hours.
Even in open source projects.
Culture is important
Jonah HorowitzSite Reliability Engineer
Netflix lawyers didn't approve my talk, so everything I said was my own opinion.
Speakers here were really inspiring.