autobots @ rea
TRANSCRIPT
Autobots @ realestate.com.au
AutomateALL THE THINGS(or at least the things that really matter)
Technical Lead @ realestate.com.au
Who am I?
@ggiesemann @geekle
Some newcomers mistake me for a sysadmin :(
Geoffrey Giesemann
(and what the hell am I talking about?)
What's the problem?
WTF is going on?!?
I hope my wife still recognises me...
(deployments are *hard*)
(just the usual suspects)
Why so hard?
(and who do we blame?)
How do we fix this?
deploybot
with_down_in_load_balancer(server) do with_down_in_nagios(server) do puppet(server) raise "borked!" unless server.working? endend
schemabot
What's worked well?
Service Discovery!
$ grep -r 'aa01' nagios/ | wc -l 17$ grep -r 'aa01' puppet/ | wc -l 4$ grep -r 'aa01' deploybot/ | wc -l 0
> show lb vserver agentadmin-prod
agentadmin-prod (125.56.204.120:80) - HTTP Type: ADDRESS State: UP
... blah blah ...
Bound Service Groups:
1) Group Name: agentadmin
1) agentadmin (192.168.25.1: 80) - HTTP State: UP Weight: 1
Persistence Cookie Value : my_random_str=9999
2) agentadmin (192.168.25.2: 80) - HTTP State: UP Weight: 1
Persistence Cookie Value : my_random_str=9999
... etc etc ...
> show servicegroup agentadmin
agentadmin - HTTP
State: ENABLED Monitor Threshold : 0
Max Conn: 0 Max Req: 0 Max Bandwidth: 0 kbits
Monitor Name: http-diagnositic-warmup State: ENABLED Weight: 1
1) 192.168.25.1:80 State: UP Server Name: 192.168.25.1
Probes: 131205 Failed [Total: 2525 Current: 0]
Last response: Success - HTTP response code 200 received.
2) 192.168.25.2:80 State: UP Server Name: 192.168.25.2
Probes: 131322 Failed [Total: 2428 Current: 0]
Last response: Success - HTTP response code 200 received.
Standardised Monitoring!
$ curl http://aa01/diagnostic/status/nagiosOK - the application is functioning correctly
$ curl http://ar01/diagnostic/status/nagiosOK - the application is functioning correctly
● custom app health checks○ https://github.com/tribune/is_it_working○ https://github.com/blythedunham/health_monitor
● monitor requests in munin○ https://github.com/pka/rack-monitor
Better Error Handling!
● Don't bother trying to handle errors - just make sure you can recover from them quickly!
● As long as you have some app servers alive things will work out!
FIN
● http://markerguru.deviantart.com/● http://www.quickmeme.com/● http://www.doingitwrong.com/● Everyone who was part of @rea_autobots
Thanks!