atmosphere 2013
DESCRIPTION
Wikia is one of largest websites in the world. We serve millions webpages every day. This talk will present an overview the top 5 issues Wikia Operations encountered with our production setup in 2012, our learnings and improvements.TRANSCRIPT
Reliability at scale, lessons learned at Wikia
Łukasz Jagiełło, Paweł Rein
Wikia by numbers
○ 1.5B Global Monthly Page Views,
○ 92M Global Unique Visitors,
○ 20M Pages of Content,
○ 16M Global Mobile Unique Visitors,
○ 300K Wikis with over 200 languages,
○ comScore Top 100 and Quantcast Top 50
even small change can have big impact
monitoring200k ganglia metrics for main colo
4500 service checks in nagiosPingdom, Websitepulse, Keynote
NewRelic
NewRelic
Pagerduty
distributed team
communication is king
(cc) http://www.flickr.com/photos/markusschoepke/90287837/
communication
○ irc
○ opslog
○ hand-off emails
○ code reviews
This slide is stolen from Dave Zwieback's presentation seen at Velocity Conference.
failures easy to handle
this slide is stolen from Artur Bergman's presentation
automated failover
descriptive alerts
readable code
Lessons learned
"I don't believe in Google Index"
anonymous Wikia developer
noindex, nofollow
○ 150-200mln page views lost
○ 1 week to get back at google
lessons learned
- started monitoring metatags - additional step to QA checklist
Iowa DC outage
(cc) http://www.flickr.com/photos/daquellamanera/143760074/
(cc) http://www.flickr.com/photos/pingu1963/3595010054/
lessons learned
- backup DC failing once per year for 3h is OK, but if you want active-active it's not that much OK anymore
unexpected PHP upgrade
lessons learned
- use internal repos- merge with upstream manually- in Chef, declare version in package resource
leap second of death
http://www.flickr.com/photos/iandavid/2676696330/
lessons learned
- expect unexpected ;)
null code release
lessons learned
- buy it and build monitoring and backup for it
\018 instead of \021
All varnishes crashed within minutes...
lessons learned
- inline C functions moved away from our vcl's - change to C function needs to be requested
consistently?
lessons learned
- buy it and build monitoring for it
do your homework
- always do post-mortem, blameless RCAs
http://www.wikia.com/Careers
Thank You !