atmosphere 2013

Post on 07-Dec-2014

1.141 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Wikia is one of largest websites in the world. We serve millions webpages every day. This talk will present an overview the top 5 issues Wikia Operations encountered with our production setup in 2012, our learnings and improvements.

TRANSCRIPT

Reliability at scale, lessons learned at Wikia

Łukasz Jagiełło, Paweł Rein

Wikia by numbers

○ 1.5B Global Monthly Page Views,

○ 92M Global Unique Visitors,

○ 20M Pages of Content,

○ 16M Global Mobile Unique Visitors,

○ 300K Wikis with over 200 languages,

○ comScore Top 100 and Quantcast Top 50

even small change can have big impact

monitoring200k ganglia metrics for main colo

4500 service checks in nagiosPingdom, Websitepulse, Keynote

NewRelic

NewRelic

Pagerduty

distributed team

communication is king

(cc) http://www.flickr.com/photos/markusschoepke/90287837/

communication

○ irc

○ opslog

○ hand-off emails

○ code reviews

This slide is stolen from Dave Zwieback's presentation seen at Velocity Conference.

failures easy to handle

this slide is stolen from Artur Bergman's presentation

automated failover

descriptive alerts

readable code

Lessons learned

"I don't believe in Google Index"

anonymous Wikia developer

noindex, nofollow

○ 150-200mln page views lost

○ 1 week to get back at google

lessons learned

- started monitoring metatags - additional step to QA checklist

Iowa DC outage

(cc) http://www.flickr.com/photos/daquellamanera/143760074/

(cc) http://www.flickr.com/photos/pingu1963/3595010054/

lessons learned

- backup DC failing once per year for 3h is OK, but if you want active-active it's not that much OK anymore

unexpected PHP upgrade

lessons learned

- use internal repos- merge with upstream manually- in Chef, declare version in package resource

leap second of death

lessons learned

- expect unexpected ;)

null code release

lessons learned

- buy it and build monitoring and backup for it

\018 instead of \021

All varnishes crashed within minutes...

lessons learned

- inline C functions moved away from our vcl's - change to C function needs to be requested

consistently?

lessons learned

- buy it and build monitoring for it

do your homework

- always do post-mortem, blameless RCAs

http://www.wikia.com/Careers

Thank You !

top related