atmosphere 2013

55
Reliability at scale, lessons learned at Wikia Łukasz Jagiełło, Paweł Rein

Upload: lukasz-jagiello

Post on 07-Dec-2014

1.141 views

Category:

Technology


4 download

DESCRIPTION

Wikia is one of largest websites in the world. We serve millions webpages every day. This talk will present an overview the top 5 issues Wikia Operations encountered with our production setup in 2012, our learnings and improvements.

TRANSCRIPT

Page 1: Atmosphere 2013

Reliability at scale, lessons learned at Wikia

Łukasz Jagiełło, Paweł Rein

Page 2: Atmosphere 2013
Page 3: Atmosphere 2013

Wikia by numbers

○ 1.5B Global Monthly Page Views,

○ 92M Global Unique Visitors,

○ 20M Pages of Content,

○ 16M Global Mobile Unique Visitors,

○ 300K Wikis with over 200 languages,

○ comScore Top 100 and Quantcast Top 50

Page 4: Atmosphere 2013
Page 5: Atmosphere 2013
Page 6: Atmosphere 2013

even small change can have big impact

Page 7: Atmosphere 2013
Page 8: Atmosphere 2013

monitoring200k ganglia metrics for main colo

4500 service checks in nagiosPingdom, Websitepulse, Keynote

NewRelic

Page 9: Atmosphere 2013

NewRelic

Page 10: Atmosphere 2013

Pagerduty

Page 11: Atmosphere 2013

distributed team

Page 12: Atmosphere 2013

communication is king

(cc) http://www.flickr.com/photos/markusschoepke/90287837/

Page 13: Atmosphere 2013

communication

○ irc

○ opslog

○ hand-off emails

○ code reviews

Page 14: Atmosphere 2013
Page 15: Atmosphere 2013
Page 16: Atmosphere 2013

This slide is stolen from Dave Zwieback's presentation seen at Velocity Conference.

Page 17: Atmosphere 2013

failures easy to handle

Page 18: Atmosphere 2013

this slide is stolen from Artur Bergman's presentation

Page 19: Atmosphere 2013

automated failover

Page 20: Atmosphere 2013

descriptive alerts

Page 21: Atmosphere 2013

readable code

Page 22: Atmosphere 2013

Lessons learned

Page 23: Atmosphere 2013
Page 24: Atmosphere 2013

"I don't believe in Google Index"

anonymous Wikia developer

Page 25: Atmosphere 2013

noindex, nofollow

○ 150-200mln page views lost

○ 1 week to get back at google

Page 26: Atmosphere 2013
Page 27: Atmosphere 2013

lessons learned

- started monitoring metatags - additional step to QA checklist

Page 28: Atmosphere 2013

Iowa DC outage

Page 29: Atmosphere 2013
Page 30: Atmosphere 2013

(cc) http://www.flickr.com/photos/daquellamanera/143760074/

Page 31: Atmosphere 2013

(cc) http://www.flickr.com/photos/pingu1963/3595010054/

Page 32: Atmosphere 2013

lessons learned

- backup DC failing once per year for 3h is OK, but if you want active-active it's not that much OK anymore

Page 33: Atmosphere 2013

unexpected PHP upgrade

Page 34: Atmosphere 2013
Page 35: Atmosphere 2013

lessons learned

- use internal repos- merge with upstream manually- in Chef, declare version in package resource

Page 36: Atmosphere 2013

leap second of death

Page 38: Atmosphere 2013
Page 39: Atmosphere 2013
Page 40: Atmosphere 2013

lessons learned

- expect unexpected ;)

Page 41: Atmosphere 2013

null code release

Page 42: Atmosphere 2013
Page 43: Atmosphere 2013

lessons learned

- buy it and build monitoring and backup for it

Page 44: Atmosphere 2013

\018 instead of \021

All varnishes crashed within minutes...

Page 45: Atmosphere 2013
Page 46: Atmosphere 2013

lessons learned

- inline C functions moved away from our vcl's - change to C function needs to be requested

Page 47: Atmosphere 2013

consistently?

Page 48: Atmosphere 2013
Page 49: Atmosphere 2013
Page 50: Atmosphere 2013
Page 51: Atmosphere 2013

lessons learned

- buy it and build monitoring for it

Page 52: Atmosphere 2013

do your homework

- always do post-mortem, blameless RCAs

Page 53: Atmosphere 2013

http://www.wikia.com/Careers

Page 54: Atmosphere 2013
Page 55: Atmosphere 2013

Thank You !