scaling at showyou - operations · introductionstorageprocessingmonitoringreview challenges...

55
Introduction Storage Processing Monitoring Review Scaling at Showyou Operations September 26, 2011

Upload: others

Post on 07-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Scaling at ShowyouOperations

September 26, 2011

Page 2: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

I’m Kyle Kingsbury

Handle aphyrCode http://github.com/aphyrEmail [email protected] Backend, API, ops

Page 3: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

What the hell is Showyou?

Page 4: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 5: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Nontrivial complexity

Page 6: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Challenges

� Scanning social networks� Feeds� Search� Trends� Responsive client experience

� Everything fails all the time

Page 7: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Challenges

� Scanning social networks� Feeds� Search� Trends� Responsive client experience� Everything fails all the time

Page 8: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 9: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Storage

Page 10: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 11: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

We left MySQL

� Changing the schema requires downtime� Crashes� Master-slave lag� Slow restarts� Node replacements difficult� Fully normalized queries complex, slow

Page 12: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

We left MySQL

� Changing the schema requires downtime� Crashes� Master-slave lag� Slow restarts� Node replacements difficult� Fully normalized queries complex, slow

Page 13: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

MySQL does scale

But there are tradeoffs

Page 14: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Riak

� Key/value store� Homogenous� Scales linearly with nodes� Excellent durability/recoverability� Eventually consistent

Page 15: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

We use Riak as our durable datastore

� Users, feeds, videos, etc� Highly denormalized� Limited MR queries (feeds, etc)

� Latency-bounded MR jobs are Erlang� Hot-deployable

� Extensive use of conflict resolution� Made possible by Risky

Page 16: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Riak at Showyou

� 51 million keys (153 M replicated)� 100 GB of data (300 GB replicated)� 260 gets/sec (baseline)� 75 puts/sec (baseline)� Capable of over 3000 ops/sec

Page 17: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

SSDs are amazing

WD 7200RPM

� 100 ops/sec� 95%: 100-300ms

Micron RealSSD P300

� 1000+ ops/sec� 95%: 3-5ms

Page 18: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

When Riak fails,

� Another node takes up the slack� Clients connected to that node reconnect to others� Typically no service interruption

� However, latencies may rise� Especially for MR jobs

Page 19: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Riak has downsides

� Difficult to debug� Membership changes are dangerous� Significantly slower than MySQL� (Bitcask) All keys must fit in memory� Mapreduce is only appropriate for known keys� List-keys can take down your cluster

Long story short: it’s only a KV store

Page 20: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

+Redis

Page 21: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

We use Redis for fast, temporary state

� List of users� List of videos� Counters� Queues

Incredibly fast, excellent primitives

Page 22: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

When Redis fails,

� Daemons using those indexes pause� Frontend service continues� Bitcask scanners and incremental updaters repair

any lost data

Eventually consistent.

Page 23: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

When Redis fails,

� Daemons using those indexes pause� Frontend service continues� Bitcask scanners and incremental updaters repair

any lost data

Eventually consistent.

Page 24: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

We also use SOLR extensively

� Supplements Riak� Complex indices� Full-text search� Analytics

More on that later. . .

Page 25: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Processing

Page 26: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 27: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Do one thing well

Lots of small processes handling well-defined tasks

� Easier to debug� Easier to test� Simplifies parallelism� Simplifies error handling� Less likely to cause total system failure

Page 28: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Minimize Shared State

� Vector clocks for concurrent modification� Queues for message passing� Riak for durable storage� Redis for fast synchronous state

Page 29: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Crash by Default

� Someone else will take your work� Repair constantly� Assume everybody is out to kill you

Page 30: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Distribute

� Multiple threads, processes, hosts� Failover IPs with Heartbeat� Rolling restarts mean frequent deploys and nobody

notices� Losing a node is no big deal� Scaling out is easy

Page 31: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 32: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Monitoring

Page 33: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 34: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

UState: A state aggregator

Page 35: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Receive states over protobufs

Host backend1.showyou.comService feed merger rate

Time unix epoch secondsState ok

Metric 12.5Description 12.5 feed items/sec

Page 36: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Query states

� state = "warning" or state = "critical"� service =∼ "api %" and host != null

Page 37: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

� Combine states together (sum, average, . . . )� Send email on changes� Forward to another UState server� Forward to Graphite� Dashboard

Page 38: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 39: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 40: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Understand application behavior

Page 41: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 42: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

When can we. . . ?

Page 43: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 44: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

It’s 23:15 PST.

Do you know where YOUR database is?

Page 45: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

It’s 23:15 PST.

Do you know where YOUR database is?

Page 46: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 47: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 48: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 49: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 50: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 51: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 52: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails
Page 53: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

http://github.com/aphyr/ustate

Page 54: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Recap

� Robust, discrete components� Highly distributed� Message passing� Eventual consistency� Comprehensive monitoring

Page 55: Scaling at Showyou - Operations · IntroductionStorageProcessingMonitoringReview Challenges Scanning social networks Feeds Search Trends Responsive client experience Everything fails

Introduction Storage Processing Monitoring Review

Thanks!

� Basho (esp. Pharkmillups!)� Formspring� Bump