monitoring at facebook - ran leibman, facebook - devopsdays tel aviv 2015

58
Monitoring @ Facebook Ran Leibman Production Engineer Monitoring Tools, Components, & Mentality at Facebook

Upload: devopsdays-tel-aviv

Post on 22-Jan-2018

1.565 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Monitoring @ Facebook

Ran LeibmanProduction Engineer

Monitoring Tools, Components, & Mentality at Facebook

Page 2: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Who Am I ?

Page 3: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Agenda

• Problems in today’s monitoring, solutions & approaches • Facebook Monitoring Architecture • Dive into each component • Show Use cases

Page 4: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

The Problems

Page 5: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 6: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Problems - Nagios Checks•Good for binary checks •Monitoring based on a “point in time” when the script executes •Can’t monitor on a time window (can do clowny checks with a

temp file but that is not so elegant …) •Perf data

•Not all plugins implement this •Data gathering is coupled with the way you want to alert •Hard to aggregate from perf data from multiple checks

Page 7: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Problems - Cron Scripts

•Difficult to put this scripts in “maintenance mode” while deploying code or acknowledging an issue

•How do you know that your script actually runs? •Who stopped crond!? •Error handling?

Page 8: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Problems - Metrics R 2nd Class Citizen•We are not always treating our metric store like we treat our

application data •Storing metrics in clowny temp files or some unmaintained mysql •Metric is DATA like any other and we should treat it as such!

•How & from where we are going to query it? •What is the best way to store it? •Retention?

Page 9: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Problems - Ops Ownership•Usually only the ops teams

owns monitoring •Even if the developers wants

to add metrics and alerts they have a steep learning curve in order to achieve this

Page 10: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Facebook Monitoring Architecture

Page 11: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Operational Data Store - ODS

Page 12: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Operational Data Store - ODS

•“key —> float” type of metrics, associated with an entity • system.load1, system.cpu-user, system.n_eth-txbyt •chef.run_sucess •chef.last_run_time •myapp.request_num •myapp.request_median

Entity / Key-Value

Page 13: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Operational Data Store - ODS

•Use Gorilla in-memory TSDB for short term data (24 hours) •Store permanent data in HBase on top of HDFS •Aggregation

•by rack / cluster / tier •by custom tags - app, tier name, etc … •cross datacenter aggregations

Data Store

Page 14: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Collecting MetricsOperational Data Store - ODS

• API modules for every imaginable language • Thrift Endpoint - https://thrift.apache.org • Implements fb303 counters • fb303 counters are collected from the service by FBAgent • FBAgent submit the metrics over to ODS

Page 15: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

RetentionOperational Data Store - ODS

•Metrics is A LOT of data! •All ODS metrics are being rolled up in the same way

•Daily - 2 Weeks (depends on you) •Weekly - 2 Weeks (1min) •Monthly - 1 Month (1h) •Yearly - until the end of time (1h)

•How do I solve the data loss??

Page 16: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

AggregationOperational Data Store

•Aggregate important metrics save spikes •p50, p90, p99 • top(N) •count •min, max

•Aggregate by cluster | rack | custom

Page 17: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Scuba - Real-Time Deep Dive Log Monitoring

Page 18: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Whats is going on RIGHT NOW with my service?Scuba - Real-Time Log Monitoring

•Was started as a hackathon ! today we can’t live without it •Combine application logs from all servers & containers into a single table

•Data is stored in memory •Very small lag (<1min) •Super fast queries (median of ~300ms) •SQL like query syntax

Page 19: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Logging to scubaScuba - Real-Time Log Monitoring

• Libraries for every imaginable language • PHP, Python, C++, Bash, etc …

• Scuba supports: • String • Ints • Set of String • Stack of strings (usually for stack traces)

Page 20: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

So… What’s the catch ?Scuba - Real-Time Log Monitoring

• Strict quota policy • by size • by time

• Use sampling in order to reduce load • Not good for pipelining of data

Page 21: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 22: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 23: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 24: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 25: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 26: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Alarm System

Page 27: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 28: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

What is an Alert?Alarm System

•Creating an alert does not mean you’ll be notified! •Alert is an event stating that something happened

•Can’t ssh to server •p50 of request time is slower than 100ms 80% of the time in the last 10min •the application tier in us-east is 50% down

•The alert should contain ALL relevant data about the event •Alerts can be suppressed in case of maintenance

Page 29: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Alarm System - Alert Structure

Page 30: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 31: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 32: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 33: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 34: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

FBAR - Facebook Auto-Remediation

Page 35: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Automation Automation Automation !FBAR

• Most alarms could be auto remediate without human intervention

• Code it once, never do it again • Doing the work of 136,000 engineering hours (29/04/2015)

•136,000 / 8 = 17,000 engineers a day !

Page 36: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

FBAR

Page 37: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

FBAR

Page 38: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

FBAR

Page 39: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Notifications & Subscriptions

Page 40: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 41: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 42: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

What should I be paged about?Notification & Subscriptions

•Actionable alerts •Impactful alerts •Before you subscribe to an alert ask yourself: •Can I automate this? (FBAR!) •Is this actionable? •Should an engineer wake up because of this?

Page 43: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 44: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 45: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Notification & Subscriptions

Page 46: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Dashboards

Page 47: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

What are they good for ?Dashboards

• Awesome tool for debugging production issues • Making your case: • We need to fix this code path • If we had the BLABLA tool it would reduce this by a factor of X • Since we deployed the last release engagement dropped by 10% in west Europe

• Dashboards are cool to look at =) it’s the best way to get an understanding of what is going on with the service

Page 48: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Cubism

based on Cubism.js

Page 49: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 50: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 51: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Page 52: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

אז מה היה לנו ?

Page 53: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Use data!

Page 54: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Think before you alert

Page 55: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Surfacing problems you never thought existed

Page 56: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Monitoring is not an “Ops Job”

Page 57: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Questions?

Ran LeibmanProduction Engineer

Page 58: Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015