monitoring using open source technologies
TRANSCRIPT
MonitoringUsing
Open Source Technologies
Utkarsh Bhatnagar
• Senior Software Engineer @ Sony Interactive Entertainment (PlayStation).• An active contributor to Grafana.• Project initiator for wizzy – a user friendly CLI tool for GRAFANA
GitHub - https://github.com/utkarshcmuEmail – [email protected]
GrafanaCon 2016 Speaker - https://www.youtube.com/watch?v=llRhdvV25rg
Monitoring using
@
PlayStation Outage!
Hi, I am Jack.
Sometime 2 years back…
POC on Monitoring
Requirements:
• 50,000 unique metrics from one source• Data points every minute• Roughly about 72 million data points per day• Data retention 60 days• User friendly UI with possible customization
Monitoring Stack
METRICSOURCE
Time Series Database Visualization Layer
Choosing the technology!
POCDesign & Architecture
METRICSOURCE
POC Completed!
Mission accomplished!
1 metrics source50,000 unique metrics
72 million data points per day
Metrics OnboardingTeam 1 Requirements:• 100,000 unique metrics• About 200 million data points per day
Team 2 Requirements:• 400,000 unique metrics• About 600 million data points per day
Team 3 Requirements:• 500,000 unique metrics• About 2 billion data points per day
Team 4 Requirements:• 800,000 unique metrics• About 5 billion data points per day
And more………
POCDesign & Architecture
METRICSOURCE
How to Scale?
Should he continue with Graphite?Should he ask to reduce metrics or datapoints?
How to dynamically scale Graphite?Does Grafana support other datasources?
OpenTSDB / InfluxDB / KairosDB / Prometheus?Support scaling Infrastructure to support variable load of metrics?
Challenges:• Multiple teams• Millions of unique metrics• Above 10 billion data points a day• Process 3 million logs every minute
and generate metrics• Reprocessing of metrics and logs if
needed• Provide real time monitoring for all
of the above using GRAFANA!
Strategy
Divide & Conquer
Team 1 Requirements:• 100,000 unique metrics• About 200 million data
points per day
Team 2 Requirements:• 500,000 unique metrics• About 2 billion data
points per day
Team 3 Requirements:• 3 million logs a minute• Generate metrics in real
time
And more………
Team 1 Requirements:• 100,000 unique metrics• About 200 million data
points per day
Design & Architecture
POCMETRICSOURCE
POC works for:
1 metrics source50,000 unique metrics
72 million data points per day
Team 1 requirements:
1 metrics source100,000 unique metrics
200 million data points per day
TEAM 1 METRIC SOURCE
Team 1 Conquered!
This strategy works! Bring it on!
Strategy
Divide & Conquer
Team 1 Requirements:• 100,000 unique metrics• About 200 million data
points per day
Team 2 Requirements:• 500,000 unique metrics• About 2 billion data
points per day
Team 3 Requirements:• 3 million logs a minute• Generate metrics in real
time
And more………
Team 2 Requirements:• 500,000 unique metrics• About 2 billion data
points per day
Design & Architecture
POCMETRICSOURCE
Team 2 requirements:
1 metrics source500,000 unique metrics
2 billion data points per day
TEAM 1 METRIC SOURCE
TEAM 2 METRIC SOURCE
Team 2 Conquered!
Design & Architecture
POCMETRICSOURCE
TEAM 1 METRIC SOURCE
TEAM 2 METRIC SOURCE
Team 2 requirements:
1 metrics source500,000 unique metrics
2 billion data points per day
Scaling Graphite
Clustering Graphite
CARBON RELAY
CARBON CACHE + WHISPER +
GRAPHITE WEB
CARBON CACHE + WHISPER +
GRAPHITE WEB
CARBON CACHE + WHISPER +
GRAPHITE WEB. . .
GRAPHITE WEB GRAPHITE WEB
LOAD BALANCER
Design & Architecture
POCMETRICSOURCE
TEAM 1 METRIC SOURCE
TEAM 2 METRIC SOURCE
Team 2 requirements:
1 metrics source500,000 unique metrics
2 billion data points per day
CR
G G G. . .
GW GW
LB
Team 2 Conquered!
But……. Happiness lasted only for a month
Design & Architecture
POCMETRICSOURCE
TEAM 1 METRIC SOURCE
TEAM 2 METRIC SOURCE
Team 2 requirements:
1 metrics source500,000 unique metrics
2 billion data points per day
CR
G G G. . .
GW GW
LB
Scalable Alternatives ToGraphite
Design & Architecture
POCMETRICSOURCE
TEAM 1 METRIC SOURCE
TEAM 2 METRIC SOURCE
Team 2 requirements:
1 metrics source500,000 unique metrics
2 billion data points per day
CR
G G G. . .
GW GW
LB
Team 2 Conquered!
Finally!
Strategy
Divide & Conquer
Team 1 Requirements:• 100,000 unique metrics• About 200 million data
points per day
Team 2 Requirements:• 500,000 unique metrics• About 2 billion data
points per day
Team 3 Requirements:• 3 million logs a minute• Generate metrics in real
time
And more………
Team 3 Requirements:• 3 million logs a minute• Generate metrics in real
time
How to process logs at scale?
Design & Architecture
POCMETRICSOURCE
TEAM 1 METRIC SOURCE
Team 3 requirements:
Over 5000 log sources3 million logs per minute
TEAM 2 METRIC SOURCE
LOGS SOURCES
Team 3 Conquered!
But …. One day..
Design & Architecture
POCMETRICSOURCE
TEAM 1 METRIC SOURCE
TEAM 2 METRIC SOURCE
LOGS SOURCES
Design & ArchitectureMETRIC SOURCE 1
METRIC SOURCE 2
METRIC SOURCE 3
METRIC SOURCE N
LOGS SOURCES
LB
Alerting
Metrics & Logs Sources
Graphite Stats- Apps using a stats library written byAlexander Filipchik
Custom metrics- From other sources
Lessons Learned
Strategy
Divide & Conquer
Look for alternatives!
Choose scalable components!
(Subject to effort and time)
Automation
Design & ArchitectureMETRIC SOURCE 1
METRIC SOURCE 2
METRIC SOURCE 3
METRIC SOURCE N
LOGS SOURCES
LB
Alerting
Some numbers• More than 3 million unique metrics supported
- creation and deletion happens all the time
• More than 11 billion data points written per day- across all TSDBs
• Processing about 40 billion events per day- logs and metrics events in near real time (within 30 seconds)
• More than 3000 requests per minute to Grafana dashboards- around 7000 requests in during outages
Monitoring Stack @ Sony PlayStation
METRIC SOURCE 1
METRIC SOURCE 2
METRIC SOURCE 3
METRIC SOURCE N
LOGS SOURCES
LB
Alerting
Grafana
A metrics visualization and alerting tool
Supports multipletime series databases
Supports multiple panel types
https://grafana.net/plugins
Supports multiplenotification channels for alerting
Other features……• Alert lists
• Drilldown links
• Template variables
• Dashboard snapshots
• Grafana.net community
• Grafana CLI
http://grafana.org/
http://docs.grafana.org/
https://github.com/grafana/grafana
https://raintank.slack.com
Grafana links!
• Move• Copy• Extract• Insert• Remove
• Rows• Panels• Template variables• Dashboard tags
• Dashboards• Datasources• Orgs• Rows• Panels• Template variables• Dashboard tags
Version Control
• Production• Staging• Testing• Development
Grafana in multiple environments
• Last 24 hours• By a dashboard tag• Customized dashboard list
Generate GIFs of important dashboards
Generate GIFs of important dashboards
• Upload/Store/Download dashboards to/in/from AWS S3 respectively.
• Search/Download community dashboards from Grafana.net
External features
https://utkarshcmu.github.io/wizzy-site/
https://utkarshcmu.github.io/wizzy-site/home/
https://github.com/utkarshcmu/wizzy
https://raintank.slack.com/messages/wizzy/
wizzy links!
Utkarsh Bhatnagar
• Senior Software Engineer @ Sony Interactive Entertainment (PlayStation).• An active contributor to Grafana.• Project initiator for wizzy – a user friendly CLI tool for GRAFANA
GitHub - https://github.com/utkarshcmuEmail – [email protected]
GrafanaCon 2016 Speaker - https://www.youtube.com/watch?v=llRhdvV25rg